CHG-2025-04-30 replace hetzner2 sda
Status
2025-04-30 14:30 UTC
This change was completed successfully
2025-04-30 14:18 UTC
- I'm going to double-tap the grub install before giving it a reboot
[root@opensourceecology ~]# grub2-install /dev/sda Installing for i386-pc platform. grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image.. grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image.. Installation finished. No error reported. [root@opensourceecology ~]#
- and I rebooted it
[root@opensourceecology ~]# reboot Connection to opensourceecology.org closed by remote host. Connection to opensourceecology.org closed. ssh: connect to host opensourceecology.org port 32415: Connection refused ssh: connect to host opensourceecology.org port 32415: Connection refused ... user@personal:~$ autossh opensourceecology.org Last login: Wed Apr 30 11:28:26 2025 from REDACTED [maltfield@opensourceecology ~]$ uptime 14:17:14 up 1 min, 1 user, load average: 0.85, 0.24, 0.08 [maltfield@opensourceecology ~]$
- cool, it came back.
- cool, raid looks healthy
[root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda3[3] sdb3[2] 209984640 blocks super 1.2 [2/2] [UU] bitmap: 2/2 pages [8KB], 65536KB chunk md0 : active raid1 sdb1[2] sda1[3] 33521664 blocks super 1.2 [2/2] [UU] md1 : active raid1 sdb2[2] sda2[3] 523712 blocks super 1.2 [2/2] [UU] unused devices: <none> [root@opensourceecology ~]# [root@opensourceecology ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 477G 0 disk ├─sda1 8:1 0 32G 0 part │ └─md0 9:0 0 32G 0 raid1 [SWAP] ├─sda2 8:2 0 512M 0 part │ └─md1 9:1 0 511.4M 0 raid1 /boot └─sda3 8:3 0 200.4G 0 part └─md2 9:2 0 200.3G 0 raid1 / sdb 8:16 0 477G 0 disk ├─sdb1 8:17 0 32G 0 part │ └─md0 9:0 0 32G 0 raid1 [SWAP] ├─sdb2 8:18 0 512M 0 part │ └─md1 9:1 0 511.4M 0 raid1 /boot └─sdb3 8:19 0 200.4G 0 part └─md2 9:2 0 200.3G 0 raid1 / [root@opensourceecology ~]#
- and SMART isn't yelling about failed disks anymore
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED [root@opensourceecology ~]#
2025-04-30 14:13 UTC
The RAID sync is finished; I guess these Micron 500G disks have better i/o throughput than our old 200GCrucial disks
Wed Apr 30 14:07:12 UTC 2025 Personalities : [raid1] md0 : active raid1 sda1[3] sdb1[2] 33521664 blocks super 1.2 [2/1] [_U] [====>................] recovery = 21.2% (7124992/33521664) finish=2.2min speed=191533K/sec md2 : active raid1 sda3[3] sdb3[2] 209984640 blocks super 1.2 [2/2] [UU] bitmap: 1/2 pages [4KB], 65536KB chunk md1 : active raid1 sda2[3] sdb2[2] 523712 blocks super 1.2 [2/2] [UU] unused devices: <none> Wed Apr 30 14:12:12 UTC 2025 Personalities : [raid1] md0 : active raid1 sda1[3] sdb1[2] 33521664 blocks super 1.2 [2/2] [UU] md2 : active raid1 sda3[3] sdb3[2] 209984640 blocks super 1.2 [2/2] [UU] bitmap: 2/2 pages [8KB], 65536KB chunk md1 : active raid1 sda2[3] sdb2[2] 523712 blocks super 1.2 [2/2] [UU] unused devices: <none>
2025-04-30 13:48 UTC
Since we can't add a new drive, I went ahead and added the drive they gave us to the RAID
- looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
[root@opensourceecology ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 477G 0 disk sdb 8:16 0 477G 0 disk ├─sdb1 8:17 0 32G 0 part │ └─md0 9:0 0 32G 0 raid1 [SWAP] ├─sdb2 8:18 0 512M 0 part │ └─md1 9:1 0 511.4M 0 raid1 /boot └─sdb3 8:19 0 200.4G 0 part └─md2 9:2 0 200.3G 0 raid1 / [root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088 ID_SERIAL_SHORT=18301DC6A088 [root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379 ID_SERIAL_SHORT=171416BD4379 [root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[2] 33521664 blocks super 1.2 [2/1] [_U] md2 : active raid1 sdb3[2] 209984640 blocks super 1.2 [2/1] [_U] bitmap: 2/2 pages [8KB], 65536KB chunk md1 : active raid1 sdb2[2] 523712 blocks super 1.2 [2/1] [_U] unused devices: <none> [root@opensourceecology ~]#
- I made a backup of the partitions
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S") [root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp [root@opensourceecology ~]# mkdir $chg_dir [root@opensourceecology ~]# chown root:root $chg_dir [root@opensourceecology ~]# chmod 0700 $chg_dir [root@opensourceecology ~]# pushd $chg_dir /var/tmp/chg.20250430_134343 ~ [root@opensourceecology chg.20250430_134343]# [root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak [root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak [root@opensourceecology chg.20250430_134343]# [root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/* 0 /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak 4.0K /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak [root@opensourceecology chg.20250430_134343]#
- the sda partition is empty, which makes sense
- I copied the sdb partition to sda
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda Checking that no-one is using this disk right now ... OK Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track sfdisk: /dev/sda: unrecognized partition table type Old situation: sfdisk: No partitions found New situation: Units: sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sda1 2048 67110912 67108865 fd Linux raid autodetect /dev/sda2 67112960 68161536 1048577 fd Linux raid autodetect /dev/sda3 68163584 488395120 420231537 fd Linux raid autodetect /dev/sda4 0 - 0 0 Empty Warning: partition 1 does not end at a cylinder boundary Warning: partition 2 does not start at a cylinder boundary Warning: partition 2 does not end at a cylinder boundary Warning: partition 3 does not start at a cylinder boundary Warning: partition 3 does not end at a cylinder boundary Warning: no primary partition is marked bootable (active) This does not matter for LILO, but the DOS MBR will not boot this disk. Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).) [root@opensourceecology chg.20250430_134343]#
- and reloaded the kernel
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda [root@opensourceecology chg.20250430_134343]#
- and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
[root@opensourceecology chg.20250430_134343]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 477G 0 disk ├─sda1 8:1 0 32G 0 part ├─sda2 8:2 0 512M 0 part └─sda3 8:3 0 200.4G 0 part sdb 8:16 0 477G 0 disk ├─sdb1 8:17 0 32G 0 part │ └─md0 9:0 0 32G 0 raid1 [SWAP] ├─sdb2 8:18 0 512M 0 part │ └─md1 9:1 0 511.4M 0 raid1 /boot └─sdb3 8:19 0 200.4G 0 part └─md2 9:2 0 200.3G 0 raid1 / [root@opensourceecology chg.20250430_134343]# [root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2 mdadm: added /dev/sda2 [root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3 mdadm: added /dev/sda3 [root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1 mdadm: added /dev/sda1 [root@opensourceecology chg.20250430_134343]#
- cool, that worked. /boot is already done, and it's syncing root (/) now
[root@opensourceecology chg.20250430_134343]# date -u Wed Apr 30 13:48:43 UTC 2025 [root@opensourceecology chg.20250430_134343]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sda1[3] sdb1[2] 33521664 blocks super 1.2 [2/1] [_U] resync=DELAYED md2 : active raid1 sda3[3] sdb3[2] 209984640 blocks super 1.2 [2/1] [_U] [=>...................] recovery = 9.1% (19231872/209984640) finish=16.5min speed=192161K/sec bitmap: 2/2 pages [8KB], 65536KB chunk md1 : active raid1 sda2[3] sdb2[2] 523712 blocks super 1.2 [2/2] [UU] unused devices: <none> [root@opensourceecology chg.20250430_134343]# <pre> # I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing <pre> [root@opensourceecology ~]# grub2-install /dev/sda Installing for i386-pc platform. grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image.. grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image.. Installation finished. No error reported. [root@opensourceecology ~]#
- as noted in the docs, those warnings can be safely ignored
2025-04-30 13:26 UTC
- I got a response back from hetzner 4 minutes later
Dear Client. We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
- so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
I emailed last week about this to make sure you had time to order a new drive (check my support tickets). This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new. How long would it take you to order a new drive?
2025-04-30 13:20 UTC
We're still waiting on hetzner.
Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.
[root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 [root@opensourceecology ~]# date -u Wed Apr 30 13:23:39 UTC 2025 [root@opensourceecology ~]#
In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.
I asked them to insert an actually new drive with <1,000 hours of use.
2025-04-30 11:44 UTC
- I confirmed that the RAID is currently healthy
- and today's backup (from a few hours ago) is sane and uploaded
[root@opensourceecology ~]# source /root/backups/backup.settings [root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d") 20133744108 daily_hetzner3_20250430_080904.tar.gpg [root@opensourceecology ~]#
- I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
[root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]#
- I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
[root@opensourceecology ~]# smartctl -A /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12 [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858 [root@opensourceecology ~]#
- and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C ID_SERIAL_SHORT=154410FA336C [root@opensourceecology ~]#
- ok, I'm removing sda from the raid
[root@opensourceecology ~]# date -u Wed Apr 30 11:38:06 UTC 2025 [root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1 mdadm: set /dev/sda1 faulty in /dev/md0 [root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2 mdadm: set /dev/sda2 faulty in /dev/md1 [root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3 mdadm: set /dev/sda3 faulty in /dev/md2 [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1 mdadm: hot removed /dev/sda1 from /dev/md0 [root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2 mdadm: hot removed /dev/sda2 from /dev/md1 [root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3 mdadm: hot removed /dev/sda3 from /dev/md2 [root@opensourceecology ~]# [root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[2] 33521664 blocks super 1.2 [2/1] [_U] md2 : active raid1 sdb3[2] 209984640 blocks super 1.2 [2/1] [_U] bitmap: 2/2 pages [8KB], 65536KB chunk md1 : active raid1 sdb2[2] 523712 blocks super 1.2 [2/1] [_U] unused devices: <none> [root@opensourceecology ~]# [root@opensourceecology ~]# date -u Wed Apr 30 11:38:58 UTC 2025 [root@opensourceecology ~]#
- and I submitted the request for support to swap the disk
SMART says disk is FAILED and needs to be replaced asap. I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
2025-04-30 11:29 UTC
Starting Change
2025-04-24 17:56 UTC
Marcin approved the start time of this CHG
Yes, time is perfect at 6 am. Any day. On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote: > Hey Marcin, > > When would be a good time to replace the second disk on hetzner2? > > If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next > week on Wednesday 2025-04-30 11:00 UTC, which is: > > * 13:00 in Germany (where the server lives) > * 06:00 here in Ecuador, and > * 06:00 at FeF > > For details about what this change entails, and expected downtime, > please see the change ticket: > > * > https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda > > Please let me know if you approve this change, if the suggested time is > agreeable to you, and if you have any questions. > > > Thank you, > > > Michael Altfield > https://www.michaelaltfield.net > PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41 > > Note: If you cannot reach me via email, please check to see if I have > changed my email address by visiting my website at > https://email.michaelaltfield.net
2025-04-24 17:37 UTC
Marcin approved purchasing a new disk for this replacement
Yes. On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote: > Hey Marcin, > > Would you authorize spending €41.18 on a new disk for your server? > > Update: Your websites are back online. The RAID is still syncing. > > I was a bit disappointed to learn that hetzner replaced a disk with 0% > "life left" with a disk with 4% "life left". That's what we get for > choosing the free disk replacement.. > > The "free" option said it would replace it with a "Replacement drive > nearly new or used and tested; depends on what is in stock." Obviously > they didn't give us a "nearly new" drive.. > > Your other disk is also at 0% "life left". I was already planning on > replacing that one next week too, but I would recommend that you pay for > a new drive for this one. The cost listed on the website is €41.18. > > Do you authorize me selecting €41.18 for the replacement of /dev/sda on > hetzner2? > > > Thank you, > > Michael Altfield > https://www.michaelaltfield.net > PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41 > > Note: If you cannot reach me via email, please check to see if I have > changed my email address by visiting my website at > https://email.michaelaltfield.net
2025-04-18 22:15 UTC
Initial Ticket draft created on wiki (WIP)
Change Info
Scheduled Time
This change will take place on 2025-04-30 11:00 UTC
- = 2025-04-30 06:00 Kansas City, US
- = 2025-04-30 06:00 Guayaquil, EC
https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93
Purpose
This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on hetzner2
On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org START OF READ SMART DATA SECTION SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685 [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864 [root@opensourceecology ~]#
Points of Contact
Change being performed by: Michael Altfield
Service owners: Catarina Mota & Marcin Jakubowski
Time Length
We expect at-most 5 hours of downtime.
Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.
Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.
Systems Impacted
This change impacts hetzner2 and every service/website that runs on it will go down.
Staging Test
n/a
Change Steps
First, before we do anything, get the status of the RAID
# verify RAID status cat /proc/mdstat
Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze
# verify today's backup is present and a sane size source /root/backups/backup.settings ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array
# remove all sda partitions from our software RAID mdadm --manage /dev/md0 --fail /dev/sda1 mdadm --manage /dev/md1 --fail /dev/sda2 mdadm --manage /dev/md2 --fail /dev/sda3 mdadm /dev/md0 -r /dev/sda1 mdadm /dev/md1 -r /dev/sda2 mdadm /dev/md2 -r /dev/sda3
Log into the Hetzner WUI https://robot.your-server.de/
Go to the servers page https://robot.hetzner.com/server
- Click the "Support" tab under hetzner2
- Click "Technical"
- Select "Server - Disk Failure"
- Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
- Select "At cost"
- Select "Swap while the system is running"
- Select "As soon as possible"
- In the "Entire SMART log" textarea, enter this:
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]#
- Click "Send request"
Wait until hetzner confirms that the replacement drive has been inserted
# monitor for I/O events in kernel logs dmesg -w
After the replacement drive has been inserted, get some info about it
# get disks and partition info lsblk # get serial numbers of both disk; confirm sdb is the same and sda has changed udevadm info --query=property --name sda | grep ID_SER udevadm info --query=property --name sdb | grep ID_SER # verify RAID status cat /proc/mdstat
Before we modify the partition tables of any of our drives, let's make backups
# create a temp dir for this change stamp=$(date "+%Y%m%d_%H%M%S") chg_dir=/var/tmp/chg.$stamp mkdir $chg_dir chown root:root $chg_dir chmod 0700 $chg_dir pushd $chg_dir # make backups of both disks' partition tables sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak # verify du -sh ${chg_dir}/*
Copy the partition table from our old disk to our new disk
# dump the partition table of the first disk and pipe it to the second disk sfdisk -d /dev/sdb | sfdisk /dev/sda
Tell the kernel to re-read the partition table
# kernel reload of the new partition table blockdev --rereadpt /dev/sda
Now add the new drive to the RAID array
# add all of the new disks's partitions to the software RAID mdadm /dev/md0 -a /dev/sda1 mdadm /dev/md1 -a /dev/sda2 mdadm /dev/md2 -a /dev/sda3
Copy our grub configuration and files onto the new disk using `grub-install`
grub-install /dev/sda
Execute this command to monitor the status of the RAID replication
while true; do date; cat /proc/mdstat; echo; sleep 300; done
You may need to wait several hours (hopefully less than 1 day) before proceeding.
Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected
sudo reboot
Revert Steps
Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.