CHG-2025-04-24 replace hetzner2 sdb
Status
2025-04-24 16:18 UTC
The new disk is now fully in-sync with the old (failing) disk, since sometime between 15:15 and 15:20 UTC
Thu Apr 24 15:15:59 UTC 2025 Personalities : [raid1] md2 : active raid1 sdb3[2] sda3[0] 209984640 blocks super 1.2 [2/1] [U_] [===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec bitmap: 2/2 pages [8KB], 65536KB chunk md0 : active raid1 sdb1[2] sda1[0] 33521664 blocks super 1.2 [2/2] [UU] md1 : active raid1 sdb2[2] sda2[0] 523712 blocks super 1.2 [2/2] [UU] unused devices: <none> Thu Apr 24 15:20:59 UTC 2025 Personalities : [raid1] md2 : active raid1 sdb3[2] sda3[0] 209984640 blocks super 1.2 [2/2] [UU] bitmap: 1/2 pages [4KB], 65536KB chunk md0 : active raid1 sdb1[2] sda1[0] 33521664 blocks super 1.2 [2/2] [UU] md1 : active raid1 sdb2[2] sda2[0] 523712 blocks super 1.2 [2/2] [UU] unused devices: <none>
Now SMART says /dev/sdb is PASSED and /dev/sda is still FAILED
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED [root@opensourceecology ~]#
Full info
[root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762 [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12 [root@opensourceecology ~]#
I'm marking this change as completed successfully. Next-up we replace the other failing disk. See CHG-2025-XX-XX_replace_hetzner2_sda
2025-04-24 14:23 UTC
The wiki is back!
Unfortunately, hetzner fucked-up and removed *both* disks
Dear Client, we've replaced the drive via hotswap as wished. The second drive was unfortunately also briefly disconnected as there was a= wrong physical label on it. If you have any further questions or problems, feel free to contact us agai= n. Kind regards Nils Wei=C3=9Fer
The result was that /dev/sda was listed as /dev/sdc, the new drive was /dev/sdb, and dmesg was being spammed with I/O and RAID errors. The wiki was down. Disks were read-only, so I couldn't even take backups. I tried to reboot, but even reboot
failed due to i/o errors.
I used the WUI to trigger a reboot, and--thank god--the server came-up again. I immediately took down all the web services as I investigated the damage and triggered a new backup.
I was able to partition the new disk and add it to a RAID. At the time of writing, both swap and boot are synced (and grub installed on the new disk), and it's still syncing the root partition on the new disk in the RAID (currently at 35% and writing at 58 MB/s
When the backup finished uploading, I put the web services back online and typed this status message.
2025-04-24 10:32 UTC
I finished submitting the request to hetnzer to replace the disk for free.
It says we should expect the new disk to be inserted in 2-4 hours. One part of the form said this would happen without downtime. But the (required) checkbox at the bottom said that I understand that downtime is required. So that's ambiguous.
2025-04-24 10:22 UTC
Because the RAID wasn't defective, I first had to force it to break
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1 mdadm: hot remove failed for /dev/sdb1: Device or resource busy [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md0 [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2 mdadm: set /dev/sdb2 faulty in /dev/md1 [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3 mdadm: set /dev/sdb3 faulty in /dev/md2 [root@opensourceecology ~]# [root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1](F) 523712 blocks super 1.2 [2/1] [U_] md0 : active raid1 sda1[0] sdb1[1](F) 33521664 blocks super 1.2 [2/1] [U_] md2 : active raid1 sda3[0] sdb3[1](F) 209984640 blocks super 1.2 [2/1] [U_] bitmap: 2/2 pages [8KB], 65536KB chunk unused devices: <none> [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1 mdadm: hot removed /dev/sdb1 from /dev/md0 [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2 mdadm: hot removed /dev/sdb2 from /dev/md1 [root@opensourceecology ~]# [root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3 mdadm: hot removed /dev/sdb3 from /dev/md2 [root@opensourceecology ~]# [root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] 523712 blocks super 1.2 [2/1] [U_] md0 : active raid1 sda1[0] 33521664 blocks super 1.2 [2/1] [U_] md2 : active raid1 sda3[0] 209984640 blocks super 1.2 [2/1] [U_] bitmap: 2/2 pages [8KB], 65536KB chunk unused devices: <none> [root@opensourceecology ~]#
2025-04-24 10:07 UTC
I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
[root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1] 523712 blocks super 1.2 [2/2] [UU] md0 : active raid1 sda1[0] sdb1[1] 33521664 blocks super 1.2 [2/2] [UU] md2 : active raid1 sda3[0] sdb3[1] 209984640 blocks super 1.2 [2/2] [UU] bitmap: 2/2 pages [8KB], 65536KB chunk unused devices: <none> [root@opensourceecology ~]# [root@opensourceecology ~]# source /root/backups/backup.settings [root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d") 20144027578 daily_hetzner3_20250424_074924.tar.gpg [root@opensourceecology ~]# [root@opensourceecology ~]# date -u Thu Apr 24 10:06:52 UTC 2025 [root@opensourceecology ~]#
2025-04-24 10:04 UTC
Starting CHG
2025-04-19 11:49 UTC
Marcin approved this CHG for 2025-04-24 10:00 UTC
2025-04-18 22:15 UTC
Initial Ticket draft created on wiki (WIP)
Change Info
Scheduled Time
This change will take place on 2025-04-24 10:00 UTC
- = 2025-04-24 05:00 Kansas City, US
- = 2025-04-24 05:00 Guayaquil, EC
https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93
Purpose
This change will physically replace one of our two HDD (/dev/sdb = Crucial_CT250MX200SSD1_154410FA4520) on hetzner2
On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org START OF READ SMART DATA SECTION SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685 [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864 [root@opensourceecology ~]#
Points of Contact
Change being performed by: Michael Altfield
Service owners: Catarina Mota & Marcin Jakubowski
Time Length
We expect at-most 4 hours of downtime.
Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.
Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.
Systems Impacted
This change impacts hetzner2 and every service/website that runs on it will go down.
Staging Test
n/a
Change Steps
First, before we do anything, get the status of the RAID
# verify RAID status cat /proc/mdstat
Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze
# verify today's backup is present and a sane size source /root/backups/backup.settings ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array
# remove all sdb partitions from our software RAID mdadm /dev/md0 -r /dev/sdb1 mdadm /dev/md1 -r /dev/sdb2 mdadm /dev/md2 -r /dev/sdb3
Log into the Hetzner WUI https://robot.your-server.de/
Go to the servers page https://robot.hetzner.com/server
- Click the "Support" tab under hetzner2
- Click "Technical"
- Select "Server - Disk Failure"
- Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA4520"
- Select "Free"
- Select "Swap while the system is running"
- Select "As soon as possible"
- In the "Entire SMART log" textarea, enter this:
[root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]#
- Click "Send request"
Wait until hetzner confirms that the replacement drive has been inserted
# monitor for I/O events in kernel logs dmesg -w
After the replacement drive has been inserted, get some info about it
# get disks and partition info lsblk # get serial numbers of both disk; confirm sda is the same and sdb has changed udevadm info --query=property --name sda | grep ID_SER udevadm info --query=property --name sdb | grep ID_SER # verify RAID status cat /proc/mdstat
Before we modify the partition tables of any of our drives, let's make backups
# create a temp dir for this change stamp=$(date "+%Y%m%d_%H%M%S") chg_dir=/var/tmp/chg.$stamp mkdir $chg_dir chown root:root $chg_dir chmod 0700 $chg_dir pushd $chg_dir # make backups of both disks' partition tables sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak # verify du -sh ${chg_dir}/*
Copy the partition table from our old disk to our new disk
# dump the partition table of the first disk and pipe it to the second disk sfdisk -d /dev/sda | sfdisk /dev/sdb
Tell the kernel to re-read the partition table
# kernel reload of the new partition table blockdev --rereadpt /dev/sdb
Now add the new drive to the RAID array
# add all of the new disks's partitions to the software RAID mdadm /dev/md0 -a /dev/sdb1 mdadm /dev/md1 -a /dev/sdb2 mdadm /dev/md2 -a /dev/sdb3
Copy our grub configuration and files onto the new disk using `grub-install`
grub-install /dev/sdb
Execute this command to monitor the status of the RAID replication
while true; do date; cat /proc/mdstat; echo; sleep 300; done
You may need to wait several hours (hopefully less than 1 day) before proceeding.
Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected
sudo reboot
Revert Steps
Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.
See Also
- Maltfield_Log/2025_Q2 Investigation into failed disks (after db corruption event in April)
- CHG-2025-04-30_replace_hetzner2_sda replacement of the second disk
- List of other CHG "tickets"