CHG-2025-04-24 replace hetzner2 sdb: Difference between revisions
(→Status) |
|||
Line 13: | Line 13: | ||
==Scheduled Time== | ==Scheduled Time== | ||
This change will take place on 2025- | This change will take place on 2025-04-24 10:00 UTC | ||
* = 2025- | * = 2025-04-24 05:00 Kansas City, US | ||
* = 2025- | * = 2025-04-24 05:00 Guayaquil, EC | ||
https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93 | https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93 |
Revision as of 16:13, 20 April 2025
Status
2025-04-19 11:49 UTC
Marcin approved this CHG for 2025-04-24 10:00 UTC
2025-04-18 22:15 UTC
Initial Ticket draft created on wiki (WIP)
Change Info
Scheduled Time
This change will take place on 2025-04-24 10:00 UTC
- = 2025-04-24 05:00 Kansas City, US
- = 2025-04-24 05:00 Guayaquil, EC
https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93
Purpose
This change will physically replace one of our two HDD (/dev/sdb = Crucial_CT250MX200SSD1_154410FA4520) on hetzner2
On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org START OF READ SMART DATA SECTION SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685 [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864 [root@opensourceecology ~]#
Points of Contact
Change being performed by: Michael Altfield
Service owners: Catarina Mota & Marcin Jakubowski
Time Length
We expect at-most 4 hours of downtime.
Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.
Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.
Systems Impacted
This change impacts hetzner2 and every service/website that runs on it will go down.
Staging Test
n/a
Change Steps
First, before we do anything, get the status of the RAID
# verify RAID status cat /proc/mdstat
Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze
# verify today's backup is present and a sane size source /root/backups/backup.settings ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array
# remove all sdb partitions from our software RAID mdadm /dev/md0 -r /dev/sdb1 mdadm /dev/md1 -r /dev/sdb2 mdadm /dev/md1 -r /dev/sdb3
Log into the Hetzner WUI https://robot.your-server.de/
Go to the servers page https://robot.hetzner.com/server
- Click the "Support" tab under hetzner2
- Click "Technical"
- Select "Server - Disk Failure"
- Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA4520"
- Select "Free"
- Select "Swap while the system is running"
- Select "As soon as possible"
- In the "Entire SMART log" textarea, enter this:
[root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]#
- Click "Send request"
Wait until hetzner confirms that the replacement drive has been inserted
# monitor for I/O events in kernel logs dmesg -w
After the replacement drive has been inserted, get some info about it
# get disks and partition info lsblk # get serial numbers of both disk; confirm sda is the same and sdb has changed udevadm info --query=property --name sda | grep ID_SER udevadm info --query=property --name sdb | grep ID_SER # verify RAID status cat /proc/mdstat
Before we modify the partition tables of any of our drives, let's make backups
# create a temp dir for this change stamp=$(date "+%Y%m%d_%H%M%S") chg_dir=/var/tmp/chg.$stamp mkdir $tmpDir chown root:root $tmpDir chmod 0700 $tmpDir pushd $tmpDir # make backups of both disks' partition tables sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak # verify du -sh ${chg_dir}/*
Copy the partition table from our old disk to our new disk
# dump the partition table of the first disk and pipe it to the second disk sfdisk -d /dev/sda | sfdisk /dev/sdb
Tell the kernel to re-read the partition table
# kernel reload of the new partition table blockdev --rereadpt /dev/sdb
Now add the new drive to the RAID array
# add all of the new disks's partitions to the software RAID mdadm /dev/md0 -a /dev/sdb1 mdadm /dev/md1 -a /dev/sdb2 mdadm /dev/md1 -a /dev/sdb3
Copy our grub configuration and files onto the new disk using `grub-install`
grub-install /dev/sdb
Execute this command to monitor the status of the RAID replication
while true; do date; cat /proc/mdstat; echo; sleep 300; done
You may need to wait several hours (hopefully less than 1 day) before proceeding.
Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected
sudo reboot
Revert Steps
Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.
See Also
- Maltfield_Log/2025_Q2 Last (possible) update to hetzner2
- List of other CHG "tickets"