Revision as of 16:13, 20 April 2025

Status

2025-04-19 11:49 UTC

Marcin approved this CHG for 2025-04-24 10:00 UTC

2025-04-18 22:15 UTC

Initial Ticket draft created on wiki (WIP)

Change Info

Scheduled Time

This change will take place on 2025-04-24 10:00 UTC

= 2025-04-24 05:00 Kansas City, US
= 2025-04-24 05:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93

Purpose

This change will physically replace one of our two HDD (/dev/sdb = Crucial_CT250MX200SSD1_154410FA4520) on hetzner2

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78355
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3433
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2599
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   046   000    Old_age   Always       -       36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       405734134966
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12794981941
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       26207531685

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78354
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3742
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2585
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   044   000    Old_age   Always       -       35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       406209116828
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12809824998
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       42504271864

[root@opensourceecology ~]#

Points of Contact

Change being performed by: Michael Altfield

Service owners: Catarina Mota & Marcin Jakubowski

Time Length

We expect at-most 4 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

Systems Impacted

This change impacts hetzner2 and every service/website that runs on it will go down.

Staging Test

n/a

Change Steps

First, before we do anything, get the status of the RAID

# verify RAID status
cat /proc/mdstat

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

# remove all sdb partitions from our software RAID
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md1 -r /dev/sdb3

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

Click the "Support" tab under hetzner2
Click "Technical"
Select "Server - Disk Failure"
Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA4520"
Select "Free"
Select "Swap while the system is running"
Select "As soon as possible"
In the "Entire SMART log" textarea, enter this:

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

# monitor for I/O events in kernel logs
dmesg -w

After the replacement drive has been inserted, get some info about it

# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sda is the same and sdb has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat

Before we modify the partition tables of any of our drives, let's make backups

# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $tmpDir
chown root:root $tmpDir
chmod 0700 $tmpDir
pushd $tmpDir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*

Copy the partition table from our old disk to our new disk

# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sda | sfdisk /dev/sdb

Tell the kernel to re-read the partition table

# kernel reload of the new partition table
blockdev --rereadpt /dev/sdb

Now add the new drive to the RAID array

# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3

Copy our grub configuration and files onto the new disk using `grub-install`

grub-install /dev/sdb

Execute this command to monitor the status of the RAID replication

while true; do date; cat /proc/mdstat; echo; sleep 300; done

You may need to wait several hours (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

sudo reboot

Revert Steps

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

External Links

https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

@@ Line 13: / Line 13: @@
 ==Scheduled Time==
-This change will take place on 2025-??-?? ??:00 UTC
+This change will take place on 2025-04-24 10:00 UTC
-* = 2025-??-?? ??:00 Kansas City, US
+* = 2025-04-24 05:00 Kansas City, US
-* = 2025-??-?? ??:00 Guayaquil, EC
+* = 2025-04-24 05:00 Guayaquil, EC
 https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93

CHG-2025-04-24 replace hetzner2 sdb: Difference between revisions

Revision as of 16:13, 20 April 2025

Contents

Status

2025-04-19 11:49 UTC

2025-04-18 22:15 UTC

Change Info

Scheduled Time

Purpose

Points of Contact

Time Length

Systems Impacted

Staging Test

Change Steps

Revert Steps

See Also

External Links

Navigation menu

CHG-2025-04-24 replace hetzner2 sdb: Difference between revisions

Revision as of 16:13, 20 April 2025

Status

2025-04-19 11:49 UTC

2025-04-18 22:15 UTC

Change Info

Scheduled Time

Purpose

Points of Contact

Time Length

Systems Impacted

Staging Test

Change Steps

Revert Steps

See Also

External Links

Navigation menu

Search