CHG-2025-04-24 replace hetzner2 sdb

From Open Source Ecology
Jump to: navigation, search

Status

2025-04-24 16:18 UTC

The new disk is now fully in-sync with the old (failing) disk, since sometime between 15:15 and 15:20 UTC

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
      209984640 blocks super 1.2 [2/1] [U_]
      [===================>.]  recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
      bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
      33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
      523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
      209984640 blocks super 1.2 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
      33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
     	 523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Now SMART says /dev/sdb is PASSED and /dev/sda is still FAILED

[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]# 

Full info

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78516
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       50
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3445
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       47
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2599
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   060   046   000    Old_age   Always       -       40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       407132499909
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12839097351
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       26313144762

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       3
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       3
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       52083
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   004   004   000    Old_age   Always       -       1449
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       20
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   061   049   000    Old_age   Always       -       39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   004   004   001    Old_age   Offline      -       96
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       600236629947
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       18860233219
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       11828985935
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2470
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       12

[root@opensourceecology ~]# 

I'm marking this change as completed successfully. Next-up we replace the other failing disk. See CHG-2025-XX-XX_replace_hetzner2_sda

2025-04-24 14:23 UTC

The wiki is back!

Unfortunately, hetzner fucked-up and removed *both* disks

Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
 wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.


Kind regards

 Nils Wei=C3=9Fer

The result was that /dev/sda was listed as /dev/sdc, the new drive was /dev/sdb, and dmesg was being spammed with I/O and RAID errors. The wiki was down. Disks were read-only, so I couldn't even take backups. I tried to reboot, but even reboot failed due to i/o errors.

I used the WUI to trigger a reboot, and--thank god--the server came-up again. I immediately took down all the web services as I investigated the damage and triggered a new backup.

I was able to partition the new disk and add it to a RAID. At the time of writing, both swap and boot are synced (and grub installed on the new disk), and it's still syncing the root partition on the new disk in the RAID (currently at 35% and writing at 58 MB/s

When the backup finished uploading, I put the web services back online and typed this status message.

2025-04-24 10:32 UTC

I finished submitting the request to hetnzer to replace the disk for free.

It says we should expect the new disk to be inserted in 2-4 hours. One part of the form said this would happen without downtime. But the (required) checkbox at the bottom said that I understand that downtime is required. So that's ambiguous.

2025-04-24 10:22 UTC

Because the RAID wasn't defective, I first had to force it to break

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]# 

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1] 
md1 : active raid1 sda2[0] sdb2[1](F)
      523712 blocks super 1.2 [2/1] [U_]
      
md0 : active raid1 sda1[0] sdb1[1](F)
      33521664 blocks super 1.2 [2/1] [U_]
      
md2 : active raid1 sda3[0] sdb3[1](F)
      209984640 blocks super 1.2 [2/1] [U_]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]# 

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1] 
md1 : active raid1 sda2[0]
      523712 blocks super 1.2 [2/1] [U_]
      
md0 : active raid1 sda1[0]
      33521664 blocks super 1.2 [2/1] [U_]
      
md2 : active raid1 sda3[0]
      209984640 blocks super 1.2 [2/1] [U_]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]# 

2025-04-24 10:07 UTC

I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1] 
md1 : active raid1 sda2[0] sdb2[1]
      523712 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0] sdb1[1]
      33521664 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sda3[0] sdb3[1]
      209984640 blocks super 1.2 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]# 

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]# 

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]# 

2025-04-24 10:04 UTC

Starting CHG

2025-04-19 11:49 UTC

Marcin approved this CHG for 2025-04-24 10:00 UTC

2025-04-18 22:15 UTC

Initial Ticket draft created on wiki (WIP)

Change Info

Scheduled Time

This change will take place on 2025-04-24 10:00 UTC

  • = 2025-04-24 05:00 Kansas City, US
  • = 2025-04-24 05:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93

Purpose

This change will physically replace one of our two HDD (/dev/sdb = Crucial_CT250MX200SSD1_154410FA4520) on hetzner2

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78355
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3433
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2599
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   046   000    Old_age   Always       -       36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       405734134966
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12794981941
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       26207531685

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78354
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3742
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2585
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   044   000    Old_age   Always       -       35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       406209116828
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12809824998
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       42504271864

[root@opensourceecology ~]# 

Points of Contact

Change being performed by: Michael Altfield

Service owners: Catarina Mota & Marcin Jakubowski

Time Length

We expect at-most 4 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

Systems Impacted

This change impacts hetzner2 and every service/website that runs on it will go down.

Staging Test

n/a

Change Steps

First, before we do anything, get the status of the RAID

# verify RAID status
cat /proc/mdstat

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

# remove all sdb partitions from our software RAID
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

  1. Click the "Support" tab under hetzner2
  2. Click "Technical"
  3. Select "Server - Disk Failure"
  4. Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA4520"
  5. Select "Free"
  6. Select "Swap while the system is running"
  7. Select "As soon as possible"
  8. In the "Entire SMART log" textarea, enter this:
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
  1. Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

# monitor for I/O events in kernel logs
dmesg -w

After the replacement drive has been inserted, get some info about it

# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sda is the same and sdb has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat

Before we modify the partition tables of any of our drives, let's make backups

# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*

Copy the partition table from our old disk to our new disk

# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sda | sfdisk /dev/sdb

Tell the kernel to re-read the partition table

# kernel reload of the new partition table
blockdev --rereadpt /dev/sdb

Now add the new drive to the RAID array

# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md2 -a /dev/sdb3

Copy our grub configuration and files onto the new disk using `grub-install`

grub-install /dev/sdb

Execute this command to monitor the status of the RAID replication

while true; do date; cat /proc/mdstat; echo; sleep 300; done

You may need to wait several hours (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

sudo reboot

Revert Steps

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

See Also

  1. Maltfield_Log/2025_Q2 Investigation into failed disks (after db corruption event in April)
  2. CHG-2025-04-30_replace_hetzner2_sda replacement of the second disk
  3. List of other CHG "tickets"

External Links