CHG-2025-04-30 replace hetzner2 sda

From Open Source Ecology
Jump to: navigation, search

Status

2025-04-30 14:30 UTC

This change was completed successfully

2025-04-30 14:18 UTC

  1. I'm going to double-tap the grub install before giving it a reboot
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]# 
  1. and I rebooted it
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
user@personal:~$ autossh opensourceecology.org
Last login: Wed Apr 30 11:28:26 2025 from REDACTED
[maltfield@opensourceecology ~]$ uptime
 14:17:14 up 1 min,  1 user,  load average: 0.85, 0.24, 0.08
[maltfield@opensourceecology ~]$ 
  1. cool, it came back.
  2. cool, raid looks healthy
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1] 
md2 : active raid1 sda3[3] sdb3[2]
	  209984640 blocks super 1.2 [2/2] [UU]
	  bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[3]
	  33521664 blocks super 1.2 [2/2] [UU]
      
md1 : active raid1 sdb2[2] sda2[3]
	  523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
[root@opensourceecology ~]# 
[root@opensourceecology ~]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda       8:0    0   477G  0 disk  
├─sda1    8:1    0    32G  0 part  
│ └─md0   9:0    0    32G  0 raid1 [SWAP]
├─sda2    8:2    0   512M  0 part  
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sda3    8:3    0 200.4G  0 part  
  └─md2   9:2    0 200.3G  0 raid1 /
sdb       8:16   0   477G  0 disk  
├─sdb1    8:17   0    32G  0 part  
│ └─md0   9:0    0    32G  0 raid1 [SWAP]
├─sdb2    8:18   0   512M  0 part  
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sdb3    8:19   0 200.4G  0 part  
  └─md2   9:2    0 200.3G  0 raid1 /
[root@opensourceecology ~]# 
  1. and SMART isn't yelling about failed disks anymore
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#

2025-04-30 14:13 UTC

The RAID sync is finished; I guess these Micron 500G disks have better i/o throughput than our old 200GCrucial disks

Wed Apr 30 14:07:12 UTC 2025
Personalities : [raid1] 
md0 : active raid1 sda1[3] sdb1[2]
	  33521664 blocks super 1.2 [2/1] [_U]
	  [====>................]  recovery = 21.2% (7124992/33521664) finish=2.2min speed=191533K/sec
      
md2 : active raid1 sda3[3] sdb3[2]
	  209984640 blocks super 1.2 [2/2] [UU]
	  bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
	  523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Wed Apr 30 14:12:12 UTC 2025
Personalities : [raid1] 
md0 : active raid1 sda1[3] sdb1[2]
	  33521664 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sda3[3] sdb3[2]
	  209984640 blocks super 1.2 [2/2] [UU]
	  bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
	  523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

2025-04-30 13:48 UTC

Since we can't add a new drive, I went ahead and added the drive they gave us to the RAID

  1. looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
[root@opensourceecology ~]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda       8:0    0   477G  0 disk  
sdb       8:16   0   477G  0 disk  
├─sdb1    8:17   0    32G  0 part  
│ └─md0   9:0    0    32G  0 raid1 [SWAP]
├─sdb2    8:18   0   512M  0 part  
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sdb3    8:19   0 200.4G  0 part  
  └─md2   9:2    0 200.3G  0 raid1 /
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088
ID_SERIAL_SHORT=18301DC6A088
[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[2]
	  33521664 blocks super 1.2 [2/1] [_U]
      
md2 : active raid1 sdb3[2]
	  209984640 blocks super 1.2 [2/1] [_U]
	  bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
	  523712 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>
[root@opensourceecology ~]# 
  1. I made a backup of the partitions
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250430_134343 ~
[root@opensourceecology chg.20250430_134343]# 
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# 
[root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/*
0       /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak
4.0K    /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# 
  1. the sda partition is empty, which makes sense
  2. I copied the sdb partition to sda
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk:  /dev/sda: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1          2048  67110912   67108865  fd  Linux raid autodetect
/dev/sda2      67112960  68161536    1048577  fd  Linux raid autodetect
/dev/sda3      68163584 488395120  420231537  fd  Linux raid autodetect
/dev/sda4             0         -          0   0  Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250430_134343]# 
  1. and reloaded the kernel
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda
[root@opensourceecology chg.20250430_134343]# 
  1. and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
[root@opensourceecology chg.20250430_134343]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda       8:0    0   477G  0 disk  
├─sda1    8:1    0    32G  0 part  
├─sda2    8:2    0   512M  0 part  
└─sda3    8:3    0 200.4G  0 part  
sdb       8:16   0   477G  0 disk  
├─sdb1    8:17   0    32G  0 part  
│ └─md0   9:0    0    32G  0 raid1 [SWAP]
├─sdb2    8:18   0   512M  0 part  
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sdb3    8:19   0 200.4G  0 part  
  └─md2   9:2    0 200.3G  0 raid1 /
[root@opensourceecology chg.20250430_134343]# 

[root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3
mdadm: added /dev/sda3
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
[root@opensourceecology chg.20250430_134343]# 
  1. cool, that worked. /boot is already done, and it's syncing root (/) now
[root@opensourceecology chg.20250430_134343]# date -u
Wed Apr 30 13:48:43 UTC 2025
[root@opensourceecology chg.20250430_134343]# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sda1[3] sdb1[2]
	  33521664 blocks super 1.2 [2/1] [_U]
		resync=DELAYED
      
md2 : active raid1 sda3[3] sdb3[2]
	  209984640 blocks super 1.2 [2/1] [_U]
	  [=>...................]  recovery =  9.1% (19231872/209984640) finish=16.5min speed=192161K/sec
	  bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
	  523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
[root@opensourceecology chg.20250430_134343]# 
<pre>
# I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]# 
  1. as noted in the docs, those warnings can be safely ignored

2025-04-30 13:26 UTC

  1. I got a response back from hetzner 4 minutes later
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
  1. so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?


2025-04-30 13:20 UTC

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       18623
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   032   032   000    Old_age   Always       -       1030
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       2
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   068   047   000    Old_age   Always       -       32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   032   032   001    Old_age   Offline      -       68
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       96994281182
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       3059820027
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       31429771271
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2467
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]# 

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

2025-04-30 11:44 UTC

  1. I confirmed that the RAID is currently healthy
  2. and today's backup (from a few hours ago) is sane and uploaded
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]# 
  1. I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
  1. I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       3
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       3
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       52223
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       46
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   004   004   000    Old_age   Always       -       1452
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       29
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   049   000    Old_age   Always       -       36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   004   004   001    Old_age   Offline      -       96
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       601634812550
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       18904241237
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       11849811867
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2470
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       12

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78658
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       63
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3454
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       56
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2599
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   062   046   000    Old_age   Always       -       38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       408221767008
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12873452848
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       26389101858

[root@opensourceecology ~]# 
  1. and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]# 
  1. ok, I'm removing sda from the raid
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]# 
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]# 
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[2]
	  33521664 blocks super 1.2 [2/1] [_U]
      
md2 : active raid1 sdb3[2]
	  209984640 blocks super 1.2 [2/1] [_U]
	  bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
	  523712 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>
[root@opensourceecology ~]# 
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]# 
  1. and I submitted the request for support to swap the disk
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.

2025-04-30 11:29 UTC

Starting Change

2025-04-24 17:56 UTC

Marcin approved the start time of this CHG

Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
>     * 13:00 in Germany (where the server lives)
>     * 06:00 here in Ecuador, and
>     * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
>   *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972  644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net

2025-04-24 17:37 UTC

Marcin approved purchasing a new disk for this replacement

Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972  644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net

2025-04-18 22:15 UTC

Initial Ticket draft created on wiki (WIP)

Change Info

Scheduled Time

This change will take place on 2025-04-30 11:00 UTC

  • = 2025-04-30 06:00 Kansas City, US
  • = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

Purpose

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on hetzner2

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78355
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3433
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2599
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   046   000    Old_age   Always       -       36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       405734134966
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12794981941
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       26207531685

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78354
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3742
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2585
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   044   000    Old_age   Always       -       35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       406209116828
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12809824998
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       42504271864

[root@opensourceecology ~]# 

Points of Contact

Change being performed by: Michael Altfield

Service owners: Catarina Mota & Marcin Jakubowski

Time Length

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

Systems Impacted

This change impacts hetzner2 and every service/website that runs on it will go down.

Staging Test

n/a

Change Steps

First, before we do anything, get the status of the RAID

# verify RAID status
cat /proc/mdstat

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

  1. Click the "Support" tab under hetzner2
  2. Click "Technical"
  3. Select "Server - Disk Failure"
  4. Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
  5. Select "At cost"
  6. Select "Swap while the system is running"
  7. Select "As soon as possible"
  8. In the "Entire SMART log" textarea, enter this:
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 

  1. Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

# monitor for I/O events in kernel logs
dmesg -w

After the replacement drive has been inserted, get some info about it

# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat

Before we modify the partition tables of any of our drives, let's make backups

# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*

Copy the partition table from our old disk to our new disk

# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda

Tell the kernel to re-read the partition table

# kernel reload of the new partition table
blockdev --rereadpt /dev/sda

Now add the new drive to the RAID array

# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3

Copy our grub configuration and files onto the new disk using `grub-install`

grub-install /dev/sda

Execute this command to monitor the status of the RAID replication

while true; do date; cat /proc/mdstat; echo; sleep 300; done

You may need to wait several hours (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

sudo reboot

Revert Steps

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

See Also

  1. Maltfield_Log/2025_Q2
  2. CHG-2025-04-24_replace_hetzner2_sdb
  3. List of other CHG "tickets"

External Links