CHG-2025-04-30 replace hetzner2 sda: Difference between revisions
(initial CHG ticket) |
|||
Line 19: | Line 19: | ||
This change will physically replace one of our two HDD (/dev/sda) on [[hetzner2]] | This change will physically replace one of our two HDD (/dev/sda) on [[hetzner2]] | ||
On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately | |||
<pre> | |||
[root@opensourceecology ~]# smartctl -H /dev/sda | |||
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) | |||
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org | |||
=== START OF READ SMART DATA SECTION === | |||
SMART overall-health self-assessment test result: FAILED! | |||
Drive failure expected in less than 24 hours. SAVE ALL DATA. | |||
No failed Attributes found. | |||
[root@opensourceecology ~]# | |||
[root@opensourceecology ~]# smartctl -H /dev/sdb | |||
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) | |||
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org | |||
=== START OF READ SMART DATA SECTION === | |||
SMART overall-health self-assessment test result: FAILED! | |||
Drive failure expected in less than 24 hours. SAVE ALL DATA. | |||
No failed Attributes found. | |||
[root@opensourceecology ~]# | |||
</pre> | |||
==Points of Contact== | ==Points of Contact== |
Revision as of 22:26, 18 April 2025
Status
2025-04-18 22:15 UTC
Initial Ticket draft created on wiki (WIP)
Change Info
Scheduled Time
This change will take place on 2025-??-?? ??:00 UTC
- = 2025-??-?? ??:00 Kansas City, US
- = 2025-??-?? ??:00 Guayaquil, EC
https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93
Purpose
This change will physically replace one of our two HDD (/dev/sda) on hetzner2
On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]#
Points of Contact
Change being performed by: Michael Altfield
Service owners: Catarina Mota & Marcin Jakubowski
Time Length
We expect at-most 4 hours of downtime.
Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.
Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.
Systems Impacted
This change impacts hetzner2 and every service/website that runs on it will go down.
Staging Test
n/a
Change Steps
TODO
Revert Steps
Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.
See Also
- Maltfield_Log/2025_Q2 Last (possible) update to hetzner2
- List of other CHG "tickets"