CHG-2025-04-30 replace hetzner2 sda

From Open Source Ecology
Jump to navigation Jump to search

Status

2025-04-18 22:15 UTC

Initial Ticket draft created on wiki (WIP)

Change Info

Scheduled Time

This change will take place on 2025-??-?? ??:00 UTC

  • = 2025-??-?? ??:00 Kansas City, US
  • = 2025-??-?? ??:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93

Purpose

This change will physically replace one of our two HDD (/dev/sda) on hetzner2

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 

Points of Contact

Change being performed by: Michael Altfield

Service owners: Catarina Mota & Marcin Jakubowski

Time Length

We expect at-most 4 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

Systems Impacted

This change impacts hetzner2 and every service/website that runs on it will go down.

Staging Test

n/a

Change Steps

TODO


Revert Steps

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

See Also

  1. Maltfield_Log/2025_Q2 Last (possible) update to hetzner2
  2. List of other CHG "tickets"