Revision as of 22:26, 18 April 2025

Status

2025-04-18 22:15 UTC

Initial Ticket draft created on wiki (WIP)

Change Info

Scheduled Time

This change will take place on 2025-??-?? ??:00 UTC

= 2025-??-?? ??:00 Kansas City, US
= 2025-??-?? ??:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20240727T160000&p1=405&p2=1440&p3=93

Purpose

This change will physically replace one of our two HDD (/dev/sda) on hetzner2

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

Points of Contact

Change being performed by: Michael Altfield

Service owners: Catarina Mota & Marcin Jakubowski

Time Length

We expect at-most 4 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

Systems Impacted

This change impacts hetzner2 and every service/website that runs on it will go down.

Staging Test

n/a

Change Steps

TODO

Revert Steps

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

CHG-2025-04-30 replace hetzner2 sda: Difference between revisions

Revision as of 22:26, 18 April 2025

Contents

Status

2025-04-18 22:15 UTC

Change Info

Scheduled Time

Purpose

Points of Contact

Time Length

Systems Impacted

Staging Test

Change Steps

Revert Steps

See Also

Navigation menu

CHG-2025-04-30 replace hetzner2 sda: Difference between revisions

Revision as of 22:26, 18 April 2025

Status

2025-04-18 22:15 UTC

Change Info

Scheduled Time

Purpose

Points of Contact

Time Length

Systems Impacted

Staging Test

Change Steps

Revert Steps

See Also

Navigation menu

Search