Open Source Ecology - User contributions [en]

Google Workspace

2026-02-03T17:23:25Z

Maltfield: paragraph formatting

As a legally-registered NGO (non-profit in the US), Open Source Ecology has a free Google Workspace account.

Note that Google Workspace is also known as:

# Google Apps (or gapps) and
# Google Suite (or gsuite)

= Why? =

Google Workspace lets us create Google accounts with a username on the <code>@opensourceecology.org</code> domain. For example, when OSE users manage email, we can do so from a gmail-like UI. While we have access to numerous apps in Google Workspace, OSE specifically makes heavy use of the following apps:

# Google Mail
# Google Calendar
# Google Docs
# Google Drive
# Google Meet
# Google Groups
# Google Slides
# etc

=Google Groups=

OSE uses (internal-only) Google Groups for creating one-to-many email lists (a designated email account that reaches the inbox of many people at OSE).

Because

# Google doesn't support the concept of "shared accounts", <ref>https://support.google.com/a/answer/33330?hl=en</ref>,
# Google may lock you out of being able to login to your account if their anomaly detection system thinks an account is being shared, <ref>https://support.google.com/a/answer/6002699?hl=en&ref_topic=2759193&sjid=814912251894340756-EU#zippy=%2Cwhen-does-google-consider-a-sign-in-attempt-suspicious</ref>
# Google won't let you turn-off their "suspicious login" feature that locks you out of your own account -- even if their system is faulty and blocking you from logging in, even when you entered the correct password<ref>https://knowledge.workspace.google.com/kb/how-to-disable-login-challenge-security-method-permanently-000007696</ref>
# Google doesn't let you forward mail from one account to many accounts

If you want to create a one-to-many email address (eg <code>tractor-team@opensourceecology.org</code>) for which there are many recipients, the way to do this in Google Workspace is to create a "Google Group".

== System Alerts ==

For example, in September 2024, OSE nearly lost all of its backup data (on [[Backblaze]]) due to few missed payments (amounting to <$10) because our bank false-positive blocked the transaction as "suspicious". The issue was exacerbated by the fact that our backblaze-specific email address (which received many, many "payment failed" alerts) was not being forwarded to the email inboxes of Marcin (or anyone else).

For security reasons, it's always better to use services that ''don't'' use shared logins. If possible, create one user account per person and grant that user account access to the OSE account. Unfortunately, this isn't possible with many services -- and we're forced to use one shared account.

For more flexibility and security, rather than signing-up for an account directly with some shared <code>some-google-group-list@opensourceecology.org</code> account that's tied to a Google Group directly, we create a new user account for that account. Then you can [1] forward all of that account's mail to a Google Group and [2] grant other users to be able to access that account's mail.

To setup email forwarding, login as the <code>service-specific-shared-account@opensourceecology.org</code> account in gmail. Click on the settings "gear icon" in the top-right of the webpage. Click on the "Forwarding and POP/IMAP" tab. Under the "Forwarding" section, enter the email address of the Google Group. Make sure to check the correct radio button that says "Forward a copy of incoming mail to ..." and also leave the drop-down set to "keep ... copy in the inbox". This will ensure that, even if the Google Group gets moved or deleted in the future, all of the mail for this specific account will be retained in gmail. Finally, click "Save Changes".

To grant Marcin or anyone else access to this new service-specific account's mail, login as the account in Gmail. Click on the settings "gear icon" in the top-right of the webpage. Click on the "Accounts" tab. Under the "Grant access to your account" section, click "Add an account" and enter the email address of the person (eg Marcin) that you want to give access to be able to read and write mail on behalf of this user.

{{Warning|Please note that "reset password" functionality usually works by sending a link to a user's email address, so we should assume that '''anyone either on the Google Groups list or under the "Grant access to your account" list will be able to login''' to these services, '''even if they don't have the account password'''. So please only ever put trusted users on this list.
}}

= Why can't I login? =

The best way to avoid lockout issues on Google is to use a [https://tech.michaelaltfield.net/2026/02/03/single-site-browser-firejail-proxychains/ persistent single-site browser]. For more info, see:

* https://tech.michaelaltfield.net/2026/02/03/single-site-browser-firejail-proxychains/

Unfortunately, Google employs an infamously faulty anomaly detection system<ref>https://support.google.com/a/answer/6002699?hl=en&ref_topic=2759193&sjid=814912251894340756-EU#zippy=%2Cwhen-does-google-consider-a-sign-in-attempt-suspicious</ref> that may false-positive due to a "suspicious login" that could lock you out of your own account -- even when you entered the correct password on the first try. Unfortunatly, Google is aware of the issue and refuses to let Google Workspace (or individual user) disable this broken "feature" for their accounts, even if it causes more harm than good <ref>https://knowledge.workspace.google.com/kb/how-to-disable-login-challenge-security-method-permanently-000007696</ref>.

If this happens, try enabling 2FA (with TOTP) in your account. It ''should'' prevent Google from locking you out of your own account, even if you enter the correct password on the first try.

Of course, you need to login in order to add 2FA to your account. To bypass the lockout, ask an OSE member with Admin access to Google Workspace to temporarily turn-off "two step authentication" (which is a distinct Google concept from "two factor authentication") as follows:

# Log into the admin.google.com panel
# Click Directory -> Users
# Click on your username
# Click on the "Security" tab
# Scroll-down to "Login challenge" and clicked the "TURN OFF FOR 10 MINS" button <ref>https://knowledge.workspace.google.com/kb/how-to-turn-off-2-step-verification-for-specific-users-000007496</ref>

Now you should be able to login and setup 2FA with TOTP to prevent this from happening again.

=References=
{{reflist}}

[[Category: IT Infrastructure]]
[[Category: Software]]

Google Workspace

2026-02-03T17:22:43Z

Maltfield: add link to article with more background info, and a solution to prevent google lockouts

As a legally-registered NGO (non-profit in the US), Open Source Ecology has a free Google Workspace account.

Note that Google Workspace is also known as:

# Google Apps (or gapps) and
# Google Suite (or gsuite)

= Why? =

Google Workspace lets us create Google accounts with a username on the <code>@opensourceecology.org</code> domain. For example, when OSE users manage email, we can do so from a gmail-like UI. While we have access to numerous apps in Google Workspace, OSE specifically makes heavy use of the following apps:

# Google Mail
# Google Calendar
# Google Docs
# Google Drive
# Google Meet
# Google Groups
# Google Slides
# etc

=Google Groups=

OSE uses (internal-only) Google Groups for creating one-to-many email lists (a designated email account that reaches the inbox of many people at OSE).

Because

# Google doesn't support the concept of "shared accounts", <ref>https://support.google.com/a/answer/33330?hl=en</ref>,
# Google may lock you out of being able to login to your account if their anomaly detection system thinks an account is being shared, <ref>https://support.google.com/a/answer/6002699?hl=en&ref_topic=2759193&sjid=814912251894340756-EU#zippy=%2Cwhen-does-google-consider-a-sign-in-attempt-suspicious</ref>
# Google won't let you turn-off their "suspicious login" feature that locks you out of your own account -- even if their system is faulty and blocking you from logging in, even when you entered the correct password<ref>https://knowledge.workspace.google.com/kb/how-to-disable-login-challenge-security-method-permanently-000007696</ref>
# Google doesn't let you forward mail from one account to many accounts

If you want to create a one-to-many email address (eg <code>tractor-team@opensourceecology.org</code>) for which there are many recipients, the way to do this in Google Workspace is to create a "Google Group".

== System Alerts ==

For example, in September 2024, OSE nearly lost all of its backup data (on [[Backblaze]]) due to few missed payments (amounting to <$10) because our bank false-positive blocked the transaction as "suspicious". The issue was exacerbated by the fact that our backblaze-specific email address (which received many, many "payment failed" alerts) was not being forwarded to the email inboxes of Marcin (or anyone else).

For security reasons, it's always better to use services that ''don't'' use shared logins. If possible, create one user account per person and grant that user account access to the OSE account. Unfortunately, this isn't possible with many services -- and we're forced to use one shared account.

For more flexibility and security, rather than signing-up for an account directly with some shared <code>some-google-group-list@opensourceecology.org</code> account that's tied to a Google Group directly, we create a new user account for that account. Then you can [1] forward all of that account's mail to a Google Group and [2] grant other users to be able to access that account's mail.

To setup email forwarding, login as the <code>service-specific-shared-account@opensourceecology.org</code> account in gmail. Click on the settings "gear icon" in the top-right of the webpage. Click on the "Forwarding and POP/IMAP" tab. Under the "Forwarding" section, enter the email address of the Google Group. Make sure to check the correct radio button that says "Forward a copy of incoming mail to ..." and also leave the drop-down set to "keep ... copy in the inbox". This will ensure that, even if the Google Group gets moved or deleted in the future, all of the mail for this specific account will be retained in gmail. Finally, click "Save Changes".

To grant Marcin or anyone else access to this new service-specific account's mail, login as the account in Gmail. Click on the settings "gear icon" in the top-right of the webpage. Click on the "Accounts" tab. Under the "Grant access to your account" section, click "Add an account" and enter the email address of the person (eg Marcin) that you want to give access to be able to read and write mail on behalf of this user.

{{Warning|Please note that "reset password" functionality usually works by sending a link to a user's email address, so we should assume that '''anyone either on the Google Groups list or under the "Grant access to your account" list will be able to login''' to these services, '''even if they don't have the account password'''. So please only ever put trusted users on this list.
}}

= Why can't I login? =

Unfortunately, Google employs an infamously faulty anomaly detection system<ref>https://support.google.com/a/answer/6002699?hl=en&ref_topic=2759193&sjid=814912251894340756-EU#zippy=%2Cwhen-does-google-consider-a-sign-in-attempt-suspicious</ref> that may false-positive due to a "suspicious login" that could lock you out of your own account -- even when you entered the correct password on the first try. Unfortunatly, Google is aware of the issue and refuses to let Google Workspace (or individual user) disable this broken "feature" for their accounts, even if it causes more harm than good <ref>https://knowledge.workspace.google.com/kb/how-to-disable-login-challenge-security-method-permanently-000007696</ref>.

If this happens, try enabling 2FA (with TOTP) in your account. It ''should'' prevent Google from locking you out of your own account, even if you enter the correct password on the first try.

Of course, you need to login in order to add 2FA to your account. To bypass the lockout, ask an OSE member with Admin access to Google Workspace to temporarily turn-off "two step authentication" (which is a distinct Google concept from "two factor authentication") as follows:

# Log into the admin.google.com panel
# Click Directory -> Users
# Click on your username
# Click on the "Security" tab
# Scroll-down to "Login challenge" and clicked the "TURN OFF FOR 10 MINS" button <ref>https://knowledge.workspace.google.com/kb/how-to-turn-off-2-step-verification-for-specific-users-000007496</ref>

Now you should be able to login and setup 2FA with TOTP to prevent this from happening again.

The best way to avoid lockout issues on Google is to use a [https://tech.michaelaltfield.net/2026/02/03/single-site-browser-firejail-proxychains/ persistent single-site browser]. For more info, see:

* https://tech.michaelaltfield.net/2026/02/03/single-site-browser-firejail-proxychains/

=References=
{{reflist}}

[[Category: IT Infrastructure]]
[[Category: Software]]

OSE Piping Workbench

2025-09-16T21:39:49Z

Maltfield: prevent error if ~/.FreeCAD doesn't exist yet

{{Hint|See Workbench Source Code at '''[[PVC_Pipe_and_Fittings_Library#OSE_Piping_Workbench]]'''}}

=Introduction=
The OSE pipe workbench is a FreeCAD workbench with pipes and fittings. It creates pipes and fitting using FreeCAD Parts workbench and [https://github.com/oddtopus/flamingo Flamingo].

[[File:OsePiningWorkbenchScreenshot.png | 512px]]

= Installation =
In a Linux system
$ mkdir -p ~/.FreeCAD/Mod
$ cd ~/.FreeCAD/Mod
$ git clone https://github.com/rkrenzler/ose-piping-workbench.git

[[File:check.png]] Command line instructions work on Ubuntu 16.04

Hint:For those new to Linux, always remember Linux is case sensitive.
mkdir ~/.FreeCAD/Mod creates the mod directory inside of FreeCAD. this might already exist, and that is fine.

=Pipes=

The dimensions of the PVC pipes can be found here [[PVC_Pipe]].
Wikipedia on Nominal Pipe Size (NPS) [https://en.wikipedia.org/wiki/Nominal_Pipe_Size],

A pipe is described by its outer diameter '''OD''', its wall thickness '''Thk''' and its height<ref>We use height instead of length in order to make a pipe similar to a FreeCAD cylinder. These particular choice of pipe dimensions makes it more compatible with pipes from flamingo workbench.</ref> H.

To create a pipe, click [[File:CreatePipe.svg]] in OSE piping workbench. Select pipe dimensions and click "OK".

[[File:create-pipe-screenshot.png| 512px]]

To add new dimensions adjust CSV '''pipe.csv''' in ''tables'' directory within workbench directory.

=Elbows=

An elbow is described by an angle alpha, outer pipe diameter POD, inner pipe diameter PID, H, J, M.

To create an elbow, click [[File:CreateElbow.svg]] in OSE piping workbench.

[[File:create-elbow-screenshot.png|512px]]
[[File:create-elbow-cad-screenshot.png|thumb]]

To add new elbows, adjust '''elbow.csv''' in ''tables'' directory within workbench directory.

=Sweep Elbows=

A sweep elbow is a special elbow with larger radius of the bent part. It is described by outer pipe diameter POD, pipe thickness PThk, G, H,and M.
To create an elbow, click [[File:CreateSweepElbow.svg]].

[[File:create-sweep-elbow-screenshot.png|512px]]
[[File:create-sweep-elbow-cad-screenshot.png|thumb]]

To add new sweep elbows, adjust '''sweep-elbow.csv''' in ''tables'' directory within workbench directory.

=Couplings=

A (general) coupling is described by dimensions: POD, POD1, PID, PID1, L, M, M1, N. The dimensions POD1 and PID1 are not from a official specifications.
They are derived from pipe size and schedule. In a reducer coupling, the pipe dimensions on one side POD and PID differ from on the other side POD1 and PID1.

To create a coupling, click [[File:CreateCoupling.svg]] in OSE piping workbench.

[[File:create-coupling-screenshot.png|512px]]
[[File:create-coupling-cad-screenshot.png|thumb]]

To add new couplings, adjust '''coupling.csv''' in ''tables'' directory within workbench directory.

=Bushings=
{{Hint|Correction needed from octagonal shape to hex shape bushing flange, as bushings like bolts are hexagonal.}}
A bushing is described by dimensions N, L and pipe dimensions. As pipe dimensions we use POD, PID1, and POD1.

To create a bushing, click [[File:CreateBushing.svg]] in OSE piping workbench.

[[File:create-bushing-screenshot.png|512px]]
[[File:create-bushing-cad-screenshot.png|thumb]]

To add a new coupling to the part list, adjust '''bushing.csv''' in ''tables'' directory within workbench directory.

=Tees=

A tee is described by parameters G, G1, H, H1, M, M1, and pipe dimensions. As pipe dimensions we use POD, POD1, PID, and PID1.

To create a tee click [[File:CreateTee.svg]] in OSE piping workbench.

[[File:create-tee-screenshot.png|512px]]
[[File:create-tee-cad-screenshot.png|thumb]]

To add a new tee to the part list, adjust '''tee.csv''' in ''tables'' directory within workbench directory.

=Crosses=

A cross is described by parameters G, G1, H, H1, L, L1, M, M1, and pipe dimensions. As pipe dimensions we use POD, POD1, PThk, and PThk1.

To create a tee click [[File:CreateCross.svg]] in OSE piping workbench.

[[File:create-cross-screenshot.png|512px]]
[[File:create-cross-cad-screenshot.png|thumb]]

To add a new cross to the part list, adjust '''cross.csv''' in ''tables'' directory within workbench directory.

=Corners=

An corner is described by dimensions G, H, M and pipe dimensions. As pipe dimensions we use POD and PID.

To create a corner, click [[File:CreateCorner.svg]] in OSE piping workbench.

[[File:create-corner-screenshot.png|512px]]
[[File:create-corner-cad-screenshot.png|thumb]]

To add a new corner to the part list, adjust '''corner.csv''' in ''tables'' directory within workbench directory.
=Customization=
The dimensions of the fittings are saved in [https://en.wikipedia.org/wiki/Comma-separated_values | CSV files].
If you want add new dimensions or change old ones, modify tese CSV files.

The CSV files are in ~/.FreeCAD/Mod/ose-piping-workbench/tables. The columns are separted by commas ",". Always keep this format.

To modify CSV files with LibreOffice Calc follow these steps:

# Open CSV file in LibreOffice Calc. Calc must correctly detect the column-separator "Comma". If it does not, check "Comma" manually. Click OK.<p> [[File:calc-imports-csv.png]]</p>
# Now you can add, remove and modify dimensions of the fittings. Each row of the table must contain a '''unique''' part number and dimensions. You do not need to specify every dimension. To find out which dimensions are mandatory for particular part, click on a button with this part in OSE-piping-workbench. The dialog will tell you which dimensions are mandatory.<p>[[File:piping-workbench-mandatory-dimensions.png]]</p>
# Save the CSV file. Calc will ask you which format to use.<p> [[File:calc-store-csv.png]]</p> Select "Use Text CSV Format"

=Programming=
* [https://www.freecadweb.org/wiki/Scripted_objects FreeCAD scripted object]
* It should be possible to represent the object with "classic" FreeCAD forms like cylinders, spheres, sweeping objects ...
* It should be possible to use solids.
* The main purpose is to create tools for moving, rotations, and fittings.

=Documentation=

==Programming==
* [https://www.freecadweb.org/wiki/Scripted_objects FreeCAD scripted objects]
* [https://forum.freecadweb.org/viewtopic.php?f=8&t=27641&sid=0f829d3bd056ec5add5407879796451a Forum entry on freecadweb.org]

== Remarks about the coupling code ==

To create a simple coupling or a reduced we internally use a more general coupling.
This general coupling is described by 9 dimensions: POD, PID, POD1, PID1, X1, X2, N, M, M1. The dimensions POD, PID, POD1, and PID1 are derived from the pipe sizes.
The are abbreviations of '''P'''ipe '''O'''uter '''D'''iameter and '''P'''ipe '''I'''nner '''D'''iameter.
The dimensions X1 and X2 are not official dimension names.

[[File:coupling-calculations.png]]

The offset a1 is calculated in such a way, that the thinest part of the middle section is not thinner than the walls on of the both sockets.
Lengths a2, a3, a4 and angle b1 are derived from the dimensions and are only used to calculate a1.

=Useful links=
* An example of fittings with dimensioned drawings produced by [https://www.aetnaplastics.com/site_media/media/attachments/aetna_product_aetnaproduct/204/PVC%20Sch%2040%20Fittings%20Dimensions.pdf Aetna lastics].
* [https://forum.freecadweb.org/viewtopic.php?f=8&t=27641&sid=0f829d3bd056ec5add5407879796451a Forum entry on freecadweb.org]

* [https://youtu.be/1FBudfRcQv4 Using Flamingo to move parts]
=Discussion=

<html><div id="disqus_thread"></div>
<script>

/**
* RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
* LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/
/*
var disqus_config = function () {
this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
*/
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = 'https://ose-piping-workbench.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
</html>

File Simplification

2025-09-09T17:02:31Z

Maltfield: fix italic syntax

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Identifying problem objects=

If you have a large/slow FreeCAD file, you'll first want to identify ''which'' object is causing the problem.

There is a distinction in two sizes:

# The (compressed) on-disk size of the .FCStd file
# The (uncompressed) MemSize size of each object

The two are ''sometimes'' correlated, but it's possible to have a <1 MB .FCStd file that is completely unusable because of a very large MemSize. This would happen, for example, if you made a very simple sketch and then an enormous array of the sketch in three dimensions (eg for a mesh object). That would compress to a very small file size, but explode to a very large (uncompressed) MemSize, crashing FreeCAD.

Fortunately, FreeCAD is a very robust software that exposes the "python console" to the user, where you can paste custom code to interact with the objects. The snippet below will:

# Iterate through every layer in the [https://wiki.freecad.org/Document_structure FreeCAD Document's Tree]
# Get the size [https://github.com/FreeCAD/FreeCAD/blob/6ab8589a03b498b237f8ba88c6ae4692bb3adba6/src/Mod/TemplatePyMod/DocumentObject.py#L117-L119 MemSize] of each layer,
# Sort the list of layers by their size, and
# Print the list of layers (sorted by size)

To use this, you first need to open the [https://wiki.freecad.org/index.php?title=Python_Console Python Console] in FreeCAD. Do this by clicking to '''View -> Panels -> Python Console'''. Then '''paste the following snippet''' into the Python Console. And '''press enter'''

<pre>
def printMem():
objs = list(FreeCAD.ActiveDocument.Objects)
objs.append(FreeCAD.ActiveDocument) # add doc to list
objs.sort(reverse=True, key=lambda x: x.MemSize) # max mem is first

hdr = "MemSize (bytes) | Object Label\n"
hLine = "-"*len(hdr) + "\n"
linesList = ["\n", hLine, hdr, hLine]
for obj in objs:
linesList.append("{:>15,d} | {}\n".format(obj.MemSize, obj.Label))
linesList.append(hLine)
s = "".join(linesList)
print(s)

printMem();
</pre>

Note that it may take several seconds to finish the calculation.

For more information (and an example) of using the above code snippet to find the MemSize of every object in your FreeCAD file's tree, please see [https://www.eco-libre.org/big-freecad-file-size/ Troubleshooting Large FreeCAD File Sizes]

* https://www.eco-libre.org/big-freecad-file-size/

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] question on FreeCAD Forums
* [https://engineering.stackexchange.com/questions/63647/why-is-my-freecad-file-so-large Why is my FreeCAD file so large?] question on Engineering Stack Exchange

File Simplification

2025-09-09T17:00:13Z

Maltfield: fix link syntax

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Identifying problem objects=

If you have a large/slow FreeCAD file, you'll first want to identify ''which'' object is causing the problem.

There is a distinction in two sizes:

# The (compressed) on-disk size of the .FCStd file
# The (uncompressed) MemSize size of each object

The two are _sometimes_ coorelated, but it's possible to have a <1 MB .FCStd file that is completely unusable because of a very large MemSize. This would happen, for example, if you made a very simple sketch and then an enormous array of the sketch in three dimensions (eg for a mesh object). That would compress to a very small file size, but explode to a very large (uncompressed) MemSize, crashing FreeCAD.

Fortunately, FreeCAD is a very robust software that exposes the "python console" to the user, where you can paste custom code to interact with the objects. The snippet below will:

# Iterate through every layer in the [https://wiki.freecad.org/Document_structure FreeCAD Document's Tree]
# Get the size [https://github.com/FreeCAD/FreeCAD/blob/6ab8589a03b498b237f8ba88c6ae4692bb3adba6/src/Mod/TemplatePyMod/DocumentObject.py#L117-L119 MemSize] of each layer,
# Sort the list of layers by their size, and
# Print the list of layers (sorted by size)

To use this, you first need to open the [https://wiki.freecad.org/index.php?title=Python_Console Python Console] in FreeCAD. Do this by clicking to '''View -> Panels -> Python Console'''. Then '''paste the following snippet''' into the Python Console. And '''press enter'''

<pre>
def printMem():
objs = list(FreeCAD.ActiveDocument.Objects)
objs.append(FreeCAD.ActiveDocument) # add doc to list
objs.sort(reverse=True, key=lambda x: x.MemSize) # max mem is first

hdr = "MemSize (bytes) | Object Label\n"
hLine = "-"*len(hdr) + "\n"
linesList = ["\n", hLine, hdr, hLine]
for obj in objs:
linesList.append("{:>15,d} | {}\n".format(obj.MemSize, obj.Label))
linesList.append(hLine)
s = "".join(linesList)
print(s)

printMem();
</pre>

Note that it may take several seconds to finish the calculation.

For more information (and an example) of using the above code snippet to find the MemSize of every object in your FreeCAD file's tree, please see [https://www.eco-libre.org/big-freecad-file-size/ Troubleshooting Large FreeCAD File Sizes]

* https://www.eco-libre.org/big-freecad-file-size/

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] question on FreeCAD Forums
* [https://engineering.stackexchange.com/questions/63647/why-is-my-freecad-file-so-large Why is my FreeCAD file so large?] question on Engineering Stack Exchange

File Simplification

2025-09-09T16:58:52Z

Maltfield: fix bold syntax

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Identifying problem objects=

If you have a large/slow FreeCAD file, you'll first want to identify ''which'' object is causing the problem.

There is a distinction in two sizes:

# The (compressed) on-disk size of the .FCStd file
# The (uncompressed) MemSize size of each object

The two are _sometimes_ coorelated, but it's possible to have a <1 MB .FCStd file that is completely unusable because of a very large MemSize. This would happen, for example, if you made a very simple sketch and then an enormous array of the sketch in three dimensions (eg for a mesh object). That would compress to a very small file size, but explode to a very large (uncompressed) MemSize, crashing FreeCAD.

Fortunately, FreeCAD is a very robust software that exposes the "python console" to the user, where you can paste custom code to interact with the objects. The snippet below will:

# Iterate through every layer in the [FreeCAD Document's Tree](https://wiki.freecad.org/Document_structure),
# Get the size [https://github.com/FreeCAD/FreeCAD/blob/6ab8589a03b498b237f8ba88c6ae4692bb3adba6/src/Mod/TemplatePyMod/DocumentObject.py#L117-L119 MemSize] of each layer,
# Sort the list of layers by their size, and
# Print the list of layers (sorted by size)

To use this, you first need to open the [https://wiki.freecad.org/index.php?title=Python_Console Python Console] in FreeCAD. Do this by clicking to '''View -> Panels -> Python Console'''. Then '''paste the following snippet''' into the Python Console. And '''press enter'''

<pre>
def printMem():
objs = list(FreeCAD.ActiveDocument.Objects)
objs.append(FreeCAD.ActiveDocument) # add doc to list
objs.sort(reverse=True, key=lambda x: x.MemSize) # max mem is first

hdr = "MemSize (bytes) | Object Label\n"
hLine = "-"*len(hdr) + "\n"
linesList = ["\n", hLine, hdr, hLine]
for obj in objs:
linesList.append("{:>15,d} | {}\n".format(obj.MemSize, obj.Label))
linesList.append(hLine)
s = "".join(linesList)
print(s)

printMem();
</pre>

Note that it may take several seconds to finish the calculation.

For more information (and an example) of using the above code snippet to find the MemSize of every object in your FreeCAD file's tree, please see [https://www.eco-libre.org/big-freecad-file-size/ Troubleshooting Large FreeCAD File Sizes]

* https://www.eco-libre.org/big-freecad-file-size/

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] question on FreeCAD Forums
* [https://engineering.stackexchange.com/questions/63647/why-is-my-freecad-file-so-large Why is my FreeCAD file so large?] question on Engineering Stack Exchange

File Simplification

2025-09-09T16:58:17Z

Maltfield: fix syntax of code

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Identifying problem objects=

If you have a large/slow FreeCAD file, you'll first want to identify ''which'' object is causing the problem.

There is a distinction in two sizes:

# The (compressed) on-disk size of the .FCStd file
# The (uncompressed) MemSize size of each object

The two are _sometimes_ coorelated, but it's possible to have a <1 MB .FCStd file that is completely unusable because of a very large MemSize. This would happen, for example, if you made a very simple sketch and then an enormous array of the sketch in three dimensions (eg for a mesh object). That would compress to a very small file size, but explode to a very large (uncompressed) MemSize, crashing FreeCAD.

Fortunately, FreeCAD is a very robust software that exposes the "python console" to the user, where you can paste custom code to interact with the objects. The snippet below will:

# Iterate through every layer in the [FreeCAD Document's Tree](https://wiki.freecad.org/Document_structure),
# Get the size [https://github.com/FreeCAD/FreeCAD/blob/6ab8589a03b498b237f8ba88c6ae4692bb3adba6/src/Mod/TemplatePyMod/DocumentObject.py#L117-L119 MemSize] of each layer,
# Sort the list of layers by their size, and
# Print the list of layers (sorted by size)

To use this, you first need to open the [https://wiki.freecad.org/index.php?title=Python_Console Python Console] in FreeCAD. Do this by clicking to **View -> Panels -> Python Console**. Then **paste the following snippet** into the Python Console. And **press enter**.

<pre>
def printMem():
objs = list(FreeCAD.ActiveDocument.Objects)
objs.append(FreeCAD.ActiveDocument) # add doc to list
objs.sort(reverse=True, key=lambda x: x.MemSize) # max mem is first

hdr = "MemSize (bytes) | Object Label\n"
hLine = "-"*len(hdr) + "\n"
linesList = ["\n", hLine, hdr, hLine]
for obj in objs:
linesList.append("{:>15,d} | {}\n".format(obj.MemSize, obj.Label))
linesList.append(hLine)
s = "".join(linesList)
print(s)

printMem();
</pre>

Note that it may take several seconds to finish the calculation.

For more information (and an example) of using the above code snippet to find the MemSize of every object in your FreeCAD file's tree, please see [https://www.eco-libre.org/big-freecad-file-size/ Troubleshooting Large FreeCAD File Sizes]

* https://www.eco-libre.org/big-freecad-file-size/

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] question on FreeCAD Forums
* [https://engineering.stackexchange.com/questions/63647/why-is-my-freecad-file-so-large Why is my FreeCAD file so large?] question on Engineering Stack Exchange

File Simplification

2025-09-09T16:56:47Z

Maltfield: added section for calculating MemSize of every object

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Identifying problem objects=

If you have a large/slow FreeCAD file, you'll first want to identify _which_ object is causing the problem.

There is a distinction in two sizes:

# The (compressed) on-disk size of the .FCStd file
# The (uncompressed) MemSize size of each object

The two are _sometimes_ coorelated, but it's possible to have a <1 MB .FCStd file that is completely unusable because of a very large MemSize. This would happen, for example, if you made a very simple sketch and then an enormous array of the sketch in three dimensions (eg for a mesh object). That would compress to a very small file size, but explode to a very large (uncompressed) MemSize, crashing FreeCAD.

Fortunately, FreeCAD is a very robust software that exposes the "python console" to the user, where you can paste custom code to interact with the objects. The snippet below will:

# Iterate through every layer in the [FreeCAD Document's Tree](https://wiki.freecad.org/Document_structure),
# Get the size [https://github.com/FreeCAD/FreeCAD/blob/6ab8589a03b498b237f8ba88c6ae4692bb3adba6/src/Mod/TemplatePyMod/DocumentObject.py#L117-L119 MemSize] of each layer,
# Sort the list of layers by their size, and
# Print the list of layers (sorted by size)

To use this, you first need to open the [https://wiki.freecad.org/index.php?title=Python_Console Python Console] in FreeCAD. Do this by clicking to **View -> Panels -> Python Console**. Then **paste the following snippet** into the Python Console. And **press enter**.

```
def printMem():
objs = list(FreeCAD.ActiveDocument.Objects)
objs.append(FreeCAD.ActiveDocument) # add doc to list
objs.sort(reverse=True, key=lambda x: x.MemSize) # max mem is first

hdr = "MemSize (bytes) | Object Label\n"
hLine = "-"*len(hdr) + "\n"
linesList = ["\n", hLine, hdr, hLine]
for obj in objs:
linesList.append("{:>15,d} | {}\n".format(obj.MemSize, obj.Label))
linesList.append(hLine)
s = "".join(linesList)
print(s)

printMem();
```

Note that it may take several seconds to finish the calculation.

For more information (and an example) of using the above code snippet to find the MemSize of every object in your FreeCAD file's tree, please see [https://www.eco-libre.org/big-freecad-file-size/ Troubleshooting Large FreeCAD File Sizes]

* https://www.eco-libre.org/big-freecad-file-size/

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] question on FreeCAD Forums
* [https://engineering.stackexchange.com/questions/63647/why-is-my-freecad-file-so-large Why is my FreeCAD file so large?] question on Engineering Stack Exchange

File Simplification

2025-08-27T18:51:51Z

Maltfield: add link to SE

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] question on FreeCAD Forums
* [https://engineering.stackexchange.com/questions/63647/why-is-my-freecad-file-so-large Why is my FreeCAD file so large?] question on Engineering Stack Exchange

File Simplification

2025-08-27T18:51:07Z

Maltfield: fix syntax

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] FreeCAD Forums

File Simplification

2025-08-27T18:50:55Z

Maltfield: add link to freecad forums

=Introduction=
With FreeCAD, OSE practices 2 levels of file simplification. In both cases, the goals are is to reduce file size, and to simplify the part tree. OSE workflow assumes that we work with the part tree (especially the very useful feature of hiding and un-hiding parts for build instructionals purposes), and that we reduce file size as much as possible to make complex files quick to open and easy to manipulate without bogging down the computer. This is especially important when large teams are collaborating.

The file simplification below refers to simplifying the actual features of a part - the Level of Detail section below. Another type of simplification can be done on the part tree to simplify the part tree during the design phase. This is the Part Tree Simplification section.

=Part Tree Simplification=
When doing design work with multiple modules of similar parts, such as the Seed Eco-Home wall modules - it is useful to collapse the part tree into a single item.

OSE usually creates detailed CAD where every single part (such as the tens of parts of wall modules - each appear as an individual item in the Part Tree. This is useful for making instructionals, where parts can be hidden and unhidden to allow for step-by-step build sequences. Also, exploded part animations can be done using the [[Exploded Assembly Workbench]].

However, in the design phase, it is challenging to keep track of dozens of parts, so it is useful to collapse the part tree into a more manageable form. This can be done by either removing information from the CAD file, or retaining it. To retain all information, right click on a part tree heading and Create Group - which creates a folder. Then you can drag and drop parts into that folder. This makes it easy to keep track of parts - or selecting a bunch of parts at once by selecting that folder. This does not reduce file size.

To reduce file size, we can remove sketches by Create Simple Copy in the Part Workbench, or by clicking on a sketch and deleting. We can also Make Compound - collapsing a bunch of parts into one. However, Make Compound does not reduce file size further - in fact, a Compound of a bunch of simple parts takes more memory than the simple copies themselves. To reduce file size of a compound, Ctrl-C and Ctrl-V into a new document. Ctrl-V into the same document doesn't seem to reduce the file size. You will notice typically when you select a compound or part with sketch:

[[File:dependenciescopy.png|300px]]

Select no, and your paste will be lower in size.

'''To summarize - remove sketches to reduce memory, make a compound to collapse all parts into one, and then copy-paste without detail into a new file - and you will have the minimum-size file possible under one item in the part tree. Such format makes the overall assembly file in a team workflow the smallest possible, allowing for large scale design. The limit here is a few thousand part files that can be manipulated readily. Once a file reaches an unmanageable size - we can go to file simplification in terms of Level of Detail - in the next section. This is like making thumbnails of pictures available: you can work with it, but it doesn't contain all the detail. The simple version is an abstract version of the original file. Thus, in large-scale team workflows - the part tree simplification and level of detail simplification can be pursued ad infinitum - abstracting the design further and furth - so that complex assemblies can be created. In principle, the complexity of design that this process can handle has no limit. Therefore, even the largest design problems can be solved in a day - with thousands or even millions of people collaborating in realtime.'''

=Level of Detail=

We work with CAD files at different levels of detail. For example, we can download a file for a valve from [[McMaster-Carr]] and the thing is a few MB because it has details like threads. But - the problem comes in when we have an assembly of many parts. This leads easily to 100MB or GB size files if one doesn't pay attention to file size. This is rather unworkable - as the computer bogs down to very slow operations.

The solution is creating very small part files that represent the original - but instead of say 2 MB - it would be like 10k or so. Just a placeholder - which shows relatively accurate dimensions (important for analyzing part interference and fit) - but shows them in the crudest way possible. Such that - say we have a file with 200 parts of 10k each - so the entire assembly remains at only 2MB. As a general practice - files above 50MB are unusable - the practical limit is 10-20MB. But if kept down to around 1MB, navigation is lightning fast and no time is wasted.

We save these small files as individual files, and assemblies of individual files, in the OSE [[Part Library]]. Thus, if we want to create an excessively large file - we can handle complex files of hundreds of parts without any visible slowdown of the computer. Read more about our workflow of merging files together - see [[Merge Workflow]].

=Working Doc=

<html><iframe src="https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></html>

[https://docs.google.com/presentation/d/11CXpjC2phOyV40SSsKuAET6jEAIGK036BmqTNzkxwmM/edit#slide=id.g22c1dd84ad_1_132 edit]

=Notes=
*Note that FreeCAD file size is 2.8k minimum for a cubic shape in the above presentation.
*Thus, the simplest useful files start at about 10k. Files with about
*A cube should be only a few bytes - l, w, h. 8 bits are a byte. About 65,000 divisions is 2 bytes (16 bit depth). So each dimension should be stores in 2 bytes. Thus, a cube should be 6 bytes large. If we add angle and position, we have 18 bytes. Thus, memory size of FreeCAD files can be reduced by at least 100x if files were stores in their most efficient form, because minimum file size is on the order of kilobytes, not bytes. Just sayin'.

=External Links=
* [https://forum.freecad.org/viewtopic.php?p=844168#p844168 Why is my FreeCAD file so large? (grainular file size view)] FreeCAD Forums

Maltfield Log/2025 Q2

2025-05-31T19:39:21Z

Maltfield: apr 30

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

__TOC__

=See Also=
# [[Maltfield_Log]]
# [[User:Maltfield]]
# [[Special:Contributions/Maltfield]]

=Wed Apr 30, 2025=
# This morning we're going to replace /dev/sda on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-30_replace_hetzner2_sda
# unfortunately my computer was off when I woke up
# worse, my personal keepass db appeared to be corrupt
# I had to restore from my most recent on-boot backup of my keepass, which means I have ~3 weeks of data loss
# so I'm starting this change about half an hour late due to ^ that
# first-off, I logged into hetzner and the wiki to make damn sure I have those creds before continuing
# last week I had asked hetzner support to ensure they had a stock of the replacement drive we needed
## they responded asking me to update an existing ticket, but idk how to even view existing tickets. when I click "support" after logging-in, it just sends me to create a new ticket
## probably too late, but I responded (by email) to this response
<pre>
> you have another ticket open and running parallel to this one.

Can you please tell me where I can see my existing open tickets in your
website?

When I click on "Support" after logging-in, I am only given an option to
create new tickets. I can't see any existing tickets here

* https://robot.hetzner.com/support/index

If your request about an additional drive refers to REDACTED then
please copy / paste you request from this ticket into Ticket#
2025042403016013, as it will then land in the data center where your server
is online and the DC Support staff can respond to you according. Thank-you.

Yes, this is regarding REDACTED

Sorry, I can't paste it into any existing ticket because [a] existing
tickets are not visible when I login and [b] you didn't tell me how to
access the existing tickets..

We need the disk for a scheduled change in a couple hours.

Thank you,
Michael

On Mon, Apr 28, 2025 at 5:03=E2=80=AFAM Support - Hetzner Online GmbH <
support@hetzner.com> wrote:

> Dear Mr Altfield
>
> Thank-you for your request.
>
> I notice that you are not referring to a specific server ID number in thi=
s
> ticket and that you have another ticket open and running parallel to this
> one.
>
> If your request about an additional drive refers to REDACTED the=
n
> please copy / paste you request from this ticket into Ticket#
> REDACTED, as it will then land in the data center where your serv=
er
> is online and the DC Support staff can respond to you according. Thank-yo=
u.
>
> Kind regards
>
> Robin Rabe
>
> Sales & Product Advic
</pre>
# well it was worth trying; let's proceed in hopes they have stock.
# I logged-into hetzner2 and confirmed that it completed its daily reboot just 48 minutes ago, so we can proceed without worry it'll reboot again for ~23 hours
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:29:46 UTC 2025
[root@opensourceecology ~]# uptime
11:29:48 up 48 min, 5 users, load average: 1.29, 1.27, 1.01
[root@opensourceecology ~]#
</pre>
# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>
# last time it took about 1 hour for them to respond saying the new disk was installed. I'll come back in about an hour
# ...
# I got an email at 13:20 UTC (08:20 my time), saying the drive was replaced
# ugh, they gave us a drive with 18,623 hours of use. It only has 32% of its life left
# I replied to the support ticket within 2 minutes telling them to replace it again with a drive that has <1,000 hours of use
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>
# I got a response back from hetzner 4 minutes later
<pre>
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
</pre>
# so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
<pre>
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?
</pre>
# I'm going to go ahead and provision it
# I tried to update the wiki, but it looks like I got logged-out and I can't login again
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# the disk isn't full, and I'm not getting read only i/o errors like last time (when they removed both drives by mistake)
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 157G 31G 84% /
/dev/md1 486M 383M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088
ID_SERIAL_SHORT=18301DC6A088
[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250430_134343 ~
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/*
0 /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak
4.0K /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
</pre>
# the sda partition is empty, which makes sense
# I copied the sdb partition to sda
<pre>
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sda: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sda1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sda2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sda3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sda4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250430_134343]#
</pre>
# and reloaded the kernel
<pre>
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda
[root@opensourceecology chg.20250430_134343]#
</pre>
# and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
<pre>
[root@opensourceecology chg.20250430_134343]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 200.4G 0 part
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology chg.20250430_134343]#

[root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3
mdadm: added /dev/sda3
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
[root@opensourceecology chg.20250430_134343]#
</pre>
# cool, that worked. /boot is already done, and it's syncing root (/) now
<pre>
[root@opensourceecology chg.20250430_134343]# date -u
Wed Apr 30 13:48:43 UTC 2025
[root@opensourceecology chg.20250430_134343]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
resync=DELAYED

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
[=>...................] recovery = 9.1% (19231872/209984640) finish=16.5min speed=192161K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology chg.20250430_134343]#
<pre>
# I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# as noted in the docs, those warnings can be safely ignored
# replication is finished; I guess these Micron 500G disks have better i/o throughput than our old 200GCrucial disks
<pre>
Wed Apr 30 14:07:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
[====>................] recovery = 21.2% (7124992/33521664) finish=2.2min speed=191533K/sec

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Wed Apr 30 14:12:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I'm going to double-tap the grub install before giving it a reboot
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# and I rebooted it
<pre>
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
user@personal:~$ autossh opensourceecology.org
Last login: Wed Apr 30 11:28:26 2025 from REDACTED
[maltfield@opensourceecology ~]$ uptime
14:17:14 up 1 min, 1 user, load average: 0.85, 0.24, 0.08
[maltfield@opensourceecology ~]$
</pre>
# cool, it came back.
# cool, raid looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[3]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[3]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# and SMART isn't yelling about failed disks anymore
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# I'm marking this CHG as completed successfully
# ...
# Marcin asked me how much longer the 4% disk will last; I replied
# so it says it's bee online for 52,235 hours, and it has 4% remaining
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52235
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 47
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 30
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 049 000 Old_age Always - 35 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601655717236
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904918036
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11850643256
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# that's 52,235/96 = 544.114584 hours per percent
# so I guess we have 4*544.114584 = 2,176.458333332 hours = 90 days left
<pre>
Likely longer than 90 days. Plus or Minus a very large uncertainty.

SMART data keys are a standard, but the values are very vendor-specific. So they vary *a lot*. And, for the "life left" percent -- obviously something can be said about designed obsolescence. Or at least it's in the interest of the vendor to tell you to replace a drive earlier than needed. I have no idea how long you've been running on two disks that have 0% "life left". In any case, I wouldn't be cheap about disks; it's not worth the risk.

Anyway, the disk with 4% "life left" says that it's been online for 52,235 hours. So dividing that by 96% and then multiplying by 4 suggests that you have maybe 2,176 hours before it says it has 0% "life left". But it might also depend on the read/write frequency/pattern used by the previous customer. So take it with a big grain of salt.

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/30/25 15:53, Marcin Jakubowski wrote:
> 4% of life left means how many days?
</pre>

=Sun Apr 27, 2025=
# Tom created a GitHub account https://github.com/tgriff-ose
# I invited this new account to become a member of the official OSE GitHub org, and sent them an email
<pre>
Hey Tom,

I've invited you to join the official OSE GitHub org:

* https://github.com/orgs/OpenSourceEcology

Please check your GitHub notifications and accept the invite.

PS: If you haven't yet, can you please enable 2FA on your GitHub account?

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/26/25 22:42, REDACTED@tutanota.com wrote:
> Account name: tgriff-ose
>
> --
> Tom Griffing
>
>
>
> Apr 27, 2025, 03:24 by REDACTED@disroot.org:
>
>> GitHub is owned by Microsoft, and it's free (as in beer) to create an account.
>>
>> Could you please create a free GitHub account?
>>
>> Michael Altfield
>> https://www.michaelaltfield.net
>> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>>
>> Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
>>
>> On 4/26/25 21:06, REDACTED@tutanota.com wrote:
>>
>>> Michael;
>>>
>>> I don't have a github account, as it's a Microsoft thing requiring a paid account. I don't intent to support them.
>>>
>>> Is there any other way to access the ansible repo?
>>> --
>>> Tom Griffing
</pre>
# ...
# Marcin confirmed that he has not received a bill from AWS for some time, so it appears we did finally delete all of the glacier crap
<pre>
I have not received another bill since January, so it looks like there is
nothing owed.
MJ

On Sat, Apr 26, 2025 at 6:28 PM Michael Altfield <REDACTED>
wrote:

> Hey Marcin,
>
> Speaking of aws, can you confirm that your bill for last month was $0?
>
>
> Thank you,
</pre>
# ...
# I updated my wiki and osedev work logs for April so-far

=Sat Apr 26, 2025=
# Marcin authorized me to add Tom to our ops google groups mailing list and to give him access to our shared ose keepass
<pre>
Yes.

On Fri, Apr 25, 2025, 12:43 PM Michael Altfield <REDACTED@disroot.org> wrote:

> (re-sending without encryption)
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
>
> On 4/25/25 12:41, Michael Altfield wrote:
>> Hey Marcin,
>>
>> Do you authorize:
>>
>> 1. Giving Tom access to the shared OSE keepass file
>>
>> 2. Adding Tom to the ops mailing list (this would allow him to password
>> reset many of our important accounts)
>>
>> Please let me know if you authorize the above.
>>
>> Thank you,
</pre>
# Tom sent me his gpg public key, which I can use to add him to the wazuh emails
<pre>
user@ose:~$ gpg
gpg: WARNING: no command supplied. Trying to guess what you mean ...
gpg: Go ahead and type your message ...
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----

pub rsa4096 2025-04-26 [SC]
13300901348A985115679165FB137A633FD1EB4C
uid Tom Griffing (OSE PGP Key 4-25-2025) <REDACTED@tutanota.com>
sub rsa4096 2025-04-26 [E]
user@ose:~$
</pre>
# I added Tom to the wazuh recipients, per https://wiki.opensourceecology.org/wiki/Wazuh
<pre>
mkdir -p /var/tmp/gpg
pushd /var/tmp/gpg
# write multi-line to file for documentation copy & paste
cat << EOF > /var/tmp/gpg/tom.pubkey.asc
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----
EOF
gpg --homedir /var/ossec/.gnupg --import /var/tmp/gpg/tom.pubkey.asc
popd

# add marcin's email (that matches an email on a UID of his key above) to the space-delimited "recipients" variable
vim /var/ossec/sent_encrypted_alarm.settings
</pre>
# and I sent him an email asking him to confirm that it's working
<pre>
Hey Tom,

Can you please confirm that you're now receiving alerts from wazuh?

Wazuh is our HIDS (Host-Based Intrusion Detection System). It's a fork of the HIDS and FIM (File Integrity Monitor) OSSEC. Because it sometimes sends sensitive information (eg diffs of config files with passwords), it's important that we encrypt its email notifications end-to-end with PGP.

And because someone who compromises the server could "clean up" after themselves, these (off-server) alerts are critical to post-compromise investigations.

For more info, see:

* https://wiki.opensourceecology.org/wiki/Wazuh
* https://en.wikipedia.org/wiki/OSSEC
* https://documentation.wazuh.com/current/getting-started/index.html

Out-of-the-box, Wazuh has a ton of features, but probably where we use it the most is its ingestion of apache's mod_security WAF and its tie-in to Wazuh's Active Response. If an IP is found doing something bad (eg multiple consecutive 403 responses, such as a brute-force attack on wordpress [or ssh]), then the IP will get temp blocked by the firewall for 10 minutes. If it does it again shortly after the ban is lifted, it'll be banned for 12 hours. If again, 1 day. Then 2 days. Then 4 days. And the max ban for 5x repeat offenses is 8 days

* https://github.com/OpenSourceEcology/ansible/blob/master/hetzner3/roles/maltfield.wazuh/templates/ossec.conf.j2#L256-L271

It also has rootkit detection, and lots of other useful alerts that "just work" out of the box.

Please confirm that you're now receiving encrypted wazuh alerts.

Thank you,
</pre>
# I tried to add Tom to our ops google groups email list, but it said I wasn't allowed to add members outside of our google workspace
<pre>
An error occurred
1 user is outside of your organization. Based on your group or organization settings, you can only add organization users to this group. Contact your group owner or domain administrator for help.
</pre>
# I checked our user's group. it appears that Tom doesn't have an account @opensourceecology.org in gsuite
# I found the setting to change that here https://admin.google.com/ac/managedsettings/864450622151/GROUPS_SHARING_SETTINGS_TAB
## https://support.google.com/a/thread/63692725/
## https://support.google.com/a/answer/167097
# I checked the box that said "Group owners can allow external members"
## curiously the subline said "Organization admins can always add external members" – but I'm a damn org admin, and I couldn't add him :/
# I tried to add him again, but I got the same error
# this time I went to the group settings https://groups.google.com/a/opensourceecology.org/g/REDACTED/settings
# I found the "allow external members" and changed it from "off" to "on" and clicked "save changes"
## this wasn't possible before. So first I had to change the workspace-wide settings to allow me to change the groups-specific settings. now it's changed.
# this time it worked.
# I sent an email to our ops google group, asking Tom to reply if he saw it
# ...
# I checked-in on hetzner2 to make sure it rebooted this morning
# looks like the cron is set to reboot at 10:40 UTC every day, and – indeed – uptime says it's been online for a bit less than 13 hours. And its last boot time was today at 10:41:25
<pre>
[root@opensourceecology ~]# uptime
23:30:25 up 12:49, 7 users, load average: 1.02, 0.98, 0.74
[root@opensourceecology ~]# journalctl | head
-- Logs begin at Sat 2025-04-26 10:41:25 UTC, end at Sat 2025-04-26 23:30:26 UTC. --
Apr 26 10:41:25 localhost systemd-journal[129]: Runtime journal is using 8.0M (max allowed 3.1G, trying to leave 4.0G free of 31.2G available → current limit 3.1G).
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuset
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpu
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuacct
Apr 26 10:41:25 localhost kernel: Linux version 3.10.0-1160.119.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jun 4 14:43:51 UTC 2024
Apr 26 10:41:25 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
Apr 26 10:41:25 localhost kernel: e820: BIOS-provided physical RAM map:
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009c7ff] usable
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x000000000009c800-0x000000000009ffff] reserved
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /etc/cron.d/reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology ~]# date -u
Sat Apr 26 23:31:32 UTC 2025
[root@opensourceecology ~]#
</pre>
# so it looks like we'll have ~2 minutes of downtime every day in the very early morning in the US. I can live with that.
# and grub clearly is fixed
# oh, also the RAID looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I asked Tom for his GitHub account profile username, so I can grant him write access to our OSE ansible repo
# I updated Tom's new ssh key to his authorized_keys file on hetzner2
# I sent Tom an email asking to confirm his access to hetzner2

=Fri Apr 25, 2025=
# I woke up this morning and discovered the wiki was offline
# I tried to ssh into the server; it's not responding
# I figured I'd log into the hetzner wui, but – uhh – the credentials are in keepass and live on the server
# I mitigated this by giving Marcin a copy of the keepass file on his veracrypt drive, but he since changed the password a month or two ago, and we don't have a new local copy
# I sent an email to Marcin asking him to login to hetzner wui and boot hetzner2. if it doesn't come-up, then I'll have to get the password from him so I can load it in the wui from a rescue disk
# oh, I did find the new hetzner password in my personal keepass
# I logged-in, and I found the server was listed as being on. But I can't ping it. I gave it an "automatic hardware reset" from the wui
# I'll give it a few minutes before trying the rescue system
# their rescue systems are much nicer for their cloud product than their dedicated server product
# it looks like I have two options
## rescue boot mode: where I'm given ssh access
## vnc
# the problem with the rescue boot is that – if this is a grub issue – I wouldn't be able to "see" the error
# I enabled VNC and gave the server a reboot
# I was able to connect via vnc, but it was the damn installation wizard for almalinux. I quit the installation, and the vnc session died.
# damn, I guess vnc won't let me see the boot process, after all
# instead I tried the "rescue system"
# that didn't work; I can't access ssh on either of the IP addresses
# the docs say to activate the rescue system and then reboot it; that's what I did https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system/
# this time I fully shut down the server, and then I enabled the rescue system (while it's off)
# I went back to the Reset tab, and it's still off. So I booted it
# somehow I was able to login from my ose vm using my personal ssh key, but with user root
<pre>
user@ose:~$ ssh -v root@138.201.84.223
OpenSSH_9.2p1 Debian-2+deb12u5, OpenSSL 3.0.15 3 Sep 2024
debug1: Reading configuration data /home/user/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 138.201.84.223 [138.201.84.223] port 22.
debug1: Connection established.
...
Linux rescue 6.12.19 #1 SMP Fri Mar 14 05:34:52 UTC 2025 x86_64

--------------------

Welcome to the Hetzner Rescue System.

This Rescue System is based on Debian GNU/Linux 12 (bookworm) with a custom kernel.
You can install software like you would in a normal system.

To install a new operating system from one of our prebuilt images, run 'installimage' and follow the instructions.

Important note: Any data that was not written to the disks will be lost during a reboot.

For additional information, check the following resources:
Rescue System: https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system
Installimage: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installimage
Install custom software: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installing-custom-images
other articles: https://docs.hetzner.com/robot

--------------------

Rescue System (via Legacy/CSM) up since 2025-04-25 17:24 +02:00

Hardware data:

CPU1: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Cores 8)
Memory: 64153 MB (Non-ECC)
Disk /dev/sda: 250 GB (=> 232 GiB)
Disk /dev/sdb: 512 GB (=> 476 GiB)
Total capacity 709 GiB with 2 Disks

Network data:
eth0 LINK: yes
MAC: 90:1b:0e:94:07:c4
IP: 138.201.84.223
IPv6: 2a01:4f8:172:209e::2/64
Intel(R) PRO/1000 Network Driver

root@rescue ~ #
</pre>
# I was able to mount the root drive
<pre>
root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@rescue ~ # mount /dev/md2 /mnt
root@rescue ~ # ls /mnt
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # ls /mnt/home
b2user crupp hart lberezhny marcin stagingsync wp
cmota Flipo jthomas maltfield not-apache tgriffing
root@rescue ~ #
</pre>
# I don't know what the point of this is; I can't fix it if I can't watch it boot and see what's breaking
# ok, at the bottom of the docs, hetnzer lists another option = xKVM Rescue System https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/
# it specifically says that's for debugging boot issues
# last thing before I try that: I downloaded a local copy of the keepass files from hetzner2
<pre>
user@ose:~/tmp/hetzner2$ rsync -av --progress root@138.201.84.223:/mnt/etc/keepass ./etc-keepass-20250525
receiving incremental file list
created directory ./etc-keepass-20250525
keepass/
keepass/passwords.kdbx
46,142 100% 44.00MB/s 0:00:00 (xfr#1, to-chk=6/8)
keepass/passwords.kdbx.20170728.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#2, to-chk=5/8)
keepass/passwords.kdbx.20170804.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#3, to-chk=4/8)
keepass/passwords.kdbx.20190820.bak
33,726 100% 143.20kB/s 0:00:00 (xfr#4, to-chk=3/8)
keepass/passwords.kdbx.20190909.bak
34,238 100% 71.75kB/s 0:00:00 (xfr#5, to-chk=2/8)
keepass/passwords.kdbx.20250316.bak
45,406 100% 94.55kB/s 0:00:00 (xfr#6, to-chk=1/8)
keepass/passwords.kdbxs.20180525.bak
27,102 100% 56.31kB/s 0:00:00 (xfr#7, to-chk=0/8)

sent 161 bytes received 196,407 bytes 35,739.64 bytes/sec
total size is 195,794 speedup is 1.00
user@ose:~/tmp/hetzner2$

user@ose:~/tmp/hetzner2$ du -sh etc-keepass-20250525/keepass/*
48K etc-keepass-20250525/keepass/passwords.kdbx
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170728.bak
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170804.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190820.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190909.bak
48K etc-keepass-20250525/keepass/passwords.kdbx.20250316.bak
28K etc-keepass-20250525/keepass/passwords.kdbxs.20180525.bak
user@ose:~/tmp/hetzner2$
</pre>
# so this time was the same as the rescue system, except I choose "xKVM" instead of "Linux" in the "Operationg System" dropdown
# strange, it gave me an error
<pre>
Public key authentication is not available for the selected operating system.
</pre>
# I unselected my ssh key, and chose "no key" instead
# it gave me a URL and a password. I booted the server, but the URL didn't load ("Unable to connect" error)
# ok, it took a few minutes and had a self-signed cert
# I bypassed the cert error, and entered the username and password into the basic auth popup. It failed! Could I really have been MITM'd?
# I immediately shut down the server from the wui, and I tried again.
# this time I was able to login – both from ssh and in the wui.
# as soon as it opened, I saw the error
<pre>
No more network devices

Booting from Hard Disk...
.
error: symbol 'grub_calloc' not found.
Entering rescue mode...
grub rescue>
</pre>
# I wonder if this is grub or grub2. I didn't have a binary "grub-install" before. I assumed it was an error with the hetzner docs when I did "grub2-install" instead, which said it worked (there was a warning that the docs said were safe to ignore)
# curoiusly, the opposite is true for the ssh session in vkvm: I have grub-install but not grub2-install
<pre>
root@vKVM-rescue ~ # which grub-install
/usr/sbin/grub-install
root@vKVM-rescue ~ #
root@vKVM-rescue ~ # which grub2-install
root@vKVM-rescue ~ #
</pre>
# here's the docs in question https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# I don't want to fuck with the grub without first taking a backup of these disks. But, uh, it looks like I can't access the RAID from inside this vkvm setup
# yeah, that's one of the limitations listed for VKVM https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/#raid-controllers
<pre>
Configured units are passed through as SCSI devices to the VM. However it is not possible to access the controller. Please use the regular Hetzner Rescue System for this purpose.
</pre>
# I shutdown VKVM and booted it into the regular rescue mode
# it took a few minutes to get back into the old rescue system, but here I can use the raid
<pre>
root@rescue ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 3.4G 1 loop
sda 8:0 0 476.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
sdb 8:16 0 232.9G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
root@rescue ~ # mkdir /mnt/md1
root@rescue ~ # mkdir /mnt/md2
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # mount /dev/md2 /mnt/md2
root@rescue ~ #
</pre>
# I created a dir for these backups
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ #

root@rescue ~ # mkdir /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chown root:root /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chmod 0700 /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ #
</pre>
# first I made a backup from the raid
<pre>
root@rescue ~ # rsync -av --progress /mnt/md1 /mnt/md2/var/tmp/20250425-grub-fail/md1.$(date "+%Y%m%d_%H%M%S")
...
md1/grub2/locale/zh_TW.mo
30,882 100% 31.38kB/s 0:00:00 (xfr#345, to-chk=0/355)
md1/lost+found/

sent 399,450,301 bytes received 6,709 bytes 159,782,804.00 bytes/sec
total size is 399,330,989 speedup is 1.00
root@rescue ~ #
</pre>
# then I figured I'd make a backup of the two disk partitions directly, but I couldn't even mount it
<pre>
root@rescue ~ # umount /mnt/md1
root@rescue ~ # mkdir /mnt/sda2
root@rescue ~ # mkdir /mnt/sdb2
root@rescue ~ # mount /dev/sda2 /mnt/sda2
mount: /mnt/sda2: unknown filesystem type 'linux_raid_member'.
dmesg(1) may have more information after failed mount system call.
root@rescue ~ #
</pre>
# I tried this command (from the docs), which I skipped before because it said that the next command (grub-install) was enough; sure enough, it didn't work https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
<pre>
root@rescue ~ # grub-mkdevicemap -n
grub-mkdevicemap: error: cannot open /boot/grub/device.map.
root@rescue ~ #
</pre>
# I investigated this before, and I thought I decided we're using grub2, not grub1
<pre>
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # ls /mnt/md1/
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue ~ #
</pre>
# oh, shit, even the grub-install command is v2 https://askubuntu.com/questions/107486/how-to-know-the-version-of-grub
<pre>
root@rescue ~ # grub-install --version
grub-install (GRUB) 2.06-13+deb12u1
root@rescue ~ #
</pre>
# ok, this indicates we're not using lilo https://askubuntu.com/questions/24459/how-do-i-find-out-which-boot-loader-i-have
<pre>
root@rescue ~ # ls /mnt/md2/etc/ | grep lilo
root@rescue ~ #
</pre>
# we can dd straight from the disk to read the MBR. And, yeah, it appears we are using grub via MBR .. and this info is stored on the disks, not the raid
<pre>
root@rescue ~ # dd if=/dev/md1 bs=512 count=1 2>/dev/null | strings
root@rescue ~ #

root@rescue ~ # dd if=/dev/sda bs=512 count=1 2>/dev/null | strings
214fb5736d1e5ad63e515dc2fffe44bd928cd8dab2c019dc11fb9fcaef5ea90dbf51f1ac507ab1cfbbe74ff
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
DA/jjF
root@rescue ~ #

root@rescue ~ # dd if=/dev/sdb bs=512 count=1 2>/dev/null | strings
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
root@rescue ~ #
</pre>
# idk what to do; I tried the grub-install again, but it gives me this error
<pre>
root@rescue ~ # grub-install /dev/sda
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #

root@rescue ~ # grub-install /dev/sdb
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #
</pre>
# I tried creating a chroot of our real raid disks first
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # umount /mnt/md1
root@rescue ~ # chroot-prepare /mnt/md2
root@rescue ~ # chroot /mnt/md2
root@rescue / # ls /boot
root@rescue / # mount /dev/md1 /boot
root@rescue / # ls /boot
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue / #
</pre>
# I then tried the grub install again
<pre>
root@rescue / # grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #

root@rescue / # grub2-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #
</pre>
# I exited the chroot and shutdown the rescue system
# I activated the VKVM resuce system, and booted it again
# when I connected to the KVM wui, I was shown a password prompt. So I think booting works!
# I rebooted it from the ssh
# and now I can ssh into the real system
<pre>
user@personal:~$ autossh opensourceecology.org
Last login: Thu Apr 24 23:12:44 2025 from 146.70.199.15
[maltfield@opensourceecology ~]$
</pre>
# and now the wiki loads too
# I did another reboot test
<pre>
[maltfield@opensourceecology ~]$ sudo su -
[sudo] password for maltfield:
Last login: Thu Apr 24 16:25:15 UTC 2025 on pts/0
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
ssh: connect to host opensourceecology.org port 32415: Connection refused
Last login: Fri Apr 25 16:29:21 2025 from 185.204.1.184
[maltfield@opensourceecology ~]$
</pre>
# idk, my takeaway is that either one or some of these assumptions are correct
## grub-install needs to be run *after* the RAID sync is finished
## grub-install needs to be run on *both* the new *and* the old disk
## grub-install needs to be run inside a chroot on the rescue system
# anyway, we're stable again
# I got an email from Marcin saying Tom could help with the migrations. I sent him some wiki articles to get caught-up
<pre>
Hey Tom,

I'll try to get you ssh access on hetzner2 soon. In the meantime, please read the following articles:

* https://wiki.opensourceecology.org/wiki/Hetzner2

* https://wiki.opensourceecology.org/wiki/Hetzner3

I've started preparing draft "change tickets" for migrating each of the websites from hetzner2 to hetzner3. Note that some of these are not fully tested, so you'll want to execute them manually and make corrections as-needed.

Please also read-through these:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_store_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_microfactory_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_fef

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_oswh

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_phplist_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_wiki_to_hetzner3

(There's also one CHG for the forum that I think needs to be made)

The next item TODO is to finish the migration plan for these websites:

1. www.opensourceecology.org (osemain)
2. www.openbuildinginstiture.org (obi)

We decided that there would be 2 simultaneous versions of obi:

1. A static site scraped with curl on hetzner3
2. The (broken) dynamic wordpress site on hetzner3

And we decided that there would be 3 simultaneous versions of osemain:

1. The live/current site on hetzner2
2. A static site scraped with curl on hetzner3
3. The (broken) dynamic wordpress site on hetzner3

To have multiple sites with the same domain on the same server, we bought a second IPv4 address (FeF isn't setup with IPv6). This week I just finished updating the hetzer3 server to persist this new IPv4 address.

The next item for you would be to update our ansible to push out new vhosts (in nginx, varnish, and apache) for the static sites that are bound to the second IPv4 address using the same hostname.

Please read-through the ansible playbook and roles (most importantly for nginx, varnish, and apache) to understand how they're provisioned

* https://github.com/OpenSourceEcology/ansible

Since you have access to hetzner3, you can also poke around (read-only please) the configs for these three web services to understand how ansible provisions them.

Once you've updated and pushed-out the new vhosts with ansible, you'll need to update the migration plan

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_obi_to_hetzner3
* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

And then you'll want to go-through each migration plan to create a temp "snapshot" of all the sites on hetzner3, where Marcin & Catarina can do a thorough verification of each site (by updating /etc/hosts) before we do the *real* migration -- which is nearly the same as the "snapshot" except we actually migrate DNS.

Please let me know when you've finished reading the above articles.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/24/25 22:16, REDACTED@tutanota.com wrote:
> Michael;
>
> I need to reset my ssh key on hetzner2. Can you use the same as on 3 or best to generate a new one?
>
> I spoke with Marcin and I think I can help with the admin, as I have time available.
>
> Can you give a run-down of its status and what needs to be done for completing the migration to hetzner3?
> --
> Tom Griffing
</pre>

=Thr Apr 24, 2025=
# it's 05:00; I tried to login to the wiki, but I got an error
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# oh, under that it says I'm already logged-in?
<pre>
You are already logged in as Maltfield. Use the form below to log in as another user.
</pre>
# anyway, let's start the CHG to replace the failing disk on hetzner 2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb
# I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]#

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]#
</pre>
# I tried to remove the first partition from the RAID, but it said I can't?
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#
</pre>
# apparently the docs say that if the RAID is healthy, you have to force it with '--fail' https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# crap, I realized I have an issue in my CHG (we need two sysadmins for peer review *sigh*)
## I listed this
<pre>
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3
</pre>
## but it should be this
<pre>
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3
</pre>
# anyway, it looks like I first need to execute this, to force the RAID into a failure state
<pre>
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md2 --fail /dev/sdb3
</pre>
# ok, I was able to remove it
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1](F)
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0] sdb1[1](F)
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1](F)
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# by 10:32 UTC, I submitted the request to hetzner to replace /dev/sdb = "Crucial_CT250MX200SSD1_154410FA4520"
# it says they should do it within 2-4 hours
# meanwhile, I updated https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# at 08:00 my time, I checked and saw that we had an email come from hetzner at 06:36 (my time)
<pre>
Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.
</pre>
# well, crap. I tried to load the wiki CHG article, but there's an error
<pre>
Sorry! This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot access the database)
</pre>
# the server wasn't shutdown, and my screen session is still intact, but dmesg is being flooded with RAID and io errors
<pre>
...
[11136.011313] md: super_written gets error=-5, uptodate=0
[11136.011372] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11136.319267] md: super_written gets error=-5, uptodate=0
[11136.319322] md: super_written gets error=-5, uptodate=0
[11138.827642] EXT4-fs error: 5 callbacks suppressed
[11138.827693] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
[11138.827793] EXT4-fs: 5 callbacks suppressed
[11138.827841] EXT4-fs (md2): previous I/O error to superblock detected
[11138.835255] md: super_written gets error=-5, uptodate=0
[11138.835311] md: super_written gets error=-5, uptodate=0
[11138.835367] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11138.835472] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
...
</pre>
# well anyway, I'll see if I can at least restart the RAID sync and install grub on the new disk
# son of a bitch, they removed the wrong drive!
<pre>
[root@opensourceecology ~]# date -u
Thu Apr 24 13:05:32 UTC 2025
[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 32G 0 part
├─sdc2 8:34 0 512M 0 part
└─sdc3 8:35 0 200.4G 0 part
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
device node not found
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# it shows a new drive (sdc) and and old drive (sdb)
# ugh, so now we have nothing in the raid?
# here's the new drive
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sdc | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# christ, so this new disk is half the size of our actual disk? what did they do?!?
# and now we have a prod server online with no redundancy. I can't tell them to put back-in the *correct* disk, or we'll have data loss
# I'm going to stop all the web services before this disaster gets any worse
# great; io errors. this is a damn disaster
<pre>
[root@opensourceecology ~]# systemctl stop nginx
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop varnish
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop apache2
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
Failed to stop apache2.service: Unit apache2.service not loaded.
[root@opensourceecology ~]#
</pre>
# I went ahead and made partition backups, anyway
# wait, actually, it said that /dev/sdc = Crucial_CT250MX200SSD1_154410FA336C. That's our old /dev/sda
# so they *did* remove the right drive, but the re-insertion of the wrong drive pushed /dev/sda to /dev/sdc. That kinda breaks our ability to map the RAID, but let's at-least partition this new drive
# but this new drive isn't the right size. it's 512G while our old disk was 250G. I guess it's better to have too-big of a disk than too-small of a disk, but we won't be able to use that extra disk space. I'm going to assume that they just didn't have 250G disks in-stock anymore.
# anyway, I tried to backup the partitions, but that wouldn't work since we're read-only
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
mkdir: cannot create directory ‘/var/tmp/chg.20250424_132010’: Read-only file system
[root@opensourceecology ~]# chown root:root $chg_dir
chown: cannot access ‘/var/tmp/chg.20250424_132010’: No such file or directory
[root@opensourceecology ~]#
</pre>
# I don't know what to do besides giving it a reboot, but that scares me
# I'd like to take a backup, but I can't if I get read-only errors :(
# well, I guess that's why we made a backup before this. I don't think I have any option other than to reboot. and pray that grub is intact to bring it back.
# I gave it a reboot. If it doesn't come back, I'll try to boot to the rescue CD from within the hetzner wui
<pre>
[root@opensourceecology ~]# date && reboot
Thu Apr 24 13:24:18 UTC 2025
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!

Failed to start reboot.target: Unit is not loaded properly: Input/output error.
See system logs and 'systemctl status reboot.target' for details.

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!
</pre>
# wtf, it can't even reboot it's so broken.
# I triggered a rest on the hetzner wui
# the server came back, and I immediately shutdown all services again
<pre>
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop apache2
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]#
</pre>
# I went ahead and triggered backups
<pre>
[root@opensourceecology ~]# cat /etc/cron.d/backup_to_backblaze
20 07 * * * root time /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log
20 04 03 * * root time /bin/nice /root/backups/backupReport.sh
[root@opensourceecology ~]#

[root@opensourceecology ~]# time /root/backups/backup.sh &>> /var/log/backups/backup.log
</pre>
# ok, sdc is gone. we have sda and sdb again, and sda is our original sda – as we wanted
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions; it's not surprising the sdb file is empty
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250424_133230 ~
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# du -sh ${chg_dir}/*
4.0K /var/tmp/chg.20250424_133230/sda_parttable_mbr.bak
0 /var/tmp/chg.20250424_133230/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
</pre>
# I copied the partition from sda to sdb
<pre>
[root@opensourceecology chg.20250424_133230]# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sdb: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sdb2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sdb3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250424_133230]#
</pre>
# that looked good, other than the complaint about not being able to boot from this disk; I'll check later what is LILO and if this will matter for raid grub
# I reloaded the partition table for this disk
<pre>
[root@opensourceecology chg.20250424_133230]# blockdev --rereadpt /dev/sdb
[root@opensourceecology chg.20250424_133230]#
</pre>
# I added the new disk to the RAID, and it shows that it's starting to sync now. excellent
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md0 -a /dev/sdb1
mdadm: added /dev/sdb1
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md1 -a /dev/sdb2
mdadm: added /dev/sdb2
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md2 -a /dev/sdb3
mdadm: added /dev/sdb3
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (19712/33521664) finish=481.1min speed=1159K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, it looks like it's not syncing each partition of the RAID at the same time. it's doing md0 now and then it'll do the others after, I guess
# md0 is partition 1 (sda1/sdb1). That's *sigh* swap. It's 32GB.
# I kinda wish we'd sync'd /boot first. I don't think I can install grub until that's sync'd. maybe?
# it says it's moving about 1024K/s. That's 1 MB per sec. 32G*1024 = 32,768 MB. That's 32,768 seconds / 60 = 546 minutes / 60 = 9 hours. Just for swap!
# assuming we have the same speed for the rest of the disk, that's 250 G * 1024 = 256,000 MB / 1 MB/s = 256,000 seconds. 256,000 seconds / 60 = 4,266.666666667 minutes / 60 = 4,266.666666667 = 71.11 hours. I guess we just have to accept the risk and hope that old /dev/sda with all our data doesn't fail within then next 3 days.
# I tried to go ahead and install grub on the new disk, but i got a command not found error
<pre>
[root@opensourceecology chg.20250424_133230]# grub-install /dev/sdb
-bash: grub-install: command not found
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# grub
grub2-bios-setup grub2-glue-efi grub2-mkconfig grub2-mkpasswd-pbkdf2 grub2-probe grub2-set-default
grub2-editenv grub2-install grub2-mkfont grub2-mkrelpath grub2-reboot grub2-setpassword
grub2-file grub2-kbdcomp grub2-mkimage grub2-mkrescue grub2-render-label grub2-sparc64-setup
grub2-fstest grub2-macbless grub2-mklayout grub2-mkstandalone grub2-rpm-sort grub2-syslinux2cfg
grub2-get-kernel-settings grub2-menulst2cfg grub2-mknetdir grub2-ofpathname grub2-script-check grubby
[root@opensourceecology chg.20250424_133230]#
</pre>
# looks like it should be 'grub2-install' I tried that
<pre>
[root@opensourceecology chg.20250424_133230]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, that's two warnings but no errors; I'll take it.
# we're up to 12.4% on the RAID sync of swap. It's now going >50x faster than it was before; good news
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[==>..................] recovery = 12.4% (4168832/33521664) finish=8.2min speed=59264K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# calculations at that speed would be 250*1024/58 = 4,413.793103448 seconds / 60 = 73 minutes. Oh, that's just over an hour.
# and now we're at 42.7%
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[========>............] recovery = 42.7% (14334208/33521664) finish=6.6min speed=47845K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# backups are still running; I'll let them finish before starting-up the webservers again
# I wrote a status email to Marcin
# the backups still aren't finished
# I checked on the raid replication, and it shows md0 (swap) and md1 (boot) are both done. Horray! Now we just need to finish root (/), which is 9.8% done and going at 60 MB/s. Great!
<pre>
Thu Apr 24 14:05:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 9.8% (20767872/209984640) finish=50.5min speed=62429K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I gave the grub install a double-tap now that it's synced with the first disk; the output was the same
<pre>
[root@opensourceecology ~]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# the output of lsblk looks much nicer now, too
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# backups say they're 9% uploaded
<pre>
[root@opensourceecology ~]# tail -f /var/log/backups/backup.log
...
2025/04/24 14:13:48 INFO :
Transferred: 2.210G / 20.472 GBytes, 11%, 2.904 MBytes/s, ETA 1h47m20s
Transferred: 0 / 1, 0%
Elapsed time: 13m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 10% /20.472G, 2.997M/s, 1h43m59s
</pre>
# I decided to just kill the backup script and manually upload it without the bwlimit, so it'll go-out faster
<pre>
[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20250424_133017.tar.gpg b2:ose-server-backups
2025/04/24 14:15:20 INFO :
Transferred: 116.500M / 20.472 GBytes, 1%, 1.958 MBytes/s, ETA 2h57m25s
Transferred: 0 / 1, 0%
Elapsed time: 1m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 0% /20.472G, 5.065M/s, 1h8m35s
</pre>
# meanwhile we're at 24% on the RAID sync
<pre>
Thu Apr 24 14:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.9% (50200448/209984640) finish=101.1min speed=26325K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# oh, important to note: our new disk doesn't say that it's failing :D
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# while the old disk says it's reached 100% of its lifecycle, the new disk says it's at – uhh – 96% of it's life? That doesn't sound very good :(
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# Shame. I was hoping for at least something <50%. Well, I wonder how long that remaining 4% will last us :/
# ok, backups just finished; let's start the web services
<pre>
[root@opensourceecology ~]# systemctl start mariadb
[root@opensourceecology ~]# systemctl start httpd
[root@opensourceecology ~]# systemctl start varnish
[root@opensourceecology ~]# systemctl start nginx
[root@opensourceecology ~]#
</pre>
# I updated the wiki CHG with a status https://wiki.opensourceecology.org/wiki/Category:CHGs
# And I sent an email to Marcin recommending that he replace /dev/sda with an actual new drive
<pre>
Hey Marcin,

Would you authorize spending €41.18 on a new disk for your server?

Update: Your websites are back online. The RAID is still syncing.

I was a bit disappointed to learn that hetzner replaced a disk with 0% "life left" with a disk with 4% "life left". That's what we get for choosing the free disk replacement..

The "free" option said it would replace it with a "Replacement drive nearly new or used and tested; depends on what is in stock." Obviously they didn't give us a "nearly new" drive..

Your other disk is also at 0% "life left". I was already planning on replacing that one next week too, but I would recommend that you pay for a new drive for this one. The cost listed on the website is €41.18.

Do you authorize me selecting €41.18 for the replacement of /dev/sda on hetzner2?
</pre>
# from the output above, our old drive said it had "Power_On_Hours" of 78516/24/365 = 8.96 years
# and our new drive says Power_On_Hours = 52083/24/365 = 5.95 years. Well that's better, I guess.
# oh wow, the power cycle count is crazy; our disk we only rebooted 50 times and the new one was only 33 times.
# also the SMART data for both of these drives has different keys (not just values). apparently it's very vendor-specific, so some of these comparisons are apples-to-oranges
# right, we're at 69.7% replication on root. I'm going to go make breakfast and check-in again after
# ...
# over lunch, I realized that Marcin's last email was possibly hyperbolic panic
# he's worried that he just kicked-off a marketing campaign (for the apprenticeship), which now links to information on a broken website – where potential applicants can't read the info
# but I think the content actually *is* accessible, just not to Marcin
# when you're logged-into the wiki, the cookies bypass the cache. So, regretablly, when hetnzer2's backend is offline, Marcin sees an error
# but I'd bet that the frontpage of all the websites and the recently-published apprenticeship info page that he's published & promoted are still online when he sees that error – for users who are *not* logged-into the site
# but if the backend site is broken for >24 hours, then the cache will cache the errors (not the content)
# as a short-term hack, I recommended that we setup a daily reboot of hetzner2 at 10:40 (a good buffer after the backups finish uploading)
# I asked Marcin if he'd like me to setup a daily reboot at 10:40
<pre>
Hey Marcin,

I don't think the situation is as bad as you think.

> We are missing opportunity,
> the announcement is posted, and our servers are down.

Of course I agree it's not good, and we should migrate away from hetzner2 asap. And I do wish I had more bandwidth to finish the migration faster for you.

But you have a varnish cache that caches pages for 24 hours. Even if your backend webserver and database are down, popular pages (like the frontpage of your wiki or a recent article that you've recently promoted) should still load for users.

The big issue isn't marketing and read-only content. The big issue is editing. That's what is breaking.

When you're logged into the wiki, it bypasses the varnish cache. So, even if the wiki appears down to you, the contents of (most) articles viewed in the past 24 hours will be still visible to potential apprenticeship applicants.

The next time you see the websites are down, try loading it from another device where you're not logged-in. You'll probably see that the apprenticeship info is still accessible, even though the backend for the site is down.

As a short-term hack, I recommend setting-up a daily reboot of the server. Backups typically finish before 10:10 UTC. I recommend we add a cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.

The server seems to function for some time after a fresh reboot, and it caches pages for 24 hours. So the first time someone loads a page in the wiki after that reboot, it'll be cached for the entire time that the server is online until its next reboot. I think this will ensure higher availability of your read-only content (eg information about the apprenticeship).

Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# ...
# I checked-in on the RAID replication status; it's finished
<pre>

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# so it looks like I started it just after 13:32 and it finished just before 15:20. So it took just under 2 hours. Great!
# I updated the article with status updates, marking the CHG as completed successfully https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb#2025-04-24_16:18_UTC
# And I sent an email to Marcin & Catarana to let them know it was successful, and asked again about buying a new drive for replacing /dev/sda
<pre>
Update: your new (used) disk is now fully synced with the old (failing) disk.

* https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb

According to SMART data, you now have one failing disk and one not-failing disk.

Your hetzner2 RAID is now healthy, and you have redundancy spread across two mirrored disks again.

Next week I'd like to replace the other failing disk. Please let me know if you approve the purchase of a new disk for its replacement.
</pre>
# Marcin got back to me, approving the purchase of the new disk; I updated the ticket https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# Note that the price is listed as "at cost" and it says
<pre>
Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
</pre>
# 1,000 hours is fine. That's compared to the 78,516 hours of /dev/sda and 52,083 hours of our "new" /dev/sdb
# but it's a bit concerning that it says it might not be in-stock. I'm going to message them and ask if they can set one aside for us for next week
<pre>
Hi Support,

Can you set-aside a replacement disk for this server?

Our disks' SMART logs indicated that both disks should be replaced. Today we replaced one of the two disks, but the disk that you replaced it with has 4% of its life left, according to SMART data (it has 52,083 hours of operation).

Next week we would like to replace the other disk, and this time we'd like your "at cost" option, to get a disk with <1,000 hours of operation.

But I was a bit concerned when I read this next to the WUI option for "at cost" on your website

> Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.

Specifically what worries me is the "may not be in stock".

Can you please tell us if you have stock now? And if you do, can you please reserve one disk for us for next week?

We don't want to remove a disk from our RAID and plan for downtime, only to discover that you don't have a disk available for us..

Please let us know if you can reserve 1 disk for us for next week.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# I asked Marcin if Wed next week at 11:00 UTC is ok for replacing hetzner2's sda
<pre>
Hey Marcin,

When would be a good time to replace the second disk on hetzner2?

If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next week on Wednesday 2025-04-30 11:00 UTC, which is:

* 13:00 in Germany (where the server lives)
* 06:00 here in Ecuador, and
* 06:00 at FeF

For details about what this change entails, and expected downtime,
please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda

Please let me know if you approve this change, if the suggested time is
agreeable to you, and if you have any questions.

Thank you,
</pre>
# Marcin returned the email confirming the time
<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>
# ...
# Marcin got back to me and told me to setup the daily reboot cron on hetzner2
<pre>
Yes, please set up reboot. That is decent for now

On Thu, Apr 24, 2025, 11:08 AM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> I don't think the situation is as bad as you think.
>
> > We are missing opportunity,
> > the announcement is posted, and our servers are down.
>
> Of course I agree it's not good, and we should migrate away from
> hetzner2 asap. And I do wish I had more bandwidth to finish the
> migration faster for you.
>
> But you have a varnish cache that caches pages for 24 hours. Even if
> your backend webserver and database are down, popular pages (like the
> frontpage of your wiki or a recent article that you've recently
> promoted) should still load for users.
>
> The big issue isn't marketing and read-only content. The big issue is
> editing. That's what is breaking.
>
> When you're logged into the wiki, it bypasses the varnish cache. So,
> even if the wiki appears down to you, the contents of (most) articles
> viewed in the past 24 hours will be still visible to potential
> apprenticeship applicants.
>
> The next time you see the websites are down, try loading it from another
> device where you're not logged-in. You'll probably see that the
> apprenticeship info is still accessible, even though the backend for the
> site is down.
>
> As a short-term hack, I recommend setting-up a daily reboot of the
> server. Backups typically finish before 10:10 UTC. I recommend we add a
> cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.
>
> The server seems to function for some time after a fresh reboot, and it
> caches pages for 24 hours. So the first time someone loads a page in the
> wiki after that reboot, it'll be cached for the entire time that the
> server is online until its next reboot. I think this will ensure higher
> availability of your read-only content (eg information about the
> apprenticeship).
>
> Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# we don't have ansible for hetzner2; I did this manually
<pre>
[root@opensourceecology cron.d]# pwd
/etc/cron.d
[root@opensourceecology cron.d]# ls -lah
total 52K
drwxr-xr-x. 2 root root 4.0K Apr 24 17:56 .
drwxr-xr-x. 105 root root 12K Apr 18 21:52 ..
-rw-r--r-- 1 root root 128 May 16 2023 0hourly
-rw-r--r-- 1 root root 1.3K Apr 9 2019 awstats_generate_static_files
-rw-r--r-- 1 root root 151 Apr 24 17:52 backup_to_backblaze
-rw-r--r-- 1 root root 78 May 31 2024 cacti
-rw-r--r-- 1 root root 125 Dec 11 00:16 letsencrypt
-rw-r--r-- 1 root root 506 Mar 18 2019 phplist
-rw-r--r-- 1 root root 108 Jan 7 2022 raid-check
-rw-r--r-- 1 root root 118 Apr 24 17:56 reboot
-rw------- 1 root root 235 Dec 15 2022 sysstat
[root@opensourceecology cron.d]# cat reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology cron.d]#
# tomorrow morning I should check on the uptime and journalctl to make sure it rebooted sometime around 10:40 UTC
</pre>
# ...
# ok, back to hetzner3: we bought a second IPv4 address for the static sites, but the server's networking was never setup for it; let's add that
<pre>
root@hetzner3 /etc/network # cp interfaces interfaces.20250424
root@hetzner3 /etc/network # vim interfaces
...
</pre>
# well, that failed.
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
You have mail in /var/mail/root
root@hetzner3 ~ #
</pre>
I restored the backup file, and it still failed. The journal and status aren't helpful
<pre>
root@hetzner3 ~ # systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-24 17:18:55 UTC; 52s ago
Duration: 2month 1w 20h 39min 50.765s
Docs: man:interfaces(5)
Process: 3259336 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 3259371 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 3259336 (code=exited, status=1/FAILURE)
CPU: 29ms

Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
root@hetzner3 ~ # journalctl -u networking | tail
Apr 24 17:16:36 hetzner3 ifup[3258504]: ifup: failed to bring up enp0s31f6
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:16:36 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
</pre>
# if I run the ExecStart command manaully, I can add a verbose tag. but that's not especially helpful, either
<pre>
root@hetzner3 ~ # ifup --verbose -a --read-environment
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool

ifup: configuring interface enp0s31f6=enp0s31f6 (inet)
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool
ip addr add 144.76.164.201/255.255.255.224 broadcast 144.76.164.223 dev enp0s31f6 label enp0s31f6
RTNETLINK answers: File exists
ifup: failed to bring up enp0s31f6
run-parts --exit-on-error --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/000resolvconf
run-parts: executing /etc/network/if-up.d/ethtool
run-parts: executing /etc/network/if-up.d/postfix
run-parts: executing /etc/network/if-up.d/resolved
root@hetzner3 ~ #
</pre>
# curiously, though, the new IPv4 address is listed in `ip a`
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# I'm just going to give this server a reboot before proceeding, to make sure the IP config is sticky
# when it came-up, it lost the new IP :(
<pre>
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# well, at least it's restarting now without errors; I can work with that
<pre>
root@hetzner3 /etc/network # systemctl restart networking
You have new mail in /var/mail/root
root@hetzner3 /etc/network # systemctlstatus networking
-bash: systemctlstatus: command not found
root@hetzner3 /etc/network # systemctl status networking
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: active (exited) since Thu 2025-04-24 17:33:40 UTC; 15s ago
Docs: man:interfaces(5)
Process: 8598 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Process: 9022 ExecStart=/bin/sh -c if [ -f /run/network/restart-hotplug ]; then /sbin/ifup -a --read-environment --allow=hotplug; fi (code=exited, status=0/SUCCESS)
Main PID: 9022 (code=exited, status=0/SUCCESS)
CPU: 357ms

Apr 24 17:33:34 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:33:39 hetzner3 ifup[8663]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 ifup[8907]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 systemd[1]: Finished networking.service - Raise network interfaces.
root@hetzner3 /etc/network #
</pre>
# let's try to add it now
<pre>
root@hetzner3 /etc/network # diff interfaces interfaces.20250424
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # vim interfaces
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,23
> iface enp0s31f6 inet static
> address 144.76.164.195
> netmask 255.255.255.224
> gateway 144.76.164.193
> # route 144.76.164.192/27 via 144.76.164.193
> #up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6
>
root@hetzner3 /etc/network #
</pre>
# I gave it a restart, but I have errors again
<pre>
# curiously, it *did* add the new IP address; wtf
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# the internet isn't very helpful because it seems the damn format has changed so many times over the years; lots of outdated info
# lots of people say they fixed this by deleting everything in interfaces.d/, but we don't have anything in that folder
# I did find this hetzner-specific docs on adding a second IP; it's totally different than what I've read elsewhere https://docs.hetzner.com/robot/dedicated-server/network/net-config-debian-ubuntu
<pre>
up ip addr add 10.4.2.1/32 dev eth0
down ip addr del 10.4.2.1/32 dev eth0
</pre>
# I tried this, and gave the server a reboot
<pre>
root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,20
> # 2025-04-24: add second IPv4 address
> up ip addr add 144.76.164.195/32 dev enp0s31f6
> down ip addr del 144.76.164.195/32 dev enp0s31f6
>
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # cat interfaces
### Hetzner Online GmbH installimage

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback
iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet static
address 144.76.164.201
netmask 255.255.255.224
gateway 144.76.164.193
# route 144.76.164.192/27 via 144.76.164.193
up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6

# 2025-04-24: add second IPv4 address
up ip addr add 144.76.164.195/32 dev enp0s31f6
down ip addr del 144.76.164.195/32 dev enp0s31f6

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::2
netmask 64
gateway fe80::1

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::3
netmask 64
gateway fe80::1
root@hetzner3 /etc/network #
</pre>
# the system came-up with the IP I want. Cool!
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# and I'm able to restart the service without it yelling at me (or breaking the IP config)
<pre>
root@hetzner3 ~ # systemctl restart networking
root@hetzner3 ~ #
You have new mail in /var/mail/root
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# I'm also able to ping the server on both IPs, which is a good sign
<pre>
user@disp9871:~$ ping 144.76.164.201
PING 144.76.164.201 (144.76.164.201) 56(84) bytes of data.
64 bytes from 144.76.164.201: icmp_seq=1 ttl=50 time=490 ms
64 bytes from 144.76.164.201: icmp_seq=2 ttl=50 time=490 ms
^C
--- 144.76.164.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 489.558/489.676/489.795/0.118 ms
user@disp9871:~$
user@disp9871:~$ ping 144.76.164.195
PING 144.76.164.195 (144.76.164.195) 56(84) bytes of data.
64 bytes from 144.76.164.195: icmp_seq=1 ttl=50 time=493 ms
64 bytes from 144.76.164.195: icmp_seq=2 ttl=50 time=512 ms
^C
--- 144.76.164.195 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 492.853/502.518/512.184/9.665 ms
user@disp9871:~$
</pre>
# I used netcat to test it. Most ports are closed, and I found that nginx is listening on most of the other ports on all IPs – except 4443
<pre>
root@hetzner3 ~ # nc -s 144.76.164.195 -l -p 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# and this was how it looked on my laptop's side
<pre>
user@disp9871:~$ nc 144.76.164.195 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# ok, so the server's new IPv4 address is configured (and persistent between reboots)

=Sun Apr 20, 2025=
# Marcin replied to my email authorizing the replacement of the /dev/sdb disk on hetzner2 at 2025-04-24 10:00 UTC https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
## I updated the article with the defined date & time
# ...
# I also checked hetzner3. I see that I setup email alerts for the RAID, but not for SMART.
## on hetzner2, we had no errors of the RAID, but we did have SMART errors. I guess eventually if it failed enough that RAID replication was breaking, we would have gotten alerts. But it would be good if we could get alerts *before* that happened..
# I checked munin on hetzner2 to see what data it collects for monitoring disks @ /disk-day.html
## looks like we have latency, throughput, usage, utilization, i/o, and inode usage. There's nothing about "SMART errors"
# looks like there *is* a smart module for munin https://gallery.munin-monitoring.org/plugins/munin/smart_/
# it's already there on hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah | grep -i smart
-rwxr-xr-x 1 root root 11K Mar 21 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Mar 21 2023 smart_
You have new mail in /var/mail/root
root@hetzner3 /usr/share/munin/plugins #
</pre>
# hetzner2 has it too
<pre>
[root@opensourceecology munin]# ls -lah /usr/share/munin/plugins | grep -i smart
-rwxr-xr-x 1 root root 11K Nov 6 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Nov 6 2023 smart_
[root@opensourceecology munin]#
</pre>
# crap, I just checked hetzner3's munin, and I realized that varnish is missing :(
# it looks like ansible *has* pushed-out the script and plugins
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah /usr/share/munin/plugins/ | grep -i varnish
-rwxr-xr-x 1 root root 26K Mar 21 2023 varnish_
-rwxr-xr-x 1 root root 28K Feb 12 00:14 varnish5_
-rwxr-xr-x 1 root root 28K Sep 28 2024 varnish5_.175431.2025-02-12@00:16:02~
-rwxr-xr-x 1 root root 28K Sep 25 2024 varnish5_.20240928.orig
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # ls -lah /etc/munin/plugins/ | grep -i varnish
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_backend_traffic -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_bad -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_expunge -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_hit_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_main_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_memory_usage -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_mgt_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_objects -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_request_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_threads -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_transfer_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Feb 12 00:16 varnish_uptime -> /usr/share/munin/plugins/varnish5_
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I did a diff of the varnish5_ script from my server and ose's server, and I found 2 new lines at the top of the hetzner3 server
## my server
<pre>
maltfield@mail:~$ head /usr/share/munin/plugins/varnish5_
#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
#
# This program is free software; you can redistribute it and/or modify
maltfield@mail:~$
</pre>
## ose's hetzner3
<pre>
maltfield@hetzner3:~$ head /usr/share/munin/plugins/varnish5_
# Ansible managed

#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
maltfield@hetzner3:~$
</pre>
# so basically the issue appears to be that my "ansible managed" comment comes before the shebang, so varnish is interpreting everything as shell, instead of perl
# we can see the result of all these syntax errors with a test run too
## my server
<pre>
root@mail:/etc/munin# munin-run varnish_hit_rate
cache_hitpass.value 0
client_req.value 704255
cache_miss.value 202581
cache_hitmiss.value 2181
cache_hit.value 499493
root@mail:/etc/munin#
</pre>
## ose's hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
/etc/munin/plugins/varnish_hit_rate: 26: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 28: varnish5_: not found
/etc/munin/plugins/varnish_hit_rate: 30: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 32: Varnish: not found
/etc/munin/plugins/varnish_hit_rate: 34: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 36: The: not found
/etc/munin/plugins/varnish_hit_rate: 38: The: not found
/etc/munin/plugins/varnish_hit_rate: 39: [varnish5_*]: not found
/etc/munin/plugins/varnish_hit_rate: 40: group: not found
/etc/munin/plugins/varnish_hit_rate: 41: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 42: env.name: not found
/etc/munin/plugins/varnish_hit_rate: 44: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 108: my: not found
/etc/munin/plugins/varnish_hit_rate: 111: my: not found
/etc/munin/plugins/varnish_hit_rate: 114: my: not found
/etc/munin/plugins/varnish_hit_rate: 117: my: not found
/etc/munin/plugins/varnish_hit_rate: 119: my: not found
/etc/munin/plugins/varnish_hit_rate: 123: Syntax error: "(" unexpected
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I moved the "ansible managed" comment below the shebang in ansible, and pushed it out; now it works
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
client_req.value 10714
cache_hitmiss.value 9
cache_hit.value 6478
cache_hitpass.value 0
cache_miss.value 4227
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I also pushed-out smart at the same time, but it's not working
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_ suggest
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # munin-run smart_
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# the docs page for the smart_ munin plugin says that we need this section at-minimum in the munin config file, so I added it to hetzner2 https://gallery.munin-monitoring.org/plugins/munin/smart_/
<pre>
[root@opensourceecology plugin-conf.d]# tail -n4 zzz-ose

[smart_*]
user root
group disk
[root@opensourceecology plugin-conf.d]#
</pre>
# and I manually created the symlinks for sda & sdb
<pre>
[root@opensourceecology ~]# cd /etc/munin/plugins
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sda
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sdb
[root@opensourceecology plugins]#
</pre>
# sweet, that worked
<pre>
[root@opensourceecology plugins]# munin-run smart_sdb
Program_Fail_Count.value 100
Reallocated_Event_Count.value 100
Ave_Block_Erase_Count.value 001
Reallocate_NAND_Blk_Cnt.value 100
Erase_Fail_Count.value 100
Reported_Uncorrect.value 100
SATA_Interfac_Downshift.value 100
Offline_Uncorrectable.value 100
smartctl_exit_status.value 8
Write_Error_Rate.value 100
FTL_Program_Page_Count.value 100
Current_Pending_Sector.value 100
Success_RAIN_Recov_Cnt.value 100
UDMA_CRC_Error_Count.value 100
Error_Correction_Count.value 100
Temperature_Celsius.value 064
Raw_Read_Error_Rate.value 100
Total_Host_Sector_Write.value 100
Power_Cycle_Count.value 100
Power_On_Hours.value 100
Host_Program_Page_Count.value 100
Unused_Reserve_NAND_Blk.value 000
Percent_Lifetime_Remain.value 000
Unexpect_Power_Loss_Ct.value 100
[root@opensourceecology plugins]#
</pre>
# Unfortunately, I'm not getting the same results on hetzner3. I wonder if this munin plugin doesn't support nvme drives?
# oh, it looks like I'm actually not updating that file anymore in ansible, because it has a backup. I'm going to make a note in ansible so I don't make that mistake again.
# meanwhile, I manually updated the config file on hetzner3 too
<pre>
root@hetzner3 /etc/munin # cd plugin-conf.d/
root@hetzner3 /etc/munin/plugin-conf.d # ls
dhcpd3 munin-node README spamstats zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # touch /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chown root:root /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chmod 0600 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # cp zzz-myconf /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # ls -lah /var/tmp/munin-zzz-myconf.20250420
-rw------- 1 root root 1,7K Apr 20 17:29 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # vim zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # diff /var/tmp/munin-zzz-myconf.20250420 /etc/munin/plugin-conf.d/zzz-myconf
3c3
< # Version: 0.2
---
> # Version: 0.3
9c9
< # Updated: 2024-12-12
---
> # Updated: 2025-04-20
31a32,35
>
> [smart_*]
> user root
> group disk
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# that still fails
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# but, if I restart the service first and then run it, it – uhh – kinda works
<pre>
root@hetzner3 /etc/munin/plugin-conf.d # service munin-node restart
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# so it exits with a non-error, just a U. no further stats. huh.
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
smartctl_exit_status.value U
root@hetzner3 /usr/share/munin/plugins #
</pre>
# yeah, it looks like the smart_ plugin doesn't work for nvme drives :(
## https://github.com/munin-monitoring/munin/issues/790
## https://github.com/aranemac/munin-smart-nvme
# I'm not looking to compile some binary. I think we've reached the point of diminished return here
# while historical smart charts would be great, what I really want to achieve is some email alerts from SMART, like we setup for the RAID
# I found a few guides about this
## https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email
## https://serverfault.com/questions/426761/is-smartd-properly-configured-to-send-alerts-by-email
## https://unix.stackexchange.com/questions/662633/best-practices-to-enable-smart-disk-notifications-on-a-linux-workstation
# I replaced the files
<pre>
root@hetzner3 /etc # mv /etc/smartd.conf /etc/smartd.conf.$(date "+%Y%m%d_%H%M%S").orig
root@hetzner3 /etc #

root@hetzner3 /etc # echo "DEVICESCAN -d removable -n standby -m REDACTED@opensourceecology.org -M exec /usr/share/smartmontools/smartd-runner" > /etc/smartd.conf
root@hetzner3 /etc #
</pre>
# but that didn't work; no email came when I restarted the service (even if I added -M test)
# I checked the status in systemd, and it says that it did try to send the mail
<pre>
root@hetzner3 /etc # systemctl status smartd
● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-04-20 20:58:57 UTC; 3min 22s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1466569 (smartd)
Status: "Next check of 2 devices will start at 21:28:57"
Tasks: 1 (limit: 76834)
Memory: 1.2M
CPU: 66ms
CGroup: /system.slice/smartmontools.service
└─1466569 /usr/sbin/smartd -n

Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, is SMART capable. Adding to "monitor" list.
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme0n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NX0M104566-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
root@hetzner3 /etc #
</pre>
# so I checked the postfix logs, and it looks like google is rejecting our mail?!?
<pre>
root@hetzner3 ~ # journalctl -fu postfix@-
...
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: Untrusted TLS connection established to aspmx.l.google.com[108.177.15.27]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bit
s) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: CB6E5B94BB2: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[108.177.15.27]:25, delay=1.2, delays=0.01/0.01/0.86/0.27, dsn=2.0.0, status=sent (250 2.0.0 OK 1745183017 ffacd0b85a97d-39efa5a45b6si4251829f8f.798 - gsmtp)
Apr 20 21:04:34 hetzner3 postfix/qmgr[4510]: CB6E5B94BB2: removed
Apr 20 21:04:36 hetzner3 postfix/smtp[1468114]: Untrusted TLS connection established to aspmx.l.google.com[2404:6800:4003:c02::1b]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/bounce socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/bounce socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/defer socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/defer socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: D13CAB94BB3: defer service failure
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: D13CAB94BB3: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[2404:6800:4003:c02::1b]:25, delay=4.5, delays=0.01/0.01/3.5/1, dsn=4.3.0, status=deferred (bounce or trace service failure)
...
</pre>
# I changed it to my personal email, restarted, and I got two emails
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme1

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NA0M345614, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# and
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme0

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NX0M104566, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# so I changed it back to the google groups email list email address, and I updated the wiki https://wiki.opensourceecology.org/wiki/Hetzner3
# after lunch, I refreshed munin on hetzne2 and hetzner3, to see if smart info was not being charted
## on hetzner2, there's no changes. I don't see any charts related to SMART
## on hetzner3, there's two new charts (S.M.A.R.T values for drive nvme0n1 & S.M.A.R.T values for drive nvme1n1), but they're both empty; it only has 1 value (smartctl_exit_status), and it's "nan" for all time charts. This is expected, since it can't read the nvme smartctl output format.
# I think maybe I forgot to restart munin on hetzner2, so I gave that a try
<pre>
[root@opensourceecology ~]# service munin-node restart
Redirecting to /bin/systemctl restart munin-node.service
[root@opensourceecology ~]#

[root@opensourceecology ~]# sudo -u munin /usr/bin/munin-cron
2025/04/20 21:29:38 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 55.
closedir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 56.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:52 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
[root@opensourceecology ~]#
</pre>
# whatever; I guess no munin logs on SMART for this dying server
# I also confirmed that varnish logs are now visible in munin
# I committed my ansible changes https://github.com/OpenSourceEcology/ansible/commit/2fb906fd62cf0773d84f50f1cf113ddfe66910ec
# anyway, I also updated smartd.conf on hetzner2
<pre>
[root@opensourceecology smartmontools]# cp smartd.conf smartd.conf.20250420.bak
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# vim smartd.conf
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# diff smartd.conf.20250420.bak smartd.conf
23c23,24
< DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
---
> #DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
> DEVICESCAN -H -m REDACTED@opensourceecology.org -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
[root@opensourceecology smartmontools]#
[root@opensourceecology smartmontools]# systemctl restart smartd
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
[root@opensourceecology smartmontools]#
</pre>
# oh wow, that screaming about the disks failing wasn't just printed to my tty; it got printed to every tty on my screen session. It really is angry..
# but, alas, no email was sent – even from hetzner2. where email should *definitely* be working
# this time the postfix logs on hetzner2 gave us an error from gmail saying why they're blocking us
<pre>
Apr 20 21:40:27 opensourceecology postfix/smtp[21221]: 297716847E6: host aspmx.l.google.com[64.233.167.27] said: 421-4.7.28 Gmail has detected an unusual rate of unso
licited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.go
ogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42a931si4417083f8f.167 - gsmtp (in reply to end
of DATA command)
Apr 20 21:40:27 opensourceecology postfix/smtp[21094]: 3CBF7684804: host aspmx.l.google.com[142.251.168.27] said: 421-4.7.28 Gmail has detected an unusual rate of uns
olicited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.g
oogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42967csi4306047f8f.165 - gsmtp (in reply to end
of DATA command)
</pre>
# marcin sent an email campaign today with phpList. If that didn't make it out due to this, that's kinda problem.
# I see in the log that we're kinda spamming phplist_bounces@opensourceecology.org
# that's basically where phplist is supposed to let our admins know that it failed to deliver to some people on the mailing list
## I confirmed that this account *does* exist in the gsuite admin wui user list
# yeah, crap, it's blocking other mail sent to my personal account from apache.
# woah, I'm tailing the mail log and I just got probably hundereds or thousands of emails tried to be sent. phpList is *supposed* to do it in small batches, but I wonder if, once it fails and gets added to the queue, it'll do the re-send without batching it..
# I checked phpList wui settings and config.php, and I don't see anything about rate-limiting
# here's the docs on it https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
# it says it should be set in config.php. By default, I think it's 5,000 emails per hour
# Marcin's campaign today was sent to 14,111 people
# I checked the event log page, and I see a lot of these "Maximum time for queue processing: 99999" – which I guess means we need to break these up into batches https://phplist.opensourceecology.org/lists/admin/?page=eventlog
# looks like the easiest thing to do is to add a pause with MAILQUEUE_THROTTLE https://discuss.phplist.org/t/some-advice-for-correct-configuration-of-sending-rate/429
# if we send one per second, then we'll send 3,600 per hour.
## If we have 15,000 people on our list, then at that rate we'd need 4-5 hours to send a campaign. That sounds like a good idea.
# I updated the phpList config file to send only one email per second
<pre>
[root@opensourceecology phplist.opensourceecology.org]# diff config.20250420.php config.php
83a84,87
> // only send 1 email per second
> // * https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
> define('MAILQUEUE_THROTTLE',1);
>
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# we should also probably throttle postfix https://serverfault.com/questions/110919/postfix-throttling-for-outgoing-messages
# looks like for both hetzner2 and hetzner3, this is set to no delay
<pre>
[root@opensourceecology phplist.opensourceecology.org]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 0s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# I set this on hetzner2
<pre>
[root@opensourceecology postfix]# diff main.cf.20250420 main.cf
683a684,686
>
> # limit emails to the same-destination-domain to one-email-per-2-seconds
> default_destination_rate_delay = 2s
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# systemctl restart postfix
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 2s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology postfix]#
</pre>
# and I also added this to ansible and pushed it out to the server on hetnzer3 https://github.com/OpenSourceEcology/ansible/commit/7ed339cad055a9a0c5b04f26d32c9416daf3a2c7

=Sat Apr 19, 2025=

# I responded to Tom's email about ssh
# Tom wasn't able to reset their account's password
# I think I created these accounts with `--disabled-password`, probably as some layered security for ssh (to force keys), but that kinda breaks sudo, which requires the password. I could make sudo NOPASSWD, but I think it's safer to have a user password set (and have ssh disabled passoword logins still) rather than set sudoers to NOPASSWD, in general
# disabled passwords are set with the '!' in the second field of /etc/shadown
<pre>
root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing:!:20133:0:99999:7:::
root@hetzner3 ~ #
</pre>
# I just manually edited /etc/shadow with vim to remove the exclimation point
<pre>
root@hetzner3 ~ # vim /etc/shadow
root@hetzner3 ~ #

root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing::20133:0:99999:7:::
</pre>
# Tom replied, saying he can become root on hetzner3 now.
# ...
# I returned to work on the plan for replacing the disks on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb#Change_Steps
# I confirmed that the disks (on both hetzner2 and hetzner3) are MBR partition scheme (not GPT) – indicated by "Disk label type: dos"
<pre>
[root@opensourceecology ~]# fdisk -l /dev/sda

Disk /dev/sda: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x9b8e1266

Device Boot Start End Blocks Id System
/dev/sda1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sda2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sda3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0xd904fc05

Device Boot Start End Blocks Id System
/dev/sdb1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sdb2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sdb3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#
</pre>
# A quick spot-check shows that our backups usually finish at 09:55 – one time as late as 10:07. That's UTC.
# 10:00 UTC is 05:00 my time and 12:00 in Berlin. God that's early, but better to do this early in Germany time..
# I sent an email to Marcin asking if Thr 2025-04-24 @ 10:00 UTC (~05:00 FeF) would be a good time to do this
<pre>
Hey Marcin,

When would be a good time to replace the first disk on hetzner2?

Our backups finish daily at 10:00 UTC, which is:

* 12:00 in Germany (where the server lives)
* 05:00 here in Ecuador, and
* 05:00 at FeF

I propose next week on Thursday 2025-04-24 10:00 UTC.

For details about what this change entails, and expected downtime, please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb

Please let me know if you approve this change, if the suggested time is agreeable to you, and if you have any questions.
</pre>

=Fri Apr 18, 2025=
# Marcin sent another email this morning asking why osemain is down too now, and I responded
<pre>
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

* https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.

Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing? Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly. This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin
</pre>
# ok, so back to trying to figure out the corruption of the mariadb
# looks like the attempt to start it in recovery mode 2 fails after 10 minutes
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 10m0.435s
user 0m0.011s
sys 0m0.012s
[root@opensourceecology etc]#
</pre>
# and the tail of the db log
<pre>
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00 InnoDB: Waiting for the background threads to start
250417 23:06:01 InnoDB: Waiting for the background threads to start
250417 23:06:02 InnoDB: Waiting for the background threads to start
250417 23:06:03 InnoDB: Waiting for the background threads to start
250417 23:06:04 InnoDB: Waiting for the background threads to start
250417 23:06:05 InnoDB: Waiting for the background threads to start
250417 23:06:06 InnoDB: Waiting for the background threads to start
250417 23:06:07 InnoDB: Waiting for the background threads to start
250417 23:06:08 InnoDB: Waiting for the background threads to start
250417 23:06:09 InnoDB: Waiting for the background threads to start
</pre>
# so we have one more recovery mode we can try before it becomes destructive = 3
<pre>
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
>
[root@opensourceecology etc]#
</pre>
# and gave it a restart
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
...
</pre>
# damn, looks like it's stuck on the same thing
<pre>
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18 InnoDB: Waiting for the background threads to start
250418 19:33:19 InnoDB: Waiting for the background threads to start
250418 19:33:20 InnoDB: Waiting for the background threads to start
...
</pre>
# the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
## https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
## https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
# anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
# meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
# oh shit, that worked
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m2.102s
user 0m0.010s
sys 0m0.007s
[root@opensourceecology etc]#
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 22468 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]#
</pre>
# the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
<pre>
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28 InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
</pre>
# oh, the spam stopped. maybe just some startup thing.
# I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
# this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
# but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# this took about a minute to run
<pre>
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]#
</pre>
# good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
<pre>
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log
3dp_db.wp_commentmeta OK
3dp_db.wp_comments OK
3dp_db.wp_links OK
3dp_db.wp_masterslider_options OK
3dp_db.wp_masterslider_sliders OK
3dp_db.wp_options OK
3dp_db.wp_postmeta OK
3dp_db.wp_posts OK
3dp_db.wp_revslider_css OK
3dp_db.wp_revslider_layer_animations OK
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log
cacti_db.automation_ips
note : The storage engine for the table doesn't support check
cacti_db.automation_processes
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note : The storage engine for the table doesn't support check
cacti_db.poller_output
note : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note : The storage engine for the table doesn't support check
osemain_db.wp_options
warning : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]#
</pre>
# let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
## cacti_db
## osemain_db
## oswh_db
# I sent Marcin a status update email
<pre>
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
</pre>
# ok, I made the post-corruption mysqldump backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real 2m48.845s
user 3m19.170s
sys 0m2.023s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# now let's drop those three databases.
<pre>
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db |
+--------------------+
| d3d_db |
+--------------------+
| fef_db |
+--------------------+
| microfactory_db |
+--------------------+
| mysql |
+--------------------+
| obi_db |
+--------------------+
| obi_staging_db |
+--------------------+
| oseforum_db |
+--------------------+
| osemain_s_db |
+--------------------+
| osewiki_db |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db |
+--------------------+
| seedhome_db |
+--------------------+
| store_db |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# that looked good
<pre>
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]#
</pre>
# recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
# nah, it failed
<pre>
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m2.805s
user 0m0.006s
sys 0m0.010s
[root@opensourceecology etc]#
</pre>
# logs are the same, I think?
<pre>
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04 InnoDB: Waiting for the background threads to start
250418 20:10:04 InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
<pre>
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
250418 20:11:42 InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42 InnoDB: Waiting for the background threads to start
</pre>
# well, even though it *says* it's started
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m5.156s
user 0m0.008s
sys 0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 24458 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real 0m0.012s
user 0m0.001s
sys 0m0.007s
[root@opensourceecology etc]#
</pre>
# we can't connect to it with mysqlcheck
<pre>
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log
real 0m0.010s
user 0m0.002s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]#
</pre>
# so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
# huh, all lines say OK
<pre>
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real 0m11.597s
user 0m0.010s
sys 0m0.009s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log
[root@opensourceecology dbFail.20250417]#
</pre>
# well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
# I'm going to restore from the backup and then see if I can do that
# oh, right, we can't INSERT in recovery mode
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]#
</pre>
# well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
# we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
<pre>
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

</pre>
# even after that, it still won't start :'(
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m4.863s
user 0m0.009s
sys 0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real 0m0.010s
user 0m0.002s
sys 0m0.005s
[root@opensourceecology etc]#
</pre>
# before I purge those three system-level DBs, I want to confirm they're in our backups
# as I feared, it looks like they're missing
<pre>
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]#
</pre>
# according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
# I'm just going to manually dump these three anyway. Or try to
# well, I was able to get one of the three to backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real 0m0.010s
user 0m0.006s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real 0m0.142s
user 0m0.155s
sys 0m0.010s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real 0m0.009s
user 0m0.009s
sys 0m0.005s
[root@opensourceecology dbFail.20250417]#
</pre>
# mysql looks good
<pre>
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# I'm just going to move this whole db dir out of the way and see if we can start it fresh
<pre>
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x 4 mysql mysql 4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]#
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]#
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x 2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
[root@opensourceecology lib]#
</pre>
# ok, it's started outside recovery mode now
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m3.550s
user 0m0.007s
sys 0m0.012s
[root@opensourceecology etc]#

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23 InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
</pre>
# it created all these files
<pre>
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x 5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
-rw-rw---- 1 mysql mysql 16K Apr 18 20:56 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Apr 18 20:56 aria_log_control
-rw-rw---- 1 mysql mysql 18M Apr 18 20:56 ibdata1
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx 1 mysql mysql 0 Apr 18 20:56 mysql.sock
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]#
</pre>
# that also would have killed the mysql password; I can't login
<pre>
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]#
</pre>
# I hacked my way in and set the root password
<pre>
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
</pre>
# now I can see our three databases, plus one named test
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql |
+--------------------+
| performance_schema |
+--------------------+
| test |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]#
</pre>
# first let's just restore the 'mysql' database
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]#
</pre>
# that appears to have worked; our users are present now
<pre>
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User |
+------------------+
| oseforum_user |
| cacti_user |
| 3dp_user |
| cacti_user |
| d3d_user |
| fef_user |
| microfactory_usr |
| munin_user |
| obi2_user |
| obi3_user |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]>
</pre>
# I gave it a restart, and ensured it's still working. Great.
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# now let's restore the rest – including even our corrupt databases – and see if it works or breaks
# that took about 11.5 minutes to import ~6.8G of data
<pre>
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real 11m36.530s
user 1m52.944s
sys 0m3.593s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G /var/lib/mysql
[root@opensourceecology dbFail.20250417]#

</pre>
# I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# woah, I gave it a restart, and it came back fine
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# I guess we fixed it with no data loss?
# let's bring up the web servers
<pre>
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]#
</pre>
# the wiki loads now
# so does osemain
# I'd say we're back in business
# I sent an email to Marcin
<pre>
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues.
</pre>
# now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
# at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
# looks like it won't let him reboot if other users are logged-in
<pre>
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
1. maltfield
2. crupp
3. Tom Griffing (tgriffing)
4. jthomas
Choose identity to authenticate as (1-4):
</pre>
# I updated the sudoers command to give marcin *just* access to the reboot command
<pre>
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]#
</pre>
# I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
# it's debian, but sudoers syntax should (hopefully) be the same
<pre>
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~#

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password:
Retype new password:
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
</pre>
# yeah, that worked. Perfect.
# I tested it on hetzner2; it worked too.
<pre>
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
</pre>
# I sent Marcin a reply ask him to test reboots via ssh
<pre>
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need?
</pre>
# I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
## I wonder if I can figure out which one grub uses and replace that one second..
# from my log yesterday, here's our two drive's serial numbers
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# fuck; looks like neither is referenced in /boot/
<pre>
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]#
</pre>
# the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
# it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
<pre>
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64 initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64 initramfs-3.10.0-327.18.2.el7.x86_64.img System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64 initramfs-3.10.0-514.26.2.el7.x86_64.img System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64.img System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64kdump.img System.map-3.10.0-693.2.2.el7.x86_64
efi initrd-plymouth.img vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub lost+found vmlinuz-3.10.0-1127.el7.x86_64
grub2 symvers-3.10.0-1127.el7.x86_64.gz vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img symvers-3.10.0-1160.119.1.el7.x86_64.gz vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img symvers-3.10.0-327.18.2.el7.x86_64.gz vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img symvers-3.10.0-514.26.2.el7.x86_64.gz vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]#
</pre>
# I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
# oh, the efi dir is empty, so I'm thinking we're using grub2
<pre>
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]#
</pre>
# yeah, the grub dir just has one file in it?
<pre>
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11 2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
-rw-r--r-- 1 root root 1.4K Nov 15 2011 splash.xpm.gz
[root@opensourceecology boot]#
</pre>
# grub2 looks most sane
<pre>
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26 2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15 2015 fonts
-rw-r--r-- 1 root root 7.8K Jul 26 2024 grub.cfg
-rw-r--r-- 1 root root 5.3K Jun 1 2016 grub.cfg.1499616907.rpmsave
-rw-r--r-- 1 root root 6.1K Jul 9 2017 grub.cfg.1506097734.rpmsave
-rw-r--r-- 1 root root 7.0K Sep 22 2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26 2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31 2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31 2016 locale
[root@opensourceecology boot]#
</pre>
# it looks like it's referencing the raid, not the drive
<pre>
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a' 9f6b5264-da8c-406d-a444-45e3fb3aeb26
else
search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
fi
linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
</pre>
# right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
# well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
# that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
# Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
# the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 150G 38G 80% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#
</pre>

=Thr Apr 17, 2025=
# Marcin sent me an email last night (and again this morning) asking why the wiki is down
# I hadn't touched ose infra since 6 days ago
# the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
# I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
# anyway, let's check what happened to the wiki on hetzner2
# it's a 500 error complaining about the db
<pre>
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
<input type="hidden" name="num" value="50" />
<input type="hidden" name="ie" value="UTF-8" />
<input type="hidden" name="oe" value="UTF-8" />
<input type="text" name="q" size="31" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<p>
<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
<label><input type="radio" name="sitesearch" value="" />WWW</label>
</p>
user@disp9871:~$
</pre>
# disk is fine
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 96G 92G 52% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
# there's also no new logs in the mariadb error log when I hit the site in real-time
# well, the db isn't running
<pre>
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#
</pre>
# error logs aren't very helpful
<pre>
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
</pre>
# if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
<pre>
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]#
</pre>
# here's the log that pops-up when we try a restart
<pre>
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31 InnoDB: Waiting for the background threads to start
250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# google points to this https://bugs.mysql.com/bug.php?id=61516
## they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
# reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
# before reading any more, I'm going to immediately make a local copy of our most-recent backups
# looks like we have a backup from 13 hours ago and one from 27 hours ago
<pre>
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x 2 root root 4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x 2 root root 4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
</pre>
# this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
## it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
## other warnings suggest solving the underlying issue: why did the data become corrupt?
## well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
## but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
# I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
<pre>
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

* https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry.
</pre>
# the backups of the backups finished
<pre>
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec
total size is 44,542,039,620 speedup is 1.00
[root@opensourceecology ~]#
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 138G 50G 74% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
<pre>
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]#
</pre>
# I should also make a backup of /var/lib/mysql
# I'm going to create a dif for all of this
<pre>
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

* https://bugs.mysql.com/bug.php?id=61516

Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]#
</pre>
# aaaand I added a copy of /var/lib/mysql/
<pre>
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec
total size is 7,383,062,830 speedup is 1.00
[root@opensourceecology ~]#
</pre>
# another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
<pre>
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
</pre>
# Unfortunately, a lot of the links for how to fix this are now dead
## https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
## https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## https://forums.mysql.com/read.php?22,603093,604631#msg-604631
## https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
# we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
## but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
## ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
# this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
## good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
### but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
## it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
### here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
<pre>
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
</pre>
# I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
# ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
# but first let's check SMART
# oh, fuck, my notes on this are on the wiki. Of course.
# arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
# fail
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]#
</pre>
# luckily the yum servers for this EOL OS are still online, and I could install it
<pre>
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1
Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1

Installed:
smartmontools.x86_64 1:7.0-2.el7

Complete!
[root@opensourceecology ~]#
</pre>
# better
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]#
</pre>
# well this is terrifying; it says both our disks are gonna fail within 24 hours
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# compare that to hetnzer3, which says all is good
<pre>
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ #
</pre>
# I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
# oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
## well, it lists two options
### Free Replacement drive nearly new or used and tested; depends on what is in stock.
### At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
## we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
## dang, it says they'll swap the drive within 2-4 hours.
# I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
# hetzner provides 4 docs for assistance on this
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
## https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
## https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
# that first doc says to run the command we just ran
# hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
# ok, the docs say we can get more info with -A
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>
# so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
# well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
# yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
# in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
# the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
# wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
<pre>
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 152.358.379 [78,0 TB]
Data Units Written: 52.125.092 [26,6 TB]
Host Read Commands: 6.873.372.480
Host Write Commands: 1.362.559.127
Controller Busy Time: 22.226
Power Cycles: 28
Power On Hours: 17.245
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 159
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Data Units Read: 140.811.605 [72,0 TB]
Data Units Written: 56.604.901 [28,9 TB]
Host Read Commands: 1.304.073.899
Host Write Commands: 1.364.668.115
Controller Busy Time: 21.180
Power Cycles: 23
Power On Hours: 15.565
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 149
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 45 Celsius

root@hetzner3 ~ #
</pre>
# that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
# the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
# this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
# the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
<pre>
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
</pre>
# I also kicked-off a long test, which I can check tomorrow
<pre>
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
</pre>
# ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
<pre>
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/md2 197G 145G 43G 78% /
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# it won't let me check the filesystem while it's mounted
<pre>
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]#
</pre>
# it probably should be happening on-boot, but I couldn't find it in dmesg
<pre>
[root@opensourceecology ~]# dmesg | grep -i check
[ 0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]#
</pre>
# ok, instead we can just use tune2fs to get the info on the last check that was run
# looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 13131776
Block count: 52496160
Reserved block count: 2624808
Free blocks: 26575102
Free inodes: 12417672
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1011
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue May 31 06:01:12 2016
Last mount time: Thu Apr 17 17:39:11 2025
Last write time: Thu Apr 17 17:39:00 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
Check interval: 0 (<none>)
Lifetime writes: 124 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup: inode blocks
[root@opensourceecology ~]#
</pre>
# both of the filesystems (/ and /boot) look fine
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on: /boot
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 46
Maximum mount count: -1
Last checked: Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on: /
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
</pre>
# well, so far I couldn't find any signs of corruption on the disk/fs level
# back to the db, I set the recovery option in the my.cnf file
<pre>
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]#

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
>
[root@opensourceecology etc]#
</pre>
# it didn't come-up
<pre>
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]#
</pre>
# I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
<pre>
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49 InnoDB: Waiting for the background threads to start
250417 22:32:50 InnoDB: Waiting for the background threads to start
250417 22:32:51 InnoDB: Waiting for the background threads to start
250417 22:32:52 InnoDB: Waiting for the background threads to start
250417 22:32:53 InnoDB: Waiting for the background threads to start
250417 22:32:54 InnoDB: Waiting for the background threads to start
250417 22:32:55 InnoDB: Waiting for the background threads to start
250417 22:32:56 InnoDB: Waiting for the background threads to start
250417 22:32:57 InnoDB: Waiting for the background threads to start
250417 22:32:58 InnoDB: Waiting for the background threads to start
...
</pre>
# it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

=Sun Apr 11, 2025=

# let's get Catarina that broken staging site for osemain on hetzner3
# Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
# so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
# I logged-into hetzner https://robot.hetzner.com/server
# I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
# I clicked on the server for "hetzner3" and the tab "IPs".
## Then I clicked on "Order additional IPs / Nets"
## I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
## it required me to enter a reason (IPv4 is scarce) to which I wrote:
<pre>
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
</pre>
## and I clicked "Apply for IP/subnet in obligation"
## I got a message; looks like it needs human approval
<pre>
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
</pre>
# I typed an email to Marcin and Catarina to notify them of this order
<pre>
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
# I refreshed the hetzner wui, and now I see the new IP
# ...
# following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
## I sent them both emails asking them to confirm access
# I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
# update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
# he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
# I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
# I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
# Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
# ...
# I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
# gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
# I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
# there's too many layers on this server; I checked the logs
<pre>
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
</pre>
# ok, so it's modsecurity?
# gah, that's a lot of files to review
<pre>
[root@opensourceecology httpd]# find . |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]#
</pre>
# looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
<pre>
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]#
</pre>
# it's 13107200
<pre>
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf: SecRequestBodyLimit 13107200
conf.d/mod_security.conf: SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]#
</pre>
# docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
# so 13107200 / 1024 / 1024 = 12.5 MB.
# jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
# ...
# so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
# I didn't want to do this site before the other less-priority ones, but it's just a sandbox
# I realized I never made a CHG file for osemain
# looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
# ugh, I just said I was "following the same guide as with the other sites"
## I was hoping to know which one to CHG to copy-from
## I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
# ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

Maltfield Log/2025 Q2

2025-05-31T19:36:36Z

Maltfield: apr 27

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

__TOC__

=See Also=
# [[Maltfield_Log]]
# [[User:Maltfield]]
# [[Special:Contributions/Maltfield]]

=Sun Apr 27, 2025=
# Tom created a GitHub account https://github.com/tgriff-ose
# I invited this new account to become a member of the official OSE GitHub org, and sent them an email
<pre>
Hey Tom,

I've invited you to join the official OSE GitHub org:

* https://github.com/orgs/OpenSourceEcology

Please check your GitHub notifications and accept the invite.

PS: If you haven't yet, can you please enable 2FA on your GitHub account?

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/26/25 22:42, REDACTED@tutanota.com wrote:
> Account name: tgriff-ose
>
> --
> Tom Griffing
>
>
>
> Apr 27, 2025, 03:24 by REDACTED@disroot.org:
>
>> GitHub is owned by Microsoft, and it's free (as in beer) to create an account.
>>
>> Could you please create a free GitHub account?
>>
>> Michael Altfield
>> https://www.michaelaltfield.net
>> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>>
>> Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
>>
>> On 4/26/25 21:06, REDACTED@tutanota.com wrote:
>>
>>> Michael;
>>>
>>> I don't have a github account, as it's a Microsoft thing requiring a paid account. I don't intent to support them.
>>>
>>> Is there any other way to access the ansible repo?
>>> --
>>> Tom Griffing
</pre>
# ...
# Marcin confirmed that he has not received a bill from AWS for some time, so it appears we did finally delete all of the glacier crap
<pre>
I have not received another bill since January, so it looks like there is
nothing owed.
MJ

On Sat, Apr 26, 2025 at 6:28 PM Michael Altfield <REDACTED>
wrote:

> Hey Marcin,
>
> Speaking of aws, can you confirm that your bill for last month was $0?
>
>
> Thank you,
</pre>
# ...
# I updated my wiki and osedev work logs for April so-far

=Sat Apr 26, 2025=
# Marcin authorized me to add Tom to our ops google groups mailing list and to give him access to our shared ose keepass
<pre>
Yes.

On Fri, Apr 25, 2025, 12:43 PM Michael Altfield <REDACTED@disroot.org> wrote:

> (re-sending without encryption)
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
>
> On 4/25/25 12:41, Michael Altfield wrote:
>> Hey Marcin,
>>
>> Do you authorize:
>>
>> 1. Giving Tom access to the shared OSE keepass file
>>
>> 2. Adding Tom to the ops mailing list (this would allow him to password
>> reset many of our important accounts)
>>
>> Please let me know if you authorize the above.
>>
>> Thank you,
</pre>
# Tom sent me his gpg public key, which I can use to add him to the wazuh emails
<pre>
user@ose:~$ gpg
gpg: WARNING: no command supplied. Trying to guess what you mean ...
gpg: Go ahead and type your message ...
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----

pub rsa4096 2025-04-26 [SC]
13300901348A985115679165FB137A633FD1EB4C
uid Tom Griffing (OSE PGP Key 4-25-2025) <REDACTED@tutanota.com>
sub rsa4096 2025-04-26 [E]
user@ose:~$
</pre>
# I added Tom to the wazuh recipients, per https://wiki.opensourceecology.org/wiki/Wazuh
<pre>
mkdir -p /var/tmp/gpg
pushd /var/tmp/gpg
# write multi-line to file for documentation copy & paste
cat << EOF > /var/tmp/gpg/tom.pubkey.asc
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----
EOF
gpg --homedir /var/ossec/.gnupg --import /var/tmp/gpg/tom.pubkey.asc
popd

# add marcin's email (that matches an email on a UID of his key above) to the space-delimited "recipients" variable
vim /var/ossec/sent_encrypted_alarm.settings
</pre>
# and I sent him an email asking him to confirm that it's working
<pre>
Hey Tom,

Can you please confirm that you're now receiving alerts from wazuh?

Wazuh is our HIDS (Host-Based Intrusion Detection System). It's a fork of the HIDS and FIM (File Integrity Monitor) OSSEC. Because it sometimes sends sensitive information (eg diffs of config files with passwords), it's important that we encrypt its email notifications end-to-end with PGP.

And because someone who compromises the server could "clean up" after themselves, these (off-server) alerts are critical to post-compromise investigations.

For more info, see:

* https://wiki.opensourceecology.org/wiki/Wazuh
* https://en.wikipedia.org/wiki/OSSEC
* https://documentation.wazuh.com/current/getting-started/index.html

Out-of-the-box, Wazuh has a ton of features, but probably where we use it the most is its ingestion of apache's mod_security WAF and its tie-in to Wazuh's Active Response. If an IP is found doing something bad (eg multiple consecutive 403 responses, such as a brute-force attack on wordpress [or ssh]), then the IP will get temp blocked by the firewall for 10 minutes. If it does it again shortly after the ban is lifted, it'll be banned for 12 hours. If again, 1 day. Then 2 days. Then 4 days. And the max ban for 5x repeat offenses is 8 days

* https://github.com/OpenSourceEcology/ansible/blob/master/hetzner3/roles/maltfield.wazuh/templates/ossec.conf.j2#L256-L271

It also has rootkit detection, and lots of other useful alerts that "just work" out of the box.

Please confirm that you're now receiving encrypted wazuh alerts.

Thank you,
</pre>
# I tried to add Tom to our ops google groups email list, but it said I wasn't allowed to add members outside of our google workspace
<pre>
An error occurred
1 user is outside of your organization. Based on your group or organization settings, you can only add organization users to this group. Contact your group owner or domain administrator for help.
</pre>
# I checked our user's group. it appears that Tom doesn't have an account @opensourceecology.org in gsuite
# I found the setting to change that here https://admin.google.com/ac/managedsettings/864450622151/GROUPS_SHARING_SETTINGS_TAB
## https://support.google.com/a/thread/63692725/
## https://support.google.com/a/answer/167097
# I checked the box that said "Group owners can allow external members"
## curiously the subline said "Organization admins can always add external members" – but I'm a damn org admin, and I couldn't add him :/
# I tried to add him again, but I got the same error
# this time I went to the group settings https://groups.google.com/a/opensourceecology.org/g/REDACTED/settings
# I found the "allow external members" and changed it from "off" to "on" and clicked "save changes"
## this wasn't possible before. So first I had to change the workspace-wide settings to allow me to change the groups-specific settings. now it's changed.
# this time it worked.
# I sent an email to our ops google group, asking Tom to reply if he saw it
# ...
# I checked-in on hetzner2 to make sure it rebooted this morning
# looks like the cron is set to reboot at 10:40 UTC every day, and – indeed – uptime says it's been online for a bit less than 13 hours. And its last boot time was today at 10:41:25
<pre>
[root@opensourceecology ~]# uptime
23:30:25 up 12:49, 7 users, load average: 1.02, 0.98, 0.74
[root@opensourceecology ~]# journalctl | head
-- Logs begin at Sat 2025-04-26 10:41:25 UTC, end at Sat 2025-04-26 23:30:26 UTC. --
Apr 26 10:41:25 localhost systemd-journal[129]: Runtime journal is using 8.0M (max allowed 3.1G, trying to leave 4.0G free of 31.2G available → current limit 3.1G).
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuset
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpu
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuacct
Apr 26 10:41:25 localhost kernel: Linux version 3.10.0-1160.119.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jun 4 14:43:51 UTC 2024
Apr 26 10:41:25 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
Apr 26 10:41:25 localhost kernel: e820: BIOS-provided physical RAM map:
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009c7ff] usable
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x000000000009c800-0x000000000009ffff] reserved
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /etc/cron.d/reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology ~]# date -u
Sat Apr 26 23:31:32 UTC 2025
[root@opensourceecology ~]#
</pre>
# so it looks like we'll have ~2 minutes of downtime every day in the very early morning in the US. I can live with that.
# and grub clearly is fixed
# oh, also the RAID looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I asked Tom for his GitHub account profile username, so I can grant him write access to our OSE ansible repo
# I updated Tom's new ssh key to his authorized_keys file on hetzner2
# I sent Tom an email asking to confirm his access to hetzner2

=Fri Apr 25, 2025=
# I woke up this morning and discovered the wiki was offline
# I tried to ssh into the server; it's not responding
# I figured I'd log into the hetzner wui, but – uhh – the credentials are in keepass and live on the server
# I mitigated this by giving Marcin a copy of the keepass file on his veracrypt drive, but he since changed the password a month or two ago, and we don't have a new local copy
# I sent an email to Marcin asking him to login to hetzner wui and boot hetzner2. if it doesn't come-up, then I'll have to get the password from him so I can load it in the wui from a rescue disk
# oh, I did find the new hetzner password in my personal keepass
# I logged-in, and I found the server was listed as being on. But I can't ping it. I gave it an "automatic hardware reset" from the wui
# I'll give it a few minutes before trying the rescue system
# their rescue systems are much nicer for their cloud product than their dedicated server product
# it looks like I have two options
## rescue boot mode: where I'm given ssh access
## vnc
# the problem with the rescue boot is that – if this is a grub issue – I wouldn't be able to "see" the error
# I enabled VNC and gave the server a reboot
# I was able to connect via vnc, but it was the damn installation wizard for almalinux. I quit the installation, and the vnc session died.
# damn, I guess vnc won't let me see the boot process, after all
# instead I tried the "rescue system"
# that didn't work; I can't access ssh on either of the IP addresses
# the docs say to activate the rescue system and then reboot it; that's what I did https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system/
# this time I fully shut down the server, and then I enabled the rescue system (while it's off)
# I went back to the Reset tab, and it's still off. So I booted it
# somehow I was able to login from my ose vm using my personal ssh key, but with user root
<pre>
user@ose:~$ ssh -v root@138.201.84.223
OpenSSH_9.2p1 Debian-2+deb12u5, OpenSSL 3.0.15 3 Sep 2024
debug1: Reading configuration data /home/user/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 138.201.84.223 [138.201.84.223] port 22.
debug1: Connection established.
...
Linux rescue 6.12.19 #1 SMP Fri Mar 14 05:34:52 UTC 2025 x86_64

--------------------

Welcome to the Hetzner Rescue System.

This Rescue System is based on Debian GNU/Linux 12 (bookworm) with a custom kernel.
You can install software like you would in a normal system.

To install a new operating system from one of our prebuilt images, run 'installimage' and follow the instructions.

Important note: Any data that was not written to the disks will be lost during a reboot.

For additional information, check the following resources:
Rescue System: https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system
Installimage: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installimage
Install custom software: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installing-custom-images
other articles: https://docs.hetzner.com/robot

--------------------

Rescue System (via Legacy/CSM) up since 2025-04-25 17:24 +02:00

Hardware data:

CPU1: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Cores 8)
Memory: 64153 MB (Non-ECC)
Disk /dev/sda: 250 GB (=> 232 GiB)
Disk /dev/sdb: 512 GB (=> 476 GiB)
Total capacity 709 GiB with 2 Disks

Network data:
eth0 LINK: yes
MAC: 90:1b:0e:94:07:c4
IP: 138.201.84.223
IPv6: 2a01:4f8:172:209e::2/64
Intel(R) PRO/1000 Network Driver

root@rescue ~ #
</pre>
# I was able to mount the root drive
<pre>
root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@rescue ~ # mount /dev/md2 /mnt
root@rescue ~ # ls /mnt
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # ls /mnt/home
b2user crupp hart lberezhny marcin stagingsync wp
cmota Flipo jthomas maltfield not-apache tgriffing
root@rescue ~ #
</pre>
# I don't know what the point of this is; I can't fix it if I can't watch it boot and see what's breaking
# ok, at the bottom of the docs, hetnzer lists another option = xKVM Rescue System https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/
# it specifically says that's for debugging boot issues
# last thing before I try that: I downloaded a local copy of the keepass files from hetzner2
<pre>
user@ose:~/tmp/hetzner2$ rsync -av --progress root@138.201.84.223:/mnt/etc/keepass ./etc-keepass-20250525
receiving incremental file list
created directory ./etc-keepass-20250525
keepass/
keepass/passwords.kdbx
46,142 100% 44.00MB/s 0:00:00 (xfr#1, to-chk=6/8)
keepass/passwords.kdbx.20170728.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#2, to-chk=5/8)
keepass/passwords.kdbx.20170804.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#3, to-chk=4/8)
keepass/passwords.kdbx.20190820.bak
33,726 100% 143.20kB/s 0:00:00 (xfr#4, to-chk=3/8)
keepass/passwords.kdbx.20190909.bak
34,238 100% 71.75kB/s 0:00:00 (xfr#5, to-chk=2/8)
keepass/passwords.kdbx.20250316.bak
45,406 100% 94.55kB/s 0:00:00 (xfr#6, to-chk=1/8)
keepass/passwords.kdbxs.20180525.bak
27,102 100% 56.31kB/s 0:00:00 (xfr#7, to-chk=0/8)

sent 161 bytes received 196,407 bytes 35,739.64 bytes/sec
total size is 195,794 speedup is 1.00
user@ose:~/tmp/hetzner2$

user@ose:~/tmp/hetzner2$ du -sh etc-keepass-20250525/keepass/*
48K etc-keepass-20250525/keepass/passwords.kdbx
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170728.bak
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170804.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190820.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190909.bak
48K etc-keepass-20250525/keepass/passwords.kdbx.20250316.bak
28K etc-keepass-20250525/keepass/passwords.kdbxs.20180525.bak
user@ose:~/tmp/hetzner2$
</pre>
# so this time was the same as the rescue system, except I choose "xKVM" instead of "Linux" in the "Operationg System" dropdown
# strange, it gave me an error
<pre>
Public key authentication is not available for the selected operating system.
</pre>
# I unselected my ssh key, and chose "no key" instead
# it gave me a URL and a password. I booted the server, but the URL didn't load ("Unable to connect" error)
# ok, it took a few minutes and had a self-signed cert
# I bypassed the cert error, and entered the username and password into the basic auth popup. It failed! Could I really have been MITM'd?
# I immediately shut down the server from the wui, and I tried again.
# this time I was able to login – both from ssh and in the wui.
# as soon as it opened, I saw the error
<pre>
No more network devices

Booting from Hard Disk...
.
error: symbol 'grub_calloc' not found.
Entering rescue mode...
grub rescue>
</pre>
# I wonder if this is grub or grub2. I didn't have a binary "grub-install" before. I assumed it was an error with the hetzner docs when I did "grub2-install" instead, which said it worked (there was a warning that the docs said were safe to ignore)
# curoiusly, the opposite is true for the ssh session in vkvm: I have grub-install but not grub2-install
<pre>
root@vKVM-rescue ~ # which grub-install
/usr/sbin/grub-install
root@vKVM-rescue ~ #
root@vKVM-rescue ~ # which grub2-install
root@vKVM-rescue ~ #
</pre>
# here's the docs in question https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# I don't want to fuck with the grub without first taking a backup of these disks. But, uh, it looks like I can't access the RAID from inside this vkvm setup
# yeah, that's one of the limitations listed for VKVM https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/#raid-controllers
<pre>
Configured units are passed through as SCSI devices to the VM. However it is not possible to access the controller. Please use the regular Hetzner Rescue System for this purpose.
</pre>
# I shutdown VKVM and booted it into the regular rescue mode
# it took a few minutes to get back into the old rescue system, but here I can use the raid
<pre>
root@rescue ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 3.4G 1 loop
sda 8:0 0 476.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
sdb 8:16 0 232.9G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
root@rescue ~ # mkdir /mnt/md1
root@rescue ~ # mkdir /mnt/md2
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # mount /dev/md2 /mnt/md2
root@rescue ~ #
</pre>
# I created a dir for these backups
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ #

root@rescue ~ # mkdir /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chown root:root /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chmod 0700 /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ #
</pre>
# first I made a backup from the raid
<pre>
root@rescue ~ # rsync -av --progress /mnt/md1 /mnt/md2/var/tmp/20250425-grub-fail/md1.$(date "+%Y%m%d_%H%M%S")
...
md1/grub2/locale/zh_TW.mo
30,882 100% 31.38kB/s 0:00:00 (xfr#345, to-chk=0/355)
md1/lost+found/

sent 399,450,301 bytes received 6,709 bytes 159,782,804.00 bytes/sec
total size is 399,330,989 speedup is 1.00
root@rescue ~ #
</pre>
# then I figured I'd make a backup of the two disk partitions directly, but I couldn't even mount it
<pre>
root@rescue ~ # umount /mnt/md1
root@rescue ~ # mkdir /mnt/sda2
root@rescue ~ # mkdir /mnt/sdb2
root@rescue ~ # mount /dev/sda2 /mnt/sda2
mount: /mnt/sda2: unknown filesystem type 'linux_raid_member'.
dmesg(1) may have more information after failed mount system call.
root@rescue ~ #
</pre>
# I tried this command (from the docs), which I skipped before because it said that the next command (grub-install) was enough; sure enough, it didn't work https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
<pre>
root@rescue ~ # grub-mkdevicemap -n
grub-mkdevicemap: error: cannot open /boot/grub/device.map.
root@rescue ~ #
</pre>
# I investigated this before, and I thought I decided we're using grub2, not grub1
<pre>
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # ls /mnt/md1/
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue ~ #
</pre>
# oh, shit, even the grub-install command is v2 https://askubuntu.com/questions/107486/how-to-know-the-version-of-grub
<pre>
root@rescue ~ # grub-install --version
grub-install (GRUB) 2.06-13+deb12u1
root@rescue ~ #
</pre>
# ok, this indicates we're not using lilo https://askubuntu.com/questions/24459/how-do-i-find-out-which-boot-loader-i-have
<pre>
root@rescue ~ # ls /mnt/md2/etc/ | grep lilo
root@rescue ~ #
</pre>
# we can dd straight from the disk to read the MBR. And, yeah, it appears we are using grub via MBR .. and this info is stored on the disks, not the raid
<pre>
root@rescue ~ # dd if=/dev/md1 bs=512 count=1 2>/dev/null | strings
root@rescue ~ #

root@rescue ~ # dd if=/dev/sda bs=512 count=1 2>/dev/null | strings
214fb5736d1e5ad63e515dc2fffe44bd928cd8dab2c019dc11fb9fcaef5ea90dbf51f1ac507ab1cfbbe74ff
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
DA/jjF
root@rescue ~ #

root@rescue ~ # dd if=/dev/sdb bs=512 count=1 2>/dev/null | strings
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
root@rescue ~ #
</pre>
# idk what to do; I tried the grub-install again, but it gives me this error
<pre>
root@rescue ~ # grub-install /dev/sda
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #

root@rescue ~ # grub-install /dev/sdb
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #
</pre>
# I tried creating a chroot of our real raid disks first
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # umount /mnt/md1
root@rescue ~ # chroot-prepare /mnt/md2
root@rescue ~ # chroot /mnt/md2
root@rescue / # ls /boot
root@rescue / # mount /dev/md1 /boot
root@rescue / # ls /boot
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue / #
</pre>
# I then tried the grub install again
<pre>
root@rescue / # grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #

root@rescue / # grub2-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #
</pre>
# I exited the chroot and shutdown the rescue system
# I activated the VKVM resuce system, and booted it again
# when I connected to the KVM wui, I was shown a password prompt. So I think booting works!
# I rebooted it from the ssh
# and now I can ssh into the real system
<pre>
user@personal:~$ autossh opensourceecology.org
Last login: Thu Apr 24 23:12:44 2025 from 146.70.199.15
[maltfield@opensourceecology ~]$
</pre>
# and now the wiki loads too
# I did another reboot test
<pre>
[maltfield@opensourceecology ~]$ sudo su -
[sudo] password for maltfield:
Last login: Thu Apr 24 16:25:15 UTC 2025 on pts/0
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
ssh: connect to host opensourceecology.org port 32415: Connection refused
Last login: Fri Apr 25 16:29:21 2025 from 185.204.1.184
[maltfield@opensourceecology ~]$
</pre>
# idk, my takeaway is that either one or some of these assumptions are correct
## grub-install needs to be run *after* the RAID sync is finished
## grub-install needs to be run on *both* the new *and* the old disk
## grub-install needs to be run inside a chroot on the rescue system
# anyway, we're stable again
# I got an email from Marcin saying Tom could help with the migrations. I sent him some wiki articles to get caught-up
<pre>
Hey Tom,

I'll try to get you ssh access on hetzner2 soon. In the meantime, please read the following articles:

* https://wiki.opensourceecology.org/wiki/Hetzner2

* https://wiki.opensourceecology.org/wiki/Hetzner3

I've started preparing draft "change tickets" for migrating each of the websites from hetzner2 to hetzner3. Note that some of these are not fully tested, so you'll want to execute them manually and make corrections as-needed.

Please also read-through these:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_store_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_microfactory_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_fef

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_oswh

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_phplist_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_wiki_to_hetzner3

(There's also one CHG for the forum that I think needs to be made)

The next item TODO is to finish the migration plan for these websites:

1. www.opensourceecology.org (osemain)
2. www.openbuildinginstiture.org (obi)

We decided that there would be 2 simultaneous versions of obi:

1. A static site scraped with curl on hetzner3
2. The (broken) dynamic wordpress site on hetzner3

And we decided that there would be 3 simultaneous versions of osemain:

1. The live/current site on hetzner2
2. A static site scraped with curl on hetzner3
3. The (broken) dynamic wordpress site on hetzner3

To have multiple sites with the same domain on the same server, we bought a second IPv4 address (FeF isn't setup with IPv6). This week I just finished updating the hetzer3 server to persist this new IPv4 address.

The next item for you would be to update our ansible to push out new vhosts (in nginx, varnish, and apache) for the static sites that are bound to the second IPv4 address using the same hostname.

Please read-through the ansible playbook and roles (most importantly for nginx, varnish, and apache) to understand how they're provisioned

* https://github.com/OpenSourceEcology/ansible

Since you have access to hetzner3, you can also poke around (read-only please) the configs for these three web services to understand how ansible provisions them.

Once you've updated and pushed-out the new vhosts with ansible, you'll need to update the migration plan

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_obi_to_hetzner3
* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

And then you'll want to go-through each migration plan to create a temp "snapshot" of all the sites on hetzner3, where Marcin & Catarina can do a thorough verification of each site (by updating /etc/hosts) before we do the *real* migration -- which is nearly the same as the "snapshot" except we actually migrate DNS.

Please let me know when you've finished reading the above articles.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/24/25 22:16, REDACTED@tutanota.com wrote:
> Michael;
>
> I need to reset my ssh key on hetzner2. Can you use the same as on 3 or best to generate a new one?
>
> I spoke with Marcin and I think I can help with the admin, as I have time available.
>
> Can you give a run-down of its status and what needs to be done for completing the migration to hetzner3?
> --
> Tom Griffing
</pre>

=Thr Apr 24, 2025=
# it's 05:00; I tried to login to the wiki, but I got an error
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# oh, under that it says I'm already logged-in?
<pre>
You are already logged in as Maltfield. Use the form below to log in as another user.
</pre>
# anyway, let's start the CHG to replace the failing disk on hetzner 2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb
# I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]#

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]#
</pre>
# I tried to remove the first partition from the RAID, but it said I can't?
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#
</pre>
# apparently the docs say that if the RAID is healthy, you have to force it with '--fail' https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# crap, I realized I have an issue in my CHG (we need two sysadmins for peer review *sigh*)
## I listed this
<pre>
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3
</pre>
## but it should be this
<pre>
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3
</pre>
# anyway, it looks like I first need to execute this, to force the RAID into a failure state
<pre>
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md2 --fail /dev/sdb3
</pre>
# ok, I was able to remove it
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1](F)
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0] sdb1[1](F)
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1](F)
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# by 10:32 UTC, I submitted the request to hetzner to replace /dev/sdb = "Crucial_CT250MX200SSD1_154410FA4520"
# it says they should do it within 2-4 hours
# meanwhile, I updated https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# at 08:00 my time, I checked and saw that we had an email come from hetzner at 06:36 (my time)
<pre>
Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.
</pre>
# well, crap. I tried to load the wiki CHG article, but there's an error
<pre>
Sorry! This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot access the database)
</pre>
# the server wasn't shutdown, and my screen session is still intact, but dmesg is being flooded with RAID and io errors
<pre>
...
[11136.011313] md: super_written gets error=-5, uptodate=0
[11136.011372] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11136.319267] md: super_written gets error=-5, uptodate=0
[11136.319322] md: super_written gets error=-5, uptodate=0
[11138.827642] EXT4-fs error: 5 callbacks suppressed
[11138.827693] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
[11138.827793] EXT4-fs: 5 callbacks suppressed
[11138.827841] EXT4-fs (md2): previous I/O error to superblock detected
[11138.835255] md: super_written gets error=-5, uptodate=0
[11138.835311] md: super_written gets error=-5, uptodate=0
[11138.835367] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11138.835472] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
...
</pre>
# well anyway, I'll see if I can at least restart the RAID sync and install grub on the new disk
# son of a bitch, they removed the wrong drive!
<pre>
[root@opensourceecology ~]# date -u
Thu Apr 24 13:05:32 UTC 2025
[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 32G 0 part
├─sdc2 8:34 0 512M 0 part
└─sdc3 8:35 0 200.4G 0 part
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
device node not found
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# it shows a new drive (sdc) and and old drive (sdb)
# ugh, so now we have nothing in the raid?
# here's the new drive
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sdc | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# christ, so this new disk is half the size of our actual disk? what did they do?!?
# and now we have a prod server online with no redundancy. I can't tell them to put back-in the *correct* disk, or we'll have data loss
# I'm going to stop all the web services before this disaster gets any worse
# great; io errors. this is a damn disaster
<pre>
[root@opensourceecology ~]# systemctl stop nginx
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop varnish
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop apache2
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
Failed to stop apache2.service: Unit apache2.service not loaded.
[root@opensourceecology ~]#
</pre>
# I went ahead and made partition backups, anyway
# wait, actually, it said that /dev/sdc = Crucial_CT250MX200SSD1_154410FA336C. That's our old /dev/sda
# so they *did* remove the right drive, but the re-insertion of the wrong drive pushed /dev/sda to /dev/sdc. That kinda breaks our ability to map the RAID, but let's at-least partition this new drive
# but this new drive isn't the right size. it's 512G while our old disk was 250G. I guess it's better to have too-big of a disk than too-small of a disk, but we won't be able to use that extra disk space. I'm going to assume that they just didn't have 250G disks in-stock anymore.
# anyway, I tried to backup the partitions, but that wouldn't work since we're read-only
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
mkdir: cannot create directory ‘/var/tmp/chg.20250424_132010’: Read-only file system
[root@opensourceecology ~]# chown root:root $chg_dir
chown: cannot access ‘/var/tmp/chg.20250424_132010’: No such file or directory
[root@opensourceecology ~]#
</pre>
# I don't know what to do besides giving it a reboot, but that scares me
# I'd like to take a backup, but I can't if I get read-only errors :(
# well, I guess that's why we made a backup before this. I don't think I have any option other than to reboot. and pray that grub is intact to bring it back.
# I gave it a reboot. If it doesn't come back, I'll try to boot to the rescue CD from within the hetzner wui
<pre>
[root@opensourceecology ~]# date && reboot
Thu Apr 24 13:24:18 UTC 2025
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!

Failed to start reboot.target: Unit is not loaded properly: Input/output error.
See system logs and 'systemctl status reboot.target' for details.

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!
</pre>
# wtf, it can't even reboot it's so broken.
# I triggered a rest on the hetzner wui
# the server came back, and I immediately shutdown all services again
<pre>
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop apache2
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]#
</pre>
# I went ahead and triggered backups
<pre>
[root@opensourceecology ~]# cat /etc/cron.d/backup_to_backblaze
20 07 * * * root time /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log
20 04 03 * * root time /bin/nice /root/backups/backupReport.sh
[root@opensourceecology ~]#

[root@opensourceecology ~]# time /root/backups/backup.sh &>> /var/log/backups/backup.log
</pre>
# ok, sdc is gone. we have sda and sdb again, and sda is our original sda – as we wanted
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions; it's not surprising the sdb file is empty
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250424_133230 ~
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# du -sh ${chg_dir}/*
4.0K /var/tmp/chg.20250424_133230/sda_parttable_mbr.bak
0 /var/tmp/chg.20250424_133230/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
</pre>
# I copied the partition from sda to sdb
<pre>
[root@opensourceecology chg.20250424_133230]# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sdb: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sdb2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sdb3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250424_133230]#
</pre>
# that looked good, other than the complaint about not being able to boot from this disk; I'll check later what is LILO and if this will matter for raid grub
# I reloaded the partition table for this disk
<pre>
[root@opensourceecology chg.20250424_133230]# blockdev --rereadpt /dev/sdb
[root@opensourceecology chg.20250424_133230]#
</pre>
# I added the new disk to the RAID, and it shows that it's starting to sync now. excellent
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md0 -a /dev/sdb1
mdadm: added /dev/sdb1
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md1 -a /dev/sdb2
mdadm: added /dev/sdb2
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md2 -a /dev/sdb3
mdadm: added /dev/sdb3
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (19712/33521664) finish=481.1min speed=1159K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, it looks like it's not syncing each partition of the RAID at the same time. it's doing md0 now and then it'll do the others after, I guess
# md0 is partition 1 (sda1/sdb1). That's *sigh* swap. It's 32GB.
# I kinda wish we'd sync'd /boot first. I don't think I can install grub until that's sync'd. maybe?
# it says it's moving about 1024K/s. That's 1 MB per sec. 32G*1024 = 32,768 MB. That's 32,768 seconds / 60 = 546 minutes / 60 = 9 hours. Just for swap!
# assuming we have the same speed for the rest of the disk, that's 250 G * 1024 = 256,000 MB / 1 MB/s = 256,000 seconds. 256,000 seconds / 60 = 4,266.666666667 minutes / 60 = 4,266.666666667 = 71.11 hours. I guess we just have to accept the risk and hope that old /dev/sda with all our data doesn't fail within then next 3 days.
# I tried to go ahead and install grub on the new disk, but i got a command not found error
<pre>
[root@opensourceecology chg.20250424_133230]# grub-install /dev/sdb
-bash: grub-install: command not found
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# grub
grub2-bios-setup grub2-glue-efi grub2-mkconfig grub2-mkpasswd-pbkdf2 grub2-probe grub2-set-default
grub2-editenv grub2-install grub2-mkfont grub2-mkrelpath grub2-reboot grub2-setpassword
grub2-file grub2-kbdcomp grub2-mkimage grub2-mkrescue grub2-render-label grub2-sparc64-setup
grub2-fstest grub2-macbless grub2-mklayout grub2-mkstandalone grub2-rpm-sort grub2-syslinux2cfg
grub2-get-kernel-settings grub2-menulst2cfg grub2-mknetdir grub2-ofpathname grub2-script-check grubby
[root@opensourceecology chg.20250424_133230]#
</pre>
# looks like it should be 'grub2-install' I tried that
<pre>
[root@opensourceecology chg.20250424_133230]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, that's two warnings but no errors; I'll take it.
# we're up to 12.4% on the RAID sync of swap. It's now going >50x faster than it was before; good news
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[==>..................] recovery = 12.4% (4168832/33521664) finish=8.2min speed=59264K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# calculations at that speed would be 250*1024/58 = 4,413.793103448 seconds / 60 = 73 minutes. Oh, that's just over an hour.
# and now we're at 42.7%
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[========>............] recovery = 42.7% (14334208/33521664) finish=6.6min speed=47845K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# backups are still running; I'll let them finish before starting-up the webservers again
# I wrote a status email to Marcin
# the backups still aren't finished
# I checked on the raid replication, and it shows md0 (swap) and md1 (boot) are both done. Horray! Now we just need to finish root (/), which is 9.8% done and going at 60 MB/s. Great!
<pre>
Thu Apr 24 14:05:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 9.8% (20767872/209984640) finish=50.5min speed=62429K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I gave the grub install a double-tap now that it's synced with the first disk; the output was the same
<pre>
[root@opensourceecology ~]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# the output of lsblk looks much nicer now, too
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# backups say they're 9% uploaded
<pre>
[root@opensourceecology ~]# tail -f /var/log/backups/backup.log
...
2025/04/24 14:13:48 INFO :
Transferred: 2.210G / 20.472 GBytes, 11%, 2.904 MBytes/s, ETA 1h47m20s
Transferred: 0 / 1, 0%
Elapsed time: 13m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 10% /20.472G, 2.997M/s, 1h43m59s
</pre>
# I decided to just kill the backup script and manually upload it without the bwlimit, so it'll go-out faster
<pre>
[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20250424_133017.tar.gpg b2:ose-server-backups
2025/04/24 14:15:20 INFO :
Transferred: 116.500M / 20.472 GBytes, 1%, 1.958 MBytes/s, ETA 2h57m25s
Transferred: 0 / 1, 0%
Elapsed time: 1m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 0% /20.472G, 5.065M/s, 1h8m35s
</pre>
# meanwhile we're at 24% on the RAID sync
<pre>
Thu Apr 24 14:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.9% (50200448/209984640) finish=101.1min speed=26325K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# oh, important to note: our new disk doesn't say that it's failing :D
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# while the old disk says it's reached 100% of its lifecycle, the new disk says it's at – uhh – 96% of it's life? That doesn't sound very good :(
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# Shame. I was hoping for at least something <50%. Well, I wonder how long that remaining 4% will last us :/
# ok, backups just finished; let's start the web services
<pre>
[root@opensourceecology ~]# systemctl start mariadb
[root@opensourceecology ~]# systemctl start httpd
[root@opensourceecology ~]# systemctl start varnish
[root@opensourceecology ~]# systemctl start nginx
[root@opensourceecology ~]#
</pre>
# I updated the wiki CHG with a status https://wiki.opensourceecology.org/wiki/Category:CHGs
# And I sent an email to Marcin recommending that he replace /dev/sda with an actual new drive
<pre>
Hey Marcin,

Would you authorize spending €41.18 on a new disk for your server?

Update: Your websites are back online. The RAID is still syncing.

I was a bit disappointed to learn that hetzner replaced a disk with 0% "life left" with a disk with 4% "life left". That's what we get for choosing the free disk replacement..

The "free" option said it would replace it with a "Replacement drive nearly new or used and tested; depends on what is in stock." Obviously they didn't give us a "nearly new" drive..

Your other disk is also at 0% "life left". I was already planning on replacing that one next week too, but I would recommend that you pay for a new drive for this one. The cost listed on the website is €41.18.

Do you authorize me selecting €41.18 for the replacement of /dev/sda on hetzner2?
</pre>
# from the output above, our old drive said it had "Power_On_Hours" of 78516/24/365 = 8.96 years
# and our new drive says Power_On_Hours = 52083/24/365 = 5.95 years. Well that's better, I guess.
# oh wow, the power cycle count is crazy; our disk we only rebooted 50 times and the new one was only 33 times.
# also the SMART data for both of these drives has different keys (not just values). apparently it's very vendor-specific, so some of these comparisons are apples-to-oranges
# right, we're at 69.7% replication on root. I'm going to go make breakfast and check-in again after
# ...
# over lunch, I realized that Marcin's last email was possibly hyperbolic panic
# he's worried that he just kicked-off a marketing campaign (for the apprenticeship), which now links to information on a broken website – where potential applicants can't read the info
# but I think the content actually *is* accessible, just not to Marcin
# when you're logged-into the wiki, the cookies bypass the cache. So, regretablly, when hetnzer2's backend is offline, Marcin sees an error
# but I'd bet that the frontpage of all the websites and the recently-published apprenticeship info page that he's published & promoted are still online when he sees that error – for users who are *not* logged-into the site
# but if the backend site is broken for >24 hours, then the cache will cache the errors (not the content)
# as a short-term hack, I recommended that we setup a daily reboot of hetzner2 at 10:40 (a good buffer after the backups finish uploading)
# I asked Marcin if he'd like me to setup a daily reboot at 10:40
<pre>
Hey Marcin,

I don't think the situation is as bad as you think.

> We are missing opportunity,
> the announcement is posted, and our servers are down.

Of course I agree it's not good, and we should migrate away from hetzner2 asap. And I do wish I had more bandwidth to finish the migration faster for you.

But you have a varnish cache that caches pages for 24 hours. Even if your backend webserver and database are down, popular pages (like the frontpage of your wiki or a recent article that you've recently promoted) should still load for users.

The big issue isn't marketing and read-only content. The big issue is editing. That's what is breaking.

When you're logged into the wiki, it bypasses the varnish cache. So, even if the wiki appears down to you, the contents of (most) articles viewed in the past 24 hours will be still visible to potential apprenticeship applicants.

The next time you see the websites are down, try loading it from another device where you're not logged-in. You'll probably see that the apprenticeship info is still accessible, even though the backend for the site is down.

As a short-term hack, I recommend setting-up a daily reboot of the server. Backups typically finish before 10:10 UTC. I recommend we add a cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.

The server seems to function for some time after a fresh reboot, and it caches pages for 24 hours. So the first time someone loads a page in the wiki after that reboot, it'll be cached for the entire time that the server is online until its next reboot. I think this will ensure higher availability of your read-only content (eg information about the apprenticeship).

Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# ...
# I checked-in on the RAID replication status; it's finished
<pre>

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# so it looks like I started it just after 13:32 and it finished just before 15:20. So it took just under 2 hours. Great!
# I updated the article with status updates, marking the CHG as completed successfully https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb#2025-04-24_16:18_UTC
# And I sent an email to Marcin & Catarana to let them know it was successful, and asked again about buying a new drive for replacing /dev/sda
<pre>
Update: your new (used) disk is now fully synced with the old (failing) disk.

* https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb

According to SMART data, you now have one failing disk and one not-failing disk.

Your hetzner2 RAID is now healthy, and you have redundancy spread across two mirrored disks again.

Next week I'd like to replace the other failing disk. Please let me know if you approve the purchase of a new disk for its replacement.
</pre>
# Marcin got back to me, approving the purchase of the new disk; I updated the ticket https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# Note that the price is listed as "at cost" and it says
<pre>
Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
</pre>
# 1,000 hours is fine. That's compared to the 78,516 hours of /dev/sda and 52,083 hours of our "new" /dev/sdb
# but it's a bit concerning that it says it might not be in-stock. I'm going to message them and ask if they can set one aside for us for next week
<pre>
Hi Support,

Can you set-aside a replacement disk for this server?

Our disks' SMART logs indicated that both disks should be replaced. Today we replaced one of the two disks, but the disk that you replaced it with has 4% of its life left, according to SMART data (it has 52,083 hours of operation).

Next week we would like to replace the other disk, and this time we'd like your "at cost" option, to get a disk with <1,000 hours of operation.

But I was a bit concerned when I read this next to the WUI option for "at cost" on your website

> Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.

Specifically what worries me is the "may not be in stock".

Can you please tell us if you have stock now? And if you do, can you please reserve one disk for us for next week?

We don't want to remove a disk from our RAID and plan for downtime, only to discover that you don't have a disk available for us..

Please let us know if you can reserve 1 disk for us for next week.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# I asked Marcin if Wed next week at 11:00 UTC is ok for replacing hetzner2's sda
<pre>
Hey Marcin,

When would be a good time to replace the second disk on hetzner2?

If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next week on Wednesday 2025-04-30 11:00 UTC, which is:

* 13:00 in Germany (where the server lives)
* 06:00 here in Ecuador, and
* 06:00 at FeF

For details about what this change entails, and expected downtime,
please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda

Please let me know if you approve this change, if the suggested time is
agreeable to you, and if you have any questions.

Thank you,
</pre>
# Marcin returned the email confirming the time
<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>
# ...
# Marcin got back to me and told me to setup the daily reboot cron on hetzner2
<pre>
Yes, please set up reboot. That is decent for now

On Thu, Apr 24, 2025, 11:08 AM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> I don't think the situation is as bad as you think.
>
> > We are missing opportunity,
> > the announcement is posted, and our servers are down.
>
> Of course I agree it's not good, and we should migrate away from
> hetzner2 asap. And I do wish I had more bandwidth to finish the
> migration faster for you.
>
> But you have a varnish cache that caches pages for 24 hours. Even if
> your backend webserver and database are down, popular pages (like the
> frontpage of your wiki or a recent article that you've recently
> promoted) should still load for users.
>
> The big issue isn't marketing and read-only content. The big issue is
> editing. That's what is breaking.
>
> When you're logged into the wiki, it bypasses the varnish cache. So,
> even if the wiki appears down to you, the contents of (most) articles
> viewed in the past 24 hours will be still visible to potential
> apprenticeship applicants.
>
> The next time you see the websites are down, try loading it from another
> device where you're not logged-in. You'll probably see that the
> apprenticeship info is still accessible, even though the backend for the
> site is down.
>
> As a short-term hack, I recommend setting-up a daily reboot of the
> server. Backups typically finish before 10:10 UTC. I recommend we add a
> cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.
>
> The server seems to function for some time after a fresh reboot, and it
> caches pages for 24 hours. So the first time someone loads a page in the
> wiki after that reboot, it'll be cached for the entire time that the
> server is online until its next reboot. I think this will ensure higher
> availability of your read-only content (eg information about the
> apprenticeship).
>
> Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# we don't have ansible for hetzner2; I did this manually
<pre>
[root@opensourceecology cron.d]# pwd
/etc/cron.d
[root@opensourceecology cron.d]# ls -lah
total 52K
drwxr-xr-x. 2 root root 4.0K Apr 24 17:56 .
drwxr-xr-x. 105 root root 12K Apr 18 21:52 ..
-rw-r--r-- 1 root root 128 May 16 2023 0hourly
-rw-r--r-- 1 root root 1.3K Apr 9 2019 awstats_generate_static_files
-rw-r--r-- 1 root root 151 Apr 24 17:52 backup_to_backblaze
-rw-r--r-- 1 root root 78 May 31 2024 cacti
-rw-r--r-- 1 root root 125 Dec 11 00:16 letsencrypt
-rw-r--r-- 1 root root 506 Mar 18 2019 phplist
-rw-r--r-- 1 root root 108 Jan 7 2022 raid-check
-rw-r--r-- 1 root root 118 Apr 24 17:56 reboot
-rw------- 1 root root 235 Dec 15 2022 sysstat
[root@opensourceecology cron.d]# cat reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology cron.d]#
# tomorrow morning I should check on the uptime and journalctl to make sure it rebooted sometime around 10:40 UTC
</pre>
# ...
# ok, back to hetzner3: we bought a second IPv4 address for the static sites, but the server's networking was never setup for it; let's add that
<pre>
root@hetzner3 /etc/network # cp interfaces interfaces.20250424
root@hetzner3 /etc/network # vim interfaces
...
</pre>
# well, that failed.
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
You have mail in /var/mail/root
root@hetzner3 ~ #
</pre>
I restored the backup file, and it still failed. The journal and status aren't helpful
<pre>
root@hetzner3 ~ # systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-24 17:18:55 UTC; 52s ago
Duration: 2month 1w 20h 39min 50.765s
Docs: man:interfaces(5)
Process: 3259336 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 3259371 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 3259336 (code=exited, status=1/FAILURE)
CPU: 29ms

Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
root@hetzner3 ~ # journalctl -u networking | tail
Apr 24 17:16:36 hetzner3 ifup[3258504]: ifup: failed to bring up enp0s31f6
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:16:36 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
</pre>
# if I run the ExecStart command manaully, I can add a verbose tag. but that's not especially helpful, either
<pre>
root@hetzner3 ~ # ifup --verbose -a --read-environment
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool

ifup: configuring interface enp0s31f6=enp0s31f6 (inet)
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool
ip addr add 144.76.164.201/255.255.255.224 broadcast 144.76.164.223 dev enp0s31f6 label enp0s31f6
RTNETLINK answers: File exists
ifup: failed to bring up enp0s31f6
run-parts --exit-on-error --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/000resolvconf
run-parts: executing /etc/network/if-up.d/ethtool
run-parts: executing /etc/network/if-up.d/postfix
run-parts: executing /etc/network/if-up.d/resolved
root@hetzner3 ~ #
</pre>
# curiously, though, the new IPv4 address is listed in `ip a`
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# I'm just going to give this server a reboot before proceeding, to make sure the IP config is sticky
# when it came-up, it lost the new IP :(
<pre>
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# well, at least it's restarting now without errors; I can work with that
<pre>
root@hetzner3 /etc/network # systemctl restart networking
You have new mail in /var/mail/root
root@hetzner3 /etc/network # systemctlstatus networking
-bash: systemctlstatus: command not found
root@hetzner3 /etc/network # systemctl status networking
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: active (exited) since Thu 2025-04-24 17:33:40 UTC; 15s ago
Docs: man:interfaces(5)
Process: 8598 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Process: 9022 ExecStart=/bin/sh -c if [ -f /run/network/restart-hotplug ]; then /sbin/ifup -a --read-environment --allow=hotplug; fi (code=exited, status=0/SUCCESS)
Main PID: 9022 (code=exited, status=0/SUCCESS)
CPU: 357ms

Apr 24 17:33:34 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:33:39 hetzner3 ifup[8663]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 ifup[8907]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 systemd[1]: Finished networking.service - Raise network interfaces.
root@hetzner3 /etc/network #
</pre>
# let's try to add it now
<pre>
root@hetzner3 /etc/network # diff interfaces interfaces.20250424
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # vim interfaces
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,23
> iface enp0s31f6 inet static
> address 144.76.164.195
> netmask 255.255.255.224
> gateway 144.76.164.193
> # route 144.76.164.192/27 via 144.76.164.193
> #up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6
>
root@hetzner3 /etc/network #
</pre>
# I gave it a restart, but I have errors again
<pre>
# curiously, it *did* add the new IP address; wtf
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# the internet isn't very helpful because it seems the damn format has changed so many times over the years; lots of outdated info
# lots of people say they fixed this by deleting everything in interfaces.d/, but we don't have anything in that folder
# I did find this hetzner-specific docs on adding a second IP; it's totally different than what I've read elsewhere https://docs.hetzner.com/robot/dedicated-server/network/net-config-debian-ubuntu
<pre>
up ip addr add 10.4.2.1/32 dev eth0
down ip addr del 10.4.2.1/32 dev eth0
</pre>
# I tried this, and gave the server a reboot
<pre>
root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,20
> # 2025-04-24: add second IPv4 address
> up ip addr add 144.76.164.195/32 dev enp0s31f6
> down ip addr del 144.76.164.195/32 dev enp0s31f6
>
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # cat interfaces
### Hetzner Online GmbH installimage

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback
iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet static
address 144.76.164.201
netmask 255.255.255.224
gateway 144.76.164.193
# route 144.76.164.192/27 via 144.76.164.193
up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6

# 2025-04-24: add second IPv4 address
up ip addr add 144.76.164.195/32 dev enp0s31f6
down ip addr del 144.76.164.195/32 dev enp0s31f6

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::2
netmask 64
gateway fe80::1

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::3
netmask 64
gateway fe80::1
root@hetzner3 /etc/network #
</pre>
# the system came-up with the IP I want. Cool!
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# and I'm able to restart the service without it yelling at me (or breaking the IP config)
<pre>
root@hetzner3 ~ # systemctl restart networking
root@hetzner3 ~ #
You have new mail in /var/mail/root
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# I'm also able to ping the server on both IPs, which is a good sign
<pre>
user@disp9871:~$ ping 144.76.164.201
PING 144.76.164.201 (144.76.164.201) 56(84) bytes of data.
64 bytes from 144.76.164.201: icmp_seq=1 ttl=50 time=490 ms
64 bytes from 144.76.164.201: icmp_seq=2 ttl=50 time=490 ms
^C
--- 144.76.164.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 489.558/489.676/489.795/0.118 ms
user@disp9871:~$
user@disp9871:~$ ping 144.76.164.195
PING 144.76.164.195 (144.76.164.195) 56(84) bytes of data.
64 bytes from 144.76.164.195: icmp_seq=1 ttl=50 time=493 ms
64 bytes from 144.76.164.195: icmp_seq=2 ttl=50 time=512 ms
^C
--- 144.76.164.195 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 492.853/502.518/512.184/9.665 ms
user@disp9871:~$
</pre>
# I used netcat to test it. Most ports are closed, and I found that nginx is listening on most of the other ports on all IPs – except 4443
<pre>
root@hetzner3 ~ # nc -s 144.76.164.195 -l -p 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# and this was how it looked on my laptop's side
<pre>
user@disp9871:~$ nc 144.76.164.195 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# ok, so the server's new IPv4 address is configured (and persistent between reboots)

=Sun Apr 20, 2025=
# Marcin replied to my email authorizing the replacement of the /dev/sdb disk on hetzner2 at 2025-04-24 10:00 UTC https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
## I updated the article with the defined date & time
# ...
# I also checked hetzner3. I see that I setup email alerts for the RAID, but not for SMART.
## on hetzner2, we had no errors of the RAID, but we did have SMART errors. I guess eventually if it failed enough that RAID replication was breaking, we would have gotten alerts. But it would be good if we could get alerts *before* that happened..
# I checked munin on hetzner2 to see what data it collects for monitoring disks @ /disk-day.html
## looks like we have latency, throughput, usage, utilization, i/o, and inode usage. There's nothing about "SMART errors"
# looks like there *is* a smart module for munin https://gallery.munin-monitoring.org/plugins/munin/smart_/
# it's already there on hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah | grep -i smart
-rwxr-xr-x 1 root root 11K Mar 21 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Mar 21 2023 smart_
You have new mail in /var/mail/root
root@hetzner3 /usr/share/munin/plugins #
</pre>
# hetzner2 has it too
<pre>
[root@opensourceecology munin]# ls -lah /usr/share/munin/plugins | grep -i smart
-rwxr-xr-x 1 root root 11K Nov 6 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Nov 6 2023 smart_
[root@opensourceecology munin]#
</pre>
# crap, I just checked hetzner3's munin, and I realized that varnish is missing :(
# it looks like ansible *has* pushed-out the script and plugins
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah /usr/share/munin/plugins/ | grep -i varnish
-rwxr-xr-x 1 root root 26K Mar 21 2023 varnish_
-rwxr-xr-x 1 root root 28K Feb 12 00:14 varnish5_
-rwxr-xr-x 1 root root 28K Sep 28 2024 varnish5_.175431.2025-02-12@00:16:02~
-rwxr-xr-x 1 root root 28K Sep 25 2024 varnish5_.20240928.orig
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # ls -lah /etc/munin/plugins/ | grep -i varnish
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_backend_traffic -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_bad -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_expunge -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_hit_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_main_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_memory_usage -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_mgt_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_objects -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_request_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_threads -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_transfer_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Feb 12 00:16 varnish_uptime -> /usr/share/munin/plugins/varnish5_
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I did a diff of the varnish5_ script from my server and ose's server, and I found 2 new lines at the top of the hetzner3 server
## my server
<pre>
maltfield@mail:~$ head /usr/share/munin/plugins/varnish5_
#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
#
# This program is free software; you can redistribute it and/or modify
maltfield@mail:~$
</pre>
## ose's hetzner3
<pre>
maltfield@hetzner3:~$ head /usr/share/munin/plugins/varnish5_
# Ansible managed

#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
maltfield@hetzner3:~$
</pre>
# so basically the issue appears to be that my "ansible managed" comment comes before the shebang, so varnish is interpreting everything as shell, instead of perl
# we can see the result of all these syntax errors with a test run too
## my server
<pre>
root@mail:/etc/munin# munin-run varnish_hit_rate
cache_hitpass.value 0
client_req.value 704255
cache_miss.value 202581
cache_hitmiss.value 2181
cache_hit.value 499493
root@mail:/etc/munin#
</pre>
## ose's hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
/etc/munin/plugins/varnish_hit_rate: 26: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 28: varnish5_: not found
/etc/munin/plugins/varnish_hit_rate: 30: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 32: Varnish: not found
/etc/munin/plugins/varnish_hit_rate: 34: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 36: The: not found
/etc/munin/plugins/varnish_hit_rate: 38: The: not found
/etc/munin/plugins/varnish_hit_rate: 39: [varnish5_*]: not found
/etc/munin/plugins/varnish_hit_rate: 40: group: not found
/etc/munin/plugins/varnish_hit_rate: 41: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 42: env.name: not found
/etc/munin/plugins/varnish_hit_rate: 44: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 108: my: not found
/etc/munin/plugins/varnish_hit_rate: 111: my: not found
/etc/munin/plugins/varnish_hit_rate: 114: my: not found
/etc/munin/plugins/varnish_hit_rate: 117: my: not found
/etc/munin/plugins/varnish_hit_rate: 119: my: not found
/etc/munin/plugins/varnish_hit_rate: 123: Syntax error: "(" unexpected
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I moved the "ansible managed" comment below the shebang in ansible, and pushed it out; now it works
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
client_req.value 10714
cache_hitmiss.value 9
cache_hit.value 6478
cache_hitpass.value 0
cache_miss.value 4227
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I also pushed-out smart at the same time, but it's not working
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_ suggest
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # munin-run smart_
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# the docs page for the smart_ munin plugin says that we need this section at-minimum in the munin config file, so I added it to hetzner2 https://gallery.munin-monitoring.org/plugins/munin/smart_/
<pre>
[root@opensourceecology plugin-conf.d]# tail -n4 zzz-ose

[smart_*]
user root
group disk
[root@opensourceecology plugin-conf.d]#
</pre>
# and I manually created the symlinks for sda & sdb
<pre>
[root@opensourceecology ~]# cd /etc/munin/plugins
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sda
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sdb
[root@opensourceecology plugins]#
</pre>
# sweet, that worked
<pre>
[root@opensourceecology plugins]# munin-run smart_sdb
Program_Fail_Count.value 100
Reallocated_Event_Count.value 100
Ave_Block_Erase_Count.value 001
Reallocate_NAND_Blk_Cnt.value 100
Erase_Fail_Count.value 100
Reported_Uncorrect.value 100
SATA_Interfac_Downshift.value 100
Offline_Uncorrectable.value 100
smartctl_exit_status.value 8
Write_Error_Rate.value 100
FTL_Program_Page_Count.value 100
Current_Pending_Sector.value 100
Success_RAIN_Recov_Cnt.value 100
UDMA_CRC_Error_Count.value 100
Error_Correction_Count.value 100
Temperature_Celsius.value 064
Raw_Read_Error_Rate.value 100
Total_Host_Sector_Write.value 100
Power_Cycle_Count.value 100
Power_On_Hours.value 100
Host_Program_Page_Count.value 100
Unused_Reserve_NAND_Blk.value 000
Percent_Lifetime_Remain.value 000
Unexpect_Power_Loss_Ct.value 100
[root@opensourceecology plugins]#
</pre>
# Unfortunately, I'm not getting the same results on hetzner3. I wonder if this munin plugin doesn't support nvme drives?
# oh, it looks like I'm actually not updating that file anymore in ansible, because it has a backup. I'm going to make a note in ansible so I don't make that mistake again.
# meanwhile, I manually updated the config file on hetzner3 too
<pre>
root@hetzner3 /etc/munin # cd plugin-conf.d/
root@hetzner3 /etc/munin/plugin-conf.d # ls
dhcpd3 munin-node README spamstats zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # touch /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chown root:root /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chmod 0600 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # cp zzz-myconf /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # ls -lah /var/tmp/munin-zzz-myconf.20250420
-rw------- 1 root root 1,7K Apr 20 17:29 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # vim zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # diff /var/tmp/munin-zzz-myconf.20250420 /etc/munin/plugin-conf.d/zzz-myconf
3c3
< # Version: 0.2
---
> # Version: 0.3
9c9
< # Updated: 2024-12-12
---
> # Updated: 2025-04-20
31a32,35
>
> [smart_*]
> user root
> group disk
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# that still fails
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# but, if I restart the service first and then run it, it – uhh – kinda works
<pre>
root@hetzner3 /etc/munin/plugin-conf.d # service munin-node restart
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# so it exits with a non-error, just a U. no further stats. huh.
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
smartctl_exit_status.value U
root@hetzner3 /usr/share/munin/plugins #
</pre>
# yeah, it looks like the smart_ plugin doesn't work for nvme drives :(
## https://github.com/munin-monitoring/munin/issues/790
## https://github.com/aranemac/munin-smart-nvme
# I'm not looking to compile some binary. I think we've reached the point of diminished return here
# while historical smart charts would be great, what I really want to achieve is some email alerts from SMART, like we setup for the RAID
# I found a few guides about this
## https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email
## https://serverfault.com/questions/426761/is-smartd-properly-configured-to-send-alerts-by-email
## https://unix.stackexchange.com/questions/662633/best-practices-to-enable-smart-disk-notifications-on-a-linux-workstation
# I replaced the files
<pre>
root@hetzner3 /etc # mv /etc/smartd.conf /etc/smartd.conf.$(date "+%Y%m%d_%H%M%S").orig
root@hetzner3 /etc #

root@hetzner3 /etc # echo "DEVICESCAN -d removable -n standby -m REDACTED@opensourceecology.org -M exec /usr/share/smartmontools/smartd-runner" > /etc/smartd.conf
root@hetzner3 /etc #
</pre>
# but that didn't work; no email came when I restarted the service (even if I added -M test)
# I checked the status in systemd, and it says that it did try to send the mail
<pre>
root@hetzner3 /etc # systemctl status smartd
● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-04-20 20:58:57 UTC; 3min 22s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1466569 (smartd)
Status: "Next check of 2 devices will start at 21:28:57"
Tasks: 1 (limit: 76834)
Memory: 1.2M
CPU: 66ms
CGroup: /system.slice/smartmontools.service
└─1466569 /usr/sbin/smartd -n

Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, is SMART capable. Adding to "monitor" list.
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme0n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NX0M104566-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
root@hetzner3 /etc #
</pre>
# so I checked the postfix logs, and it looks like google is rejecting our mail?!?
<pre>
root@hetzner3 ~ # journalctl -fu postfix@-
...
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: Untrusted TLS connection established to aspmx.l.google.com[108.177.15.27]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bit
s) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: CB6E5B94BB2: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[108.177.15.27]:25, delay=1.2, delays=0.01/0.01/0.86/0.27, dsn=2.0.0, status=sent (250 2.0.0 OK 1745183017 ffacd0b85a97d-39efa5a45b6si4251829f8f.798 - gsmtp)
Apr 20 21:04:34 hetzner3 postfix/qmgr[4510]: CB6E5B94BB2: removed
Apr 20 21:04:36 hetzner3 postfix/smtp[1468114]: Untrusted TLS connection established to aspmx.l.google.com[2404:6800:4003:c02::1b]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/bounce socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/bounce socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/defer socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/defer socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: D13CAB94BB3: defer service failure
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: D13CAB94BB3: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[2404:6800:4003:c02::1b]:25, delay=4.5, delays=0.01/0.01/3.5/1, dsn=4.3.0, status=deferred (bounce or trace service failure)
...
</pre>
# I changed it to my personal email, restarted, and I got two emails
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme1

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NA0M345614, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# and
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme0

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NX0M104566, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# so I changed it back to the google groups email list email address, and I updated the wiki https://wiki.opensourceecology.org/wiki/Hetzner3
# after lunch, I refreshed munin on hetzne2 and hetzner3, to see if smart info was not being charted
## on hetzner2, there's no changes. I don't see any charts related to SMART
## on hetzner3, there's two new charts (S.M.A.R.T values for drive nvme0n1 & S.M.A.R.T values for drive nvme1n1), but they're both empty; it only has 1 value (smartctl_exit_status), and it's "nan" for all time charts. This is expected, since it can't read the nvme smartctl output format.
# I think maybe I forgot to restart munin on hetzner2, so I gave that a try
<pre>
[root@opensourceecology ~]# service munin-node restart
Redirecting to /bin/systemctl restart munin-node.service
[root@opensourceecology ~]#

[root@opensourceecology ~]# sudo -u munin /usr/bin/munin-cron
2025/04/20 21:29:38 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 55.
closedir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 56.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:52 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
[root@opensourceecology ~]#
</pre>
# whatever; I guess no munin logs on SMART for this dying server
# I also confirmed that varnish logs are now visible in munin
# I committed my ansible changes https://github.com/OpenSourceEcology/ansible/commit/2fb906fd62cf0773d84f50f1cf113ddfe66910ec
# anyway, I also updated smartd.conf on hetzner2
<pre>
[root@opensourceecology smartmontools]# cp smartd.conf smartd.conf.20250420.bak
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# vim smartd.conf
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# diff smartd.conf.20250420.bak smartd.conf
23c23,24
< DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
---
> #DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
> DEVICESCAN -H -m REDACTED@opensourceecology.org -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
[root@opensourceecology smartmontools]#
[root@opensourceecology smartmontools]# systemctl restart smartd
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
[root@opensourceecology smartmontools]#
</pre>
# oh wow, that screaming about the disks failing wasn't just printed to my tty; it got printed to every tty on my screen session. It really is angry..
# but, alas, no email was sent – even from hetzner2. where email should *definitely* be working
# this time the postfix logs on hetzner2 gave us an error from gmail saying why they're blocking us
<pre>
Apr 20 21:40:27 opensourceecology postfix/smtp[21221]: 297716847E6: host aspmx.l.google.com[64.233.167.27] said: 421-4.7.28 Gmail has detected an unusual rate of unso
licited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.go
ogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42a931si4417083f8f.167 - gsmtp (in reply to end
of DATA command)
Apr 20 21:40:27 opensourceecology postfix/smtp[21094]: 3CBF7684804: host aspmx.l.google.com[142.251.168.27] said: 421-4.7.28 Gmail has detected an unusual rate of uns
olicited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.g
oogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42967csi4306047f8f.165 - gsmtp (in reply to end
of DATA command)
</pre>
# marcin sent an email campaign today with phpList. If that didn't make it out due to this, that's kinda problem.
# I see in the log that we're kinda spamming phplist_bounces@opensourceecology.org
# that's basically where phplist is supposed to let our admins know that it failed to deliver to some people on the mailing list
## I confirmed that this account *does* exist in the gsuite admin wui user list
# yeah, crap, it's blocking other mail sent to my personal account from apache.
# woah, I'm tailing the mail log and I just got probably hundereds or thousands of emails tried to be sent. phpList is *supposed* to do it in small batches, but I wonder if, once it fails and gets added to the queue, it'll do the re-send without batching it..
# I checked phpList wui settings and config.php, and I don't see anything about rate-limiting
# here's the docs on it https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
# it says it should be set in config.php. By default, I think it's 5,000 emails per hour
# Marcin's campaign today was sent to 14,111 people
# I checked the event log page, and I see a lot of these "Maximum time for queue processing: 99999" – which I guess means we need to break these up into batches https://phplist.opensourceecology.org/lists/admin/?page=eventlog
# looks like the easiest thing to do is to add a pause with MAILQUEUE_THROTTLE https://discuss.phplist.org/t/some-advice-for-correct-configuration-of-sending-rate/429
# if we send one per second, then we'll send 3,600 per hour.
## If we have 15,000 people on our list, then at that rate we'd need 4-5 hours to send a campaign. That sounds like a good idea.
# I updated the phpList config file to send only one email per second
<pre>
[root@opensourceecology phplist.opensourceecology.org]# diff config.20250420.php config.php
83a84,87
> // only send 1 email per second
> // * https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
> define('MAILQUEUE_THROTTLE',1);
>
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# we should also probably throttle postfix https://serverfault.com/questions/110919/postfix-throttling-for-outgoing-messages
# looks like for both hetzner2 and hetzner3, this is set to no delay
<pre>
[root@opensourceecology phplist.opensourceecology.org]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 0s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# I set this on hetzner2
<pre>
[root@opensourceecology postfix]# diff main.cf.20250420 main.cf
683a684,686
>
> # limit emails to the same-destination-domain to one-email-per-2-seconds
> default_destination_rate_delay = 2s
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# systemctl restart postfix
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 2s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology postfix]#
</pre>
# and I also added this to ansible and pushed it out to the server on hetnzer3 https://github.com/OpenSourceEcology/ansible/commit/7ed339cad055a9a0c5b04f26d32c9416daf3a2c7

=Sat Apr 19, 2025=

# I responded to Tom's email about ssh
# Tom wasn't able to reset their account's password
# I think I created these accounts with `--disabled-password`, probably as some layered security for ssh (to force keys), but that kinda breaks sudo, which requires the password. I could make sudo NOPASSWD, but I think it's safer to have a user password set (and have ssh disabled passoword logins still) rather than set sudoers to NOPASSWD, in general
# disabled passwords are set with the '!' in the second field of /etc/shadown
<pre>
root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing:!:20133:0:99999:7:::
root@hetzner3 ~ #
</pre>
# I just manually edited /etc/shadow with vim to remove the exclimation point
<pre>
root@hetzner3 ~ # vim /etc/shadow
root@hetzner3 ~ #

root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing::20133:0:99999:7:::
</pre>
# Tom replied, saying he can become root on hetzner3 now.
# ...
# I returned to work on the plan for replacing the disks on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb#Change_Steps
# I confirmed that the disks (on both hetzner2 and hetzner3) are MBR partition scheme (not GPT) – indicated by "Disk label type: dos"
<pre>
[root@opensourceecology ~]# fdisk -l /dev/sda

Disk /dev/sda: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x9b8e1266

Device Boot Start End Blocks Id System
/dev/sda1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sda2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sda3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0xd904fc05

Device Boot Start End Blocks Id System
/dev/sdb1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sdb2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sdb3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#
</pre>
# A quick spot-check shows that our backups usually finish at 09:55 – one time as late as 10:07. That's UTC.
# 10:00 UTC is 05:00 my time and 12:00 in Berlin. God that's early, but better to do this early in Germany time..
# I sent an email to Marcin asking if Thr 2025-04-24 @ 10:00 UTC (~05:00 FeF) would be a good time to do this
<pre>
Hey Marcin,

When would be a good time to replace the first disk on hetzner2?

Our backups finish daily at 10:00 UTC, which is:

* 12:00 in Germany (where the server lives)
* 05:00 here in Ecuador, and
* 05:00 at FeF

I propose next week on Thursday 2025-04-24 10:00 UTC.

For details about what this change entails, and expected downtime, please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb

Please let me know if you approve this change, if the suggested time is agreeable to you, and if you have any questions.
</pre>

=Fri Apr 18, 2025=
# Marcin sent another email this morning asking why osemain is down too now, and I responded
<pre>
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

* https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.

Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing? Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly. This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin
</pre>
# ok, so back to trying to figure out the corruption of the mariadb
# looks like the attempt to start it in recovery mode 2 fails after 10 minutes
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 10m0.435s
user 0m0.011s
sys 0m0.012s
[root@opensourceecology etc]#
</pre>
# and the tail of the db log
<pre>
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00 InnoDB: Waiting for the background threads to start
250417 23:06:01 InnoDB: Waiting for the background threads to start
250417 23:06:02 InnoDB: Waiting for the background threads to start
250417 23:06:03 InnoDB: Waiting for the background threads to start
250417 23:06:04 InnoDB: Waiting for the background threads to start
250417 23:06:05 InnoDB: Waiting for the background threads to start
250417 23:06:06 InnoDB: Waiting for the background threads to start
250417 23:06:07 InnoDB: Waiting for the background threads to start
250417 23:06:08 InnoDB: Waiting for the background threads to start
250417 23:06:09 InnoDB: Waiting for the background threads to start
</pre>
# so we have one more recovery mode we can try before it becomes destructive = 3
<pre>
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
>
[root@opensourceecology etc]#
</pre>
# and gave it a restart
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
...
</pre>
# damn, looks like it's stuck on the same thing
<pre>
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18 InnoDB: Waiting for the background threads to start
250418 19:33:19 InnoDB: Waiting for the background threads to start
250418 19:33:20 InnoDB: Waiting for the background threads to start
...
</pre>
# the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
## https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
## https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
# anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
# meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
# oh shit, that worked
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m2.102s
user 0m0.010s
sys 0m0.007s
[root@opensourceecology etc]#
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 22468 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]#
</pre>
# the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
<pre>
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28 InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
</pre>
# oh, the spam stopped. maybe just some startup thing.
# I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
# this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
# but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# this took about a minute to run
<pre>
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]#
</pre>
# good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
<pre>
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log
3dp_db.wp_commentmeta OK
3dp_db.wp_comments OK
3dp_db.wp_links OK
3dp_db.wp_masterslider_options OK
3dp_db.wp_masterslider_sliders OK
3dp_db.wp_options OK
3dp_db.wp_postmeta OK
3dp_db.wp_posts OK
3dp_db.wp_revslider_css OK
3dp_db.wp_revslider_layer_animations OK
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log
cacti_db.automation_ips
note : The storage engine for the table doesn't support check
cacti_db.automation_processes
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note : The storage engine for the table doesn't support check
cacti_db.poller_output
note : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note : The storage engine for the table doesn't support check
osemain_db.wp_options
warning : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]#
</pre>
# let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
## cacti_db
## osemain_db
## oswh_db
# I sent Marcin a status update email
<pre>
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
</pre>
# ok, I made the post-corruption mysqldump backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real 2m48.845s
user 3m19.170s
sys 0m2.023s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# now let's drop those three databases.
<pre>
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db |
+--------------------+
| d3d_db |
+--------------------+
| fef_db |
+--------------------+
| microfactory_db |
+--------------------+
| mysql |
+--------------------+
| obi_db |
+--------------------+
| obi_staging_db |
+--------------------+
| oseforum_db |
+--------------------+
| osemain_s_db |
+--------------------+
| osewiki_db |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db |
+--------------------+
| seedhome_db |
+--------------------+
| store_db |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# that looked good
<pre>
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]#
</pre>
# recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
# nah, it failed
<pre>
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m2.805s
user 0m0.006s
sys 0m0.010s
[root@opensourceecology etc]#
</pre>
# logs are the same, I think?
<pre>
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04 InnoDB: Waiting for the background threads to start
250418 20:10:04 InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
<pre>
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
250418 20:11:42 InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42 InnoDB: Waiting for the background threads to start
</pre>
# well, even though it *says* it's started
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m5.156s
user 0m0.008s
sys 0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 24458 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real 0m0.012s
user 0m0.001s
sys 0m0.007s
[root@opensourceecology etc]#
</pre>
# we can't connect to it with mysqlcheck
<pre>
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log
real 0m0.010s
user 0m0.002s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]#
</pre>
# so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
# huh, all lines say OK
<pre>
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real 0m11.597s
user 0m0.010s
sys 0m0.009s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log
[root@opensourceecology dbFail.20250417]#
</pre>
# well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
# I'm going to restore from the backup and then see if I can do that
# oh, right, we can't INSERT in recovery mode
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]#
</pre>
# well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
# we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
<pre>
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

</pre>
# even after that, it still won't start :'(
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m4.863s
user 0m0.009s
sys 0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real 0m0.010s
user 0m0.002s
sys 0m0.005s
[root@opensourceecology etc]#
</pre>
# before I purge those three system-level DBs, I want to confirm they're in our backups
# as I feared, it looks like they're missing
<pre>
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]#
</pre>
# according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
# I'm just going to manually dump these three anyway. Or try to
# well, I was able to get one of the three to backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real 0m0.010s
user 0m0.006s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real 0m0.142s
user 0m0.155s
sys 0m0.010s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real 0m0.009s
user 0m0.009s
sys 0m0.005s
[root@opensourceecology dbFail.20250417]#
</pre>
# mysql looks good
<pre>
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# I'm just going to move this whole db dir out of the way and see if we can start it fresh
<pre>
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x 4 mysql mysql 4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]#
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]#
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x 2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
[root@opensourceecology lib]#
</pre>
# ok, it's started outside recovery mode now
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m3.550s
user 0m0.007s
sys 0m0.012s
[root@opensourceecology etc]#

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23 InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
</pre>
# it created all these files
<pre>
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x 5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
-rw-rw---- 1 mysql mysql 16K Apr 18 20:56 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Apr 18 20:56 aria_log_control
-rw-rw---- 1 mysql mysql 18M Apr 18 20:56 ibdata1
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx 1 mysql mysql 0 Apr 18 20:56 mysql.sock
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]#
</pre>
# that also would have killed the mysql password; I can't login
<pre>
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]#
</pre>
# I hacked my way in and set the root password
<pre>
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
</pre>
# now I can see our three databases, plus one named test
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql |
+--------------------+
| performance_schema |
+--------------------+
| test |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]#
</pre>
# first let's just restore the 'mysql' database
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]#
</pre>
# that appears to have worked; our users are present now
<pre>
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User |
+------------------+
| oseforum_user |
| cacti_user |
| 3dp_user |
| cacti_user |
| d3d_user |
| fef_user |
| microfactory_usr |
| munin_user |
| obi2_user |
| obi3_user |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]>
</pre>
# I gave it a restart, and ensured it's still working. Great.
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# now let's restore the rest – including even our corrupt databases – and see if it works or breaks
# that took about 11.5 minutes to import ~6.8G of data
<pre>
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real 11m36.530s
user 1m52.944s
sys 0m3.593s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G /var/lib/mysql
[root@opensourceecology dbFail.20250417]#

</pre>
# I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# woah, I gave it a restart, and it came back fine
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# I guess we fixed it with no data loss?
# let's bring up the web servers
<pre>
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]#
</pre>
# the wiki loads now
# so does osemain
# I'd say we're back in business
# I sent an email to Marcin
<pre>
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues.
</pre>
# now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
# at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
# looks like it won't let him reboot if other users are logged-in
<pre>
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
1. maltfield
2. crupp
3. Tom Griffing (tgriffing)
4. jthomas
Choose identity to authenticate as (1-4):
</pre>
# I updated the sudoers command to give marcin *just* access to the reboot command
<pre>
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]#
</pre>
# I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
# it's debian, but sudoers syntax should (hopefully) be the same
<pre>
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~#

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password:
Retype new password:
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
</pre>
# yeah, that worked. Perfect.
# I tested it on hetzner2; it worked too.
<pre>
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
</pre>
# I sent Marcin a reply ask him to test reboots via ssh
<pre>
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need?
</pre>
# I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
## I wonder if I can figure out which one grub uses and replace that one second..
# from my log yesterday, here's our two drive's serial numbers
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# fuck; looks like neither is referenced in /boot/
<pre>
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]#
</pre>
# the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
# it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
<pre>
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64 initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64 initramfs-3.10.0-327.18.2.el7.x86_64.img System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64 initramfs-3.10.0-514.26.2.el7.x86_64.img System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64.img System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64kdump.img System.map-3.10.0-693.2.2.el7.x86_64
efi initrd-plymouth.img vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub lost+found vmlinuz-3.10.0-1127.el7.x86_64
grub2 symvers-3.10.0-1127.el7.x86_64.gz vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img symvers-3.10.0-1160.119.1.el7.x86_64.gz vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img symvers-3.10.0-327.18.2.el7.x86_64.gz vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img symvers-3.10.0-514.26.2.el7.x86_64.gz vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]#
</pre>
# I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
# oh, the efi dir is empty, so I'm thinking we're using grub2
<pre>
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]#
</pre>
# yeah, the grub dir just has one file in it?
<pre>
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11 2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
-rw-r--r-- 1 root root 1.4K Nov 15 2011 splash.xpm.gz
[root@opensourceecology boot]#
</pre>
# grub2 looks most sane
<pre>
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26 2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15 2015 fonts
-rw-r--r-- 1 root root 7.8K Jul 26 2024 grub.cfg
-rw-r--r-- 1 root root 5.3K Jun 1 2016 grub.cfg.1499616907.rpmsave
-rw-r--r-- 1 root root 6.1K Jul 9 2017 grub.cfg.1506097734.rpmsave
-rw-r--r-- 1 root root 7.0K Sep 22 2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26 2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31 2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31 2016 locale
[root@opensourceecology boot]#
</pre>
# it looks like it's referencing the raid, not the drive
<pre>
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a' 9f6b5264-da8c-406d-a444-45e3fb3aeb26
else
search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
fi
linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
</pre>
# right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
# well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
# that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
# Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
# the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 150G 38G 80% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#
</pre>

=Thr Apr 17, 2025=
# Marcin sent me an email last night (and again this morning) asking why the wiki is down
# I hadn't touched ose infra since 6 days ago
# the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
# I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
# anyway, let's check what happened to the wiki on hetzner2
# it's a 500 error complaining about the db
<pre>
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
<input type="hidden" name="num" value="50" />
<input type="hidden" name="ie" value="UTF-8" />
<input type="hidden" name="oe" value="UTF-8" />
<input type="text" name="q" size="31" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<p>
<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
<label><input type="radio" name="sitesearch" value="" />WWW</label>
</p>
user@disp9871:~$
</pre>
# disk is fine
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 96G 92G 52% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
# there's also no new logs in the mariadb error log when I hit the site in real-time
# well, the db isn't running
<pre>
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#
</pre>
# error logs aren't very helpful
<pre>
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
</pre>
# if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
<pre>
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]#
</pre>
# here's the log that pops-up when we try a restart
<pre>
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31 InnoDB: Waiting for the background threads to start
250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# google points to this https://bugs.mysql.com/bug.php?id=61516
## they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
# reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
# before reading any more, I'm going to immediately make a local copy of our most-recent backups
# looks like we have a backup from 13 hours ago and one from 27 hours ago
<pre>
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x 2 root root 4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x 2 root root 4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
</pre>
# this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
## it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
## other warnings suggest solving the underlying issue: why did the data become corrupt?
## well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
## but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
# I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
<pre>
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

* https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry.
</pre>
# the backups of the backups finished
<pre>
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec
total size is 44,542,039,620 speedup is 1.00
[root@opensourceecology ~]#
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 138G 50G 74% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
<pre>
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]#
</pre>
# I should also make a backup of /var/lib/mysql
# I'm going to create a dif for all of this
<pre>
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

* https://bugs.mysql.com/bug.php?id=61516

Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]#
</pre>
# aaaand I added a copy of /var/lib/mysql/
<pre>
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec
total size is 7,383,062,830 speedup is 1.00
[root@opensourceecology ~]#
</pre>
# another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
<pre>
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
</pre>
# Unfortunately, a lot of the links for how to fix this are now dead
## https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
## https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## https://forums.mysql.com/read.php?22,603093,604631#msg-604631
## https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
# we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
## but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
## ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
# this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
## good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
### but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
## it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
### here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
<pre>
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
</pre>
# I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
# ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
# but first let's check SMART
# oh, fuck, my notes on this are on the wiki. Of course.
# arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
# fail
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]#
</pre>
# luckily the yum servers for this EOL OS are still online, and I could install it
<pre>
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1
Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1

Installed:
smartmontools.x86_64 1:7.0-2.el7

Complete!
[root@opensourceecology ~]#
</pre>
# better
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]#
</pre>
# well this is terrifying; it says both our disks are gonna fail within 24 hours
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# compare that to hetnzer3, which says all is good
<pre>
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ #
</pre>
# I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
# oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
## well, it lists two options
### Free Replacement drive nearly new or used and tested; depends on what is in stock.
### At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
## we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
## dang, it says they'll swap the drive within 2-4 hours.
# I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
# hetzner provides 4 docs for assistance on this
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
## https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
## https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
# that first doc says to run the command we just ran
# hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
# ok, the docs say we can get more info with -A
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>
# so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
# well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
# yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
# in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
# the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
# wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
<pre>
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 152.358.379 [78,0 TB]
Data Units Written: 52.125.092 [26,6 TB]
Host Read Commands: 6.873.372.480
Host Write Commands: 1.362.559.127
Controller Busy Time: 22.226
Power Cycles: 28
Power On Hours: 17.245
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 159
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Data Units Read: 140.811.605 [72,0 TB]
Data Units Written: 56.604.901 [28,9 TB]
Host Read Commands: 1.304.073.899
Host Write Commands: 1.364.668.115
Controller Busy Time: 21.180
Power Cycles: 23
Power On Hours: 15.565
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 149
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 45 Celsius

root@hetzner3 ~ #
</pre>
# that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
# the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
# this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
# the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
<pre>
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
</pre>
# I also kicked-off a long test, which I can check tomorrow
<pre>
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
</pre>
# ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
<pre>
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/md2 197G 145G 43G 78% /
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# it won't let me check the filesystem while it's mounted
<pre>
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]#
</pre>
# it probably should be happening on-boot, but I couldn't find it in dmesg
<pre>
[root@opensourceecology ~]# dmesg | grep -i check
[ 0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]#
</pre>
# ok, instead we can just use tune2fs to get the info on the last check that was run
# looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 13131776
Block count: 52496160
Reserved block count: 2624808
Free blocks: 26575102
Free inodes: 12417672
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1011
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue May 31 06:01:12 2016
Last mount time: Thu Apr 17 17:39:11 2025
Last write time: Thu Apr 17 17:39:00 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
Check interval: 0 (<none>)
Lifetime writes: 124 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup: inode blocks
[root@opensourceecology ~]#
</pre>
# both of the filesystems (/ and /boot) look fine
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on: /boot
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 46
Maximum mount count: -1
Last checked: Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on: /
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
</pre>
# well, so far I couldn't find any signs of corruption on the disk/fs level
# back to the db, I set the recovery option in the my.cnf file
<pre>
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]#

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
>
[root@opensourceecology etc]#
</pre>
# it didn't come-up
<pre>
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]#
</pre>
# I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
<pre>
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49 InnoDB: Waiting for the background threads to start
250417 22:32:50 InnoDB: Waiting for the background threads to start
250417 22:32:51 InnoDB: Waiting for the background threads to start
250417 22:32:52 InnoDB: Waiting for the background threads to start
250417 22:32:53 InnoDB: Waiting for the background threads to start
250417 22:32:54 InnoDB: Waiting for the background threads to start
250417 22:32:55 InnoDB: Waiting for the background threads to start
250417 22:32:56 InnoDB: Waiting for the background threads to start
250417 22:32:57 InnoDB: Waiting for the background threads to start
250417 22:32:58 InnoDB: Waiting for the background threads to start
...
</pre>
# it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

=Sun Apr 11, 2025=

# let's get Catarina that broken staging site for osemain on hetzner3
# Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
# so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
# I logged-into hetzner https://robot.hetzner.com/server
# I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
# I clicked on the server for "hetzner3" and the tab "IPs".
## Then I clicked on "Order additional IPs / Nets"
## I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
## it required me to enter a reason (IPv4 is scarce) to which I wrote:
<pre>
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
</pre>
## and I clicked "Apply for IP/subnet in obligation"
## I got a message; looks like it needs human approval
<pre>
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
</pre>
# I typed an email to Marcin and Catarina to notify them of this order
<pre>
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
# I refreshed the hetzner wui, and now I see the new IP
# ...
# following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
## I sent them both emails asking them to confirm access
# I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
# update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
# he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
# I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
# I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
# Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
# ...
# I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
# gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
# I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
# there's too many layers on this server; I checked the logs
<pre>
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
</pre>
# ok, so it's modsecurity?
# gah, that's a lot of files to review
<pre>
[root@opensourceecology httpd]# find . |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]#
</pre>
# looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
<pre>
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]#
</pre>
# it's 13107200
<pre>
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf: SecRequestBodyLimit 13107200
conf.d/mod_security.conf: SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]#
</pre>
# docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
# so 13107200 / 1024 / 1024 = 12.5 MB.
# jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
# ...
# so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
# I didn't want to do this site before the other less-priority ones, but it's just a sandbox
# I realized I never made a CHG file for osemain
# looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
# ugh, I just said I was "following the same guide as with the other sites"
## I was hoping to know which one to CHG to copy-from
## I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
# ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T14:29:40Z

Maltfield: /* Status */

=Status=

==2025-04-30 14:30 UTC==

This change was completed successfully

==2025-04-30 14:18 UTC==

# I'm going to double-tap the grub install before giving it a reboot
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# and I rebooted it
<pre>
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
user@personal:~$ autossh opensourceecology.org
Last login: Wed Apr 30 11:28:26 2025 from REDACTED
[maltfield@opensourceecology ~]$ uptime
14:17:14 up 1 min, 1 user, load average: 0.85, 0.24, 0.08
[maltfield@opensourceecology ~]$
</pre>
# cool, it came back.
# cool, raid looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[3]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[3]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# and SMART isn't yelling about failed disks anymore
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>

==2025-04-30 14:13 UTC==

The RAID sync is finished; I guess these Micron 500G disks have better i/o throughput than our old 200GCrucial disks

<pre>
Wed Apr 30 14:07:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
[====>................] recovery = 21.2% (7124992/33521664) finish=2.2min speed=191533K/sec

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Wed Apr 30 14:12:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>

==2025-04-30 13:48 UTC==

Since we can't add a new drive, I went ahead and added the drive they gave us to the RAID

# looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088
ID_SERIAL_SHORT=18301DC6A088
[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250430_134343 ~
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/*
0 /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak
4.0K /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
</pre>
# the sda partition is empty, which makes sense
# I copied the sdb partition to sda
<pre>
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sda: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sda1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sda2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sda3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sda4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250430_134343]#
</pre>
# and reloaded the kernel
<pre>
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda
[root@opensourceecology chg.20250430_134343]#
</pre>
# and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
<pre>
[root@opensourceecology chg.20250430_134343]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 200.4G 0 part
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology chg.20250430_134343]#

[root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3
mdadm: added /dev/sda3
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
[root@opensourceecology chg.20250430_134343]#
</pre>
# cool, that worked. /boot is already done, and it's syncing root (/) now
<pre>
[root@opensourceecology chg.20250430_134343]# date -u
Wed Apr 30 13:48:43 UTC 2025
[root@opensourceecology chg.20250430_134343]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
resync=DELAYED

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
[=>...................] recovery = 9.1% (19231872/209984640) finish=16.5min speed=192161K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology chg.20250430_134343]#
<pre>
# I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# as noted in the docs, those warnings can be safely ignored

==2025-04-30 13:26 UTC==

# I got a response back from hetzner 4 minutes later
<pre>
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
</pre>
# so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
<pre>
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?
</pre>

==2025-04-30 13:20 UTC==

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T14:29:17Z

Maltfield: /* Status */

=Status=

==2025-04-30 14:18 UTC==

# I'm going to double-tap the grub install before giving it a reboot
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# and I rebooted it
<pre>
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
user@personal:~$ autossh opensourceecology.org
Last login: Wed Apr 30 11:28:26 2025 from REDACTED
[maltfield@opensourceecology ~]$ uptime
14:17:14 up 1 min, 1 user, load average: 0.85, 0.24, 0.08
[maltfield@opensourceecology ~]$
</pre>
# cool, it came back.
# cool, raid looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[3]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[3]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# and SMART isn't yelling about failed disks anymore
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>

==2025-04-30 14:13 UTC==

The RAID sync is finished; I guess these Micron 500G disks have better i/o throughput than our old 200GCrucial disks

<pre>
Wed Apr 30 14:07:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
[====>................] recovery = 21.2% (7124992/33521664) finish=2.2min speed=191533K/sec

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Wed Apr 30 14:12:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>

==2025-04-30 13:48 UTC==

Since we can't add a new drive, I went ahead and added the drive they gave us to the RAID

# looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088
ID_SERIAL_SHORT=18301DC6A088
[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250430_134343 ~
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/*
0 /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak
4.0K /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
</pre>
# the sda partition is empty, which makes sense
# I copied the sdb partition to sda
<pre>
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sda: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sda1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sda2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sda3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sda4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250430_134343]#
</pre>
# and reloaded the kernel
<pre>
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda
[root@opensourceecology chg.20250430_134343]#
</pre>
# and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
<pre>
[root@opensourceecology chg.20250430_134343]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 200.4G 0 part
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology chg.20250430_134343]#

[root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3
mdadm: added /dev/sda3
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
[root@opensourceecology chg.20250430_134343]#
</pre>
# cool, that worked. /boot is already done, and it's syncing root (/) now
<pre>
[root@opensourceecology chg.20250430_134343]# date -u
Wed Apr 30 13:48:43 UTC 2025
[root@opensourceecology chg.20250430_134343]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
resync=DELAYED

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
[=>...................] recovery = 9.1% (19231872/209984640) finish=16.5min speed=192161K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology chg.20250430_134343]#
<pre>
# I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# as noted in the docs, those warnings can be safely ignored

==2025-04-30 13:26 UTC==

# I got a response back from hetzner 4 minutes later
<pre>
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
</pre>
# so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
<pre>
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?
</pre>

==2025-04-30 13:20 UTC==

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T14:27:33Z

Maltfield: /* Status */

=Status=

==2025-04-30 14:13 UTC==

The RAID sync is finished; I guess these Micron 500G disks have better i/o throughput than our old 200GCrucial disks

<pre>
Wed Apr 30 14:07:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
[====>................] recovery = 21.2% (7124992/33521664) finish=2.2min speed=191533K/sec

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Wed Apr 30 14:12:12 UTC 2025
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>

==2025-04-30 13:48 UTC==

Since we can't add a new drive, I went ahead and added the drive they gave us to the RAID

# looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088
ID_SERIAL_SHORT=18301DC6A088
[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250430_134343 ~
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/*
0 /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak
4.0K /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
</pre>
# the sda partition is empty, which makes sense
# I copied the sdb partition to sda
<pre>
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sda: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sda1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sda2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sda3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sda4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250430_134343]#
</pre>
# and reloaded the kernel
<pre>
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda
[root@opensourceecology chg.20250430_134343]#
</pre>
# and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
<pre>
[root@opensourceecology chg.20250430_134343]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 200.4G 0 part
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology chg.20250430_134343]#

[root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3
mdadm: added /dev/sda3
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
[root@opensourceecology chg.20250430_134343]#
</pre>
# cool, that worked. /boot is already done, and it's syncing root (/) now
<pre>
[root@opensourceecology chg.20250430_134343]# date -u
Wed Apr 30 13:48:43 UTC 2025
[root@opensourceecology chg.20250430_134343]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
resync=DELAYED

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
[=>...................] recovery = 9.1% (19231872/209984640) finish=16.5min speed=192161K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology chg.20250430_134343]#
<pre>
# I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# as noted in the docs, those warnings can be safely ignored

==2025-04-30 13:26 UTC==

# I got a response back from hetzner 4 minutes later
<pre>
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
</pre>
# so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
<pre>
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?
</pre>

==2025-04-30 13:20 UTC==

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T14:24:22Z

Maltfield: /* Status */

=Status=

==2025-04-30 13:48 UTC==

Since we can't add a new drive, I went ahead and added the drive they gave us to the RAID

# looks like they gave us another 500G disk; I bet they just don't stock the 250G anymore
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_18301DC6A088
ID_SERIAL_SHORT=18301DC6A088
[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250430_134343 ~
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
[root@opensourceecology chg.20250430_134343]# du -sh ${chg_dir}/*
0 /var/tmp/chg.20250430_134343/sda_parttable_mbr.bak
4.0K /var/tmp/chg.20250430_134343/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250430_134343]#
</pre>
# the sda partition is empty, which makes sense
# I copied the sdb partition to sda
<pre>
[root@opensourceecology chg.20250430_134343]# sfdisk -d /dev/sdb | sfdisk /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sda: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sda1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sda2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sda3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sda4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250430_134343]#
</pre>
# and reloaded the kernel
<pre>
[root@opensourceecology chg.20250430_134343]# blockdev --rereadpt /dev/sda
[root@opensourceecology chg.20250430_134343]#
</pre>
# and I added the three partitions of the new disk to the RAID; note that this time I added /boot first, then /, then swap. I think it'll sync in that order (of priority)
<pre>
[root@opensourceecology chg.20250430_134343]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk
├─sda1 8:1 0 32G 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 200.4G 0 part
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology chg.20250430_134343]#

[root@opensourceecology chg.20250430_134343]# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md2 -a /dev/sda3
mdadm: added /dev/sda3
[root@opensourceecology chg.20250430_134343]# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
[root@opensourceecology chg.20250430_134343]#
</pre>
# cool, that worked. /boot is already done, and it's syncing root (/) now
<pre>
[root@opensourceecology chg.20250430_134343]# date -u
Wed Apr 30 13:48:43 UTC 2025
[root@opensourceecology chg.20250430_134343]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[3] sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]
resync=DELAYED

md2 : active raid1 sda3[3] sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
[=>...................] recovery = 9.1% (19231872/209984640) finish=16.5min speed=192161K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[3] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology chg.20250430_134343]#
<pre>
# I went ahead and installed grub. I guess I'll do this again after all the partitions sync, but I think it should actually work this time because the /boot partition was done first and is already done syncing
<pre>
[root@opensourceecology ~]# grub2-install /dev/sda
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# as noted in the docs, those warnings can be safely ignored

==2025-04-30 13:26 UTC==

# I got a response back from hetzner 4 minutes later
<pre>
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
</pre>
# so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
<pre>
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?
</pre>

==2025-04-30 13:20 UTC==

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T14:22:41Z

Maltfield: /* Status */

=Status=

==2025-04-30 13:26 UTC==

# I got a response back from hetzner 4 minutes later
<pre>
Dear Client.

We do not have these drives "new" anymore. Therefore, this is not possible. We already selected a drive with less than 20.000h. We also did not charge the fee for a new drive.
</pre>
# so it looks like we got the drive free, but that's still nearly a waste of my time. I replied and asked them how long it would take for them to order a new drive
<pre>
I emailed last week about this to make sure you had time to order a new drive (check my support tickets).

This drive you inserted has only 32% of its life left, according to SMART. It's closer to dead than new.

How long would it take you to order a new drive?
</pre>

==2025-04-30 13:20 UTC==

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T14:21:15Z

Maltfield: hetzner gave us a very used drive :(

=Status=

==2025-04-30 13:20 UTC==

We're still waiting on hetzner.

Hetzner replaced the drive with one that already has been used for 18,623 hours, which means it has only 32% of its life left.

<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18623
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 032 032 000 Old_age Always - 1030
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 23/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 032 032 001 Old_age Offline - 68
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 96994281182
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3059820027
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31429771271
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2467
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0

[root@opensourceecology ~]# date -u
Wed Apr 30 13:23:39 UTC 2025
[root@opensourceecology ~]#
</pre>

In the remote-hands support request, I was very clear that they should replace it with a new drive (with <1,000 hours of use). We're paying for that.

I asked them to insert an actually new drive with <1,000 hours of use.

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T11:45:02Z

Maltfield: status update: removed sda from RAID and submitted remote hands request to swap disk

=Status=

==2025-04-30 11:44 UTC==

# I confirmed that the RAID is currently healthy
# and today's backup (from a few hours ago) is sane and uploaded
<pre>
[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20133744108 daily_hetzner3_20250430_080904.tar.gpg
[root@opensourceecology ~]#
</pre>
# I confirmed again that /dev/sdb is PASSED and /dev/sda is FAIL
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# I confirmed that our "new" (used) /dev/sdb (replaced last week) still has 4% of its life left (no change from last week)
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52223
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1452
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 29
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 049 000 Old_age Always - 36 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 601634812550
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18904241237
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11849811867
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78658
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3454
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 56
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 046 000 Old_age Always - 38 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 408221767008
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12873452848
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26389101858

[root@opensourceecology ~]#
</pre>
# and I confirmed again the serial of the disk we want to replace matches the one listed in this CHG ticket
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# ok, I'm removing sda from the raid
<pre>
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:06 UTC 2025
[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1
[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2]
33521664 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[2]
209984640 blocks super 1.2 [2/1] [_U]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2]
523712 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[root@opensourceecology ~]#
[root@opensourceecology ~]# date -u
Wed Apr 30 11:38:58 UTC 2025
[root@opensourceecology ~]#
</pre>
# and I submitted the request for support to swap the disk
<pre>
SMART says disk is FAILED and needs to be replaced asap.

I've removed /dev/sda (Crucial_CT250MX200SSD1_154410FA336C) from the RAID, and it is now ready to be replaced with a new disk (with <1,000 hours of operation). Please replace the disk asap.
</pre>

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

CHG-2025-04-30 replace hetzner2 sda

2025-04-30T11:30:36Z

Maltfield: change begun

=Status=

==2025-04-30 11:29 UTC==

Starting Change

==2025-04-24 17:56 UTC==

Marcin approved the start time of this CHG

<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-24 17:37 UTC==

Marcin approved purchasing a new disk for this replacement

<pre>
Yes.

On Thu, Apr 24, 2025, 9:37 AM Michael Altfield <me4eapr@disroot.org> wrote:

> Hey Marcin,
>
> Would you authorize spending €41.18 on a new disk for your server?
>
> Update: Your websites are back online. The RAID is still syncing.
>
> I was a bit disappointed to learn that hetzner replaced a disk with 0%
> "life left" with a disk with 4% "life left". That's what we get for
> choosing the free disk replacement..
>
> The "free" option said it would replace it with a "Replacement drive
> nearly new or used and tested; depends on what is in stock." Obviously
> they didn't give us a "nearly new" drive..
>
> Your other disk is also at 0% "life left". I was already planning on
> replacing that one next week too, but I would recommend that you pay for
> a new drive for this one. The cost listed on the website is €41.18.
>
> Do you authorize me selecting €41.18 for the replacement of /dev/sda on
> hetzner2?
>
>
> Thank you,
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>

==2025-04-18 22:15 UTC==

Initial Ticket draft created on wiki (WIP)

=Change Info=

==Scheduled Time==

This change will take place on 2025-04-30 11:00 UTC

* = 2025-04-30 06:00 Kansas City, US
* = 2025-04-30 06:00 Guayaquil, EC

https://www.timeanddate.com/worldclock/converter.html?iso=20250430T110000&p1=405&p2=1440&p3=93

==Purpose==

This change will physically replace one of our two HDD (/dev/sda = Crucial_CT250MX200SSD1_154410FA336C) on [[hetzner2]]

On 2025-04-17, we had a database corruption event that took down all of the websites on hetzner2. The database wouldn't start because it was corrupt and it was not able to recover from the corruption due to a bug in mariadb. And because hetzner2 is EOL CentOS, we can't update mariadb. While I don't think the corruption was caused by disk failure, the SMART log output said both of our two redundant disks are going to fail within 24 hours and we should replace them immediately

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>

==Points of Contact==

Change being performed by: [[User:Maltfield|Michael Altfield]]

Service owners: [[User:Catarina|Catarina Mota]] & [[User:Marcin|Marcin Jakubowski]]

==Time Length==

We expect at-most 5 hours of downtime.

Re-partitioning the new disk, adding it to the raid, and updating grub should take less than 2 hours.

Rebuilding the RAID1 mirror of the two disks might take a day or more. During this time we'll be vulnerable as we'll only have one disk (no redundancy). This is worse because both of the disks currently say they're going to fail within 24 hours.

==Systems Impacted==

This change impacts [[hetzner2]] and every service/website that runs on it will go down.

==Staging Test==

n/a

=Change Steps=

First, before we do anything, get the status of the RAID

<pre>
# verify RAID status
cat /proc/mdstat
</pre>

Before removing the second redundant disk from the RAID, confirm that today's backup was successfully uploaded to Backblaze

<pre>
# verify today's backup is present and a sane size
source /root/backups/backup.settings
${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
</pre>

At some time in Germany's morning-ish (and also very shortly after our daily backups complete), execute these commands to remove the drive from our RAID array

<pre>
# remove all sda partitions from our software RAID
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md2 --fail /dev/sda3
mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md1 -r /dev/sda2
mdadm /dev/md2 -r /dev/sda3
</pre>

Log into the Hetzner WUI https://robot.your-server.de/

Go to the servers page https://robot.hetzner.com/server

# Click the "Support" tab under hetzner2
# Click "Technical"
# Select "Server - Disk Failure"
# Select "Specification of the defective HDD/SSD" and enter "Crucial_CT250MX200SSD1_154410FA336C"
# Select "At cost"
# Select "Swap while the system is running"
# Select "As soon as possible"
# In the "Entire SMART log" textarea, enter this:

<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#

</pre>
# Click "Send request"

Wait until hetzner confirms that the replacement drive has been inserted

<pre>
# monitor for I/O events in kernel logs
dmesg -w
</pre>

After the replacement drive has been inserted, get some info about it

<pre>
# get disks and partition info
lsblk

# get serial numbers of both disk; confirm sdb is the same and sda has changed
udevadm info --query=property --name sda | grep ID_SER
udevadm info --query=property --name sdb | grep ID_SER

# verify RAID status
cat /proc/mdstat
</pre>

Before we modify the partition tables of any of our drives, let's make backups

<pre>
# create a temp dir for this change
stamp=$(date "+%Y%m%d_%H%M%S")
chg_dir=/var/tmp/chg.$stamp
mkdir $chg_dir
chown root:root $chg_dir
chmod 0700 $chg_dir
pushd $chg_dir

# make backups of both disks' partition tables
sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak

# verify
du -sh ${chg_dir}/*
</pre>

Copy the partition table from our old disk to our new disk

<pre>
# dump the partition table of the first disk and pipe it to the second disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
</pre>

Tell the kernel to re-read the partition table

<pre>
# kernel reload of the new partition table
blockdev --rereadpt /dev/sda
</pre>

Now add the new drive to the RAID array

<pre>
# add all of the new disks's partitions to the software RAID
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
</pre>

Copy our grub configuration and files onto the new disk using `grub-install`

<pre>
grub-install /dev/sda
</pre>

Execute this command to monitor the status of the RAID replication

<pre>
while true; do date; cat /proc/mdstat; echo; sleep 300; done
</pre>

You may need to '''wait several hours''' (hopefully less than 1 day) before proceeding.

Once the sync is finally complete, test a reboot to make sure that grub is still functioning as-expected

<pre>
sudo reboot
</pre>

==Revert Steps==

Not sure if this is even possible, but we would have to contact hetzner and tell them to physically remove the new drive and re-install the old one that they just physically removed.

=See Also=

# [[Maltfield_Log/2025_Q2]]
# [[CHG-2025-04-24_replace_hetzner2_sdb]]
# [[:Category: CHGs|List of other CHG "tickets"]]

=External Links=
* https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/

[[Category: CHGs]]

Maltfield Log/2025 Q2

2025-04-27T22:04:56Z

Maltfield:

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

__TOC__

=See Also=
# [[Maltfield_Log]]
# [[User:Maltfield]]
# [[Special:Contributions/Maltfield]]

=Sat Apr 26, 2025=
# Marcin authorized me to add Tom to our ops google groups mailing list and to give him access to our shared ose keepass
<pre>
Yes.

On Fri, Apr 25, 2025, 12:43 PM Michael Altfield <REDACTED@disroot.org> wrote:

> (re-sending without encryption)
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
>
> On 4/25/25 12:41, Michael Altfield wrote:
>> Hey Marcin,
>>
>> Do you authorize:
>>
>> 1. Giving Tom access to the shared OSE keepass file
>>
>> 2. Adding Tom to the ops mailing list (this would allow him to password
>> reset many of our important accounts)
>>
>> Please let me know if you authorize the above.
>>
>> Thank you,
</pre>
# Tom sent me his gpg public key, which I can use to add him to the wazuh emails
<pre>
user@ose:~$ gpg
gpg: WARNING: no command supplied. Trying to guess what you mean ...
gpg: Go ahead and type your message ...
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----

pub rsa4096 2025-04-26 [SC]
13300901348A985115679165FB137A633FD1EB4C
uid Tom Griffing (OSE PGP Key 4-25-2025) <REDACTED@tutanota.com>
sub rsa4096 2025-04-26 [E]
user@ose:~$
</pre>
# I added Tom to the wazuh recipients, per https://wiki.opensourceecology.org/wiki/Wazuh
<pre>
mkdir -p /var/tmp/gpg
pushd /var/tmp/gpg
# write multi-line to file for documentation copy & paste
cat << EOF > /var/tmp/gpg/tom.pubkey.asc
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----
EOF
gpg --homedir /var/ossec/.gnupg --import /var/tmp/gpg/tom.pubkey.asc
popd

# add marcin's email (that matches an email on a UID of his key above) to the space-delimited "recipients" variable
vim /var/ossec/sent_encrypted_alarm.settings
</pre>
# and I sent him an email asking him to confirm that it's working
<pre>
Hey Tom,

Can you please confirm that you're now receiving alerts from wazuh?

Wazuh is our HIDS (Host-Based Intrusion Detection System). It's a fork of the HIDS and FIM (File Integrity Monitor) OSSEC. Because it sometimes sends sensitive information (eg diffs of config files with passwords), it's important that we encrypt its email notifications end-to-end with PGP.

And because someone who compromises the server could "clean up" after themselves, these (off-server) alerts are critical to post-compromise investigations.

For more info, see:

* https://wiki.opensourceecology.org/wiki/Wazuh
* https://en.wikipedia.org/wiki/OSSEC
* https://documentation.wazuh.com/current/getting-started/index.html

Out-of-the-box, Wazuh has a ton of features, but probably where we use it the most is its ingestion of apache's mod_security WAF and its tie-in to Wazuh's Active Response. If an IP is found doing something bad (eg multiple consecutive 403 responses, such as a brute-force attack on wordpress [or ssh]), then the IP will get temp blocked by the firewall for 10 minutes. If it does it again shortly after the ban is lifted, it'll be banned for 12 hours. If again, 1 day. Then 2 days. Then 4 days. And the max ban for 5x repeat offenses is 8 days

* https://github.com/OpenSourceEcology/ansible/blob/master/hetzner3/roles/maltfield.wazuh/templates/ossec.conf.j2#L256-L271

It also has rootkit detection, and lots of other useful alerts that "just work" out of the box.

Please confirm that you're now receiving encrypted wazuh alerts.

Thank you,
</pre>
# I tried to add Tom to our ops google groups email list, but it said I wasn't allowed to add members outside of our google workspace
<pre>
An error occurred
1 user is outside of your organization. Based on your group or organization settings, you can only add organization users to this group. Contact your group owner or domain administrator for help.
</pre>
# I checked our user's group. it appears that Tom doesn't have an account @opensourceecology.org in gsuite
# I found the setting to change that here https://admin.google.com/ac/managedsettings/864450622151/GROUPS_SHARING_SETTINGS_TAB
## https://support.google.com/a/thread/63692725/
## https://support.google.com/a/answer/167097
# I checked the box that said "Group owners can allow external members"
## curiously the subline said "Organization admins can always add external members" – but I'm a damn org admin, and I couldn't add him :/
# I tried to add him again, but I got the same error
# this time I went to the group settings https://groups.google.com/a/opensourceecology.org/g/REDACTED/settings
# I found the "allow external members" and changed it from "off" to "on" and clicked "save changes"
## this wasn't possible before. So first I had to change the workspace-wide settings to allow me to change the groups-specific settings. now it's changed.
# this time it worked.
# I sent an email to our ops google group, asking Tom to reply if he saw it
# ...
# I checked-in on hetzner2 to make sure it rebooted this morning
# looks like the cron is set to reboot at 10:40 UTC every day, and – indeed – uptime says it's been online for a bit less than 13 hours. And its last boot time was today at 10:41:25
<pre>
[root@opensourceecology ~]# uptime
23:30:25 up 12:49, 7 users, load average: 1.02, 0.98, 0.74
[root@opensourceecology ~]# journalctl | head
-- Logs begin at Sat 2025-04-26 10:41:25 UTC, end at Sat 2025-04-26 23:30:26 UTC. --
Apr 26 10:41:25 localhost systemd-journal[129]: Runtime journal is using 8.0M (max allowed 3.1G, trying to leave 4.0G free of 31.2G available → current limit 3.1G).
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuset
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpu
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuacct
Apr 26 10:41:25 localhost kernel: Linux version 3.10.0-1160.119.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jun 4 14:43:51 UTC 2024
Apr 26 10:41:25 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
Apr 26 10:41:25 localhost kernel: e820: BIOS-provided physical RAM map:
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009c7ff] usable
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x000000000009c800-0x000000000009ffff] reserved
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /etc/cron.d/reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology ~]# date -u
Sat Apr 26 23:31:32 UTC 2025
[root@opensourceecology ~]#
</pre>
# so it looks like we'll have ~2 minutes of downtime every day in the very early morning in the US. I can live with that.
# and grub clearly is fixed
# oh, also the RAID looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I asked Tom for his GitHub account profile username, so I can grant him write access to our OSE ansible repo
# I updated Tom's new ssh key to his authorized_keys file on hetzner2
# I sent Tom an email asking to confirm his access to hetzner2

=Fri Apr 25, 2025=
# I woke up this morning and discovered the wiki was offline
# I tried to ssh into the server; it's not responding
# I figured I'd log into the hetzner wui, but – uhh – the credentials are in keepass and live on the server
# I mitigated this by giving Marcin a copy of the keepass file on his veracrypt drive, but he since changed the password a month or two ago, and we don't have a new local copy
# I sent an email to Marcin asking him to login to hetzner wui and boot hetzner2. if it doesn't come-up, then I'll have to get the password from him so I can load it in the wui from a rescue disk
# oh, I did find the new hetzner password in my personal keepass
# I logged-in, and I found the server was listed as being on. But I can't ping it. I gave it an "automatic hardware reset" from the wui
# I'll give it a few minutes before trying the rescue system
# their rescue systems are much nicer for their cloud product than their dedicated server product
# it looks like I have two options
## rescue boot mode: where I'm given ssh access
## vnc
# the problem with the rescue boot is that – if this is a grub issue – I wouldn't be able to "see" the error
# I enabled VNC and gave the server a reboot
# I was able to connect via vnc, but it was the damn installation wizard for almalinux. I quit the installation, and the vnc session died.
# damn, I guess vnc won't let me see the boot process, after all
# instead I tried the "rescue system"
# that didn't work; I can't access ssh on either of the IP addresses
# the docs say to activate the rescue system and then reboot it; that's what I did https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system/
# this time I fully shut down the server, and then I enabled the rescue system (while it's off)
# I went back to the Reset tab, and it's still off. So I booted it
# somehow I was able to login from my ose vm using my personal ssh key, but with user root
<pre>
user@ose:~$ ssh -v root@138.201.84.223
OpenSSH_9.2p1 Debian-2+deb12u5, OpenSSL 3.0.15 3 Sep 2024
debug1: Reading configuration data /home/user/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 138.201.84.223 [138.201.84.223] port 22.
debug1: Connection established.
...
Linux rescue 6.12.19 #1 SMP Fri Mar 14 05:34:52 UTC 2025 x86_64

--------------------

Welcome to the Hetzner Rescue System.

This Rescue System is based on Debian GNU/Linux 12 (bookworm) with a custom kernel.
You can install software like you would in a normal system.

To install a new operating system from one of our prebuilt images, run 'installimage' and follow the instructions.

Important note: Any data that was not written to the disks will be lost during a reboot.

For additional information, check the following resources:
Rescue System: https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system
Installimage: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installimage
Install custom software: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installing-custom-images
other articles: https://docs.hetzner.com/robot

--------------------

Rescue System (via Legacy/CSM) up since 2025-04-25 17:24 +02:00

Hardware data:

CPU1: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Cores 8)
Memory: 64153 MB (Non-ECC)
Disk /dev/sda: 250 GB (=> 232 GiB)
Disk /dev/sdb: 512 GB (=> 476 GiB)
Total capacity 709 GiB with 2 Disks

Network data:
eth0 LINK: yes
MAC: 90:1b:0e:94:07:c4
IP: 138.201.84.223
IPv6: 2a01:4f8:172:209e::2/64
Intel(R) PRO/1000 Network Driver

root@rescue ~ #
</pre>
# I was able to mount the root drive
<pre>
root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@rescue ~ # mount /dev/md2 /mnt
root@rescue ~ # ls /mnt
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # ls /mnt/home
b2user crupp hart lberezhny marcin stagingsync wp
cmota Flipo jthomas maltfield not-apache tgriffing
root@rescue ~ #
</pre>
# I don't know what the point of this is; I can't fix it if I can't watch it boot and see what's breaking
# ok, at the bottom of the docs, hetnzer lists another option = xKVM Rescue System https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/
# it specifically says that's for debugging boot issues
# last thing before I try that: I downloaded a local copy of the keepass files from hetzner2
<pre>
user@ose:~/tmp/hetzner2$ rsync -av --progress root@138.201.84.223:/mnt/etc/keepass ./etc-keepass-20250525
receiving incremental file list
created directory ./etc-keepass-20250525
keepass/
keepass/passwords.kdbx
46,142 100% 44.00MB/s 0:00:00 (xfr#1, to-chk=6/8)
keepass/passwords.kdbx.20170728.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#2, to-chk=5/8)
keepass/passwords.kdbx.20170804.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#3, to-chk=4/8)
keepass/passwords.kdbx.20190820.bak
33,726 100% 143.20kB/s 0:00:00 (xfr#4, to-chk=3/8)
keepass/passwords.kdbx.20190909.bak
34,238 100% 71.75kB/s 0:00:00 (xfr#5, to-chk=2/8)
keepass/passwords.kdbx.20250316.bak
45,406 100% 94.55kB/s 0:00:00 (xfr#6, to-chk=1/8)
keepass/passwords.kdbxs.20180525.bak
27,102 100% 56.31kB/s 0:00:00 (xfr#7, to-chk=0/8)

sent 161 bytes received 196,407 bytes 35,739.64 bytes/sec
total size is 195,794 speedup is 1.00
user@ose:~/tmp/hetzner2$

user@ose:~/tmp/hetzner2$ du -sh etc-keepass-20250525/keepass/*
48K etc-keepass-20250525/keepass/passwords.kdbx
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170728.bak
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170804.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190820.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190909.bak
48K etc-keepass-20250525/keepass/passwords.kdbx.20250316.bak
28K etc-keepass-20250525/keepass/passwords.kdbxs.20180525.bak
user@ose:~/tmp/hetzner2$
</pre>
# so this time was the same as the rescue system, except I choose "xKVM" instead of "Linux" in the "Operationg System" dropdown
# strange, it gave me an error
<pre>
Public key authentication is not available for the selected operating system.
</pre>
# I unselected my ssh key, and chose "no key" instead
# it gave me a URL and a password. I booted the server, but the URL didn't load ("Unable to connect" error)
# ok, it took a few minutes and had a self-signed cert
# I bypassed the cert error, and entered the username and password into the basic auth popup. It failed! Could I really have been MITM'd?
# I immediately shut down the server from the wui, and I tried again.
# this time I was able to login – both from ssh and in the wui.
# as soon as it opened, I saw the error
<pre>
No more network devices

Booting from Hard Disk...
.
error: symbol 'grub_calloc' not found.
Entering rescue mode...
grub rescue>
</pre>
# I wonder if this is grub or grub2. I didn't have a binary "grub-install" before. I assumed it was an error with the hetzner docs when I did "grub2-install" instead, which said it worked (there was a warning that the docs said were safe to ignore)
# curoiusly, the opposite is true for the ssh session in vkvm: I have grub-install but not grub2-install
<pre>
root@vKVM-rescue ~ # which grub-install
/usr/sbin/grub-install
root@vKVM-rescue ~ #
root@vKVM-rescue ~ # which grub2-install
root@vKVM-rescue ~ #
</pre>
# here's the docs in question https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# I don't want to fuck with the grub without first taking a backup of these disks. But, uh, it looks like I can't access the RAID from inside this vkvm setup
# yeah, that's one of the limitations listed for VKVM https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/#raid-controllers
<pre>
Configured units are passed through as SCSI devices to the VM. However it is not possible to access the controller. Please use the regular Hetzner Rescue System for this purpose.
</pre>
# I shutdown VKVM and booted it into the regular rescue mode
# it took a few minutes to get back into the old rescue system, but here I can use the raid
<pre>
root@rescue ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 3.4G 1 loop
sda 8:0 0 476.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
sdb 8:16 0 232.9G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
root@rescue ~ # mkdir /mnt/md1
root@rescue ~ # mkdir /mnt/md2
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # mount /dev/md2 /mnt/md2
root@rescue ~ #
</pre>
# I created a dir for these backups
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ #

root@rescue ~ # mkdir /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chown root:root /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chmod 0700 /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ #
</pre>
# first I made a backup from the raid
<pre>
root@rescue ~ # rsync -av --progress /mnt/md1 /mnt/md2/var/tmp/20250425-grub-fail/md1.$(date "+%Y%m%d_%H%M%S")
...
md1/grub2/locale/zh_TW.mo
30,882 100% 31.38kB/s 0:00:00 (xfr#345, to-chk=0/355)
md1/lost+found/

sent 399,450,301 bytes received 6,709 bytes 159,782,804.00 bytes/sec
total size is 399,330,989 speedup is 1.00
root@rescue ~ #
</pre>
# then I figured I'd make a backup of the two disk partitions directly, but I couldn't even mount it
<pre>
root@rescue ~ # umount /mnt/md1
root@rescue ~ # mkdir /mnt/sda2
root@rescue ~ # mkdir /mnt/sdb2
root@rescue ~ # mount /dev/sda2 /mnt/sda2
mount: /mnt/sda2: unknown filesystem type 'linux_raid_member'.
dmesg(1) may have more information after failed mount system call.
root@rescue ~ #
</pre>
# I tried this command (from the docs), which I skipped before because it said that the next command (grub-install) was enough; sure enough, it didn't work https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
<pre>
root@rescue ~ # grub-mkdevicemap -n
grub-mkdevicemap: error: cannot open /boot/grub/device.map.
root@rescue ~ #
</pre>
# I investigated this before, and I thought I decided we're using grub2, not grub1
<pre>
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # ls /mnt/md1/
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue ~ #
</pre>
# oh, shit, even the grub-install command is v2 https://askubuntu.com/questions/107486/how-to-know-the-version-of-grub
<pre>
root@rescue ~ # grub-install --version
grub-install (GRUB) 2.06-13+deb12u1
root@rescue ~ #
</pre>
# ok, this indicates we're not using lilo https://askubuntu.com/questions/24459/how-do-i-find-out-which-boot-loader-i-have
<pre>
root@rescue ~ # ls /mnt/md2/etc/ | grep lilo
root@rescue ~ #
</pre>
# we can dd straight from the disk to read the MBR. And, yeah, it appears we are using grub via MBR .. and this info is stored on the disks, not the raid
<pre>
root@rescue ~ # dd if=/dev/md1 bs=512 count=1 2>/dev/null | strings
root@rescue ~ #

root@rescue ~ # dd if=/dev/sda bs=512 count=1 2>/dev/null | strings
214fb5736d1e5ad63e515dc2fffe44bd928cd8dab2c019dc11fb9fcaef5ea90dbf51f1ac507ab1cfbbe74ff
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
DA/jjF
root@rescue ~ #

root@rescue ~ # dd if=/dev/sdb bs=512 count=1 2>/dev/null | strings
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
root@rescue ~ #
</pre>
# idk what to do; I tried the grub-install again, but it gives me this error
<pre>
root@rescue ~ # grub-install /dev/sda
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #

root@rescue ~ # grub-install /dev/sdb
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #
</pre>
# I tried creating a chroot of our real raid disks first
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # umount /mnt/md1
root@rescue ~ # chroot-prepare /mnt/md2
root@rescue ~ # chroot /mnt/md2
root@rescue / # ls /boot
root@rescue / # mount /dev/md1 /boot
root@rescue / # ls /boot
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue / #
</pre>
# I then tried the grub install again
<pre>
root@rescue / # grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #

root@rescue / # grub2-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #
</pre>
# I exited the chroot and shutdown the rescue system
# I activated the VKVM resuce system, and booted it again
# when I connected to the KVM wui, I was shown a password prompt. So I think booting works!
# I rebooted it from the ssh
# and now I can ssh into the real system
<pre>
user@personal:~$ autossh opensourceecology.org
Last login: Thu Apr 24 23:12:44 2025 from 146.70.199.15
[maltfield@opensourceecology ~]$
</pre>
# and now the wiki loads too
# I did another reboot test
<pre>
[maltfield@opensourceecology ~]$ sudo su -
[sudo] password for maltfield:
Last login: Thu Apr 24 16:25:15 UTC 2025 on pts/0
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
ssh: connect to host opensourceecology.org port 32415: Connection refused
Last login: Fri Apr 25 16:29:21 2025 from 185.204.1.184
[maltfield@opensourceecology ~]$
</pre>
# idk, my takeaway is that either one or some of these assumptions are correct
## grub-install needs to be run *after* the RAID sync is finished
## grub-install needs to be run on *both* the new *and* the old disk
## grub-install needs to be run inside a chroot on the rescue system
# anyway, we're stable again
# I got an email from Marcin saying Tom could help with the migrations. I sent him some wiki articles to get caught-up
<pre>
Hey Tom,

I'll try to get you ssh access on hetzner2 soon. In the meantime, please read the following articles:

* https://wiki.opensourceecology.org/wiki/Hetzner2

* https://wiki.opensourceecology.org/wiki/Hetzner3

I've started preparing draft "change tickets" for migrating each of the websites from hetzner2 to hetzner3. Note that some of these are not fully tested, so you'll want to execute them manually and make corrections as-needed.

Please also read-through these:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_store_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_microfactory_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_fef

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_oswh

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_phplist_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_wiki_to_hetzner3

(There's also one CHG for the forum that I think needs to be made)

The next item TODO is to finish the migration plan for these websites:

1. www.opensourceecology.org (osemain)
2. www.openbuildinginstiture.org (obi)

We decided that there would be 2 simultaneous versions of obi:

1. A static site scraped with curl on hetzner3
2. The (broken) dynamic wordpress site on hetzner3

And we decided that there would be 3 simultaneous versions of osemain:

1. The live/current site on hetzner2
2. A static site scraped with curl on hetzner3
3. The (broken) dynamic wordpress site on hetzner3

To have multiple sites with the same domain on the same server, we bought a second IPv4 address (FeF isn't setup with IPv6). This week I just finished updating the hetzer3 server to persist this new IPv4 address.

The next item for you would be to update our ansible to push out new vhosts (in nginx, varnish, and apache) for the static sites that are bound to the second IPv4 address using the same hostname.

Please read-through the ansible playbook and roles (most importantly for nginx, varnish, and apache) to understand how they're provisioned

* https://github.com/OpenSourceEcology/ansible

Since you have access to hetzner3, you can also poke around (read-only please) the configs for these three web services to understand how ansible provisions them.

Once you've updated and pushed-out the new vhosts with ansible, you'll need to update the migration plan

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_obi_to_hetzner3
* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

And then you'll want to go-through each migration plan to create a temp "snapshot" of all the sites on hetzner3, where Marcin & Catarina can do a thorough verification of each site (by updating /etc/hosts) before we do the *real* migration -- which is nearly the same as the "snapshot" except we actually migrate DNS.

Please let me know when you've finished reading the above articles.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/24/25 22:16, REDACTED@tutanota.com wrote:
> Michael;
>
> I need to reset my ssh key on hetzner2. Can you use the same as on 3 or best to generate a new one?
>
> I spoke with Marcin and I think I can help with the admin, as I have time available.
>
> Can you give a run-down of its status and what needs to be done for completing the migration to hetzner3?
> --
> Tom Griffing
</pre>

=Thr Apr 24, 2025=
# it's 05:00; I tried to login to the wiki, but I got an error
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# oh, under that it says I'm already logged-in?
<pre>
You are already logged in as Maltfield. Use the form below to log in as another user.
</pre>
# anyway, let's start the CHG to replace the failing disk on hetzner 2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb
# I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]#

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]#
</pre>
# I tried to remove the first partition from the RAID, but it said I can't?
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#
</pre>
# apparently the docs say that if the RAID is healthy, you have to force it with '--fail' https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# crap, I realized I have an issue in my CHG (we need two sysadmins for peer review *sigh*)
## I listed this
<pre>
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3
</pre>
## but it should be this
<pre>
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3
</pre>
# anyway, it looks like I first need to execute this, to force the RAID into a failure state
<pre>
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md2 --fail /dev/sdb3
</pre>
# ok, I was able to remove it
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1](F)
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0] sdb1[1](F)
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1](F)
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# by 10:32 UTC, I submitted the request to hetzner to replace /dev/sdb = "Crucial_CT250MX200SSD1_154410FA4520"
# it says they should do it within 2-4 hours
# meanwhile, I updated https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# at 08:00 my time, I checked and saw that we had an email come from hetzner at 06:36 (my time)
<pre>
Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.
</pre>
# well, crap. I tried to load the wiki CHG article, but there's an error
<pre>
Sorry! This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot access the database)
</pre>
# the server wasn't shutdown, and my screen session is still intact, but dmesg is being flooded with RAID and io errors
<pre>
...
[11136.011313] md: super_written gets error=-5, uptodate=0
[11136.011372] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11136.319267] md: super_written gets error=-5, uptodate=0
[11136.319322] md: super_written gets error=-5, uptodate=0
[11138.827642] EXT4-fs error: 5 callbacks suppressed
[11138.827693] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
[11138.827793] EXT4-fs: 5 callbacks suppressed
[11138.827841] EXT4-fs (md2): previous I/O error to superblock detected
[11138.835255] md: super_written gets error=-5, uptodate=0
[11138.835311] md: super_written gets error=-5, uptodate=0
[11138.835367] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11138.835472] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
...
</pre>
# well anyway, I'll see if I can at least restart the RAID sync and install grub on the new disk
# son of a bitch, they removed the wrong drive!
<pre>
[root@opensourceecology ~]# date -u
Thu Apr 24 13:05:32 UTC 2025
[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 32G 0 part
├─sdc2 8:34 0 512M 0 part
└─sdc3 8:35 0 200.4G 0 part
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
device node not found
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# it shows a new drive (sdc) and and old drive (sdb)
# ugh, so now we have nothing in the raid?
# here's the new drive
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sdc | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# christ, so this new disk is half the size of our actual disk? what did they do?!?
# and now we have a prod server online with no redundancy. I can't tell them to put back-in the *correct* disk, or we'll have data loss
# I'm going to stop all the web services before this disaster gets any worse
# great; io errors. this is a damn disaster
<pre>
[root@opensourceecology ~]# systemctl stop nginx
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop varnish
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop apache2
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
Failed to stop apache2.service: Unit apache2.service not loaded.
[root@opensourceecology ~]#
</pre>
# I went ahead and made partition backups, anyway
# wait, actually, it said that /dev/sdc = Crucial_CT250MX200SSD1_154410FA336C. That's our old /dev/sda
# so they *did* remove the right drive, but the re-insertion of the wrong drive pushed /dev/sda to /dev/sdc. That kinda breaks our ability to map the RAID, but let's at-least partition this new drive
# but this new drive isn't the right size. it's 512G while our old disk was 250G. I guess it's better to have too-big of a disk than too-small of a disk, but we won't be able to use that extra disk space. I'm going to assume that they just didn't have 250G disks in-stock anymore.
# anyway, I tried to backup the partitions, but that wouldn't work since we're read-only
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
mkdir: cannot create directory ‘/var/tmp/chg.20250424_132010’: Read-only file system
[root@opensourceecology ~]# chown root:root $chg_dir
chown: cannot access ‘/var/tmp/chg.20250424_132010’: No such file or directory
[root@opensourceecology ~]#
</pre>
# I don't know what to do besides giving it a reboot, but that scares me
# I'd like to take a backup, but I can't if I get read-only errors :(
# well, I guess that's why we made a backup before this. I don't think I have any option other than to reboot. and pray that grub is intact to bring it back.
# I gave it a reboot. If it doesn't come back, I'll try to boot to the rescue CD from within the hetzner wui
<pre>
[root@opensourceecology ~]# date && reboot
Thu Apr 24 13:24:18 UTC 2025
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!

Failed to start reboot.target: Unit is not loaded properly: Input/output error.
See system logs and 'systemctl status reboot.target' for details.

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!
</pre>
# wtf, it can't even reboot it's so broken.
# I triggered a rest on the hetzner wui
# the server came back, and I immediately shutdown all services again
<pre>
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop apache2
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]#
</pre>
# I went ahead and triggered backups
<pre>
[root@opensourceecology ~]# cat /etc/cron.d/backup_to_backblaze
20 07 * * * root time /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log
20 04 03 * * root time /bin/nice /root/backups/backupReport.sh
[root@opensourceecology ~]#

[root@opensourceecology ~]# time /root/backups/backup.sh &>> /var/log/backups/backup.log
</pre>
# ok, sdc is gone. we have sda and sdb again, and sda is our original sda – as we wanted
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions; it's not surprising the sdb file is empty
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250424_133230 ~
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# du -sh ${chg_dir}/*
4.0K /var/tmp/chg.20250424_133230/sda_parttable_mbr.bak
0 /var/tmp/chg.20250424_133230/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
</pre>
# I copied the partition from sda to sdb
<pre>
[root@opensourceecology chg.20250424_133230]# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sdb: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sdb2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sdb3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250424_133230]#
</pre>
# that looked good, other than the complaint about not being able to boot from this disk; I'll check later what is LILO and if this will matter for raid grub
# I reloaded the partition table for this disk
<pre>
[root@opensourceecology chg.20250424_133230]# blockdev --rereadpt /dev/sdb
[root@opensourceecology chg.20250424_133230]#
</pre>
# I added the new disk to the RAID, and it shows that it's starting to sync now. excellent
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md0 -a /dev/sdb1
mdadm: added /dev/sdb1
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md1 -a /dev/sdb2
mdadm: added /dev/sdb2
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md2 -a /dev/sdb3
mdadm: added /dev/sdb3
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (19712/33521664) finish=481.1min speed=1159K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, it looks like it's not syncing each partition of the RAID at the same time. it's doing md0 now and then it'll do the others after, I guess
# md0 is partition 1 (sda1/sdb1). That's *sigh* swap. It's 32GB.
# I kinda wish we'd sync'd /boot first. I don't think I can install grub until that's sync'd. maybe?
# it says it's moving about 1024K/s. That's 1 MB per sec. 32G*1024 = 32,768 MB. That's 32,768 seconds / 60 = 546 minutes / 60 = 9 hours. Just for swap!
# assuming we have the same speed for the rest of the disk, that's 250 G * 1024 = 256,000 MB / 1 MB/s = 256,000 seconds. 256,000 seconds / 60 = 4,266.666666667 minutes / 60 = 4,266.666666667 = 71.11 hours. I guess we just have to accept the risk and hope that old /dev/sda with all our data doesn't fail within then next 3 days.
# I tried to go ahead and install grub on the new disk, but i got a command not found error
<pre>
[root@opensourceecology chg.20250424_133230]# grub-install /dev/sdb
-bash: grub-install: command not found
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# grub
grub2-bios-setup grub2-glue-efi grub2-mkconfig grub2-mkpasswd-pbkdf2 grub2-probe grub2-set-default
grub2-editenv grub2-install grub2-mkfont grub2-mkrelpath grub2-reboot grub2-setpassword
grub2-file grub2-kbdcomp grub2-mkimage grub2-mkrescue grub2-render-label grub2-sparc64-setup
grub2-fstest grub2-macbless grub2-mklayout grub2-mkstandalone grub2-rpm-sort grub2-syslinux2cfg
grub2-get-kernel-settings grub2-menulst2cfg grub2-mknetdir grub2-ofpathname grub2-script-check grubby
[root@opensourceecology chg.20250424_133230]#
</pre>
# looks like it should be 'grub2-install' I tried that
<pre>
[root@opensourceecology chg.20250424_133230]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, that's two warnings but no errors; I'll take it.
# we're up to 12.4% on the RAID sync of swap. It's now going >50x faster than it was before; good news
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[==>..................] recovery = 12.4% (4168832/33521664) finish=8.2min speed=59264K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# calculations at that speed would be 250*1024/58 = 4,413.793103448 seconds / 60 = 73 minutes. Oh, that's just over an hour.
# and now we're at 42.7%
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[========>............] recovery = 42.7% (14334208/33521664) finish=6.6min speed=47845K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# backups are still running; I'll let them finish before starting-up the webservers again
# I wrote a status email to Marcin
# the backups still aren't finished
# I checked on the raid replication, and it shows md0 (swap) and md1 (boot) are both done. Horray! Now we just need to finish root (/), which is 9.8% done and going at 60 MB/s. Great!
<pre>
Thu Apr 24 14:05:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 9.8% (20767872/209984640) finish=50.5min speed=62429K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I gave the grub install a double-tap now that it's synced with the first disk; the output was the same
<pre>
[root@opensourceecology ~]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# the output of lsblk looks much nicer now, too
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# backups say they're 9% uploaded
<pre>
[root@opensourceecology ~]# tail -f /var/log/backups/backup.log
...
2025/04/24 14:13:48 INFO :
Transferred: 2.210G / 20.472 GBytes, 11%, 2.904 MBytes/s, ETA 1h47m20s
Transferred: 0 / 1, 0%
Elapsed time: 13m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 10% /20.472G, 2.997M/s, 1h43m59s
</pre>
# I decided to just kill the backup script and manually upload it without the bwlimit, so it'll go-out faster
<pre>
[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20250424_133017.tar.gpg b2:ose-server-backups
2025/04/24 14:15:20 INFO :
Transferred: 116.500M / 20.472 GBytes, 1%, 1.958 MBytes/s, ETA 2h57m25s
Transferred: 0 / 1, 0%
Elapsed time: 1m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 0% /20.472G, 5.065M/s, 1h8m35s
</pre>
# meanwhile we're at 24% on the RAID sync
<pre>
Thu Apr 24 14:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.9% (50200448/209984640) finish=101.1min speed=26325K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# oh, important to note: our new disk doesn't say that it's failing :D
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# while the old disk says it's reached 100% of its lifecycle, the new disk says it's at – uhh – 96% of it's life? That doesn't sound very good :(
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# Shame. I was hoping for at least something <50%. Well, I wonder how long that remaining 4% will last us :/
# ok, backups just finished; let's start the web services
<pre>
[root@opensourceecology ~]# systemctl start mariadb
[root@opensourceecology ~]# systemctl start httpd
[root@opensourceecology ~]# systemctl start varnish
[root@opensourceecology ~]# systemctl start nginx
[root@opensourceecology ~]#
</pre>
# I updated the wiki CHG with a status https://wiki.opensourceecology.org/wiki/Category:CHGs
# And I sent an email to Marcin recommending that he replace /dev/sda with an actual new drive
<pre>
Hey Marcin,

Would you authorize spending €41.18 on a new disk for your server?

Update: Your websites are back online. The RAID is still syncing.

I was a bit disappointed to learn that hetzner replaced a disk with 0% "life left" with a disk with 4% "life left". That's what we get for choosing the free disk replacement..

The "free" option said it would replace it with a "Replacement drive nearly new or used and tested; depends on what is in stock." Obviously they didn't give us a "nearly new" drive..

Your other disk is also at 0% "life left". I was already planning on replacing that one next week too, but I would recommend that you pay for a new drive for this one. The cost listed on the website is €41.18.

Do you authorize me selecting €41.18 for the replacement of /dev/sda on hetzner2?
</pre>
# from the output above, our old drive said it had "Power_On_Hours" of 78516/24/365 = 8.96 years
# and our new drive says Power_On_Hours = 52083/24/365 = 5.95 years. Well that's better, I guess.
# oh wow, the power cycle count is crazy; our disk we only rebooted 50 times and the new one was only 33 times.
# also the SMART data for both of these drives has different keys (not just values). apparently it's very vendor-specific, so some of these comparisons are apples-to-oranges
# right, we're at 69.7% replication on root. I'm going to go make breakfast and check-in again after
# ...
# over lunch, I realized that Marcin's last email was possibly hyperbolic panic
# he's worried that he just kicked-off a marketing campaign (for the apprenticeship), which now links to information on a broken website – where potential applicants can't read the info
# but I think the content actually *is* accessible, just not to Marcin
# when you're logged-into the wiki, the cookies bypass the cache. So, regretablly, when hetnzer2's backend is offline, Marcin sees an error
# but I'd bet that the frontpage of all the websites and the recently-published apprenticeship info page that he's published & promoted are still online when he sees that error – for users who are *not* logged-into the site
# but if the backend site is broken for >24 hours, then the cache will cache the errors (not the content)
# as a short-term hack, I recommended that we setup a daily reboot of hetzner2 at 10:40 (a good buffer after the backups finish uploading)
# I asked Marcin if he'd like me to setup a daily reboot at 10:40
<pre>
Hey Marcin,

I don't think the situation is as bad as you think.

> We are missing opportunity,
> the announcement is posted, and our servers are down.

Of course I agree it's not good, and we should migrate away from hetzner2 asap. And I do wish I had more bandwidth to finish the migration faster for you.

But you have a varnish cache that caches pages for 24 hours. Even if your backend webserver and database are down, popular pages (like the frontpage of your wiki or a recent article that you've recently promoted) should still load for users.

The big issue isn't marketing and read-only content. The big issue is editing. That's what is breaking.

When you're logged into the wiki, it bypasses the varnish cache. So, even if the wiki appears down to you, the contents of (most) articles viewed in the past 24 hours will be still visible to potential apprenticeship applicants.

The next time you see the websites are down, try loading it from another device where you're not logged-in. You'll probably see that the apprenticeship info is still accessible, even though the backend for the site is down.

As a short-term hack, I recommend setting-up a daily reboot of the server. Backups typically finish before 10:10 UTC. I recommend we add a cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.

The server seems to function for some time after a fresh reboot, and it caches pages for 24 hours. So the first time someone loads a page in the wiki after that reboot, it'll be cached for the entire time that the server is online until its next reboot. I think this will ensure higher availability of your read-only content (eg information about the apprenticeship).

Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# ...
# I checked-in on the RAID replication status; it's finished
<pre>

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# so it looks like I started it just after 13:32 and it finished just before 15:20. So it took just under 2 hours. Great!
# I updated the article with status updates, marking the CHG as completed successfully https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb#2025-04-24_16:18_UTC
# And I sent an email to Marcin & Catarana to let them know it was successful, and asked again about buying a new drive for replacing /dev/sda
<pre>
Update: your new (used) disk is now fully synced with the old (failing) disk.

* https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb

According to SMART data, you now have one failing disk and one not-failing disk.

Your hetzner2 RAID is now healthy, and you have redundancy spread across two mirrored disks again.

Next week I'd like to replace the other failing disk. Please let me know if you approve the purchase of a new disk for its replacement.
</pre>
# Marcin got back to me, approving the purchase of the new disk; I updated the ticket https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# Note that the price is listed as "at cost" and it says
<pre>
Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
</pre>
# 1,000 hours is fine. That's compared to the 78,516 hours of /dev/sda and 52,083 hours of our "new" /dev/sdb
# but it's a bit concerning that it says it might not be in-stock. I'm going to message them and ask if they can set one aside for us for next week
<pre>
Hi Support,

Can you set-aside a replacement disk for this server?

Our disks' SMART logs indicated that both disks should be replaced. Today we replaced one of the two disks, but the disk that you replaced it with has 4% of its life left, according to SMART data (it has 52,083 hours of operation).

Next week we would like to replace the other disk, and this time we'd like your "at cost" option, to get a disk with <1,000 hours of operation.

But I was a bit concerned when I read this next to the WUI option for "at cost" on your website

> Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.

Specifically what worries me is the "may not be in stock".

Can you please tell us if you have stock now? And if you do, can you please reserve one disk for us for next week?

We don't want to remove a disk from our RAID and plan for downtime, only to discover that you don't have a disk available for us..

Please let us know if you can reserve 1 disk for us for next week.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# I asked Marcin if Wed next week at 11:00 UTC is ok for replacing hetzner2's sda
<pre>
Hey Marcin,

When would be a good time to replace the second disk on hetzner2?

If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next week on Wednesday 2025-04-30 11:00 UTC, which is:

* 13:00 in Germany (where the server lives)
* 06:00 here in Ecuador, and
* 06:00 at FeF

For details about what this change entails, and expected downtime,
please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda

Please let me know if you approve this change, if the suggested time is
agreeable to you, and if you have any questions.

Thank you,
</pre>
# Marcin returned the email confirming the time
<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>
# ...
# Marcin got back to me and told me to setup the daily reboot cron on hetzner2
<pre>
Yes, please set up reboot. That is decent for now

On Thu, Apr 24, 2025, 11:08 AM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> I don't think the situation is as bad as you think.
>
> > We are missing opportunity,
> > the announcement is posted, and our servers are down.
>
> Of course I agree it's not good, and we should migrate away from
> hetzner2 asap. And I do wish I had more bandwidth to finish the
> migration faster for you.
>
> But you have a varnish cache that caches pages for 24 hours. Even if
> your backend webserver and database are down, popular pages (like the
> frontpage of your wiki or a recent article that you've recently
> promoted) should still load for users.
>
> The big issue isn't marketing and read-only content. The big issue is
> editing. That's what is breaking.
>
> When you're logged into the wiki, it bypasses the varnish cache. So,
> even if the wiki appears down to you, the contents of (most) articles
> viewed in the past 24 hours will be still visible to potential
> apprenticeship applicants.
>
> The next time you see the websites are down, try loading it from another
> device where you're not logged-in. You'll probably see that the
> apprenticeship info is still accessible, even though the backend for the
> site is down.
>
> As a short-term hack, I recommend setting-up a daily reboot of the
> server. Backups typically finish before 10:10 UTC. I recommend we add a
> cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.
>
> The server seems to function for some time after a fresh reboot, and it
> caches pages for 24 hours. So the first time someone loads a page in the
> wiki after that reboot, it'll be cached for the entire time that the
> server is online until its next reboot. I think this will ensure higher
> availability of your read-only content (eg information about the
> apprenticeship).
>
> Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# we don't have ansible for hetzner2; I did this manually
<pre>
[root@opensourceecology cron.d]# pwd
/etc/cron.d
[root@opensourceecology cron.d]# ls -lah
total 52K
drwxr-xr-x. 2 root root 4.0K Apr 24 17:56 .
drwxr-xr-x. 105 root root 12K Apr 18 21:52 ..
-rw-r--r-- 1 root root 128 May 16 2023 0hourly
-rw-r--r-- 1 root root 1.3K Apr 9 2019 awstats_generate_static_files
-rw-r--r-- 1 root root 151 Apr 24 17:52 backup_to_backblaze
-rw-r--r-- 1 root root 78 May 31 2024 cacti
-rw-r--r-- 1 root root 125 Dec 11 00:16 letsencrypt
-rw-r--r-- 1 root root 506 Mar 18 2019 phplist
-rw-r--r-- 1 root root 108 Jan 7 2022 raid-check
-rw-r--r-- 1 root root 118 Apr 24 17:56 reboot
-rw------- 1 root root 235 Dec 15 2022 sysstat
[root@opensourceecology cron.d]# cat reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology cron.d]#
# tomorrow morning I should check on the uptime and journalctl to make sure it rebooted sometime around 10:40 UTC
</pre>
# ...
# ok, back to hetzner3: we bought a second IPv4 address for the static sites, but the server's networking was never setup for it; let's add that
<pre>
root@hetzner3 /etc/network # cp interfaces interfaces.20250424
root@hetzner3 /etc/network # vim interfaces
...
</pre>
# well, that failed.
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
You have mail in /var/mail/root
root@hetzner3 ~ #
</pre>
I restored the backup file, and it still failed. The journal and status aren't helpful
<pre>
root@hetzner3 ~ # systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-24 17:18:55 UTC; 52s ago
Duration: 2month 1w 20h 39min 50.765s
Docs: man:interfaces(5)
Process: 3259336 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 3259371 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 3259336 (code=exited, status=1/FAILURE)
CPU: 29ms

Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
root@hetzner3 ~ # journalctl -u networking | tail
Apr 24 17:16:36 hetzner3 ifup[3258504]: ifup: failed to bring up enp0s31f6
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:16:36 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
</pre>
# if I run the ExecStart command manaully, I can add a verbose tag. but that's not especially helpful, either
<pre>
root@hetzner3 ~ # ifup --verbose -a --read-environment
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool

ifup: configuring interface enp0s31f6=enp0s31f6 (inet)
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool
ip addr add 144.76.164.201/255.255.255.224 broadcast 144.76.164.223 dev enp0s31f6 label enp0s31f6
RTNETLINK answers: File exists
ifup: failed to bring up enp0s31f6
run-parts --exit-on-error --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/000resolvconf
run-parts: executing /etc/network/if-up.d/ethtool
run-parts: executing /etc/network/if-up.d/postfix
run-parts: executing /etc/network/if-up.d/resolved
root@hetzner3 ~ #
</pre>
# curiously, though, the new IPv4 address is listed in `ip a`
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# I'm just going to give this server a reboot before proceeding, to make sure the IP config is sticky
# when it came-up, it lost the new IP :(
<pre>
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# well, at least it's restarting now without errors; I can work with that
<pre>
root@hetzner3 /etc/network # systemctl restart networking
You have new mail in /var/mail/root
root@hetzner3 /etc/network # systemctlstatus networking
-bash: systemctlstatus: command not found
root@hetzner3 /etc/network # systemctl status networking
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: active (exited) since Thu 2025-04-24 17:33:40 UTC; 15s ago
Docs: man:interfaces(5)
Process: 8598 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Process: 9022 ExecStart=/bin/sh -c if [ -f /run/network/restart-hotplug ]; then /sbin/ifup -a --read-environment --allow=hotplug; fi (code=exited, status=0/SUCCESS)
Main PID: 9022 (code=exited, status=0/SUCCESS)
CPU: 357ms

Apr 24 17:33:34 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:33:39 hetzner3 ifup[8663]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 ifup[8907]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 systemd[1]: Finished networking.service - Raise network interfaces.
root@hetzner3 /etc/network #
</pre>
# let's try to add it now
<pre>
root@hetzner3 /etc/network # diff interfaces interfaces.20250424
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # vim interfaces
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,23
> iface enp0s31f6 inet static
> address 144.76.164.195
> netmask 255.255.255.224
> gateway 144.76.164.193
> # route 144.76.164.192/27 via 144.76.164.193
> #up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6
>
root@hetzner3 /etc/network #
</pre>
# I gave it a restart, but I have errors again
<pre>
# curiously, it *did* add the new IP address; wtf
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# the internet isn't very helpful because it seems the damn format has changed so many times over the years; lots of outdated info
# lots of people say they fixed this by deleting everything in interfaces.d/, but we don't have anything in that folder
# I did find this hetzner-specific docs on adding a second IP; it's totally different than what I've read elsewhere https://docs.hetzner.com/robot/dedicated-server/network/net-config-debian-ubuntu
<pre>
up ip addr add 10.4.2.1/32 dev eth0
down ip addr del 10.4.2.1/32 dev eth0
</pre>
# I tried this, and gave the server a reboot
<pre>
root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,20
> # 2025-04-24: add second IPv4 address
> up ip addr add 144.76.164.195/32 dev enp0s31f6
> down ip addr del 144.76.164.195/32 dev enp0s31f6
>
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # cat interfaces
### Hetzner Online GmbH installimage

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback
iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet static
address 144.76.164.201
netmask 255.255.255.224
gateway 144.76.164.193
# route 144.76.164.192/27 via 144.76.164.193
up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6

# 2025-04-24: add second IPv4 address
up ip addr add 144.76.164.195/32 dev enp0s31f6
down ip addr del 144.76.164.195/32 dev enp0s31f6

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::2
netmask 64
gateway fe80::1

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::3
netmask 64
gateway fe80::1
root@hetzner3 /etc/network #
</pre>
# the system came-up with the IP I want. Cool!
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# and I'm able to restart the service without it yelling at me (or breaking the IP config)
<pre>
root@hetzner3 ~ # systemctl restart networking
root@hetzner3 ~ #
You have new mail in /var/mail/root
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# I'm also able to ping the server on both IPs, which is a good sign
<pre>
user@disp9871:~$ ping 144.76.164.201
PING 144.76.164.201 (144.76.164.201) 56(84) bytes of data.
64 bytes from 144.76.164.201: icmp_seq=1 ttl=50 time=490 ms
64 bytes from 144.76.164.201: icmp_seq=2 ttl=50 time=490 ms
^C
--- 144.76.164.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 489.558/489.676/489.795/0.118 ms
user@disp9871:~$
user@disp9871:~$ ping 144.76.164.195
PING 144.76.164.195 (144.76.164.195) 56(84) bytes of data.
64 bytes from 144.76.164.195: icmp_seq=1 ttl=50 time=493 ms
64 bytes from 144.76.164.195: icmp_seq=2 ttl=50 time=512 ms
^C
--- 144.76.164.195 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 492.853/502.518/512.184/9.665 ms
user@disp9871:~$
</pre>
# I used netcat to test it. Most ports are closed, and I found that nginx is listening on most of the other ports on all IPs – except 4443
<pre>
root@hetzner3 ~ # nc -s 144.76.164.195 -l -p 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# and this was how it looked on my laptop's side
<pre>
user@disp9871:~$ nc 144.76.164.195 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# ok, so the server's new IPv4 address is configured (and persistent between reboots)

=Sun Apr 20, 2025=
# Marcin replied to my email authorizing the replacement of the /dev/sdb disk on hetzner2 at 2025-04-24 10:00 UTC https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
## I updated the article with the defined date & time
# ...
# I also checked hetzner3. I see that I setup email alerts for the RAID, but not for SMART.
## on hetzner2, we had no errors of the RAID, but we did have SMART errors. I guess eventually if it failed enough that RAID replication was breaking, we would have gotten alerts. But it would be good if we could get alerts *before* that happened..
# I checked munin on hetzner2 to see what data it collects for monitoring disks @ /disk-day.html
## looks like we have latency, throughput, usage, utilization, i/o, and inode usage. There's nothing about "SMART errors"
# looks like there *is* a smart module for munin https://gallery.munin-monitoring.org/plugins/munin/smart_/
# it's already there on hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah | grep -i smart
-rwxr-xr-x 1 root root 11K Mar 21 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Mar 21 2023 smart_
You have new mail in /var/mail/root
root@hetzner3 /usr/share/munin/plugins #
</pre>
# hetzner2 has it too
<pre>
[root@opensourceecology munin]# ls -lah /usr/share/munin/plugins | grep -i smart
-rwxr-xr-x 1 root root 11K Nov 6 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Nov 6 2023 smart_
[root@opensourceecology munin]#
</pre>
# crap, I just checked hetzner3's munin, and I realized that varnish is missing :(
# it looks like ansible *has* pushed-out the script and plugins
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah /usr/share/munin/plugins/ | grep -i varnish
-rwxr-xr-x 1 root root 26K Mar 21 2023 varnish_
-rwxr-xr-x 1 root root 28K Feb 12 00:14 varnish5_
-rwxr-xr-x 1 root root 28K Sep 28 2024 varnish5_.175431.2025-02-12@00:16:02~
-rwxr-xr-x 1 root root 28K Sep 25 2024 varnish5_.20240928.orig
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # ls -lah /etc/munin/plugins/ | grep -i varnish
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_backend_traffic -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_bad -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_expunge -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_hit_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_main_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_memory_usage -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_mgt_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_objects -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_request_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_threads -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_transfer_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Feb 12 00:16 varnish_uptime -> /usr/share/munin/plugins/varnish5_
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I did a diff of the varnish5_ script from my server and ose's server, and I found 2 new lines at the top of the hetzner3 server
## my server
<pre>
maltfield@mail:~$ head /usr/share/munin/plugins/varnish5_
#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
#
# This program is free software; you can redistribute it and/or modify
maltfield@mail:~$
</pre>
## ose's hetzner3
<pre>
maltfield@hetzner3:~$ head /usr/share/munin/plugins/varnish5_
# Ansible managed

#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
maltfield@hetzner3:~$
</pre>
# so basically the issue appears to be that my "ansible managed" comment comes before the shebang, so varnish is interpreting everything as shell, instead of perl
# we can see the result of all these syntax errors with a test run too
## my server
<pre>
root@mail:/etc/munin# munin-run varnish_hit_rate
cache_hitpass.value 0
client_req.value 704255
cache_miss.value 202581
cache_hitmiss.value 2181
cache_hit.value 499493
root@mail:/etc/munin#
</pre>
## ose's hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
/etc/munin/plugins/varnish_hit_rate: 26: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 28: varnish5_: not found
/etc/munin/plugins/varnish_hit_rate: 30: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 32: Varnish: not found
/etc/munin/plugins/varnish_hit_rate: 34: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 36: The: not found
/etc/munin/plugins/varnish_hit_rate: 38: The: not found
/etc/munin/plugins/varnish_hit_rate: 39: [varnish5_*]: not found
/etc/munin/plugins/varnish_hit_rate: 40: group: not found
/etc/munin/plugins/varnish_hit_rate: 41: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 42: env.name: not found
/etc/munin/plugins/varnish_hit_rate: 44: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 108: my: not found
/etc/munin/plugins/varnish_hit_rate: 111: my: not found
/etc/munin/plugins/varnish_hit_rate: 114: my: not found
/etc/munin/plugins/varnish_hit_rate: 117: my: not found
/etc/munin/plugins/varnish_hit_rate: 119: my: not found
/etc/munin/plugins/varnish_hit_rate: 123: Syntax error: "(" unexpected
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I moved the "ansible managed" comment below the shebang in ansible, and pushed it out; now it works
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
client_req.value 10714
cache_hitmiss.value 9
cache_hit.value 6478
cache_hitpass.value 0
cache_miss.value 4227
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I also pushed-out smart at the same time, but it's not working
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_ suggest
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # munin-run smart_
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# the docs page for the smart_ munin plugin says that we need this section at-minimum in the munin config file, so I added it to hetzner2 https://gallery.munin-monitoring.org/plugins/munin/smart_/
<pre>
[root@opensourceecology plugin-conf.d]# tail -n4 zzz-ose

[smart_*]
user root
group disk
[root@opensourceecology plugin-conf.d]#
</pre>
# and I manually created the symlinks for sda & sdb
<pre>
[root@opensourceecology ~]# cd /etc/munin/plugins
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sda
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sdb
[root@opensourceecology plugins]#
</pre>
# sweet, that worked
<pre>
[root@opensourceecology plugins]# munin-run smart_sdb
Program_Fail_Count.value 100
Reallocated_Event_Count.value 100
Ave_Block_Erase_Count.value 001
Reallocate_NAND_Blk_Cnt.value 100
Erase_Fail_Count.value 100
Reported_Uncorrect.value 100
SATA_Interfac_Downshift.value 100
Offline_Uncorrectable.value 100
smartctl_exit_status.value 8
Write_Error_Rate.value 100
FTL_Program_Page_Count.value 100
Current_Pending_Sector.value 100
Success_RAIN_Recov_Cnt.value 100
UDMA_CRC_Error_Count.value 100
Error_Correction_Count.value 100
Temperature_Celsius.value 064
Raw_Read_Error_Rate.value 100
Total_Host_Sector_Write.value 100
Power_Cycle_Count.value 100
Power_On_Hours.value 100
Host_Program_Page_Count.value 100
Unused_Reserve_NAND_Blk.value 000
Percent_Lifetime_Remain.value 000
Unexpect_Power_Loss_Ct.value 100
[root@opensourceecology plugins]#
</pre>
# Unfortunately, I'm not getting the same results on hetzner3. I wonder if this munin plugin doesn't support nvme drives?
# oh, it looks like I'm actually not updating that file anymore in ansible, because it has a backup. I'm going to make a note in ansible so I don't make that mistake again.
# meanwhile, I manually updated the config file on hetzner3 too
<pre>
root@hetzner3 /etc/munin # cd plugin-conf.d/
root@hetzner3 /etc/munin/plugin-conf.d # ls
dhcpd3 munin-node README spamstats zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # touch /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chown root:root /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chmod 0600 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # cp zzz-myconf /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # ls -lah /var/tmp/munin-zzz-myconf.20250420
-rw------- 1 root root 1,7K Apr 20 17:29 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # vim zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # diff /var/tmp/munin-zzz-myconf.20250420 /etc/munin/plugin-conf.d/zzz-myconf
3c3
< # Version: 0.2
---
> # Version: 0.3
9c9
< # Updated: 2024-12-12
---
> # Updated: 2025-04-20
31a32,35
>
> [smart_*]
> user root
> group disk
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# that still fails
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# but, if I restart the service first and then run it, it – uhh – kinda works
<pre>
root@hetzner3 /etc/munin/plugin-conf.d # service munin-node restart
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# so it exits with a non-error, just a U. no further stats. huh.
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
smartctl_exit_status.value U
root@hetzner3 /usr/share/munin/plugins #
</pre>
# yeah, it looks like the smart_ plugin doesn't work for nvme drives :(
## https://github.com/munin-monitoring/munin/issues/790
## https://github.com/aranemac/munin-smart-nvme
# I'm not looking to compile some binary. I think we've reached the point of diminished return here
# while historical smart charts would be great, what I really want to achieve is some email alerts from SMART, like we setup for the RAID
# I found a few guides about this
## https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email
## https://serverfault.com/questions/426761/is-smartd-properly-configured-to-send-alerts-by-email
## https://unix.stackexchange.com/questions/662633/best-practices-to-enable-smart-disk-notifications-on-a-linux-workstation
# I replaced the files
<pre>
root@hetzner3 /etc # mv /etc/smartd.conf /etc/smartd.conf.$(date "+%Y%m%d_%H%M%S").orig
root@hetzner3 /etc #

root@hetzner3 /etc # echo "DEVICESCAN -d removable -n standby -m REDACTED@opensourceecology.org -M exec /usr/share/smartmontools/smartd-runner" > /etc/smartd.conf
root@hetzner3 /etc #
</pre>
# but that didn't work; no email came when I restarted the service (even if I added -M test)
# I checked the status in systemd, and it says that it did try to send the mail
<pre>
root@hetzner3 /etc # systemctl status smartd
● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-04-20 20:58:57 UTC; 3min 22s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1466569 (smartd)
Status: "Next check of 2 devices will start at 21:28:57"
Tasks: 1 (limit: 76834)
Memory: 1.2M
CPU: 66ms
CGroup: /system.slice/smartmontools.service
└─1466569 /usr/sbin/smartd -n

Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, is SMART capable. Adding to "monitor" list.
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme0n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NX0M104566-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
root@hetzner3 /etc #
</pre>
# so I checked the postfix logs, and it looks like google is rejecting our mail?!?
<pre>
root@hetzner3 ~ # journalctl -fu postfix@-
...
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: Untrusted TLS connection established to aspmx.l.google.com[108.177.15.27]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bit
s) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: CB6E5B94BB2: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[108.177.15.27]:25, delay=1.2, delays=0.01/0.01/0.86/0.27, dsn=2.0.0, status=sent (250 2.0.0 OK 1745183017 ffacd0b85a97d-39efa5a45b6si4251829f8f.798 - gsmtp)
Apr 20 21:04:34 hetzner3 postfix/qmgr[4510]: CB6E5B94BB2: removed
Apr 20 21:04:36 hetzner3 postfix/smtp[1468114]: Untrusted TLS connection established to aspmx.l.google.com[2404:6800:4003:c02::1b]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/bounce socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/bounce socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/defer socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/defer socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: D13CAB94BB3: defer service failure
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: D13CAB94BB3: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[2404:6800:4003:c02::1b]:25, delay=4.5, delays=0.01/0.01/3.5/1, dsn=4.3.0, status=deferred (bounce or trace service failure)
...
</pre>
# I changed it to my personal email, restarted, and I got two emails
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme1

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NA0M345614, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# and
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme0

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NX0M104566, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# so I changed it back to the google groups email list email address, and I updated the wiki https://wiki.opensourceecology.org/wiki/Hetzner3
# after lunch, I refreshed munin on hetzne2 and hetzner3, to see if smart info was not being charted
## on hetzner2, there's no changes. I don't see any charts related to SMART
## on hetzner3, there's two new charts (S.M.A.R.T values for drive nvme0n1 & S.M.A.R.T values for drive nvme1n1), but they're both empty; it only has 1 value (smartctl_exit_status), and it's "nan" for all time charts. This is expected, since it can't read the nvme smartctl output format.
# I think maybe I forgot to restart munin on hetzner2, so I gave that a try
<pre>
[root@opensourceecology ~]# service munin-node restart
Redirecting to /bin/systemctl restart munin-node.service
[root@opensourceecology ~]#

[root@opensourceecology ~]# sudo -u munin /usr/bin/munin-cron
2025/04/20 21:29:38 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 55.
closedir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 56.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:52 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
[root@opensourceecology ~]#
</pre>
# whatever; I guess no munin logs on SMART for this dying server
# I also confirmed that varnish logs are now visible in munin
# I committed my ansible changes https://github.com/OpenSourceEcology/ansible/commit/2fb906fd62cf0773d84f50f1cf113ddfe66910ec
# anyway, I also updated smartd.conf on hetzner2
<pre>
[root@opensourceecology smartmontools]# cp smartd.conf smartd.conf.20250420.bak
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# vim smartd.conf
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# diff smartd.conf.20250420.bak smartd.conf
23c23,24
< DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
---
> #DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
> DEVICESCAN -H -m REDACTED@opensourceecology.org -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
[root@opensourceecology smartmontools]#
[root@opensourceecology smartmontools]# systemctl restart smartd
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
[root@opensourceecology smartmontools]#
</pre>
# oh wow, that screaming about the disks failing wasn't just printed to my tty; it got printed to every tty on my screen session. It really is angry..
# but, alas, no email was sent – even from hetzner2. where email should *definitely* be working
# this time the postfix logs on hetzner2 gave us an error from gmail saying why they're blocking us
<pre>
Apr 20 21:40:27 opensourceecology postfix/smtp[21221]: 297716847E6: host aspmx.l.google.com[64.233.167.27] said: 421-4.7.28 Gmail has detected an unusual rate of unso
licited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.go
ogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42a931si4417083f8f.167 - gsmtp (in reply to end
of DATA command)
Apr 20 21:40:27 opensourceecology postfix/smtp[21094]: 3CBF7684804: host aspmx.l.google.com[142.251.168.27] said: 421-4.7.28 Gmail has detected an unusual rate of uns
olicited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.g
oogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42967csi4306047f8f.165 - gsmtp (in reply to end
of DATA command)
</pre>
# marcin sent an email campaign today with phpList. If that didn't make it out due to this, that's kinda problem.
# I see in the log that we're kinda spamming phplist_bounces@opensourceecology.org
# that's basically where phplist is supposed to let our admins know that it failed to deliver to some people on the mailing list
## I confirmed that this account *does* exist in the gsuite admin wui user list
# yeah, crap, it's blocking other mail sent to my personal account from apache.
# woah, I'm tailing the mail log and I just got probably hundereds or thousands of emails tried to be sent. phpList is *supposed* to do it in small batches, but I wonder if, once it fails and gets added to the queue, it'll do the re-send without batching it..
# I checked phpList wui settings and config.php, and I don't see anything about rate-limiting
# here's the docs on it https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
# it says it should be set in config.php. By default, I think it's 5,000 emails per hour
# Marcin's campaign today was sent to 14,111 people
# I checked the event log page, and I see a lot of these "Maximum time for queue processing: 99999" – which I guess means we need to break these up into batches https://phplist.opensourceecology.org/lists/admin/?page=eventlog
# looks like the easiest thing to do is to add a pause with MAILQUEUE_THROTTLE https://discuss.phplist.org/t/some-advice-for-correct-configuration-of-sending-rate/429
# if we send one per second, then we'll send 3,600 per hour.
## If we have 15,000 people on our list, then at that rate we'd need 4-5 hours to send a campaign. That sounds like a good idea.
# I updated the phpList config file to send only one email per second
<pre>
[root@opensourceecology phplist.opensourceecology.org]# diff config.20250420.php config.php
83a84,87
> // only send 1 email per second
> // * https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
> define('MAILQUEUE_THROTTLE',1);
>
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# we should also probably throttle postfix https://serverfault.com/questions/110919/postfix-throttling-for-outgoing-messages
# looks like for both hetzner2 and hetzner3, this is set to no delay
<pre>
[root@opensourceecology phplist.opensourceecology.org]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 0s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# I set this on hetzner2
<pre>
[root@opensourceecology postfix]# diff main.cf.20250420 main.cf
683a684,686
>
> # limit emails to the same-destination-domain to one-email-per-2-seconds
> default_destination_rate_delay = 2s
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# systemctl restart postfix
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 2s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology postfix]#
</pre>
# and I also added this to ansible and pushed it out to the server on hetnzer3 https://github.com/OpenSourceEcology/ansible/commit/7ed339cad055a9a0c5b04f26d32c9416daf3a2c7

=Sat Apr 19, 2025=

# I responded to Tom's email about ssh
# Tom wasn't able to reset their account's password
# I think I created these accounts with `--disabled-password`, probably as some layered security for ssh (to force keys), but that kinda breaks sudo, which requires the password. I could make sudo NOPASSWD, but I think it's safer to have a user password set (and have ssh disabled passoword logins still) rather than set sudoers to NOPASSWD, in general
# disabled passwords are set with the '!' in the second field of /etc/shadown
<pre>
root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing:!:20133:0:99999:7:::
root@hetzner3 ~ #
</pre>
# I just manually edited /etc/shadow with vim to remove the exclimation point
<pre>
root@hetzner3 ~ # vim /etc/shadow
root@hetzner3 ~ #

root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing::20133:0:99999:7:::
</pre>
# Tom replied, saying he can become root on hetzner3 now.
# ...
# I returned to work on the plan for replacing the disks on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb#Change_Steps
# I confirmed that the disks (on both hetzner2 and hetzner3) are MBR partition scheme (not GPT) – indicated by "Disk label type: dos"
<pre>
[root@opensourceecology ~]# fdisk -l /dev/sda

Disk /dev/sda: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x9b8e1266

Device Boot Start End Blocks Id System
/dev/sda1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sda2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sda3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0xd904fc05

Device Boot Start End Blocks Id System
/dev/sdb1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sdb2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sdb3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#
</pre>
# A quick spot-check shows that our backups usually finish at 09:55 – one time as late as 10:07. That's UTC.
# 10:00 UTC is 05:00 my time and 12:00 in Berlin. God that's early, but better to do this early in Germany time..
# I sent an email to Marcin asking if Thr 2025-04-24 @ 10:00 UTC (~05:00 FeF) would be a good time to do this
<pre>
Hey Marcin,

When would be a good time to replace the first disk on hetzner2?

Our backups finish daily at 10:00 UTC, which is:

* 12:00 in Germany (where the server lives)
* 05:00 here in Ecuador, and
* 05:00 at FeF

I propose next week on Thursday 2025-04-24 10:00 UTC.

For details about what this change entails, and expected downtime, please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb

Please let me know if you approve this change, if the suggested time is agreeable to you, and if you have any questions.
</pre>

=Fri Apr 18, 2025=
# Marcin sent another email this morning asking why osemain is down too now, and I responded
<pre>
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

* https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.

Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing? Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly. This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin
</pre>
# ok, so back to trying to figure out the corruption of the mariadb
# looks like the attempt to start it in recovery mode 2 fails after 10 minutes
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 10m0.435s
user 0m0.011s
sys 0m0.012s
[root@opensourceecology etc]#
</pre>
# and the tail of the db log
<pre>
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00 InnoDB: Waiting for the background threads to start
250417 23:06:01 InnoDB: Waiting for the background threads to start
250417 23:06:02 InnoDB: Waiting for the background threads to start
250417 23:06:03 InnoDB: Waiting for the background threads to start
250417 23:06:04 InnoDB: Waiting for the background threads to start
250417 23:06:05 InnoDB: Waiting for the background threads to start
250417 23:06:06 InnoDB: Waiting for the background threads to start
250417 23:06:07 InnoDB: Waiting for the background threads to start
250417 23:06:08 InnoDB: Waiting for the background threads to start
250417 23:06:09 InnoDB: Waiting for the background threads to start
</pre>
# so we have one more recovery mode we can try before it becomes destructive = 3
<pre>
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
>
[root@opensourceecology etc]#
</pre>
# and gave it a restart
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
...
</pre>
# damn, looks like it's stuck on the same thing
<pre>
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18 InnoDB: Waiting for the background threads to start
250418 19:33:19 InnoDB: Waiting for the background threads to start
250418 19:33:20 InnoDB: Waiting for the background threads to start
...
</pre>
# the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
## https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
## https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
# anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
# meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
# oh shit, that worked
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m2.102s
user 0m0.010s
sys 0m0.007s
[root@opensourceecology etc]#
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 22468 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]#
</pre>
# the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
<pre>
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28 InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
</pre>
# oh, the spam stopped. maybe just some startup thing.
# I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
# this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
# but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# this took about a minute to run
<pre>
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]#
</pre>
# good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
<pre>
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log
3dp_db.wp_commentmeta OK
3dp_db.wp_comments OK
3dp_db.wp_links OK
3dp_db.wp_masterslider_options OK
3dp_db.wp_masterslider_sliders OK
3dp_db.wp_options OK
3dp_db.wp_postmeta OK
3dp_db.wp_posts OK
3dp_db.wp_revslider_css OK
3dp_db.wp_revslider_layer_animations OK
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log
cacti_db.automation_ips
note : The storage engine for the table doesn't support check
cacti_db.automation_processes
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note : The storage engine for the table doesn't support check
cacti_db.poller_output
note : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note : The storage engine for the table doesn't support check
osemain_db.wp_options
warning : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]#
</pre>
# let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
## cacti_db
## osemain_db
## oswh_db
# I sent Marcin a status update email
<pre>
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
</pre>
# ok, I made the post-corruption mysqldump backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real 2m48.845s
user 3m19.170s
sys 0m2.023s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# now let's drop those three databases.
<pre>
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db |
+--------------------+
| d3d_db |
+--------------------+
| fef_db |
+--------------------+
| microfactory_db |
+--------------------+
| mysql |
+--------------------+
| obi_db |
+--------------------+
| obi_staging_db |
+--------------------+
| oseforum_db |
+--------------------+
| osemain_s_db |
+--------------------+
| osewiki_db |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db |
+--------------------+
| seedhome_db |
+--------------------+
| store_db |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# that looked good
<pre>
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]#
</pre>
# recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
# nah, it failed
<pre>
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m2.805s
user 0m0.006s
sys 0m0.010s
[root@opensourceecology etc]#
</pre>
# logs are the same, I think?
<pre>
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04 InnoDB: Waiting for the background threads to start
250418 20:10:04 InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
<pre>
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
250418 20:11:42 InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42 InnoDB: Waiting for the background threads to start
</pre>
# well, even though it *says* it's started
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m5.156s
user 0m0.008s
sys 0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 24458 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real 0m0.012s
user 0m0.001s
sys 0m0.007s
[root@opensourceecology etc]#
</pre>
# we can't connect to it with mysqlcheck
<pre>
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log
real 0m0.010s
user 0m0.002s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]#
</pre>
# so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
# huh, all lines say OK
<pre>
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real 0m11.597s
user 0m0.010s
sys 0m0.009s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log
[root@opensourceecology dbFail.20250417]#
</pre>
# well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
# I'm going to restore from the backup and then see if I can do that
# oh, right, we can't INSERT in recovery mode
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]#
</pre>
# well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
# we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
<pre>
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

</pre>
# even after that, it still won't start :'(
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m4.863s
user 0m0.009s
sys 0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real 0m0.010s
user 0m0.002s
sys 0m0.005s
[root@opensourceecology etc]#
</pre>
# before I purge those three system-level DBs, I want to confirm they're in our backups
# as I feared, it looks like they're missing
<pre>
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]#
</pre>
# according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
# I'm just going to manually dump these three anyway. Or try to
# well, I was able to get one of the three to backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real 0m0.010s
user 0m0.006s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real 0m0.142s
user 0m0.155s
sys 0m0.010s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real 0m0.009s
user 0m0.009s
sys 0m0.005s
[root@opensourceecology dbFail.20250417]#
</pre>
# mysql looks good
<pre>
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# I'm just going to move this whole db dir out of the way and see if we can start it fresh
<pre>
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x 4 mysql mysql 4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]#
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]#
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x 2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
[root@opensourceecology lib]#
</pre>
# ok, it's started outside recovery mode now
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m3.550s
user 0m0.007s
sys 0m0.012s
[root@opensourceecology etc]#

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23 InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
</pre>
# it created all these files
<pre>
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x 5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
-rw-rw---- 1 mysql mysql 16K Apr 18 20:56 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Apr 18 20:56 aria_log_control
-rw-rw---- 1 mysql mysql 18M Apr 18 20:56 ibdata1
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx 1 mysql mysql 0 Apr 18 20:56 mysql.sock
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]#
</pre>
# that also would have killed the mysql password; I can't login
<pre>
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]#
</pre>
# I hacked my way in and set the root password
<pre>
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
</pre>
# now I can see our three databases, plus one named test
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql |
+--------------------+
| performance_schema |
+--------------------+
| test |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]#
</pre>
# first let's just restore the 'mysql' database
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]#
</pre>
# that appears to have worked; our users are present now
<pre>
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User |
+------------------+
| oseforum_user |
| cacti_user |
| 3dp_user |
| cacti_user |
| d3d_user |
| fef_user |
| microfactory_usr |
| munin_user |
| obi2_user |
| obi3_user |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]>
</pre>
# I gave it a restart, and ensured it's still working. Great.
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# now let's restore the rest – including even our corrupt databases – and see if it works or breaks
# that took about 11.5 minutes to import ~6.8G of data
<pre>
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real 11m36.530s
user 1m52.944s
sys 0m3.593s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G /var/lib/mysql
[root@opensourceecology dbFail.20250417]#

</pre>
# I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# woah, I gave it a restart, and it came back fine
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# I guess we fixed it with no data loss?
# let's bring up the web servers
<pre>
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]#
</pre>
# the wiki loads now
# so does osemain
# I'd say we're back in business
# I sent an email to Marcin
<pre>
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues.
</pre>
# now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
# at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
# looks like it won't let him reboot if other users are logged-in
<pre>
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
1. maltfield
2. crupp
3. Tom Griffing (tgriffing)
4. jthomas
Choose identity to authenticate as (1-4):
</pre>
# I updated the sudoers command to give marcin *just* access to the reboot command
<pre>
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]#
</pre>
# I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
# it's debian, but sudoers syntax should (hopefully) be the same
<pre>
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~#

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password:
Retype new password:
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
</pre>
# yeah, that worked. Perfect.
# I tested it on hetzner2; it worked too.
<pre>
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
</pre>
# I sent Marcin a reply ask him to test reboots via ssh
<pre>
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need?
</pre>
# I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
## I wonder if I can figure out which one grub uses and replace that one second..
# from my log yesterday, here's our two drive's serial numbers
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# fuck; looks like neither is referenced in /boot/
<pre>
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]#
</pre>
# the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
# it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
<pre>
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64 initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64 initramfs-3.10.0-327.18.2.el7.x86_64.img System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64 initramfs-3.10.0-514.26.2.el7.x86_64.img System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64.img System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64kdump.img System.map-3.10.0-693.2.2.el7.x86_64
efi initrd-plymouth.img vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub lost+found vmlinuz-3.10.0-1127.el7.x86_64
grub2 symvers-3.10.0-1127.el7.x86_64.gz vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img symvers-3.10.0-1160.119.1.el7.x86_64.gz vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img symvers-3.10.0-327.18.2.el7.x86_64.gz vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img symvers-3.10.0-514.26.2.el7.x86_64.gz vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]#
</pre>
# I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
# oh, the efi dir is empty, so I'm thinking we're using grub2
<pre>
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]#
</pre>
# yeah, the grub dir just has one file in it?
<pre>
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11 2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
-rw-r--r-- 1 root root 1.4K Nov 15 2011 splash.xpm.gz
[root@opensourceecology boot]#
</pre>
# grub2 looks most sane
<pre>
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26 2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15 2015 fonts
-rw-r--r-- 1 root root 7.8K Jul 26 2024 grub.cfg
-rw-r--r-- 1 root root 5.3K Jun 1 2016 grub.cfg.1499616907.rpmsave
-rw-r--r-- 1 root root 6.1K Jul 9 2017 grub.cfg.1506097734.rpmsave
-rw-r--r-- 1 root root 7.0K Sep 22 2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26 2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31 2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31 2016 locale
[root@opensourceecology boot]#
</pre>
# it looks like it's referencing the raid, not the drive
<pre>
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a' 9f6b5264-da8c-406d-a444-45e3fb3aeb26
else
search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
fi
linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
</pre>
# right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
# well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
# that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
# Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
# the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 150G 38G 80% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#
</pre>

=Thr Apr 17, 2025=
# Marcin sent me an email last night (and again this morning) asking why the wiki is down
# I hadn't touched ose infra since 6 days ago
# the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
# I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
# anyway, let's check what happened to the wiki on hetzner2
# it's a 500 error complaining about the db
<pre>
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
<input type="hidden" name="num" value="50" />
<input type="hidden" name="ie" value="UTF-8" />
<input type="hidden" name="oe" value="UTF-8" />
<input type="text" name="q" size="31" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<p>
<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
<label><input type="radio" name="sitesearch" value="" />WWW</label>
</p>
user@disp9871:~$
</pre>
# disk is fine
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 96G 92G 52% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
# there's also no new logs in the mariadb error log when I hit the site in real-time
# well, the db isn't running
<pre>
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#
</pre>
# error logs aren't very helpful
<pre>
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
</pre>
# if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
<pre>
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]#
</pre>
# here's the log that pops-up when we try a restart
<pre>
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31 InnoDB: Waiting for the background threads to start
250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# google points to this https://bugs.mysql.com/bug.php?id=61516
## they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
# reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
# before reading any more, I'm going to immediately make a local copy of our most-recent backups
# looks like we have a backup from 13 hours ago and one from 27 hours ago
<pre>
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x 2 root root 4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x 2 root root 4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
</pre>
# this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
## it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
## other warnings suggest solving the underlying issue: why did the data become corrupt?
## well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
## but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
# I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
<pre>
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

* https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry.
</pre>
# the backups of the backups finished
<pre>
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec
total size is 44,542,039,620 speedup is 1.00
[root@opensourceecology ~]#
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 138G 50G 74% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
<pre>
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]#
</pre>
# I should also make a backup of /var/lib/mysql
# I'm going to create a dif for all of this
<pre>
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

* https://bugs.mysql.com/bug.php?id=61516

Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]#
</pre>
# aaaand I added a copy of /var/lib/mysql/
<pre>
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec
total size is 7,383,062,830 speedup is 1.00
[root@opensourceecology ~]#
</pre>
# another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
<pre>
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
</pre>
# Unfortunately, a lot of the links for how to fix this are now dead
## https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
## https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## https://forums.mysql.com/read.php?22,603093,604631#msg-604631
## https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
# we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
## but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
## ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
# this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
## good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
### but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
## it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
### here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
<pre>
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
</pre>
# I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
# ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
# but first let's check SMART
# oh, fuck, my notes on this are on the wiki. Of course.
# arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
# fail
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]#
</pre>
# luckily the yum servers for this EOL OS are still online, and I could install it
<pre>
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1
Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1

Installed:
smartmontools.x86_64 1:7.0-2.el7

Complete!
[root@opensourceecology ~]#
</pre>
# better
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]#
</pre>
# well this is terrifying; it says both our disks are gonna fail within 24 hours
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# compare that to hetnzer3, which says all is good
<pre>
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ #
</pre>
# I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
# oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
## well, it lists two options
### Free Replacement drive nearly new or used and tested; depends on what is in stock.
### At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
## we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
## dang, it says they'll swap the drive within 2-4 hours.
# I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
# hetzner provides 4 docs for assistance on this
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
## https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
## https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
# that first doc says to run the command we just ran
# hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
# ok, the docs say we can get more info with -A
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>
# so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
# well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
# yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
# in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
# the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
# wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
<pre>
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 152.358.379 [78,0 TB]
Data Units Written: 52.125.092 [26,6 TB]
Host Read Commands: 6.873.372.480
Host Write Commands: 1.362.559.127
Controller Busy Time: 22.226
Power Cycles: 28
Power On Hours: 17.245
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 159
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Data Units Read: 140.811.605 [72,0 TB]
Data Units Written: 56.604.901 [28,9 TB]
Host Read Commands: 1.304.073.899
Host Write Commands: 1.364.668.115
Controller Busy Time: 21.180
Power Cycles: 23
Power On Hours: 15.565
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 149
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 45 Celsius

root@hetzner3 ~ #
</pre>
# that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
# the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
# this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
# the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
<pre>
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
</pre>
# I also kicked-off a long test, which I can check tomorrow
<pre>
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
</pre>
# ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
<pre>
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/md2 197G 145G 43G 78% /
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# it won't let me check the filesystem while it's mounted
<pre>
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]#
</pre>
# it probably should be happening on-boot, but I couldn't find it in dmesg
<pre>
[root@opensourceecology ~]# dmesg | grep -i check
[ 0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]#
</pre>
# ok, instead we can just use tune2fs to get the info on the last check that was run
# looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 13131776
Block count: 52496160
Reserved block count: 2624808
Free blocks: 26575102
Free inodes: 12417672
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1011
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue May 31 06:01:12 2016
Last mount time: Thu Apr 17 17:39:11 2025
Last write time: Thu Apr 17 17:39:00 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
Check interval: 0 (<none>)
Lifetime writes: 124 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup: inode blocks
[root@opensourceecology ~]#
</pre>
# both of the filesystems (/ and /boot) look fine
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on: /boot
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 46
Maximum mount count: -1
Last checked: Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on: /
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
</pre>
# well, so far I couldn't find any signs of corruption on the disk/fs level
# back to the db, I set the recovery option in the my.cnf file
<pre>
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]#

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
>
[root@opensourceecology etc]#
</pre>
# it didn't come-up
<pre>
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]#
</pre>
# I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
<pre>
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49 InnoDB: Waiting for the background threads to start
250417 22:32:50 InnoDB: Waiting for the background threads to start
250417 22:32:51 InnoDB: Waiting for the background threads to start
250417 22:32:52 InnoDB: Waiting for the background threads to start
250417 22:32:53 InnoDB: Waiting for the background threads to start
250417 22:32:54 InnoDB: Waiting for the background threads to start
250417 22:32:55 InnoDB: Waiting for the background threads to start
250417 22:32:56 InnoDB: Waiting for the background threads to start
250417 22:32:57 InnoDB: Waiting for the background threads to start
250417 22:32:58 InnoDB: Waiting for the background threads to start
...
</pre>
# it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

=Sun Apr 11, 2025=

# let's get Catarina that broken staging site for osemain on hetzner3
# Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
# so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
# I logged-into hetzner https://robot.hetzner.com/server
# I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
# I clicked on the server for "hetzner3" and the tab "IPs".
## Then I clicked on "Order additional IPs / Nets"
## I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
## it required me to enter a reason (IPv4 is scarce) to which I wrote:
<pre>
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
</pre>
## and I clicked "Apply for IP/subnet in obligation"
## I got a message; looks like it needs human approval
<pre>
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
</pre>
# I typed an email to Marcin and Catarina to notify them of this order
<pre>
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
# I refreshed the hetzner wui, and now I see the new IP
# ...
# following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
## I sent them both emails asking them to confirm access
# I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
# update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
# he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
# I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
# I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
# Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
# ...
# I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
# gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
# I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
# there's too many layers on this server; I checked the logs
<pre>
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
</pre>
# ok, so it's modsecurity?
# gah, that's a lot of files to review
<pre>
[root@opensourceecology httpd]# find . |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]#
</pre>
# looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
<pre>
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]#
</pre>
# it's 13107200
<pre>
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf: SecRequestBodyLimit 13107200
conf.d/mod_security.conf: SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]#
</pre>
# docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
# so 13107200 / 1024 / 1024 = 12.5 MB.
# jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
# ...
# so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
# I didn't want to do this site before the other less-priority ones, but it's just a sandbox
# I realized I never made a CHG file for osemain
# looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
# ugh, I just said I was "following the same guide as with the other sites"
## I was hoping to know which one to CHG to copy-from
## I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
# ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

Maltfield Log/2025 Q2

2025-04-27T22:02:03Z

Maltfield: apr 26

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

__TOC__

=See Also=
# [[Maltfield_Log]]
# [[User:Maltfield]]
# [[Special:Contributions/Maltfield]]

=Sat Apr 26, 2025=
# Marcin authorized me to add Tom to our ops google groups mailing list and to give him access to our shared ose keepass
<pre>
Yes.

On Fri, Apr 25, 2025, 12:43 PM Michael Altfield <REDACTED@disroot.org> wrote:

> (re-sending without encryption)
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
>
> On 4/25/25 12:41, Michael Altfield wrote:
>> Hey Marcin,
>>
>> Do you authorize:
>>
>> 1. Giving Tom access to the shared OSE keepass file
>>
>> 2. Adding Tom to the ops mailing list (this would allow him to password
>> reset many of our important accounts)
>>
>> Please let me know if you authorize the above.
>>
>> Thank you,
</pre>
# Tom sent me his gpg public key, which I can use to add him to the wazuh emails
<pre>
user@ose:~$ gpg
gpg: WARNING: no command supplied. Trying to guess what you mean ...
gpg: Go ahead and type your message ...
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----

pub rsa4096 2025-04-26 [SC]
13300901348A985115679165FB137A633FD1EB4C
uid Tom Griffing (OSE PGP Key 4-25-2025) <REDACTED@tutanota.com>
sub rsa4096 2025-04-26 [E]
user@ose:~$
</pre>
# I added Tom to the wazuh recipients, per https://wiki.opensourceecology.org/wiki/Wazuh
<pre>
mkdir -p /var/tmp/gpg
pushd /var/tmp/gpg
# write multi-line to file for documentation copy & paste
cat << EOF > /var/tmp/gpg/tom.pubkey.asc
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGgMJ7ABEACwllLJu87blFKJ8aZMR7pCjRzhhp266Rjxz7071iow43a7FkvN
pcXmYsuwW4dLhqA+Sose7Fjo9o9+7bOLcBAso9x9hk55+pDQm67wyXmxp+7pWVhj
hdLBsdB4faLQDHkHymKUs/UKRViN0an/6nARxVyah58Dh/OcnSIv0bnozze8YRJX
aklCs+OF2Jv+gBH5VWNMLloX+l+MsBYj9N14MsMeWJ8lSNFWBl/SOBGuOftZbljp
qb8dBZRo/4OR/Dr5zCUQ1KuPu2wFKfMRwi3NEdmUKpFf/U7Ydn7ZK2T+ZKl+x1eb
+0I0ZM0DgaTYTqd82wlag1hfrYM7SONYb0C03x5T4y+CsG9IchgQ2yihYIKgHOIW
Wiz6vC4N4EKmuKAqCOGS/gzp7xDqzXl2R2sWHyRuOn3yUr2z9HdDk2sjnobtaVli
wYaIoes9zrBgunLoK9S0FaHzSPX0FGwygV50E73BFxJBmL6eHeRVuYOi0FkAQmsN
dJeOvpCwKgBModyPbxin78KKbgF/0OnxWL+Zde6+J5l+aW81xbwNZYuyxWHSb7m3
2RM4dXhxAWM2cBQ5+b5yKopO8T4OzKl5C/rYzhuEYqpSEQJccFNHmQexkwqACVNl
h/D97jm0580ctnGCZuNzmLlsXX2mzqOj6UU2LlUFy0HT5tr93KBA+HkGhwARAQAB
tEBUb20gR3JpZmZpbmcgKE9TRSBQR1AgS2V5IDQtMjUtMjAyNSkgPHRvbS5ncmlm
ZmluZ0B0dXRhbm90YS5jb20+iQJRBBMBCgA7FiEEEzAJATSKmFEVZ5Fl+xN6Yz/R
60wFAmgMJ7ACGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQ+xN6Yz/R
60xHURAAqIUawudDI3dmIVPa/RHTOusoJA4KIXLNCMiILWd3iwZQFQNrt6YHpwJU
pyvsXAM4QWd/qt0D9IF6K9waOIA5ipX0yXFVxZ0V1BQ6aq3cK1r+NvQUcLJzS02W
T9UIJtHOs+8EbIIS6ybcnxS6RARinrJpTkoCWspWXMDnXcX3n4pbbhHQLViswf1C
tOE7uSfNPcxGLK4cYLxLL1VHC45eB2CTEAxfXSavCPI62IcYkZBdwWz7E8q1QpsP
vxgxe31b+v9NcaxW5tc2/4NwaObqKSZYlhK/pce3X18+uWzpmE3ubhPb7Ptb5GLo
42U9ymRFg7a14VFfq+wcwSlZR01o7Q2FofAOFpX+EoDBkughAX6hWyYxErJ4vD7k
ogYX25J5suxrixkTzDMJ0cCsZyt/Bu0liVnojaETUhrNUwBp7Rz7xx5x6Go/sZHK
mzhCe1q4xwSHeTZTjyG3oby4KDPgb0WEKCdUpa5BobgT9goGGXjCxe9dS8ZVUu4I
bso+h/SK95nmgsl/EDrmDXvWOh/Zy76GixCq48ydEkGbVz/6ri1+pD0NXYN/ijAu
h6EsLnoBLQCLlYYsBTfg31X2Sbzigeloy6iRWoHtCOAfI2Azdhby+BCGuSIvUOXa
Q4CQjmjYpsx7nwtjWOgCZ4rObTekj4O9ZnI8Gtxfpzy1gFdyfw65Ag0EaAwnsAEQ
ANnD6PMPT0CU1RqbAQtVw7eJksV96+tl/xG8mtje631n2uBe9WzyLch0fgC99eID
ZDGXfJUEdODuI9/H8037PnJmmMtP2eP1c/ztrql6pxPj9c0jIRWjtwmNhyYNaaEn
i0JyLz5SiTbuftlHXaKhVTuLc/Qp44FH5XK6LVHphDR8Ck43Mhj7enfvGvmAUgLW
OLQMst84oOCywYX+nUmov2rCIhuc6RhX4OcOBZcEA2W/CSsoNXR4To9mn8Gg3/dH
ZKS/3sDwJQxjFvkqc89+aTPY85TBoUGBUzbQG+KFQgDyVt4kABK1iyUA1PKZOb4Q
MZJnR9g0UI/ctfrOpz4hhEFaQ+rEYwdm5MSXOQGfjrnGu3t85IQzmxUXovqmfsjn
oFPSPd/91/rJJKxci+rCX7CpQSObPrwHNgPNQ5zleDV7d9/u9UaGRFeOaaM+abd0
RhPh4nJWbDdNOWpj3pxJkG3tzmbazBogxTq0SDRP8wvBAD0JYESoPVGWQ6czlTnu
T0ov9QKMb21mfUQ6DmfxTFQbkr1g1r2uYfJ1TbP0AcAK+Q/IMtt8F7chulfAe7/0
9nk7HwqWHTkj8+YB9+Ro2hkUTpL57uEYdG/ukGODfTNhu02wxG02zlYFsTyd/H62
VIgT1Cpf5HBb73lzdiSVtl45C34Fwu8ZO6dBdmk2c1nFABEBAAGJAjYEGAEKACAW
IQQTMAkBNIqYURVnkWX7E3pjP9HrTAUCaAwnsAIbDAAKCRD7E3pjP9HrTNxGD/wN
syvVZxm4hyw4l8U6J3B/3rKAup+l7GQCXthNK+f3YPwWdWc8DOo3kBrP4ppR5Ry9
YKb700wBDAYwWfy+ZJPHMi0vVUf8kX2QQEj4sFZHj9suTFvfLdsLTAhNtRXVtZiu
xfr1T3R3T0XSSFFdhiBO+BYRnlgFRiiR9FCTDaxrLRfhAhOwC6LHOarHnRi5nQS8
2PaHIYbWN7c5CdpH9dsPUt3xi1sEf8E87HTZo30Of/FYtB4eTOdx2DMqKscbJvZS
1ugK+2v7DMaiBMZCfbZSVNjn8+VcTOPW5KzJFsVR7UmfvTZu6c3jrshHuPOSguT7
l63AcfrJZOJe+djndWws2u0FpyMu0AHoS2r3EtBd/OydjEKG2P7qFb3KX9I9Tv35
zQmpHc4e2TJTYKpXyfarzgKFuUfOmZpm8maUTqFdEBL6pgwi1zcQ704g7Kzo/YUr
dHTA5yQ2WBBsrVKAZIt6Llkt0jIkpSyjjs5CAPJ2jsg61nq4uYw7w3jpwe80nbyc
7GgvdkJlTS7TfcYk3vlDQOQBpXqDZagQVUT8jc6mGiY/jbSzjGNt/8qObKSywFLY
XnxLVnGhKyzsWhR5fEbUCqywwc/c14gbjNguNZbU7e0Krf9ggYoglfPIOOp8XDX1
XwH+EXkSGW96dHXIYidONcMxClnA04zZY52Sr/r6Lw==
=UsaD
-----END PGP PUBLIC KEY BLOCK-----
EOF
gpg --homedir /var/ossec/.gnupg --import /var/tmp/gpg/tom.pubkey.asc
popd

# add marcin's email (that matches an email on a UID of his key above) to the space-delimited "recipients" variable
vim /var/ossec/sent_encrypted_alarm.settings
</pre>
# and I sent him an email asking him to confirm that it's working
<pre>
Hey Tom,

Can you please confirm that you're now receiving alerts from wazuh?

Wazuh is our HIDS (Host-Based Intrusion Detection System). It's a fork of the HIDS and FIM (File Integrity Monitor) OSSEC. Because it sometimes sends sensitive information (eg diffs of config files with passwords), it's important that we encrypt its email notifications end-to-end with PGP.

And because someone who compromises the server could "clean up" after themselves, these (off-server) alerts are critical to post-compromise investigations.

For more info, see:

* https://wiki.opensourceecology.org/wiki/Wazuh
* https://en.wikipedia.org/wiki/OSSEC
* https://documentation.wazuh.com/current/getting-started/index.html

Out-of-the-box, Wazuh has a ton of features, but probably where we use it the most is its ingestion of apache's mod_security WAF and its tie-in to Wazuh's Active Response. If an IP is found doing something bad (eg multiple consecutive 403 responses, such as a brute-force attack on wordpress [or ssh]), then the IP will get temp blocked by the firewall for 10 minutes. If it does it again shortly after the ban is lifted, it'll be banned for 12 hours. If again, 1 day. Then 2 days. Then 4 days. And the max ban for 5x repeat offenses is 8 days

* https://github.com/OpenSourceEcology/ansible/blob/master/hetzner3/roles/maltfield.wazuh/templates/ossec.conf.j2#L256-L271

It also has rootkit detection, and lots of other useful alerts that "just work" out of the box.

Please confirm that you're now receiving encrypted wazuh alerts.

Thank you,
</pre>
# I tried to add Tom to our ops google groups email list, but it said I wasn't allowed to add members outside of our google workspace
<pre>
An error occurred
1 user is outside of your organization. Based on your group or organization settings, you can only add organization users to this group. Contact your group owner or domain administrator for help.
</pre>
# I checked our user's group. it appears that Tom doesn't have an account @opensourceecology.org in gsuite
# I found the setting to change that here https://admin.google.com/ac/managedsettings/864450622151/GROUPS_SHARING_SETTINGS_TAB
## https://support.google.com/a/thread/63692725/
## https://support.google.com/a/answer/167097
# I checked the box that said "Group owners can allow external members"
## curiously the subline said "Organization admins can always add external members" – but I'm a damn org admin, and I couldn't add him :/
# I tried to add him again, but I got the same error
# this time I went to the group settings https://groups.google.com/a/opensourceecology.org/g/REDACTED/settings
# I found the "allow external members" and changed it from "off" to "on" and clicked "save changes"
## this wasn't possible before. So first I had to change the workspace-wide settings to allow me to change the groups-specific settings. now it's changed.
# this time it worked.
# I sent an email to our ops google group, asking Tom to reply if he saw it
# ...
# I checked-in on hetzner2 to make sure it rebooted this morning
# looks like the cron is set to reboot at 10:40 UTC every day, and – indeed – uptime says it's been online for a bit less than 13 hours. And its last boot time was today at 10:41:25
<pre>
[root@opensourceecology ~]# uptime
23:30:25 up 12:49, 7 users, load average: 1.02, 0.98, 0.74
[root@opensourceecology ~]# journalctl | head
-- Logs begin at Sat 2025-04-26 10:41:25 UTC, end at Sat 2025-04-26 23:30:26 UTC. --
Apr 26 10:41:25 localhost systemd-journal[129]: Runtime journal is using 8.0M (max allowed 3.1G, trying to leave 4.0G free of 31.2G available → current limit 3.1G).
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuset
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpu
Apr 26 10:41:25 localhost kernel: Initializing cgroup subsys cpuacct
Apr 26 10:41:25 localhost kernel: Linux version 3.10.0-1160.119.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jun 4 14:43:51 UTC 2024
Apr 26 10:41:25 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
Apr 26 10:41:25 localhost kernel: e820: BIOS-provided physical RAM map:
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009c7ff] usable
Apr 26 10:41:25 localhost kernel: BIOS-e820: [mem 0x000000000009c800-0x000000000009ffff] reserved
[root@opensourceecology ~]#
[root@opensourceecology ~]# cat /etc/cron.d/reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology ~]# date -u
Sat Apr 26 23:31:32 UTC 2025
[root@opensourceecology ~]#
</pre>
# so it looks like we'll have ~2 minutes of downtime every day in the very early morning in the US. I can live with that.
# and grub clearly is fixed
# oh, also the RAID looks healthy
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I asked Tom for his GitHub account profile username, so I can grant him write access to our OSE ansible repo
# I updated Tom's new ssh key to his authorized_keys file on hetzner2
# I sent Tom an email asking to confirm his access to hetzner2

=Fri Apr 25, 2025=
# I woke up this morning and discovered the wiki was offline
# I tried to ssh into the server; it's not responding
# I figured I'd log into the hetzner wui, but – uhh – the credentials are in keepass and live on the server
# I mitigated this by giving Marcin a copy of the keepass file on his veracrypt drive, but he since changed the password a month or two ago, and we don't have a new local copy
# I sent an email to Marcin asking him to login to hetzner wui and boot hetzner2. if it doesn't come-up, then I'll have to get the password from him so I can load it in the wui from a rescue disk
# oh, I did find the new hetzner password in my personal keepass
# I logged-in, and I found the server was listed as being on. But I can't ping it. I gave it an "automatic hardware reset" from the wui
# I'll give it a few minutes before trying the rescue system
# their rescue systems are much nicer for their cloud product than their dedicated server product
# it looks like I have two options
## rescue boot mode: where I'm given ssh access
## vnc
# the problem with the rescue boot is that – if this is a grub issue – I wouldn't be able to "see" the error
# I enabled VNC and gave the server a reboot
# I was able to connect via vnc, but it was the damn installation wizard for almalinux. I quit the installation, and the vnc session died.
# damn, I guess vnc won't let me see the boot process, after all
# instead I tried the "rescue system"
# that didn't work; I can't access ssh on either of the IP addresses
# the docs say to activate the rescue system and then reboot it; that's what I did https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system/
# this time I fully shut down the server, and then I enabled the rescue system (while it's off)
# I went back to the Reset tab, and it's still off. So I booted it
# somehow I was able to login from my ose vm using my personal ssh key, but with user root
<pre>
user@ose:~$ ssh -v root@138.201.84.223
OpenSSH_9.2p1 Debian-2+deb12u5, OpenSSL 3.0.15 3 Sep 2024
debug1: Reading configuration data /home/user/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 138.201.84.223 [138.201.84.223] port 22.
debug1: Connection established.
...
Linux rescue 6.12.19 #1 SMP Fri Mar 14 05:34:52 UTC 2025 x86_64

--------------------

Welcome to the Hetzner Rescue System.

This Rescue System is based on Debian GNU/Linux 12 (bookworm) with a custom kernel.
You can install software like you would in a normal system.

To install a new operating system from one of our prebuilt images, run 'installimage' and follow the instructions.

Important note: Any data that was not written to the disks will be lost during a reboot.

For additional information, check the following resources:
Rescue System: https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system
Installimage: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installimage
Install custom software: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installing-custom-images
other articles: https://docs.hetzner.com/robot

--------------------

Rescue System (via Legacy/CSM) up since 2025-04-25 17:24 +02:00

Hardware data:

CPU1: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Cores 8)
Memory: 64153 MB (Non-ECC)
Disk /dev/sda: 250 GB (=> 232 GiB)
Disk /dev/sdb: 512 GB (=> 476 GiB)
Total capacity 709 GiB with 2 Disks

Network data:
eth0 LINK: yes
MAC: 90:1b:0e:94:07:c4
IP: 138.201.84.223
IPv6: 2a01:4f8:172:209e::2/64
Intel(R) PRO/1000 Network Driver

root@rescue ~ #
</pre>
# I was able to mount the root drive
<pre>
root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@rescue ~ # mount /dev/md2 /mnt
root@rescue ~ # ls /mnt
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # ls /mnt/home
b2user crupp hart lberezhny marcin stagingsync wp
cmota Flipo jthomas maltfield not-apache tgriffing
root@rescue ~ #
</pre>
# I don't know what the point of this is; I can't fix it if I can't watch it boot and see what's breaking
# ok, at the bottom of the docs, hetnzer lists another option = xKVM Rescue System https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/
# it specifically says that's for debugging boot issues
# last thing before I try that: I downloaded a local copy of the keepass files from hetzner2
<pre>
user@ose:~/tmp/hetzner2$ rsync -av --progress root@138.201.84.223:/mnt/etc/keepass ./etc-keepass-20250525
receiving incremental file list
created directory ./etc-keepass-20250525
keepass/
keepass/passwords.kdbx
46,142 100% 44.00MB/s 0:00:00 (xfr#1, to-chk=6/8)
keepass/passwords.kdbx.20170728.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#2, to-chk=5/8)
keepass/passwords.kdbx.20170804.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#3, to-chk=4/8)
keepass/passwords.kdbx.20190820.bak
33,726 100% 143.20kB/s 0:00:00 (xfr#4, to-chk=3/8)
keepass/passwords.kdbx.20190909.bak
34,238 100% 71.75kB/s 0:00:00 (xfr#5, to-chk=2/8)
keepass/passwords.kdbx.20250316.bak
45,406 100% 94.55kB/s 0:00:00 (xfr#6, to-chk=1/8)
keepass/passwords.kdbxs.20180525.bak
27,102 100% 56.31kB/s 0:00:00 (xfr#7, to-chk=0/8)

sent 161 bytes received 196,407 bytes 35,739.64 bytes/sec
total size is 195,794 speedup is 1.00
user@ose:~/tmp/hetzner2$

user@ose:~/tmp/hetzner2$ du -sh etc-keepass-20250525/keepass/*
48K etc-keepass-20250525/keepass/passwords.kdbx
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170728.bak
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170804.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190820.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190909.bak
48K etc-keepass-20250525/keepass/passwords.kdbx.20250316.bak
28K etc-keepass-20250525/keepass/passwords.kdbxs.20180525.bak
user@ose:~/tmp/hetzner2$
</pre>
# so this time was the same as the rescue system, except I choose "xKVM" instead of "Linux" in the "Operationg System" dropdown
# strange, it gave me an error
<pre>
Public key authentication is not available for the selected operating system.
</pre>
# I unselected my ssh key, and chose "no key" instead
# it gave me a URL and a password. I booted the server, but the URL didn't load ("Unable to connect" error)
# ok, it took a few minutes and had a self-signed cert
# I bypassed the cert error, and entered the username and password into the basic auth popup. It failed! Could I really have been MITM'd?
# I immediately shut down the server from the wui, and I tried again.
# this time I was able to login – both from ssh and in the wui.
# as soon as it opened, I saw the error
<pre>
No more network devices

Booting from Hard Disk...
.
error: symbol 'grub_calloc' not found.
Entering rescue mode...
grub rescue>
</pre>
# I wonder if this is grub or grub2. I didn't have a binary "grub-install" before. I assumed it was an error with the hetzner docs when I did "grub2-install" instead, which said it worked (there was a warning that the docs said were safe to ignore)
# curoiusly, the opposite is true for the ssh session in vkvm: I have grub-install but not grub2-install
<pre>
root@vKVM-rescue ~ # which grub-install
/usr/sbin/grub-install
root@vKVM-rescue ~ #
root@vKVM-rescue ~ # which grub2-install
root@vKVM-rescue ~ #
</pre>
# here's the docs in question https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# I don't want to fuck with the grub without first taking a backup of these disks. But, uh, it looks like I can't access the RAID from inside this vkvm setup
# yeah, that's one of the limitations listed for VKVM https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/#raid-controllers
<pre>
Configured units are passed through as SCSI devices to the VM. However it is not possible to access the controller. Please use the regular Hetzner Rescue System for this purpose.
</pre>
# I shutdown VKVM and booted it into the regular rescue mode
# it took a few minutes to get back into the old rescue system, but here I can use the raid
<pre>
root@rescue ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 3.4G 1 loop
sda 8:0 0 476.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
sdb 8:16 0 232.9G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
root@rescue ~ # mkdir /mnt/md1
root@rescue ~ # mkdir /mnt/md2
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # mount /dev/md2 /mnt/md2
root@rescue ~ #
</pre>
# I created a dir for these backups
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ #

root@rescue ~ # mkdir /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chown root:root /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chmod 0700 /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ #
</pre>
# first I made a backup from the raid
<pre>
root@rescue ~ # rsync -av --progress /mnt/md1 /mnt/md2/var/tmp/20250425-grub-fail/md1.$(date "+%Y%m%d_%H%M%S")
...
md1/grub2/locale/zh_TW.mo
30,882 100% 31.38kB/s 0:00:00 (xfr#345, to-chk=0/355)
md1/lost+found/

sent 399,450,301 bytes received 6,709 bytes 159,782,804.00 bytes/sec
total size is 399,330,989 speedup is 1.00
root@rescue ~ #
</pre>
# then I figured I'd make a backup of the two disk partitions directly, but I couldn't even mount it
<pre>
root@rescue ~ # umount /mnt/md1
root@rescue ~ # mkdir /mnt/sda2
root@rescue ~ # mkdir /mnt/sdb2
root@rescue ~ # mount /dev/sda2 /mnt/sda2
mount: /mnt/sda2: unknown filesystem type 'linux_raid_member'.
dmesg(1) may have more information after failed mount system call.
root@rescue ~ #
</pre>
# I tried this command (from the docs), which I skipped before because it said that the next command (grub-install) was enough; sure enough, it didn't work https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
<pre>
root@rescue ~ # grub-mkdevicemap -n
grub-mkdevicemap: error: cannot open /boot/grub/device.map.
root@rescue ~ #
</pre>
# I investigated this before, and I thought I decided we're using grub2, not grub1
<pre>
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # ls /mnt/md1/
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue ~ #
</pre>
# oh, shit, even the grub-install command is v2 https://askubuntu.com/questions/107486/how-to-know-the-version-of-grub
<pre>
root@rescue ~ # grub-install --version
grub-install (GRUB) 2.06-13+deb12u1
root@rescue ~ #
</pre>
# ok, this indicates we're not using lilo https://askubuntu.com/questions/24459/how-do-i-find-out-which-boot-loader-i-have
<pre>
root@rescue ~ # ls /mnt/md2/etc/ | grep lilo
root@rescue ~ #
</pre>
# we can dd straight from the disk to read the MBR. And, yeah, it appears we are using grub via MBR .. and this info is stored on the disks, not the raid
<pre>
root@rescue ~ # dd if=/dev/md1 bs=512 count=1 2>/dev/null | strings
root@rescue ~ #

root@rescue ~ # dd if=/dev/sda bs=512 count=1 2>/dev/null | strings
214fb5736d1e5ad63e515dc2fffe44bd928cd8dab2c019dc11fb9fcaef5ea90dbf51f1ac507ab1cfbbe74ff
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
DA/jjF
root@rescue ~ #

root@rescue ~ # dd if=/dev/sdb bs=512 count=1 2>/dev/null | strings
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
root@rescue ~ #
</pre>
# idk what to do; I tried the grub-install again, but it gives me this error
<pre>
root@rescue ~ # grub-install /dev/sda
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #

root@rescue ~ # grub-install /dev/sdb
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #
</pre>
# I tried creating a chroot of our real raid disks first
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # umount /mnt/md1
root@rescue ~ # chroot-prepare /mnt/md2
root@rescue ~ # chroot /mnt/md2
root@rescue / # ls /boot
root@rescue / # mount /dev/md1 /boot
root@rescue / # ls /boot
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue / #
</pre>
# I then tried the grub install again
<pre>
root@rescue / # grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #

root@rescue / # grub2-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #
</pre>
# I exited the chroot and shutdown the rescue system
# I activated the VKVM resuce system, and booted it again
# when I connected to the KVM wui, I was shown a password prompt. So I think booting works!
# I rebooted it from the ssh
# and now I can ssh into the real system
<pre>
user@personal:~$ autossh opensourceecology.org
Last login: Thu Apr 24 23:12:44 2025 from 146.70.199.15
[maltfield@opensourceecology ~]$
</pre>
# and now the wiki loads too
# I did another reboot test
<pre>
[maltfield@opensourceecology ~]$ sudo su -
[sudo] password for maltfield:
Last login: Thu Apr 24 16:25:15 UTC 2025 on pts/0
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
ssh: connect to host opensourceecology.org port 32415: Connection refused
Last login: Fri Apr 25 16:29:21 2025 from 185.204.1.184
[maltfield@opensourceecology ~]$
</pre>
# idk, my takeaway is that either one or some of these assumptions are correct
## grub-install needs to be run *after* the RAID sync is finished
## grub-install needs to be run on *both* the new *and* the old disk
## grub-install needs to be run inside a chroot on the rescue system
# anyway, we're stable again
# I got an email from Marcin saying Tom could help with the migrations. I sent him some wiki articles to get caught-up
<pre>
Hey Tom,

I'll try to get you ssh access on hetzner2 soon. In the meantime, please read the following articles:

* https://wiki.opensourceecology.org/wiki/Hetzner2

* https://wiki.opensourceecology.org/wiki/Hetzner3

I've started preparing draft "change tickets" for migrating each of the websites from hetzner2 to hetzner3. Note that some of these are not fully tested, so you'll want to execute them manually and make corrections as-needed.

Please also read-through these:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_store_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_microfactory_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_fef

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_oswh

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_phplist_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_wiki_to_hetzner3

(There's also one CHG for the forum that I think needs to be made)

The next item TODO is to finish the migration plan for these websites:

1. www.opensourceecology.org (osemain)
2. www.openbuildinginstiture.org (obi)

We decided that there would be 2 simultaneous versions of obi:

1. A static site scraped with curl on hetzner3
2. The (broken) dynamic wordpress site on hetzner3

And we decided that there would be 3 simultaneous versions of osemain:

1. The live/current site on hetzner2
2. A static site scraped with curl on hetzner3
3. The (broken) dynamic wordpress site on hetzner3

To have multiple sites with the same domain on the same server, we bought a second IPv4 address (FeF isn't setup with IPv6). This week I just finished updating the hetzer3 server to persist this new IPv4 address.

The next item for you would be to update our ansible to push out new vhosts (in nginx, varnish, and apache) for the static sites that are bound to the second IPv4 address using the same hostname.

Please read-through the ansible playbook and roles (most importantly for nginx, varnish, and apache) to understand how they're provisioned

* https://github.com/OpenSourceEcology/ansible

Since you have access to hetzner3, you can also poke around (read-only please) the configs for these three web services to understand how ansible provisions them.

Once you've updated and pushed-out the new vhosts with ansible, you'll need to update the migration plan

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_obi_to_hetzner3
* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

And then you'll want to go-through each migration plan to create a temp "snapshot" of all the sites on hetzner3, where Marcin & Catarina can do a thorough verification of each site (by updating /etc/hosts) before we do the *real* migration -- which is nearly the same as the "snapshot" except we actually migrate DNS.

Please let me know when you've finished reading the above articles.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/24/25 22:16, REDACTED@tutanota.com wrote:
> Michael;
>
> I need to reset my ssh key on hetzner2. Can you use the same as on 3 or best to generate a new one?
>
> I spoke with Marcin and I think I can help with the admin, as I have time available.
>
> Can you give a run-down of its status and what needs to be done for completing the migration to hetzner3?
> --
> Tom Griffing
</pre>

=Thr Apr 24, 2025=
# it's 05:00; I tried to login to the wiki, but I got an error
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# oh, under that it says I'm already logged-in?
<pre>
You are already logged in as Maltfield. Use the form below to log in as another user.
</pre>
# anyway, let's start the CHG to replace the failing disk on hetzner 2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb
# I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]#

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]#
</pre>
# I tried to remove the first partition from the RAID, but it said I can't?
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#
</pre>
# apparently the docs say that if the RAID is healthy, you have to force it with '--fail' https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# crap, I realized I have an issue in my CHG (we need two sysadmins for peer review *sigh*)
## I listed this
<pre>
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3
</pre>
## but it should be this
<pre>
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3
</pre>
# anyway, it looks like I first need to execute this, to force the RAID into a failure state
<pre>
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md2 --fail /dev/sdb3
</pre>
# ok, I was able to remove it
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1](F)
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0] sdb1[1](F)
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1](F)
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# by 10:32 UTC, I submitted the request to hetzner to replace /dev/sdb = "Crucial_CT250MX200SSD1_154410FA4520"
# it says they should do it within 2-4 hours
# meanwhile, I updated https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# at 08:00 my time, I checked and saw that we had an email come from hetzner at 06:36 (my time)
<pre>
Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.
</pre>
# well, crap. I tried to load the wiki CHG article, but there's an error
<pre>
Sorry! This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot access the database)
</pre>
# the server wasn't shutdown, and my screen session is still intact, but dmesg is being flooded with RAID and io errors
<pre>
...
[11136.011313] md: super_written gets error=-5, uptodate=0
[11136.011372] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11136.319267] md: super_written gets error=-5, uptodate=0
[11136.319322] md: super_written gets error=-5, uptodate=0
[11138.827642] EXT4-fs error: 5 callbacks suppressed
[11138.827693] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
[11138.827793] EXT4-fs: 5 callbacks suppressed
[11138.827841] EXT4-fs (md2): previous I/O error to superblock detected
[11138.835255] md: super_written gets error=-5, uptodate=0
[11138.835311] md: super_written gets error=-5, uptodate=0
[11138.835367] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11138.835472] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
...
</pre>
# well anyway, I'll see if I can at least restart the RAID sync and install grub on the new disk
# son of a bitch, they removed the wrong drive!
<pre>
[root@opensourceecology ~]# date -u
Thu Apr 24 13:05:32 UTC 2025
[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 32G 0 part
├─sdc2 8:34 0 512M 0 part
└─sdc3 8:35 0 200.4G 0 part
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
device node not found
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# it shows a new drive (sdc) and and old drive (sdb)
# ugh, so now we have nothing in the raid?
# here's the new drive
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sdc | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# christ, so this new disk is half the size of our actual disk? what did they do?!?
# and now we have a prod server online with no redundancy. I can't tell them to put back-in the *correct* disk, or we'll have data loss
# I'm going to stop all the web services before this disaster gets any worse
# great; io errors. this is a damn disaster
<pre>
[root@opensourceecology ~]# systemctl stop nginx
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop varnish
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop apache2
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
Failed to stop apache2.service: Unit apache2.service not loaded.
[root@opensourceecology ~]#
</pre>
# I went ahead and made partition backups, anyway
# wait, actually, it said that /dev/sdc = Crucial_CT250MX200SSD1_154410FA336C. That's our old /dev/sda
# so they *did* remove the right drive, but the re-insertion of the wrong drive pushed /dev/sda to /dev/sdc. That kinda breaks our ability to map the RAID, but let's at-least partition this new drive
# but this new drive isn't the right size. it's 512G while our old disk was 250G. I guess it's better to have too-big of a disk than too-small of a disk, but we won't be able to use that extra disk space. I'm going to assume that they just didn't have 250G disks in-stock anymore.
# anyway, I tried to backup the partitions, but that wouldn't work since we're read-only
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
mkdir: cannot create directory ‘/var/tmp/chg.20250424_132010’: Read-only file system
[root@opensourceecology ~]# chown root:root $chg_dir
chown: cannot access ‘/var/tmp/chg.20250424_132010’: No such file or directory
[root@opensourceecology ~]#
</pre>
# I don't know what to do besides giving it a reboot, but that scares me
# I'd like to take a backup, but I can't if I get read-only errors :(
# well, I guess that's why we made a backup before this. I don't think I have any option other than to reboot. and pray that grub is intact to bring it back.
# I gave it a reboot. If it doesn't come back, I'll try to boot to the rescue CD from within the hetzner wui
<pre>
[root@opensourceecology ~]# date && reboot
Thu Apr 24 13:24:18 UTC 2025
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!

Failed to start reboot.target: Unit is not loaded properly: Input/output error.
See system logs and 'systemctl status reboot.target' for details.

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!
</pre>
# wtf, it can't even reboot it's so broken.
# I triggered a rest on the hetzner wui
# the server came back, and I immediately shutdown all services again
<pre>
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop apache2
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]#
</pre>
# I went ahead and triggered backups
<pre>
[root@opensourceecology ~]# cat /etc/cron.d/backup_to_backblaze
20 07 * * * root time /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log
20 04 03 * * root time /bin/nice /root/backups/backupReport.sh
[root@opensourceecology ~]#

[root@opensourceecology ~]# time /root/backups/backup.sh &>> /var/log/backups/backup.log
</pre>
# ok, sdc is gone. we have sda and sdb again, and sda is our original sda – as we wanted
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions; it's not surprising the sdb file is empty
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250424_133230 ~
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# du -sh ${chg_dir}/*
4.0K /var/tmp/chg.20250424_133230/sda_parttable_mbr.bak
0 /var/tmp/chg.20250424_133230/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
</pre>
# I copied the partition from sda to sdb
<pre>
[root@opensourceecology chg.20250424_133230]# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sdb: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sdb2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sdb3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250424_133230]#
</pre>
# that looked good, other than the complaint about not being able to boot from this disk; I'll check later what is LILO and if this will matter for raid grub
# I reloaded the partition table for this disk
<pre>
[root@opensourceecology chg.20250424_133230]# blockdev --rereadpt /dev/sdb
[root@opensourceecology chg.20250424_133230]#
</pre>
# I added the new disk to the RAID, and it shows that it's starting to sync now. excellent
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md0 -a /dev/sdb1
mdadm: added /dev/sdb1
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md1 -a /dev/sdb2
mdadm: added /dev/sdb2
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md2 -a /dev/sdb3
mdadm: added /dev/sdb3
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (19712/33521664) finish=481.1min speed=1159K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, it looks like it's not syncing each partition of the RAID at the same time. it's doing md0 now and then it'll do the others after, I guess
# md0 is partition 1 (sda1/sdb1). That's *sigh* swap. It's 32GB.
# I kinda wish we'd sync'd /boot first. I don't think I can install grub until that's sync'd. maybe?
# it says it's moving about 1024K/s. That's 1 MB per sec. 32G*1024 = 32,768 MB. That's 32,768 seconds / 60 = 546 minutes / 60 = 9 hours. Just for swap!
# assuming we have the same speed for the rest of the disk, that's 250 G * 1024 = 256,000 MB / 1 MB/s = 256,000 seconds. 256,000 seconds / 60 = 4,266.666666667 minutes / 60 = 4,266.666666667 = 71.11 hours. I guess we just have to accept the risk and hope that old /dev/sda with all our data doesn't fail within then next 3 days.
# I tried to go ahead and install grub on the new disk, but i got a command not found error
<pre>
[root@opensourceecology chg.20250424_133230]# grub-install /dev/sdb
-bash: grub-install: command not found
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# grub
grub2-bios-setup grub2-glue-efi grub2-mkconfig grub2-mkpasswd-pbkdf2 grub2-probe grub2-set-default
grub2-editenv grub2-install grub2-mkfont grub2-mkrelpath grub2-reboot grub2-setpassword
grub2-file grub2-kbdcomp grub2-mkimage grub2-mkrescue grub2-render-label grub2-sparc64-setup
grub2-fstest grub2-macbless grub2-mklayout grub2-mkstandalone grub2-rpm-sort grub2-syslinux2cfg
grub2-get-kernel-settings grub2-menulst2cfg grub2-mknetdir grub2-ofpathname grub2-script-check grubby
[root@opensourceecology chg.20250424_133230]#
</pre>
# looks like it should be 'grub2-install' I tried that
<pre>
[root@opensourceecology chg.20250424_133230]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, that's two warnings but no errors; I'll take it.
# we're up to 12.4% on the RAID sync of swap. It's now going >50x faster than it was before; good news
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[==>..................] recovery = 12.4% (4168832/33521664) finish=8.2min speed=59264K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# calculations at that speed would be 250*1024/58 = 4,413.793103448 seconds / 60 = 73 minutes. Oh, that's just over an hour.
# and now we're at 42.7%
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[========>............] recovery = 42.7% (14334208/33521664) finish=6.6min speed=47845K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# backups are still running; I'll let them finish before starting-up the webservers again
# I wrote a status email to Marcin
# the backups still aren't finished
# I checked on the raid replication, and it shows md0 (swap) and md1 (boot) are both done. Horray! Now we just need to finish root (/), which is 9.8% done and going at 60 MB/s. Great!
<pre>
Thu Apr 24 14:05:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 9.8% (20767872/209984640) finish=50.5min speed=62429K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I gave the grub install a double-tap now that it's synced with the first disk; the output was the same
<pre>
[root@opensourceecology ~]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# the output of lsblk looks much nicer now, too
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# backups say they're 9% uploaded
<pre>
[root@opensourceecology ~]# tail -f /var/log/backups/backup.log
...
2025/04/24 14:13:48 INFO :
Transferred: 2.210G / 20.472 GBytes, 11%, 2.904 MBytes/s, ETA 1h47m20s
Transferred: 0 / 1, 0%
Elapsed time: 13m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 10% /20.472G, 2.997M/s, 1h43m59s
</pre>
# I decided to just kill the backup script and manually upload it without the bwlimit, so it'll go-out faster
<pre>
[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20250424_133017.tar.gpg b2:ose-server-backups
2025/04/24 14:15:20 INFO :
Transferred: 116.500M / 20.472 GBytes, 1%, 1.958 MBytes/s, ETA 2h57m25s
Transferred: 0 / 1, 0%
Elapsed time: 1m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 0% /20.472G, 5.065M/s, 1h8m35s
</pre>
# meanwhile we're at 24% on the RAID sync
<pre>
Thu Apr 24 14:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.9% (50200448/209984640) finish=101.1min speed=26325K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# oh, important to note: our new disk doesn't say that it's failing :D
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# while the old disk says it's reached 100% of its lifecycle, the new disk says it's at – uhh – 96% of it's life? That doesn't sound very good :(
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# Shame. I was hoping for at least something <50%. Well, I wonder how long that remaining 4% will last us :/
# ok, backups just finished; let's start the web services
<pre>
[root@opensourceecology ~]# systemctl start mariadb
[root@opensourceecology ~]# systemctl start httpd
[root@opensourceecology ~]# systemctl start varnish
[root@opensourceecology ~]# systemctl start nginx
[root@opensourceecology ~]#
</pre>
# I updated the wiki CHG with a status https://wiki.opensourceecology.org/wiki/Category:CHGs
# And I sent an email to Marcin recommending that he replace /dev/sda with an actual new drive
<pre>
Hey Marcin,

Would you authorize spending €41.18 on a new disk for your server?

Update: Your websites are back online. The RAID is still syncing.

I was a bit disappointed to learn that hetzner replaced a disk with 0% "life left" with a disk with 4% "life left". That's what we get for choosing the free disk replacement..

The "free" option said it would replace it with a "Replacement drive nearly new or used and tested; depends on what is in stock." Obviously they didn't give us a "nearly new" drive..

Your other disk is also at 0% "life left". I was already planning on replacing that one next week too, but I would recommend that you pay for a new drive for this one. The cost listed on the website is €41.18.

Do you authorize me selecting €41.18 for the replacement of /dev/sda on hetzner2?
</pre>
# from the output above, our old drive said it had "Power_On_Hours" of 78516/24/365 = 8.96 years
# and our new drive says Power_On_Hours = 52083/24/365 = 5.95 years. Well that's better, I guess.
# oh wow, the power cycle count is crazy; our disk we only rebooted 50 times and the new one was only 33 times.
# also the SMART data for both of these drives has different keys (not just values). apparently it's very vendor-specific, so some of these comparisons are apples-to-oranges
# right, we're at 69.7% replication on root. I'm going to go make breakfast and check-in again after
# ...
# over lunch, I realized that Marcin's last email was possibly hyperbolic panic
# he's worried that he just kicked-off a marketing campaign (for the apprenticeship), which now links to information on a broken website – where potential applicants can't read the info
# but I think the content actually *is* accessible, just not to Marcin
# when you're logged-into the wiki, the cookies bypass the cache. So, regretablly, when hetnzer2's backend is offline, Marcin sees an error
# but I'd bet that the frontpage of all the websites and the recently-published apprenticeship info page that he's published & promoted are still online when he sees that error – for users who are *not* logged-into the site
# but if the backend site is broken for >24 hours, then the cache will cache the errors (not the content)
# as a short-term hack, I recommended that we setup a daily reboot of hetzner2 at 10:40 (a good buffer after the backups finish uploading)
# I asked Marcin if he'd like me to setup a daily reboot at 10:40
<pre>
Hey Marcin,

I don't think the situation is as bad as you think.

> We are missing opportunity,
> the announcement is posted, and our servers are down.

Of course I agree it's not good, and we should migrate away from hetzner2 asap. And I do wish I had more bandwidth to finish the migration faster for you.

But you have a varnish cache that caches pages for 24 hours. Even if your backend webserver and database are down, popular pages (like the frontpage of your wiki or a recent article that you've recently promoted) should still load for users.

The big issue isn't marketing and read-only content. The big issue is editing. That's what is breaking.

When you're logged into the wiki, it bypasses the varnish cache. So, even if the wiki appears down to you, the contents of (most) articles viewed in the past 24 hours will be still visible to potential apprenticeship applicants.

The next time you see the websites are down, try loading it from another device where you're not logged-in. You'll probably see that the apprenticeship info is still accessible, even though the backend for the site is down.

As a short-term hack, I recommend setting-up a daily reboot of the server. Backups typically finish before 10:10 UTC. I recommend we add a cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.

The server seems to function for some time after a fresh reboot, and it caches pages for 24 hours. So the first time someone loads a page in the wiki after that reboot, it'll be cached for the entire time that the server is online until its next reboot. I think this will ensure higher availability of your read-only content (eg information about the apprenticeship).

Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# ...
# I checked-in on the RAID replication status; it's finished
<pre>

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# so it looks like I started it just after 13:32 and it finished just before 15:20. So it took just under 2 hours. Great!
# I updated the article with status updates, marking the CHG as completed successfully https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb#2025-04-24_16:18_UTC
# And I sent an email to Marcin & Catarana to let them know it was successful, and asked again about buying a new drive for replacing /dev/sda
<pre>
Update: your new (used) disk is now fully synced with the old (failing) disk.

* https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb

According to SMART data, you now have one failing disk and one not-failing disk.

Your hetzner2 RAID is now healthy, and you have redundancy spread across two mirrored disks again.

Next week I'd like to replace the other failing disk. Please let me know if you approve the purchase of a new disk for its replacement.
</pre>
# Marcin got back to me, approving the purchase of the new disk; I updated the ticket https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# Note that the price is listed as "at cost" and it says
<pre>
Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
</pre>
# 1,000 hours is fine. That's compared to the 78,516 hours of /dev/sda and 52,083 hours of our "new" /dev/sdb
# but it's a bit concerning that it says it might not be in-stock. I'm going to message them and ask if they can set one aside for us for next week
<pre>
Hi Support,

Can you set-aside a replacement disk for this server?

Our disks' SMART logs indicated that both disks should be replaced. Today we replaced one of the two disks, but the disk that you replaced it with has 4% of its life left, according to SMART data (it has 52,083 hours of operation).

Next week we would like to replace the other disk, and this time we'd like your "at cost" option, to get a disk with <1,000 hours of operation.

But I was a bit concerned when I read this next to the WUI option for "at cost" on your website

> Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.

Specifically what worries me is the "may not be in stock".

Can you please tell us if you have stock now? And if you do, can you please reserve one disk for us for next week?

We don't want to remove a disk from our RAID and plan for downtime, only to discover that you don't have a disk available for us..

Please let us know if you can reserve 1 disk for us for next week.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# I asked Marcin if Wed next week at 11:00 UTC is ok for replacing hetzner2's sda
<pre>
Hey Marcin,

When would be a good time to replace the second disk on hetzner2?

If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next week on Wednesday 2025-04-30 11:00 UTC, which is:

* 13:00 in Germany (where the server lives)
* 06:00 here in Ecuador, and
* 06:00 at FeF

For details about what this change entails, and expected downtime,
please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda

Please let me know if you approve this change, if the suggested time is
agreeable to you, and if you have any questions.

Thank you,
</pre>
# Marcin returned the email confirming the time
<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>
# ...
# Marcin got back to me and told me to setup the daily reboot cron on hetzner2
<pre>
Yes, please set up reboot. That is decent for now

On Thu, Apr 24, 2025, 11:08 AM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> I don't think the situation is as bad as you think.
>
> > We are missing opportunity,
> > the announcement is posted, and our servers are down.
>
> Of course I agree it's not good, and we should migrate away from
> hetzner2 asap. And I do wish I had more bandwidth to finish the
> migration faster for you.
>
> But you have a varnish cache that caches pages for 24 hours. Even if
> your backend webserver and database are down, popular pages (like the
> frontpage of your wiki or a recent article that you've recently
> promoted) should still load for users.
>
> The big issue isn't marketing and read-only content. The big issue is
> editing. That's what is breaking.
>
> When you're logged into the wiki, it bypasses the varnish cache. So,
> even if the wiki appears down to you, the contents of (most) articles
> viewed in the past 24 hours will be still visible to potential
> apprenticeship applicants.
>
> The next time you see the websites are down, try loading it from another
> device where you're not logged-in. You'll probably see that the
> apprenticeship info is still accessible, even though the backend for the
> site is down.
>
> As a short-term hack, I recommend setting-up a daily reboot of the
> server. Backups typically finish before 10:10 UTC. I recommend we add a
> cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.
>
> The server seems to function for some time after a fresh reboot, and it
> caches pages for 24 hours. So the first time someone loads a page in the
> wiki after that reboot, it'll be cached for the entire time that the
> server is online until its next reboot. I think this will ensure higher
> availability of your read-only content (eg information about the
> apprenticeship).
>
> Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# we don't have ansible for hetzner2; I did this manually
<pre>
[root@opensourceecology cron.d]# pwd
/etc/cron.d
[root@opensourceecology cron.d]# ls -lah
total 52K
drwxr-xr-x. 2 root root 4.0K Apr 24 17:56 .
drwxr-xr-x. 105 root root 12K Apr 18 21:52 ..
-rw-r--r-- 1 root root 128 May 16 2023 0hourly
-rw-r--r-- 1 root root 1.3K Apr 9 2019 awstats_generate_static_files
-rw-r--r-- 1 root root 151 Apr 24 17:52 backup_to_backblaze
-rw-r--r-- 1 root root 78 May 31 2024 cacti
-rw-r--r-- 1 root root 125 Dec 11 00:16 letsencrypt
-rw-r--r-- 1 root root 506 Mar 18 2019 phplist
-rw-r--r-- 1 root root 108 Jan 7 2022 raid-check
-rw-r--r-- 1 root root 118 Apr 24 17:56 reboot
-rw------- 1 root root 235 Dec 15 2022 sysstat
[root@opensourceecology cron.d]# cat reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology cron.d]#
# tomorrow morning I should check on the uptime and journalctl to make sure it rebooted sometime around 10:40 UTC
</pre>
# ...
# ok, back to hetzner3: we bought a second IPv4 address for the static sites, but the server's networking was never setup for it; let's add that
<pre>
root@hetzner3 /etc/network # cp interfaces interfaces.20250424
root@hetzner3 /etc/network # vim interfaces
...
</pre>
# well, that failed.
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
You have mail in /var/mail/root
root@hetzner3 ~ #
</pre>
I restored the backup file, and it still failed. The journal and status aren't helpful
<pre>
root@hetzner3 ~ # systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-24 17:18:55 UTC; 52s ago
Duration: 2month 1w 20h 39min 50.765s
Docs: man:interfaces(5)
Process: 3259336 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 3259371 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 3259336 (code=exited, status=1/FAILURE)
CPU: 29ms

Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
root@hetzner3 ~ # journalctl -u networking | tail
Apr 24 17:16:36 hetzner3 ifup[3258504]: ifup: failed to bring up enp0s31f6
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:16:36 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
</pre>
# if I run the ExecStart command manaully, I can add a verbose tag. but that's not especially helpful, either
<pre>
root@hetzner3 ~ # ifup --verbose -a --read-environment
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool

ifup: configuring interface enp0s31f6=enp0s31f6 (inet)
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool
ip addr add 144.76.164.201/255.255.255.224 broadcast 144.76.164.223 dev enp0s31f6 label enp0s31f6
RTNETLINK answers: File exists
ifup: failed to bring up enp0s31f6
run-parts --exit-on-error --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/000resolvconf
run-parts: executing /etc/network/if-up.d/ethtool
run-parts: executing /etc/network/if-up.d/postfix
run-parts: executing /etc/network/if-up.d/resolved
root@hetzner3 ~ #
</pre>
# curiously, though, the new IPv4 address is listed in `ip a`
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# I'm just going to give this server a reboot before proceeding, to make sure the IP config is sticky
# when it came-up, it lost the new IP :(
<pre>
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# well, at least it's restarting now without errors; I can work with that
<pre>
root@hetzner3 /etc/network # systemctl restart networking
You have new mail in /var/mail/root
root@hetzner3 /etc/network # systemctlstatus networking
-bash: systemctlstatus: command not found
root@hetzner3 /etc/network # systemctl status networking
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: active (exited) since Thu 2025-04-24 17:33:40 UTC; 15s ago
Docs: man:interfaces(5)
Process: 8598 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Process: 9022 ExecStart=/bin/sh -c if [ -f /run/network/restart-hotplug ]; then /sbin/ifup -a --read-environment --allow=hotplug; fi (code=exited, status=0/SUCCESS)
Main PID: 9022 (code=exited, status=0/SUCCESS)
CPU: 357ms

Apr 24 17:33:34 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:33:39 hetzner3 ifup[8663]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 ifup[8907]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 systemd[1]: Finished networking.service - Raise network interfaces.
root@hetzner3 /etc/network #
</pre>
# let's try to add it now
<pre>
root@hetzner3 /etc/network # diff interfaces interfaces.20250424
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # vim interfaces
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,23
> iface enp0s31f6 inet static
> address 144.76.164.195
> netmask 255.255.255.224
> gateway 144.76.164.193
> # route 144.76.164.192/27 via 144.76.164.193
> #up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6
>
root@hetzner3 /etc/network #
</pre>
# I gave it a restart, but I have errors again
<pre>
# curiously, it *did* add the new IP address; wtf
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# the internet isn't very helpful because it seems the damn format has changed so many times over the years; lots of outdated info
# lots of people say they fixed this by deleting everything in interfaces.d/, but we don't have anything in that folder
# I did find this hetzner-specific docs on adding a second IP; it's totally different than what I've read elsewhere https://docs.hetzner.com/robot/dedicated-server/network/net-config-debian-ubuntu
<pre>
up ip addr add 10.4.2.1/32 dev eth0
down ip addr del 10.4.2.1/32 dev eth0
</pre>
# I tried this, and gave the server a reboot
<pre>
root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,20
> # 2025-04-24: add second IPv4 address
> up ip addr add 144.76.164.195/32 dev enp0s31f6
> down ip addr del 144.76.164.195/32 dev enp0s31f6
>
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # cat interfaces
### Hetzner Online GmbH installimage

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback
iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet static
address 144.76.164.201
netmask 255.255.255.224
gateway 144.76.164.193
# route 144.76.164.192/27 via 144.76.164.193
up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6

# 2025-04-24: add second IPv4 address
up ip addr add 144.76.164.195/32 dev enp0s31f6
down ip addr del 144.76.164.195/32 dev enp0s31f6

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::2
netmask 64
gateway fe80::1

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::3
netmask 64
gateway fe80::1
root@hetzner3 /etc/network #
</pre>
# the system came-up with the IP I want. Cool!
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# and I'm able to restart the service without it yelling at me (or breaking the IP config)
<pre>
root@hetzner3 ~ # systemctl restart networking
root@hetzner3 ~ #
You have new mail in /var/mail/root
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# I'm also able to ping the server on both IPs, which is a good sign
<pre>
user@disp9871:~$ ping 144.76.164.201
PING 144.76.164.201 (144.76.164.201) 56(84) bytes of data.
64 bytes from 144.76.164.201: icmp_seq=1 ttl=50 time=490 ms
64 bytes from 144.76.164.201: icmp_seq=2 ttl=50 time=490 ms
^C
--- 144.76.164.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 489.558/489.676/489.795/0.118 ms
user@disp9871:~$
user@disp9871:~$ ping 144.76.164.195
PING 144.76.164.195 (144.76.164.195) 56(84) bytes of data.
64 bytes from 144.76.164.195: icmp_seq=1 ttl=50 time=493 ms
64 bytes from 144.76.164.195: icmp_seq=2 ttl=50 time=512 ms
^C
--- 144.76.164.195 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 492.853/502.518/512.184/9.665 ms
user@disp9871:~$
</pre>
# I used netcat to test it. Most ports are closed, and I found that nginx is listening on most of the other ports on all IPs – except 4443
<pre>
root@hetzner3 ~ # nc -s 144.76.164.195 -l -p 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# and this was how it looked on my laptop's side
<pre>
user@disp9871:~$ nc 144.76.164.195 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# ok, so the server's new IPv4 address is configured (and persistent between reboots)

=Sun Apr 20, 2025=
# Marcin replied to my email authorizing the replacement of the /dev/sdb disk on hetzner2 at 2025-04-24 10:00 UTC https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
## I updated the article with the defined date & time
# ...
# I also checked hetzner3. I see that I setup email alerts for the RAID, but not for SMART.
## on hetzner2, we had no errors of the RAID, but we did have SMART errors. I guess eventually if it failed enough that RAID replication was breaking, we would have gotten alerts. But it would be good if we could get alerts *before* that happened..
# I checked munin on hetzner2 to see what data it collects for monitoring disks @ /disk-day.html
## looks like we have latency, throughput, usage, utilization, i/o, and inode usage. There's nothing about "SMART errors"
# looks like there *is* a smart module for munin https://gallery.munin-monitoring.org/plugins/munin/smart_/
# it's already there on hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah | grep -i smart
-rwxr-xr-x 1 root root 11K Mar 21 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Mar 21 2023 smart_
You have new mail in /var/mail/root
root@hetzner3 /usr/share/munin/plugins #
</pre>
# hetzner2 has it too
<pre>
[root@opensourceecology munin]# ls -lah /usr/share/munin/plugins | grep -i smart
-rwxr-xr-x 1 root root 11K Nov 6 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Nov 6 2023 smart_
[root@opensourceecology munin]#
</pre>
# crap, I just checked hetzner3's munin, and I realized that varnish is missing :(
# it looks like ansible *has* pushed-out the script and plugins
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah /usr/share/munin/plugins/ | grep -i varnish
-rwxr-xr-x 1 root root 26K Mar 21 2023 varnish_
-rwxr-xr-x 1 root root 28K Feb 12 00:14 varnish5_
-rwxr-xr-x 1 root root 28K Sep 28 2024 varnish5_.175431.2025-02-12@00:16:02~
-rwxr-xr-x 1 root root 28K Sep 25 2024 varnish5_.20240928.orig
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # ls -lah /etc/munin/plugins/ | grep -i varnish
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_backend_traffic -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_bad -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_expunge -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_hit_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_main_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_memory_usage -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_mgt_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_objects -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_request_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_threads -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_transfer_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Feb 12 00:16 varnish_uptime -> /usr/share/munin/plugins/varnish5_
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I did a diff of the varnish5_ script from my server and ose's server, and I found 2 new lines at the top of the hetzner3 server
## my server
<pre>
maltfield@mail:~$ head /usr/share/munin/plugins/varnish5_
#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
#
# This program is free software; you can redistribute it and/or modify
maltfield@mail:~$
</pre>
## ose's hetzner3
<pre>
maltfield@hetzner3:~$ head /usr/share/munin/plugins/varnish5_
# Ansible managed

#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
maltfield@hetzner3:~$
</pre>
# so basically the issue appears to be that my "ansible managed" comment comes before the shebang, so varnish is interpreting everything as shell, instead of perl
# we can see the result of all these syntax errors with a test run too
## my server
<pre>
root@mail:/etc/munin# munin-run varnish_hit_rate
cache_hitpass.value 0
client_req.value 704255
cache_miss.value 202581
cache_hitmiss.value 2181
cache_hit.value 499493
root@mail:/etc/munin#
</pre>
## ose's hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
/etc/munin/plugins/varnish_hit_rate: 26: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 28: varnish5_: not found
/etc/munin/plugins/varnish_hit_rate: 30: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 32: Varnish: not found
/etc/munin/plugins/varnish_hit_rate: 34: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 36: The: not found
/etc/munin/plugins/varnish_hit_rate: 38: The: not found
/etc/munin/plugins/varnish_hit_rate: 39: [varnish5_*]: not found
/etc/munin/plugins/varnish_hit_rate: 40: group: not found
/etc/munin/plugins/varnish_hit_rate: 41: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 42: env.name: not found
/etc/munin/plugins/varnish_hit_rate: 44: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 108: my: not found
/etc/munin/plugins/varnish_hit_rate: 111: my: not found
/etc/munin/plugins/varnish_hit_rate: 114: my: not found
/etc/munin/plugins/varnish_hit_rate: 117: my: not found
/etc/munin/plugins/varnish_hit_rate: 119: my: not found
/etc/munin/plugins/varnish_hit_rate: 123: Syntax error: "(" unexpected
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I moved the "ansible managed" comment below the shebang in ansible, and pushed it out; now it works
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
client_req.value 10714
cache_hitmiss.value 9
cache_hit.value 6478
cache_hitpass.value 0
cache_miss.value 4227
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I also pushed-out smart at the same time, but it's not working
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_ suggest
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # munin-run smart_
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# the docs page for the smart_ munin plugin says that we need this section at-minimum in the munin config file, so I added it to hetzner2 https://gallery.munin-monitoring.org/plugins/munin/smart_/
<pre>
[root@opensourceecology plugin-conf.d]# tail -n4 zzz-ose

[smart_*]
user root
group disk
[root@opensourceecology plugin-conf.d]#
</pre>
# and I manually created the symlinks for sda & sdb
<pre>
[root@opensourceecology ~]# cd /etc/munin/plugins
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sda
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sdb
[root@opensourceecology plugins]#
</pre>
# sweet, that worked
<pre>
[root@opensourceecology plugins]# munin-run smart_sdb
Program_Fail_Count.value 100
Reallocated_Event_Count.value 100
Ave_Block_Erase_Count.value 001
Reallocate_NAND_Blk_Cnt.value 100
Erase_Fail_Count.value 100
Reported_Uncorrect.value 100
SATA_Interfac_Downshift.value 100
Offline_Uncorrectable.value 100
smartctl_exit_status.value 8
Write_Error_Rate.value 100
FTL_Program_Page_Count.value 100
Current_Pending_Sector.value 100
Success_RAIN_Recov_Cnt.value 100
UDMA_CRC_Error_Count.value 100
Error_Correction_Count.value 100
Temperature_Celsius.value 064
Raw_Read_Error_Rate.value 100
Total_Host_Sector_Write.value 100
Power_Cycle_Count.value 100
Power_On_Hours.value 100
Host_Program_Page_Count.value 100
Unused_Reserve_NAND_Blk.value 000
Percent_Lifetime_Remain.value 000
Unexpect_Power_Loss_Ct.value 100
[root@opensourceecology plugins]#
</pre>
# Unfortunately, I'm not getting the same results on hetzner3. I wonder if this munin plugin doesn't support nvme drives?
# oh, it looks like I'm actually not updating that file anymore in ansible, because it has a backup. I'm going to make a note in ansible so I don't make that mistake again.
# meanwhile, I manually updated the config file on hetzner3 too
<pre>
root@hetzner3 /etc/munin # cd plugin-conf.d/
root@hetzner3 /etc/munin/plugin-conf.d # ls
dhcpd3 munin-node README spamstats zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # touch /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chown root:root /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chmod 0600 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # cp zzz-myconf /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # ls -lah /var/tmp/munin-zzz-myconf.20250420
-rw------- 1 root root 1,7K Apr 20 17:29 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # vim zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # diff /var/tmp/munin-zzz-myconf.20250420 /etc/munin/plugin-conf.d/zzz-myconf
3c3
< # Version: 0.2
---
> # Version: 0.3
9c9
< # Updated: 2024-12-12
---
> # Updated: 2025-04-20
31a32,35
>
> [smart_*]
> user root
> group disk
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# that still fails
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# but, if I restart the service first and then run it, it – uhh – kinda works
<pre>
root@hetzner3 /etc/munin/plugin-conf.d # service munin-node restart
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# so it exits with a non-error, just a U. no further stats. huh.
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
smartctl_exit_status.value U
root@hetzner3 /usr/share/munin/plugins #
</pre>
# yeah, it looks like the smart_ plugin doesn't work for nvme drives :(
## https://github.com/munin-monitoring/munin/issues/790
## https://github.com/aranemac/munin-smart-nvme
# I'm not looking to compile some binary. I think we've reached the point of diminished return here
# while historical smart charts would be great, what I really want to achieve is some email alerts from SMART, like we setup for the RAID
# I found a few guides about this
## https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email
## https://serverfault.com/questions/426761/is-smartd-properly-configured-to-send-alerts-by-email
## https://unix.stackexchange.com/questions/662633/best-practices-to-enable-smart-disk-notifications-on-a-linux-workstation
# I replaced the files
<pre>
root@hetzner3 /etc # mv /etc/smartd.conf /etc/smartd.conf.$(date "+%Y%m%d_%H%M%S").orig
root@hetzner3 /etc #

root@hetzner3 /etc # echo "DEVICESCAN -d removable -n standby -m REDACTED@opensourceecology.org -M exec /usr/share/smartmontools/smartd-runner" > /etc/smartd.conf
root@hetzner3 /etc #
</pre>
# but that didn't work; no email came when I restarted the service (even if I added -M test)
# I checked the status in systemd, and it says that it did try to send the mail
<pre>
root@hetzner3 /etc # systemctl status smartd
● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-04-20 20:58:57 UTC; 3min 22s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1466569 (smartd)
Status: "Next check of 2 devices will start at 21:28:57"
Tasks: 1 (limit: 76834)
Memory: 1.2M
CPU: 66ms
CGroup: /system.slice/smartmontools.service
└─1466569 /usr/sbin/smartd -n

Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, is SMART capable. Adding to "monitor" list.
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme0n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NX0M104566-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
root@hetzner3 /etc #
</pre>
# so I checked the postfix logs, and it looks like google is rejecting our mail?!?
<pre>
root@hetzner3 ~ # journalctl -fu postfix@-
...
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: Untrusted TLS connection established to aspmx.l.google.com[108.177.15.27]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bit
s) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: CB6E5B94BB2: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[108.177.15.27]:25, delay=1.2, delays=0.01/0.01/0.86/0.27, dsn=2.0.0, status=sent (250 2.0.0 OK 1745183017 ffacd0b85a97d-39efa5a45b6si4251829f8f.798 - gsmtp)
Apr 20 21:04:34 hetzner3 postfix/qmgr[4510]: CB6E5B94BB2: removed
Apr 20 21:04:36 hetzner3 postfix/smtp[1468114]: Untrusted TLS connection established to aspmx.l.google.com[2404:6800:4003:c02::1b]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/bounce socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/bounce socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/defer socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/defer socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: D13CAB94BB3: defer service failure
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: D13CAB94BB3: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[2404:6800:4003:c02::1b]:25, delay=4.5, delays=0.01/0.01/3.5/1, dsn=4.3.0, status=deferred (bounce or trace service failure)
...
</pre>
# I changed it to my personal email, restarted, and I got two emails
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme1

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NA0M345614, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# and
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme0

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NX0M104566, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# so I changed it back to the google groups email list email address, and I updated the wiki https://wiki.opensourceecology.org/wiki/Hetzner3
# after lunch, I refreshed munin on hetzne2 and hetzner3, to see if smart info was not being charted
## on hetzner2, there's no changes. I don't see any charts related to SMART
## on hetzner3, there's two new charts (S.M.A.R.T values for drive nvme0n1 & S.M.A.R.T values for drive nvme1n1), but they're both empty; it only has 1 value (smartctl_exit_status), and it's "nan" for all time charts. This is expected, since it can't read the nvme smartctl output format.
# I think maybe I forgot to restart munin on hetzner2, so I gave that a try
<pre>
[root@opensourceecology ~]# service munin-node restart
Redirecting to /bin/systemctl restart munin-node.service
[root@opensourceecology ~]#

[root@opensourceecology ~]# sudo -u munin /usr/bin/munin-cron
2025/04/20 21:29:38 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 55.
closedir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 56.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:52 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
[root@opensourceecology ~]#
</pre>
# whatever; I guess no munin logs on SMART for this dying server
# I also confirmed that varnish logs are now visible in munin
# I committed my ansible changes https://github.com/OpenSourceEcology/ansible/commit/2fb906fd62cf0773d84f50f1cf113ddfe66910ec
# anyway, I also updated smartd.conf on hetzner2
<pre>
[root@opensourceecology smartmontools]# cp smartd.conf smartd.conf.20250420.bak
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# vim smartd.conf
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# diff smartd.conf.20250420.bak smartd.conf
23c23,24
< DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
---
> #DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
> DEVICESCAN -H -m REDACTED@opensourceecology.org -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
[root@opensourceecology smartmontools]#
[root@opensourceecology smartmontools]# systemctl restart smartd
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
[root@opensourceecology smartmontools]#
</pre>
# oh wow, that screaming about the disks failing wasn't just printed to my tty; it got printed to every tty on my screen session. It really is angry..
# but, alas, no email was sent – even from hetzner2. where email should *definitely* be working
# this time the postfix logs on hetzner2 gave us an error from gmail saying why they're blocking us
<pre>
Apr 20 21:40:27 opensourceecology postfix/smtp[21221]: 297716847E6: host aspmx.l.google.com[64.233.167.27] said: 421-4.7.28 Gmail has detected an unusual rate of unso
licited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.go
ogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42a931si4417083f8f.167 - gsmtp (in reply to end
of DATA command)
Apr 20 21:40:27 opensourceecology postfix/smtp[21094]: 3CBF7684804: host aspmx.l.google.com[142.251.168.27] said: 421-4.7.28 Gmail has detected an unusual rate of uns
olicited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.g
oogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42967csi4306047f8f.165 - gsmtp (in reply to end
of DATA command)
</pre>
# marcin sent an email campaign today with phpList. If that didn't make it out due to this, that's kinda problem.
# I see in the log that we're kinda spamming phplist_bounces@opensourceecology.org
# that's basically where phplist is supposed to let our admins know that it failed to deliver to some people on the mailing list
## I confirmed that this account *does* exist in the gsuite admin wui user list
# yeah, crap, it's blocking other mail sent to my personal account from apache.
# woah, I'm tailing the mail log and I just got probably hundereds or thousands of emails tried to be sent. phpList is *supposed* to do it in small batches, but I wonder if, once it fails and gets added to the queue, it'll do the re-send without batching it..
# I checked phpList wui settings and config.php, and I don't see anything about rate-limiting
# here's the docs on it https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
# it says it should be set in config.php. By default, I think it's 5,000 emails per hour
# Marcin's campaign today was sent to 14,111 people
# I checked the event log page, and I see a lot of these "Maximum time for queue processing: 99999" – which I guess means we need to break these up into batches https://phplist.opensourceecology.org/lists/admin/?page=eventlog
# looks like the easiest thing to do is to add a pause with MAILQUEUE_THROTTLE https://discuss.phplist.org/t/some-advice-for-correct-configuration-of-sending-rate/429
# if we send one per second, then we'll send 3,600 per hour.
## If we have 15,000 people on our list, then at that rate we'd need 4-5 hours to send a campaign. That sounds like a good idea.
# I updated the phpList config file to send only one email per second
<pre>
[root@opensourceecology phplist.opensourceecology.org]# diff config.20250420.php config.php
83a84,87
> // only send 1 email per second
> // * https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
> define('MAILQUEUE_THROTTLE',1);
>
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# we should also probably throttle postfix https://serverfault.com/questions/110919/postfix-throttling-for-outgoing-messages
# looks like for both hetzner2 and hetzner3, this is set to no delay
<pre>
[root@opensourceecology phplist.opensourceecology.org]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 0s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# I set this on hetzner2
<pre>
[root@opensourceecology postfix]# diff main.cf.20250420 main.cf
683a684,686
>
> # limit emails to the same-destination-domain to one-email-per-2-seconds
> default_destination_rate_delay = 2s
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# systemctl restart postfix
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 2s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology postfix]#
</pre>
# and I also added this to ansible and pushed it out to the server on hetnzer3 https://github.com/OpenSourceEcology/ansible/commit/7ed339cad055a9a0c5b04f26d32c9416daf3a2c7

=Sat Apr 19, 2025=

# I responded to Tom's email about ssh
# Tom wasn't able to reset their account's password
# I think I created these accounts with `--disabled-password`, probably as some layered security for ssh (to force keys), but that kinda breaks sudo, which requires the password. I could make sudo NOPASSWD, but I think it's safer to have a user password set (and have ssh disabled passoword logins still) rather than set sudoers to NOPASSWD, in general
# disabled passwords are set with the '!' in the second field of /etc/shadown
<pre>
root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing:!:20133:0:99999:7:::
root@hetzner3 ~ #
</pre>
# I just manually edited /etc/shadow with vim to remove the exclimation point
<pre>
root@hetzner3 ~ # vim /etc/shadow
root@hetzner3 ~ #

root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing::20133:0:99999:7:::
</pre>
# Tom replied, saying he can become root on hetzner3 now.
# ...
# I returned to work on the plan for replacing the disks on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb#Change_Steps
# I confirmed that the disks (on both hetzner2 and hetzner3) are MBR partition scheme (not GPT) – indicated by "Disk label type: dos"
<pre>
[root@opensourceecology ~]# fdisk -l /dev/sda

Disk /dev/sda: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x9b8e1266

Device Boot Start End Blocks Id System
/dev/sda1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sda2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sda3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0xd904fc05

Device Boot Start End Blocks Id System
/dev/sdb1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sdb2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sdb3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#
</pre>
# A quick spot-check shows that our backups usually finish at 09:55 – one time as late as 10:07. That's UTC.
# 10:00 UTC is 05:00 my time and 12:00 in Berlin. God that's early, but better to do this early in Germany time..
# I sent an email to Marcin asking if Thr 2025-04-24 @ 10:00 UTC (~05:00 FeF) would be a good time to do this
<pre>
Hey Marcin,

When would be a good time to replace the first disk on hetzner2?

Our backups finish daily at 10:00 UTC, which is:

* 12:00 in Germany (where the server lives)
* 05:00 here in Ecuador, and
* 05:00 at FeF

I propose next week on Thursday 2025-04-24 10:00 UTC.

For details about what this change entails, and expected downtime, please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb

Please let me know if you approve this change, if the suggested time is agreeable to you, and if you have any questions.
</pre>

=Fri Apr 18, 2025=
# Marcin sent another email this morning asking why osemain is down too now, and I responded
<pre>
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

* https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.

Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing? Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly. This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin
</pre>
# ok, so back to trying to figure out the corruption of the mariadb
# looks like the attempt to start it in recovery mode 2 fails after 10 minutes
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 10m0.435s
user 0m0.011s
sys 0m0.012s
[root@opensourceecology etc]#
</pre>
# and the tail of the db log
<pre>
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00 InnoDB: Waiting for the background threads to start
250417 23:06:01 InnoDB: Waiting for the background threads to start
250417 23:06:02 InnoDB: Waiting for the background threads to start
250417 23:06:03 InnoDB: Waiting for the background threads to start
250417 23:06:04 InnoDB: Waiting for the background threads to start
250417 23:06:05 InnoDB: Waiting for the background threads to start
250417 23:06:06 InnoDB: Waiting for the background threads to start
250417 23:06:07 InnoDB: Waiting for the background threads to start
250417 23:06:08 InnoDB: Waiting for the background threads to start
250417 23:06:09 InnoDB: Waiting for the background threads to start
</pre>
# so we have one more recovery mode we can try before it becomes destructive = 3
<pre>
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
>
[root@opensourceecology etc]#
</pre>
# and gave it a restart
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
...
</pre>
# damn, looks like it's stuck on the same thing
<pre>
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18 InnoDB: Waiting for the background threads to start
250418 19:33:19 InnoDB: Waiting for the background threads to start
250418 19:33:20 InnoDB: Waiting for the background threads to start
...
</pre>
# the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
## https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
## https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
# anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
# meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
# oh shit, that worked
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m2.102s
user 0m0.010s
sys 0m0.007s
[root@opensourceecology etc]#
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 22468 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]#
</pre>
# the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
<pre>
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28 InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
</pre>
# oh, the spam stopped. maybe just some startup thing.
# I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
# this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
# but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# this took about a minute to run
<pre>
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]#
</pre>
# good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
<pre>
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log
3dp_db.wp_commentmeta OK
3dp_db.wp_comments OK
3dp_db.wp_links OK
3dp_db.wp_masterslider_options OK
3dp_db.wp_masterslider_sliders OK
3dp_db.wp_options OK
3dp_db.wp_postmeta OK
3dp_db.wp_posts OK
3dp_db.wp_revslider_css OK
3dp_db.wp_revslider_layer_animations OK
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log
cacti_db.automation_ips
note : The storage engine for the table doesn't support check
cacti_db.automation_processes
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note : The storage engine for the table doesn't support check
cacti_db.poller_output
note : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note : The storage engine for the table doesn't support check
osemain_db.wp_options
warning : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]#
</pre>
# let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
## cacti_db
## osemain_db
## oswh_db
# I sent Marcin a status update email
<pre>
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
</pre>
# ok, I made the post-corruption mysqldump backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real 2m48.845s
user 3m19.170s
sys 0m2.023s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# now let's drop those three databases.
<pre>
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db |
+--------------------+
| d3d_db |
+--------------------+
| fef_db |
+--------------------+
| microfactory_db |
+--------------------+
| mysql |
+--------------------+
| obi_db |
+--------------------+
| obi_staging_db |
+--------------------+
| oseforum_db |
+--------------------+
| osemain_s_db |
+--------------------+
| osewiki_db |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db |
+--------------------+
| seedhome_db |
+--------------------+
| store_db |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# that looked good
<pre>
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]#
</pre>
# recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
# nah, it failed
<pre>
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m2.805s
user 0m0.006s
sys 0m0.010s
[root@opensourceecology etc]#
</pre>
# logs are the same, I think?
<pre>
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04 InnoDB: Waiting for the background threads to start
250418 20:10:04 InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
<pre>
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
250418 20:11:42 InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42 InnoDB: Waiting for the background threads to start
</pre>
# well, even though it *says* it's started
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m5.156s
user 0m0.008s
sys 0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 24458 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real 0m0.012s
user 0m0.001s
sys 0m0.007s
[root@opensourceecology etc]#
</pre>
# we can't connect to it with mysqlcheck
<pre>
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log
real 0m0.010s
user 0m0.002s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]#
</pre>
# so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
# huh, all lines say OK
<pre>
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real 0m11.597s
user 0m0.010s
sys 0m0.009s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log
[root@opensourceecology dbFail.20250417]#
</pre>
# well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
# I'm going to restore from the backup and then see if I can do that
# oh, right, we can't INSERT in recovery mode
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]#
</pre>
# well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
# we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
<pre>
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

</pre>
# even after that, it still won't start :'(
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m4.863s
user 0m0.009s
sys 0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real 0m0.010s
user 0m0.002s
sys 0m0.005s
[root@opensourceecology etc]#
</pre>
# before I purge those three system-level DBs, I want to confirm they're in our backups
# as I feared, it looks like they're missing
<pre>
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]#
</pre>
# according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
# I'm just going to manually dump these three anyway. Or try to
# well, I was able to get one of the three to backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real 0m0.010s
user 0m0.006s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real 0m0.142s
user 0m0.155s
sys 0m0.010s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real 0m0.009s
user 0m0.009s
sys 0m0.005s
[root@opensourceecology dbFail.20250417]#
</pre>
# mysql looks good
<pre>
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# I'm just going to move this whole db dir out of the way and see if we can start it fresh
<pre>
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x 4 mysql mysql 4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]#
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]#
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x 2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
[root@opensourceecology lib]#
</pre>
# ok, it's started outside recovery mode now
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m3.550s
user 0m0.007s
sys 0m0.012s
[root@opensourceecology etc]#

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23 InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
</pre>
# it created all these files
<pre>
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x 5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
-rw-rw---- 1 mysql mysql 16K Apr 18 20:56 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Apr 18 20:56 aria_log_control
-rw-rw---- 1 mysql mysql 18M Apr 18 20:56 ibdata1
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx 1 mysql mysql 0 Apr 18 20:56 mysql.sock
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]#
</pre>
# that also would have killed the mysql password; I can't login
<pre>
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]#
</pre>
# I hacked my way in and set the root password
<pre>
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
</pre>
# now I can see our three databases, plus one named test
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql |
+--------------------+
| performance_schema |
+--------------------+
| test |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]#
</pre>
# first let's just restore the 'mysql' database
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]#
</pre>
# that appears to have worked; our users are present now
<pre>
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User |
+------------------+
| oseforum_user |
| cacti_user |
| 3dp_user |
| cacti_user |
| d3d_user |
| fef_user |
| microfactory_usr |
| munin_user |
| obi2_user |
| obi3_user |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]>
</pre>
# I gave it a restart, and ensured it's still working. Great.
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# now let's restore the rest – including even our corrupt databases – and see if it works or breaks
# that took about 11.5 minutes to import ~6.8G of data
<pre>
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real 11m36.530s
user 1m52.944s
sys 0m3.593s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G /var/lib/mysql
[root@opensourceecology dbFail.20250417]#

</pre>
# I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# woah, I gave it a restart, and it came back fine
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# I guess we fixed it with no data loss?
# let's bring up the web servers
<pre>
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]#
</pre>
# the wiki loads now
# so does osemain
# I'd say we're back in business
# I sent an email to Marcin
<pre>
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues.
</pre>
# now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
# at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
# looks like it won't let him reboot if other users are logged-in
<pre>
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
1. maltfield
2. crupp
3. Tom Griffing (tgriffing)
4. jthomas
Choose identity to authenticate as (1-4):
</pre>
# I updated the sudoers command to give marcin *just* access to the reboot command
<pre>
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]#
</pre>
# I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
# it's debian, but sudoers syntax should (hopefully) be the same
<pre>
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~#

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password:
Retype new password:
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
</pre>
# yeah, that worked. Perfect.
# I tested it on hetzner2; it worked too.
<pre>
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
</pre>
# I sent Marcin a reply ask him to test reboots via ssh
<pre>
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need?
</pre>
# I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
## I wonder if I can figure out which one grub uses and replace that one second..
# from my log yesterday, here's our two drive's serial numbers
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# fuck; looks like neither is referenced in /boot/
<pre>
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]#
</pre>
# the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
# it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
<pre>
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64 initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64 initramfs-3.10.0-327.18.2.el7.x86_64.img System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64 initramfs-3.10.0-514.26.2.el7.x86_64.img System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64.img System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64kdump.img System.map-3.10.0-693.2.2.el7.x86_64
efi initrd-plymouth.img vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub lost+found vmlinuz-3.10.0-1127.el7.x86_64
grub2 symvers-3.10.0-1127.el7.x86_64.gz vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img symvers-3.10.0-1160.119.1.el7.x86_64.gz vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img symvers-3.10.0-327.18.2.el7.x86_64.gz vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img symvers-3.10.0-514.26.2.el7.x86_64.gz vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]#
</pre>
# I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
# oh, the efi dir is empty, so I'm thinking we're using grub2
<pre>
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]#
</pre>
# yeah, the grub dir just has one file in it?
<pre>
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11 2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
-rw-r--r-- 1 root root 1.4K Nov 15 2011 splash.xpm.gz
[root@opensourceecology boot]#
</pre>
# grub2 looks most sane
<pre>
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26 2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15 2015 fonts
-rw-r--r-- 1 root root 7.8K Jul 26 2024 grub.cfg
-rw-r--r-- 1 root root 5.3K Jun 1 2016 grub.cfg.1499616907.rpmsave
-rw-r--r-- 1 root root 6.1K Jul 9 2017 grub.cfg.1506097734.rpmsave
-rw-r--r-- 1 root root 7.0K Sep 22 2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26 2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31 2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31 2016 locale
[root@opensourceecology boot]#
</pre>
# it looks like it's referencing the raid, not the drive
<pre>
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a' 9f6b5264-da8c-406d-a444-45e3fb3aeb26
else
search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
fi
linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
</pre>
# right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
# well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
# that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
# Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
# the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 150G 38G 80% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#
</pre>

=Thr Apr 17, 2025=
# Marcin sent me an email last night (and again this morning) asking why the wiki is down
# I hadn't touched ose infra since 6 days ago
# the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
# I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
# anyway, let's check what happened to the wiki on hetzner2
# it's a 500 error complaining about the db
<pre>
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
<input type="hidden" name="num" value="50" />
<input type="hidden" name="ie" value="UTF-8" />
<input type="hidden" name="oe" value="UTF-8" />
<input type="text" name="q" size="31" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<p>
<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
<label><input type="radio" name="sitesearch" value="" />WWW</label>
</p>
user@disp9871:~$
</pre>
# disk is fine
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 96G 92G 52% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
# there's also no new logs in the mariadb error log when I hit the site in real-time
# well, the db isn't running
<pre>
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#
</pre>
# error logs aren't very helpful
<pre>
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
</pre>
# if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
<pre>
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]#
</pre>
# here's the log that pops-up when we try a restart
<pre>
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31 InnoDB: Waiting for the background threads to start
250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# google points to this https://bugs.mysql.com/bug.php?id=61516
## they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
# reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
# before reading any more, I'm going to immediately make a local copy of our most-recent backups
# looks like we have a backup from 13 hours ago and one from 27 hours ago
<pre>
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x 2 root root 4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x 2 root root 4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
</pre>
# this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
## it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
## other warnings suggest solving the underlying issue: why did the data become corrupt?
## well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
## but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
# I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
<pre>
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

* https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry.
</pre>
# the backups of the backups finished
<pre>
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec
total size is 44,542,039,620 speedup is 1.00
[root@opensourceecology ~]#
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 138G 50G 74% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
<pre>
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]#
</pre>
# I should also make a backup of /var/lib/mysql
# I'm going to create a dif for all of this
<pre>
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

* https://bugs.mysql.com/bug.php?id=61516

Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]#
</pre>
# aaaand I added a copy of /var/lib/mysql/
<pre>
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec
total size is 7,383,062,830 speedup is 1.00
[root@opensourceecology ~]#
</pre>
# another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
<pre>
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
</pre>
# Unfortunately, a lot of the links for how to fix this are now dead
## https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
## https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## https://forums.mysql.com/read.php?22,603093,604631#msg-604631
## https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
# we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
## but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
## ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
# this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
## good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
### but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
## it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
### here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
<pre>
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
</pre>
# I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
# ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
# but first let's check SMART
# oh, fuck, my notes on this are on the wiki. Of course.
# arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
# fail
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]#
</pre>
# luckily the yum servers for this EOL OS are still online, and I could install it
<pre>
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1
Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1

Installed:
smartmontools.x86_64 1:7.0-2.el7

Complete!
[root@opensourceecology ~]#
</pre>
# better
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]#
</pre>
# well this is terrifying; it says both our disks are gonna fail within 24 hours
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# compare that to hetnzer3, which says all is good
<pre>
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ #
</pre>
# I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
# oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
## well, it lists two options
### Free Replacement drive nearly new or used and tested; depends on what is in stock.
### At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
## we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
## dang, it says they'll swap the drive within 2-4 hours.
# I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
# hetzner provides 4 docs for assistance on this
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
## https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
## https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
# that first doc says to run the command we just ran
# hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
# ok, the docs say we can get more info with -A
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>
# so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
# well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
# yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
# in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
# the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
# wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
<pre>
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 152.358.379 [78,0 TB]
Data Units Written: 52.125.092 [26,6 TB]
Host Read Commands: 6.873.372.480
Host Write Commands: 1.362.559.127
Controller Busy Time: 22.226
Power Cycles: 28
Power On Hours: 17.245
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 159
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Data Units Read: 140.811.605 [72,0 TB]
Data Units Written: 56.604.901 [28,9 TB]
Host Read Commands: 1.304.073.899
Host Write Commands: 1.364.668.115
Controller Busy Time: 21.180
Power Cycles: 23
Power On Hours: 15.565
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 149
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 45 Celsius

root@hetzner3 ~ #
</pre>
# that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
# the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
# this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
# the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
<pre>
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
</pre>
# I also kicked-off a long test, which I can check tomorrow
<pre>
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
</pre>
# ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
<pre>
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/md2 197G 145G 43G 78% /
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# it won't let me check the filesystem while it's mounted
<pre>
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]#
</pre>
# it probably should be happening on-boot, but I couldn't find it in dmesg
<pre>
[root@opensourceecology ~]# dmesg | grep -i check
[ 0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]#
</pre>
# ok, instead we can just use tune2fs to get the info on the last check that was run
# looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 13131776
Block count: 52496160
Reserved block count: 2624808
Free blocks: 26575102
Free inodes: 12417672
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1011
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue May 31 06:01:12 2016
Last mount time: Thu Apr 17 17:39:11 2025
Last write time: Thu Apr 17 17:39:00 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
Check interval: 0 (<none>)
Lifetime writes: 124 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup: inode blocks
[root@opensourceecology ~]#
</pre>
# both of the filesystems (/ and /boot) look fine
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on: /boot
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 46
Maximum mount count: -1
Last checked: Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on: /
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
</pre>
# well, so far I couldn't find any signs of corruption on the disk/fs level
# back to the db, I set the recovery option in the my.cnf file
<pre>
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]#

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
>
[root@opensourceecology etc]#
</pre>
# it didn't come-up
<pre>
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]#
</pre>
# I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
<pre>
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49 InnoDB: Waiting for the background threads to start
250417 22:32:50 InnoDB: Waiting for the background threads to start
250417 22:32:51 InnoDB: Waiting for the background threads to start
250417 22:32:52 InnoDB: Waiting for the background threads to start
250417 22:32:53 InnoDB: Waiting for the background threads to start
250417 22:32:54 InnoDB: Waiting for the background threads to start
250417 22:32:55 InnoDB: Waiting for the background threads to start
250417 22:32:56 InnoDB: Waiting for the background threads to start
250417 22:32:57 InnoDB: Waiting for the background threads to start
250417 22:32:58 InnoDB: Waiting for the background threads to start
...
</pre>
# it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

=Sun Apr 11, 2025=

# let's get Catarina that broken staging site for osemain on hetzner3
# Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
# so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
# I logged-into hetzner https://robot.hetzner.com/server
# I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
# I clicked on the server for "hetzner3" and the tab "IPs".
## Then I clicked on "Order additional IPs / Nets"
## I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
## it required me to enter a reason (IPv4 is scarce) to which I wrote:
<pre>
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
</pre>
## and I clicked "Apply for IP/subnet in obligation"
## I got a message; looks like it needs human approval
<pre>
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
</pre>
# I typed an email to Marcin and Catarina to notify them of this order
<pre>
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
# I refreshed the hetzner wui, and now I see the new IP
# ...
# following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
## I sent them both emails asking them to confirm access
# I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
# update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
# he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
# I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
# I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
# Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
# ...
# I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
# gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
# I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
# there's too many layers on this server; I checked the logs
<pre>
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
</pre>
# ok, so it's modsecurity?
# gah, that's a lot of files to review
<pre>
[root@opensourceecology httpd]# find . |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]#
</pre>
# looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
<pre>
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]#
</pre>
# it's 13107200
<pre>
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf: SecRequestBodyLimit 13107200
conf.d/mod_security.conf: SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]#
</pre>
# docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
# so 13107200 / 1024 / 1024 = 12.5 MB.
# jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
# ...
# so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
# I didn't want to do this site before the other less-priority ones, but it's just a sandbox
# I realized I never made a CHG file for osemain
# looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
# ugh, I just said I was "following the same guide as with the other sites"
## I was hoping to know which one to CHG to copy-from
## I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
# ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3
# oh, crap, I'm going to remove

Maltfield Log/2025 Q2

2025-04-27T21:59:44Z

Maltfield: Apr 25

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

__TOC__

=See Also=
# [[Maltfield_Log]]
# [[User:Maltfield]]
# [[Special:Contributions/Maltfield]]

=Fri Apr 25, 2025=
# I woke up this morning and discovered the wiki was offline
# I tried to ssh into the server; it's not responding
# I figured I'd log into the hetzner wui, but – uhh – the credentials are in keepass and live on the server
# I mitigated this by giving Marcin a copy of the keepass file on his veracrypt drive, but he since changed the password a month or two ago, and we don't have a new local copy
# I sent an email to Marcin asking him to login to hetzner wui and boot hetzner2. if it doesn't come-up, then I'll have to get the password from him so I can load it in the wui from a rescue disk
# oh, I did find the new hetzner password in my personal keepass
# I logged-in, and I found the server was listed as being on. But I can't ping it. I gave it an "automatic hardware reset" from the wui
# I'll give it a few minutes before trying the rescue system
# their rescue systems are much nicer for their cloud product than their dedicated server product
# it looks like I have two options
## rescue boot mode: where I'm given ssh access
## vnc
# the problem with the rescue boot is that – if this is a grub issue – I wouldn't be able to "see" the error
# I enabled VNC and gave the server a reboot
# I was able to connect via vnc, but it was the damn installation wizard for almalinux. I quit the installation, and the vnc session died.
# damn, I guess vnc won't let me see the boot process, after all
# instead I tried the "rescue system"
# that didn't work; I can't access ssh on either of the IP addresses
# the docs say to activate the rescue system and then reboot it; that's what I did https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system/
# this time I fully shut down the server, and then I enabled the rescue system (while it's off)
# I went back to the Reset tab, and it's still off. So I booted it
# somehow I was able to login from my ose vm using my personal ssh key, but with user root
<pre>
user@ose:~$ ssh -v root@138.201.84.223
OpenSSH_9.2p1 Debian-2+deb12u5, OpenSSL 3.0.15 3 Sep 2024
debug1: Reading configuration data /home/user/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 138.201.84.223 [138.201.84.223] port 22.
debug1: Connection established.
...
Linux rescue 6.12.19 #1 SMP Fri Mar 14 05:34:52 UTC 2025 x86_64

--------------------

Welcome to the Hetzner Rescue System.

This Rescue System is based on Debian GNU/Linux 12 (bookworm) with a custom kernel.
You can install software like you would in a normal system.

To install a new operating system from one of our prebuilt images, run 'installimage' and follow the instructions.

Important note: Any data that was not written to the disks will be lost during a reboot.

For additional information, check the following resources:
Rescue System: https://docs.hetzner.com/robot/dedicated-server/troubleshooting/hetzner-rescue-system
Installimage: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installimage
Install custom software: https://docs.hetzner.com/robot/dedicated-server/operating-systems/installing-custom-images
other articles: https://docs.hetzner.com/robot

--------------------

Rescue System (via Legacy/CSM) up since 2025-04-25 17:24 +02:00

Hardware data:

CPU1: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Cores 8)
Memory: 64153 MB (Non-ECC)
Disk /dev/sda: 250 GB (=> 232 GiB)
Disk /dev/sdb: 512 GB (=> 476 GiB)
Total capacity 709 GiB with 2 Disks

Network data:
eth0 LINK: yes
MAC: 90:1b:0e:94:07:c4
IP: 138.201.84.223
IPv6: 2a01:4f8:172:209e::2/64
Intel(R) PRO/1000 Network Driver

root@rescue ~ #
</pre>
# I was able to mount the root drive
<pre>
root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[2]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[2]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[2]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@rescue ~ # mount /dev/md2 /mnt
root@rescue ~ # ls /mnt
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # ls /mnt/home
b2user crupp hart lberezhny marcin stagingsync wp
cmota Flipo jthomas maltfield not-apache tgriffing
root@rescue ~ #
</pre>
# I don't know what the point of this is; I can't fix it if I can't watch it boot and see what's breaking
# ok, at the bottom of the docs, hetnzer lists another option = xKVM Rescue System https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/
# it specifically says that's for debugging boot issues
# last thing before I try that: I downloaded a local copy of the keepass files from hetzner2
<pre>
user@ose:~/tmp/hetzner2$ rsync -av --progress root@138.201.84.223:/mnt/etc/keepass ./etc-keepass-20250525
receiving incremental file list
created directory ./etc-keepass-20250525
keepass/
keepass/passwords.kdbx
46,142 100% 44.00MB/s 0:00:00 (xfr#1, to-chk=6/8)
keepass/passwords.kdbx.20170728.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#2, to-chk=5/8)
keepass/passwords.kdbx.20170804.bak
4,590 100% 4.38MB/s 0:00:00 (xfr#3, to-chk=4/8)
keepass/passwords.kdbx.20190820.bak
33,726 100% 143.20kB/s 0:00:00 (xfr#4, to-chk=3/8)
keepass/passwords.kdbx.20190909.bak
34,238 100% 71.75kB/s 0:00:00 (xfr#5, to-chk=2/8)
keepass/passwords.kdbx.20250316.bak
45,406 100% 94.55kB/s 0:00:00 (xfr#6, to-chk=1/8)
keepass/passwords.kdbxs.20180525.bak
27,102 100% 56.31kB/s 0:00:00 (xfr#7, to-chk=0/8)

sent 161 bytes received 196,407 bytes 35,739.64 bytes/sec
total size is 195,794 speedup is 1.00
user@ose:~/tmp/hetzner2$

user@ose:~/tmp/hetzner2$ du -sh etc-keepass-20250525/keepass/*
48K etc-keepass-20250525/keepass/passwords.kdbx
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170728.bak
8.0K etc-keepass-20250525/keepass/passwords.kdbx.20170804.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190820.bak
36K etc-keepass-20250525/keepass/passwords.kdbx.20190909.bak
48K etc-keepass-20250525/keepass/passwords.kdbx.20250316.bak
28K etc-keepass-20250525/keepass/passwords.kdbxs.20180525.bak
user@ose:~/tmp/hetzner2$
</pre>
# so this time was the same as the rescue system, except I choose "xKVM" instead of "Linux" in the "Operationg System" dropdown
# strange, it gave me an error
<pre>
Public key authentication is not available for the selected operating system.
</pre>
# I unselected my ssh key, and chose "no key" instead
# it gave me a URL and a password. I booted the server, but the URL didn't load ("Unable to connect" error)
# ok, it took a few minutes and had a self-signed cert
# I bypassed the cert error, and entered the username and password into the basic auth popup. It failed! Could I really have been MITM'd?
# I immediately shut down the server from the wui, and I tried again.
# this time I was able to login – both from ssh and in the wui.
# as soon as it opened, I saw the error
<pre>
No more network devices

Booting from Hard Disk...
.
error: symbol 'grub_calloc' not found.
Entering rescue mode...
grub rescue>
</pre>
# I wonder if this is grub or grub2. I didn't have a binary "grub-install" before. I assumed it was an error with the hetzner docs when I did "grub2-install" instead, which said it worked (there was a warning that the docs said were safe to ignore)
# curoiusly, the opposite is true for the ssh session in vkvm: I have grub-install but not grub2-install
<pre>
root@vKVM-rescue ~ # which grub-install
/usr/sbin/grub-install
root@vKVM-rescue ~ #
root@vKVM-rescue ~ # which grub2-install
root@vKVM-rescue ~ #
</pre>
# here's the docs in question https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# I don't want to fuck with the grub without first taking a backup of these disks. But, uh, it looks like I can't access the RAID from inside this vkvm setup
# yeah, that's one of the limitations listed for VKVM https://docs.hetzner.com/robot/dedicated-server/virtualization/vkvm/#raid-controllers
<pre>
Configured units are passed through as SCSI devices to the VM. However it is not possible to access the controller. Please use the regular Hetzner Rescue System for this purpose.
</pre>
# I shutdown VKVM and booted it into the regular rescue mode
# it took a few minutes to get back into the old rescue system, but here I can use the raid
<pre>
root@rescue ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 3.4G 1 loop
sda 8:0 0 476.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
sdb 8:16 0 232.9G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1
root@rescue ~ # mkdir /mnt/md1
root@rescue ~ # mkdir /mnt/md2
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # mount /dev/md2 /mnt/md2
root@rescue ~ #
</pre>
# I created a dir for these backups
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ #

root@rescue ~ # mkdir /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chown root:root /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ # chmod 0700 /mnt/md2/var/tmp/20250425-grub-fail
root@rescue ~ #
</pre>
# first I made a backup from the raid
<pre>
root@rescue ~ # rsync -av --progress /mnt/md1 /mnt/md2/var/tmp/20250425-grub-fail/md1.$(date "+%Y%m%d_%H%M%S")
...
md1/grub2/locale/zh_TW.mo
30,882 100% 31.38kB/s 0:00:00 (xfr#345, to-chk=0/355)
md1/lost+found/

sent 399,450,301 bytes received 6,709 bytes 159,782,804.00 bytes/sec
total size is 399,330,989 speedup is 1.00
root@rescue ~ #
</pre>
# then I figured I'd make a backup of the two disk partitions directly, but I couldn't even mount it
<pre>
root@rescue ~ # umount /mnt/md1
root@rescue ~ # mkdir /mnt/sda2
root@rescue ~ # mkdir /mnt/sdb2
root@rescue ~ # mount /dev/sda2 /mnt/sda2
mount: /mnt/sda2: unknown filesystem type 'linux_raid_member'.
dmesg(1) may have more information after failed mount system call.
root@rescue ~ #
</pre>
# I tried this command (from the docs), which I skipped before because it said that the next command (grub-install) was enough; sure enough, it didn't work https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
<pre>
root@rescue ~ # grub-mkdevicemap -n
grub-mkdevicemap: error: cannot open /boot/grub/device.map.
root@rescue ~ #
</pre>
# I investigated this before, and I thought I decided we're using grub2, not grub1
<pre>
root@rescue ~ # mount /dev/md1 /mnt/md1
root@rescue ~ # ls /mnt/md1/
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue ~ #
</pre>
# oh, shit, even the grub-install command is v2 https://askubuntu.com/questions/107486/how-to-know-the-version-of-grub
<pre>
root@rescue ~ # grub-install --version
grub-install (GRUB) 2.06-13+deb12u1
root@rescue ~ #
</pre>
# ok, this indicates we're not using lilo https://askubuntu.com/questions/24459/how-do-i-find-out-which-boot-loader-i-have
<pre>
root@rescue ~ # ls /mnt/md2/etc/ | grep lilo
root@rescue ~ #
</pre>
# we can dd straight from the disk to read the MBR. And, yeah, it appears we are using grub via MBR .. and this info is stored on the disks, not the raid
<pre>
root@rescue ~ # dd if=/dev/md1 bs=512 count=1 2>/dev/null | strings
root@rescue ~ #

root@rescue ~ # dd if=/dev/sda bs=512 count=1 2>/dev/null | strings
214fb5736d1e5ad63e515dc2fffe44bd928cd8dab2c019dc11fb9fcaef5ea90dbf51f1ac507ab1cfbbe74ff
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
DA/jjF
root@rescue ~ #

root@rescue ~ # dd if=/dev/sdb bs=512 count=1 2>/dev/null | strings
ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error
root@rescue ~ #
</pre>
# idk what to do; I tried the grub-install again, but it gives me this error
<pre>
root@rescue ~ # grub-install /dev/sda
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #

root@rescue ~ # grub-install /dev/sdb
grub-install: error: /usr/lib/grub/i386-pc/modinfo.sh doesn't exist. Please specify --target or --directory.
root@rescue ~ #
</pre>
# I tried creating a chroot of our real raid disks first
<pre>
root@rescue ~ # ls /mnt/md2
bin etc installimage.debug lost+found old root srv usr
boot home lib media opt run sys var
dev installimage.conf lib64 mnt proc sbin tmp
root@rescue ~ # umount /mnt/md1
root@rescue ~ # chroot-prepare /mnt/md2
root@rescue ~ # chroot /mnt/md2
root@rescue / # ls /boot
root@rescue / # mount /dev/md1 /boot
root@rescue / # ls /boot
config-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64
efi
grub
grub2
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img
initramfs-3.10.0-1127.el7.x86_64.img
initramfs-3.10.0-1127.el7.x86_64kdump.img
initramfs-3.10.0-1160.119.1.el7.x86_64.img
initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img
initramfs-3.10.0-327.18.2.el7.x86_64.img
initramfs-3.10.0-514.26.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64.img
initramfs-3.10.0-693.2.2.el7.x86_64kdump.img
initrd-plymouth.img
lost+found
symvers-3.10.0-1127.el7.x86_64.gz
symvers-3.10.0-1160.119.1.el7.x86_64.gz
symvers-3.10.0-327.18.2.el7.x86_64.gz
symvers-3.10.0-514.26.2.el7.x86_64.gz
symvers-3.10.0-693.2.2.el7.x86_64.gz
System.map-3.10.0-1127.el7.x86_64
System.map-3.10.0-1160.119.1.el7.x86_64
System.map-3.10.0-327.18.2.el7.x86_64
System.map-3.10.0-514.26.2.el7.x86_64
System.map-3.10.0-693.2.2.el7.x86_64
vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
vmlinuz-3.10.0-1127.el7.x86_64
vmlinuz-3.10.0-1160.119.1.el7.x86_64
vmlinuz-3.10.0-327.18.2.el7.x86_64
vmlinuz-3.10.0-514.26.2.el7.x86_64
vmlinuz-3.10.0-693.2.2.el7.x86_64
root@rescue / #
</pre>
# I then tried the grub install again
<pre>
root@rescue / # grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #

root@rescue / # grub2-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue / #
</pre>
# I exited the chroot and shutdown the rescue system
# I activated the VKVM resuce system, and booted it again
# when I connected to the KVM wui, I was shown a password prompt. So I think booting works!
# I rebooted it from the ssh
# and now I can ssh into the real system
<pre>
user@personal:~$ autossh opensourceecology.org
Last login: Thu Apr 24 23:12:44 2025 from 146.70.199.15
[maltfield@opensourceecology ~]$
</pre>
# and now the wiki loads too
# I did another reboot test
<pre>
[maltfield@opensourceecology ~]$ sudo su -
[sudo] password for maltfield:
Last login: Thu Apr 24 16:25:15 UTC 2025 on pts/0
[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
ssh: connect to host opensourceecology.org port 32415: Connection refused
Last login: Fri Apr 25 16:29:21 2025 from 185.204.1.184
[maltfield@opensourceecology ~]$
</pre>
# idk, my takeaway is that either one or some of these assumptions are correct
## grub-install needs to be run *after* the RAID sync is finished
## grub-install needs to be run on *both* the new *and* the old disk
## grub-install needs to be run inside a chroot on the rescue system
# anyway, we're stable again
# I got an email from Marcin saying Tom could help with the migrations. I sent him some wiki articles to get caught-up
<pre>
Hey Tom,

I'll try to get you ssh access on hetzner2 soon. In the meantime, please read the following articles:

* https://wiki.opensourceecology.org/wiki/Hetzner2

* https://wiki.opensourceecology.org/wiki/Hetzner3

I've started preparing draft "change tickets" for migrating each of the websites from hetzner2 to hetzner3. Note that some of these are not fully tested, so you'll want to execute them manually and make corrections as-needed.

Please also read-through these:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_store_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_microfactory_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_fef

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_deprecate_oswh

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_phplist_to_hetzner3

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_wiki_to_hetzner3

(There's also one CHG for the forum that I think needs to be made)

The next item TODO is to finish the migration plan for these websites:

1. www.opensourceecology.org (osemain)
2. www.openbuildinginstiture.org (obi)

We decided that there would be 2 simultaneous versions of obi:

1. A static site scraped with curl on hetzner3
2. The (broken) dynamic wordpress site on hetzner3

And we decided that there would be 3 simultaneous versions of osemain:

1. The live/current site on hetzner2
2. A static site scraped with curl on hetzner3
3. The (broken) dynamic wordpress site on hetzner3

To have multiple sites with the same domain on the same server, we bought a second IPv4 address (FeF isn't setup with IPv6). This week I just finished updating the hetzer3 server to persist this new IPv4 address.

The next item for you would be to update our ansible to push out new vhosts (in nginx, varnish, and apache) for the static sites that are bound to the second IPv4 address using the same hostname.

Please read-through the ansible playbook and roles (most importantly for nginx, varnish, and apache) to understand how they're provisioned

* https://github.com/OpenSourceEcology/ansible

Since you have access to hetzner3, you can also poke around (read-only please) the configs for these three web services to understand how ansible provisions them.

Once you've updated and pushed-out the new vhosts with ansible, you'll need to update the migration plan

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_obi_to_hetzner3
* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3

And then you'll want to go-through each migration plan to create a temp "snapshot" of all the sites on hetzner3, where Marcin & Catarina can do a thorough verification of each site (by updating /etc/hosts) before we do the *real* migration -- which is nearly the same as the "snapshot" except we actually migrate DNS.

Please let me know when you've finished reading the above articles.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/24/25 22:16, REDACTED@tutanota.com wrote:
> Michael;
>
> I need to reset my ssh key on hetzner2. Can you use the same as on 3 or best to generate a new one?
>
> I spoke with Marcin and I think I can help with the admin, as I have time available.
>
> Can you give a run-down of its status and what needs to be done for completing the migration to hetzner3?
> --
> Tom Griffing
</pre>

=Thr Apr 24, 2025=
# it's 05:00; I tried to login to the wiki, but I got an error
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# oh, under that it says I'm already logged-in?
<pre>
You are already logged in as Maltfield. Use the form below to log in as another user.
</pre>
# anyway, let's start the CHG to replace the failing disk on hetzner 2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb
# I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]#

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]#
</pre>
# I tried to remove the first partition from the RAID, but it said I can't?
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#
</pre>
# apparently the docs say that if the RAID is healthy, you have to force it with '--fail' https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# crap, I realized I have an issue in my CHG (we need two sysadmins for peer review *sigh*)
## I listed this
<pre>
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3
</pre>
## but it should be this
<pre>
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3
</pre>
# anyway, it looks like I first need to execute this, to force the RAID into a failure state
<pre>
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md2 --fail /dev/sdb3
</pre>
# ok, I was able to remove it
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1](F)
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0] sdb1[1](F)
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1](F)
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# by 10:32 UTC, I submitted the request to hetzner to replace /dev/sdb = "Crucial_CT250MX200SSD1_154410FA4520"
# it says they should do it within 2-4 hours
# meanwhile, I updated https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# at 08:00 my time, I checked and saw that we had an email come from hetzner at 06:36 (my time)
<pre>
Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.
</pre>
# well, crap. I tried to load the wiki CHG article, but there's an error
<pre>
Sorry! This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot access the database)
</pre>
# the server wasn't shutdown, and my screen session is still intact, but dmesg is being flooded with RAID and io errors
<pre>
...
[11136.011313] md: super_written gets error=-5, uptodate=0
[11136.011372] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11136.319267] md: super_written gets error=-5, uptodate=0
[11136.319322] md: super_written gets error=-5, uptodate=0
[11138.827642] EXT4-fs error: 5 callbacks suppressed
[11138.827693] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
[11138.827793] EXT4-fs: 5 callbacks suppressed
[11138.827841] EXT4-fs (md2): previous I/O error to superblock detected
[11138.835255] md: super_written gets error=-5, uptodate=0
[11138.835311] md: super_written gets error=-5, uptodate=0
[11138.835367] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11138.835472] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
...
</pre>
# well anyway, I'll see if I can at least restart the RAID sync and install grub on the new disk
# son of a bitch, they removed the wrong drive!
<pre>
[root@opensourceecology ~]# date -u
Thu Apr 24 13:05:32 UTC 2025
[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 32G 0 part
├─sdc2 8:34 0 512M 0 part
└─sdc3 8:35 0 200.4G 0 part
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
device node not found
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# it shows a new drive (sdc) and and old drive (sdb)
# ugh, so now we have nothing in the raid?
# here's the new drive
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sdc | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# christ, so this new disk is half the size of our actual disk? what did they do?!?
# and now we have a prod server online with no redundancy. I can't tell them to put back-in the *correct* disk, or we'll have data loss
# I'm going to stop all the web services before this disaster gets any worse
# great; io errors. this is a damn disaster
<pre>
[root@opensourceecology ~]# systemctl stop nginx
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop varnish
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop apache2
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
Failed to stop apache2.service: Unit apache2.service not loaded.
[root@opensourceecology ~]#
</pre>
# I went ahead and made partition backups, anyway
# wait, actually, it said that /dev/sdc = Crucial_CT250MX200SSD1_154410FA336C. That's our old /dev/sda
# so they *did* remove the right drive, but the re-insertion of the wrong drive pushed /dev/sda to /dev/sdc. That kinda breaks our ability to map the RAID, but let's at-least partition this new drive
# but this new drive isn't the right size. it's 512G while our old disk was 250G. I guess it's better to have too-big of a disk than too-small of a disk, but we won't be able to use that extra disk space. I'm going to assume that they just didn't have 250G disks in-stock anymore.
# anyway, I tried to backup the partitions, but that wouldn't work since we're read-only
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
mkdir: cannot create directory ‘/var/tmp/chg.20250424_132010’: Read-only file system
[root@opensourceecology ~]# chown root:root $chg_dir
chown: cannot access ‘/var/tmp/chg.20250424_132010’: No such file or directory
[root@opensourceecology ~]#
</pre>
# I don't know what to do besides giving it a reboot, but that scares me
# I'd like to take a backup, but I can't if I get read-only errors :(
# well, I guess that's why we made a backup before this. I don't think I have any option other than to reboot. and pray that grub is intact to bring it back.
# I gave it a reboot. If it doesn't come back, I'll try to boot to the rescue CD from within the hetzner wui
<pre>
[root@opensourceecology ~]# date && reboot
Thu Apr 24 13:24:18 UTC 2025
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!

Failed to start reboot.target: Unit is not loaded properly: Input/output error.
See system logs and 'systemctl status reboot.target' for details.

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!
</pre>
# wtf, it can't even reboot it's so broken.
# I triggered a rest on the hetzner wui
# the server came back, and I immediately shutdown all services again
<pre>
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop apache2
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]#
</pre>
# I went ahead and triggered backups
<pre>
[root@opensourceecology ~]# cat /etc/cron.d/backup_to_backblaze
20 07 * * * root time /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log
20 04 03 * * root time /bin/nice /root/backups/backupReport.sh
[root@opensourceecology ~]#

[root@opensourceecology ~]# time /root/backups/backup.sh &>> /var/log/backups/backup.log
</pre>
# ok, sdc is gone. we have sda and sdb again, and sda is our original sda – as we wanted
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions; it's not surprising the sdb file is empty
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250424_133230 ~
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# du -sh ${chg_dir}/*
4.0K /var/tmp/chg.20250424_133230/sda_parttable_mbr.bak
0 /var/tmp/chg.20250424_133230/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
</pre>
# I copied the partition from sda to sdb
<pre>
[root@opensourceecology chg.20250424_133230]# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sdb: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sdb2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sdb3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250424_133230]#
</pre>
# that looked good, other than the complaint about not being able to boot from this disk; I'll check later what is LILO and if this will matter for raid grub
# I reloaded the partition table for this disk
<pre>
[root@opensourceecology chg.20250424_133230]# blockdev --rereadpt /dev/sdb
[root@opensourceecology chg.20250424_133230]#
</pre>
# I added the new disk to the RAID, and it shows that it's starting to sync now. excellent
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md0 -a /dev/sdb1
mdadm: added /dev/sdb1
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md1 -a /dev/sdb2
mdadm: added /dev/sdb2
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md2 -a /dev/sdb3
mdadm: added /dev/sdb3
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (19712/33521664) finish=481.1min speed=1159K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, it looks like it's not syncing each partition of the RAID at the same time. it's doing md0 now and then it'll do the others after, I guess
# md0 is partition 1 (sda1/sdb1). That's *sigh* swap. It's 32GB.
# I kinda wish we'd sync'd /boot first. I don't think I can install grub until that's sync'd. maybe?
# it says it's moving about 1024K/s. That's 1 MB per sec. 32G*1024 = 32,768 MB. That's 32,768 seconds / 60 = 546 minutes / 60 = 9 hours. Just for swap!
# assuming we have the same speed for the rest of the disk, that's 250 G * 1024 = 256,000 MB / 1 MB/s = 256,000 seconds. 256,000 seconds / 60 = 4,266.666666667 minutes / 60 = 4,266.666666667 = 71.11 hours. I guess we just have to accept the risk and hope that old /dev/sda with all our data doesn't fail within then next 3 days.
# I tried to go ahead and install grub on the new disk, but i got a command not found error
<pre>
[root@opensourceecology chg.20250424_133230]# grub-install /dev/sdb
-bash: grub-install: command not found
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# grub
grub2-bios-setup grub2-glue-efi grub2-mkconfig grub2-mkpasswd-pbkdf2 grub2-probe grub2-set-default
grub2-editenv grub2-install grub2-mkfont grub2-mkrelpath grub2-reboot grub2-setpassword
grub2-file grub2-kbdcomp grub2-mkimage grub2-mkrescue grub2-render-label grub2-sparc64-setup
grub2-fstest grub2-macbless grub2-mklayout grub2-mkstandalone grub2-rpm-sort grub2-syslinux2cfg
grub2-get-kernel-settings grub2-menulst2cfg grub2-mknetdir grub2-ofpathname grub2-script-check grubby
[root@opensourceecology chg.20250424_133230]#
</pre>
# looks like it should be 'grub2-install' I tried that
<pre>
[root@opensourceecology chg.20250424_133230]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, that's two warnings but no errors; I'll take it.
# we're up to 12.4% on the RAID sync of swap. It's now going >50x faster than it was before; good news
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[==>..................] recovery = 12.4% (4168832/33521664) finish=8.2min speed=59264K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# calculations at that speed would be 250*1024/58 = 4,413.793103448 seconds / 60 = 73 minutes. Oh, that's just over an hour.
# and now we're at 42.7%
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[========>............] recovery = 42.7% (14334208/33521664) finish=6.6min speed=47845K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# backups are still running; I'll let them finish before starting-up the webservers again
# I wrote a status email to Marcin
# the backups still aren't finished
# I checked on the raid replication, and it shows md0 (swap) and md1 (boot) are both done. Horray! Now we just need to finish root (/), which is 9.8% done and going at 60 MB/s. Great!
<pre>
Thu Apr 24 14:05:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 9.8% (20767872/209984640) finish=50.5min speed=62429K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I gave the grub install a double-tap now that it's synced with the first disk; the output was the same
<pre>
[root@opensourceecology ~]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# the output of lsblk looks much nicer now, too
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# backups say they're 9% uploaded
<pre>
[root@opensourceecology ~]# tail -f /var/log/backups/backup.log
...
2025/04/24 14:13:48 INFO :
Transferred: 2.210G / 20.472 GBytes, 11%, 2.904 MBytes/s, ETA 1h47m20s
Transferred: 0 / 1, 0%
Elapsed time: 13m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 10% /20.472G, 2.997M/s, 1h43m59s
</pre>
# I decided to just kill the backup script and manually upload it without the bwlimit, so it'll go-out faster
<pre>
[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20250424_133017.tar.gpg b2:ose-server-backups
2025/04/24 14:15:20 INFO :
Transferred: 116.500M / 20.472 GBytes, 1%, 1.958 MBytes/s, ETA 2h57m25s
Transferred: 0 / 1, 0%
Elapsed time: 1m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 0% /20.472G, 5.065M/s, 1h8m35s
</pre>
# meanwhile we're at 24% on the RAID sync
<pre>
Thu Apr 24 14:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.9% (50200448/209984640) finish=101.1min speed=26325K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# oh, important to note: our new disk doesn't say that it's failing :D
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# while the old disk says it's reached 100% of its lifecycle, the new disk says it's at – uhh – 96% of it's life? That doesn't sound very good :(
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# Shame. I was hoping for at least something <50%. Well, I wonder how long that remaining 4% will last us :/
# ok, backups just finished; let's start the web services
<pre>
[root@opensourceecology ~]# systemctl start mariadb
[root@opensourceecology ~]# systemctl start httpd
[root@opensourceecology ~]# systemctl start varnish
[root@opensourceecology ~]# systemctl start nginx
[root@opensourceecology ~]#
</pre>
# I updated the wiki CHG with a status https://wiki.opensourceecology.org/wiki/Category:CHGs
# And I sent an email to Marcin recommending that he replace /dev/sda with an actual new drive
<pre>
Hey Marcin,

Would you authorize spending €41.18 on a new disk for your server?

Update: Your websites are back online. The RAID is still syncing.

I was a bit disappointed to learn that hetzner replaced a disk with 0% "life left" with a disk with 4% "life left". That's what we get for choosing the free disk replacement..

The "free" option said it would replace it with a "Replacement drive nearly new or used and tested; depends on what is in stock." Obviously they didn't give us a "nearly new" drive..

Your other disk is also at 0% "life left". I was already planning on replacing that one next week too, but I would recommend that you pay for a new drive for this one. The cost listed on the website is €41.18.

Do you authorize me selecting €41.18 for the replacement of /dev/sda on hetzner2?
</pre>
# from the output above, our old drive said it had "Power_On_Hours" of 78516/24/365 = 8.96 years
# and our new drive says Power_On_Hours = 52083/24/365 = 5.95 years. Well that's better, I guess.
# oh wow, the power cycle count is crazy; our disk we only rebooted 50 times and the new one was only 33 times.
# also the SMART data for both of these drives has different keys (not just values). apparently it's very vendor-specific, so some of these comparisons are apples-to-oranges
# right, we're at 69.7% replication on root. I'm going to go make breakfast and check-in again after
# ...
# over lunch, I realized that Marcin's last email was possibly hyperbolic panic
# he's worried that he just kicked-off a marketing campaign (for the apprenticeship), which now links to information on a broken website – where potential applicants can't read the info
# but I think the content actually *is* accessible, just not to Marcin
# when you're logged-into the wiki, the cookies bypass the cache. So, regretablly, when hetnzer2's backend is offline, Marcin sees an error
# but I'd bet that the frontpage of all the websites and the recently-published apprenticeship info page that he's published & promoted are still online when he sees that error – for users who are *not* logged-into the site
# but if the backend site is broken for >24 hours, then the cache will cache the errors (not the content)
# as a short-term hack, I recommended that we setup a daily reboot of hetzner2 at 10:40 (a good buffer after the backups finish uploading)
# I asked Marcin if he'd like me to setup a daily reboot at 10:40
<pre>
Hey Marcin,

I don't think the situation is as bad as you think.

> We are missing opportunity,
> the announcement is posted, and our servers are down.

Of course I agree it's not good, and we should migrate away from hetzner2 asap. And I do wish I had more bandwidth to finish the migration faster for you.

But you have a varnish cache that caches pages for 24 hours. Even if your backend webserver and database are down, popular pages (like the frontpage of your wiki or a recent article that you've recently promoted) should still load for users.

The big issue isn't marketing and read-only content. The big issue is editing. That's what is breaking.

When you're logged into the wiki, it bypasses the varnish cache. So, even if the wiki appears down to you, the contents of (most) articles viewed in the past 24 hours will be still visible to potential apprenticeship applicants.

The next time you see the websites are down, try loading it from another device where you're not logged-in. You'll probably see that the apprenticeship info is still accessible, even though the backend for the site is down.

As a short-term hack, I recommend setting-up a daily reboot of the server. Backups typically finish before 10:10 UTC. I recommend we add a cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.

The server seems to function for some time after a fresh reboot, and it caches pages for 24 hours. So the first time someone loads a page in the wiki after that reboot, it'll be cached for the entire time that the server is online until its next reboot. I think this will ensure higher availability of your read-only content (eg information about the apprenticeship).

Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# ...
# I checked-in on the RAID replication status; it's finished
<pre>

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# so it looks like I started it just after 13:32 and it finished just before 15:20. So it took just under 2 hours. Great!
# I updated the article with status updates, marking the CHG as completed successfully https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb#2025-04-24_16:18_UTC
# And I sent an email to Marcin & Catarana to let them know it was successful, and asked again about buying a new drive for replacing /dev/sda
<pre>
Update: your new (used) disk is now fully synced with the old (failing) disk.

* https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb

According to SMART data, you now have one failing disk and one not-failing disk.

Your hetzner2 RAID is now healthy, and you have redundancy spread across two mirrored disks again.

Next week I'd like to replace the other failing disk. Please let me know if you approve the purchase of a new disk for its replacement.
</pre>
# Marcin got back to me, approving the purchase of the new disk; I updated the ticket https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# Note that the price is listed as "at cost" and it says
<pre>
Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
</pre>
# 1,000 hours is fine. That's compared to the 78,516 hours of /dev/sda and 52,083 hours of our "new" /dev/sdb
# but it's a bit concerning that it says it might not be in-stock. I'm going to message them and ask if they can set one aside for us for next week
<pre>
Hi Support,

Can you set-aside a replacement disk for this server?

Our disks' SMART logs indicated that both disks should be replaced. Today we replaced one of the two disks, but the disk that you replaced it with has 4% of its life left, according to SMART data (it has 52,083 hours of operation).

Next week we would like to replace the other disk, and this time we'd like your "at cost" option, to get a disk with <1,000 hours of operation.

But I was a bit concerned when I read this next to the WUI option for "at cost" on your website

> Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.

Specifically what worries me is the "may not be in stock".

Can you please tell us if you have stock now? And if you do, can you please reserve one disk for us for next week?

We don't want to remove a disk from our RAID and plan for downtime, only to discover that you don't have a disk available for us..

Please let us know if you can reserve 1 disk for us for next week.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# I asked Marcin if Wed next week at 11:00 UTC is ok for replacing hetzner2's sda
<pre>
Hey Marcin,

When would be a good time to replace the second disk on hetzner2?

If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next week on Wednesday 2025-04-30 11:00 UTC, which is:

* 13:00 in Germany (where the server lives)
* 06:00 here in Ecuador, and
* 06:00 at FeF

For details about what this change entails, and expected downtime,
please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda

Please let me know if you approve this change, if the suggested time is
agreeable to you, and if you have any questions.

Thank you,
</pre>
# Marcin returned the email confirming the time
<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>
# ...
# Marcin got back to me and told me to setup the daily reboot cron on hetzner2
<pre>
Yes, please set up reboot. That is decent for now

On Thu, Apr 24, 2025, 11:08 AM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> I don't think the situation is as bad as you think.
>
> > We are missing opportunity,
> > the announcement is posted, and our servers are down.
>
> Of course I agree it's not good, and we should migrate away from
> hetzner2 asap. And I do wish I had more bandwidth to finish the
> migration faster for you.
>
> But you have a varnish cache that caches pages for 24 hours. Even if
> your backend webserver and database are down, popular pages (like the
> frontpage of your wiki or a recent article that you've recently
> promoted) should still load for users.
>
> The big issue isn't marketing and read-only content. The big issue is
> editing. That's what is breaking.
>
> When you're logged into the wiki, it bypasses the varnish cache. So,
> even if the wiki appears down to you, the contents of (most) articles
> viewed in the past 24 hours will be still visible to potential
> apprenticeship applicants.
>
> The next time you see the websites are down, try loading it from another
> device where you're not logged-in. You'll probably see that the
> apprenticeship info is still accessible, even though the backend for the
> site is down.
>
> As a short-term hack, I recommend setting-up a daily reboot of the
> server. Backups typically finish before 10:10 UTC. I recommend we add a
> cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.
>
> The server seems to function for some time after a fresh reboot, and it
> caches pages for 24 hours. So the first time someone loads a page in the
> wiki after that reboot, it'll be cached for the entire time that the
> server is online until its next reboot. I think this will ensure higher
> availability of your read-only content (eg information about the
> apprenticeship).
>
> Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# we don't have ansible for hetzner2; I did this manually
<pre>
[root@opensourceecology cron.d]# pwd
/etc/cron.d
[root@opensourceecology cron.d]# ls -lah
total 52K
drwxr-xr-x. 2 root root 4.0K Apr 24 17:56 .
drwxr-xr-x. 105 root root 12K Apr 18 21:52 ..
-rw-r--r-- 1 root root 128 May 16 2023 0hourly
-rw-r--r-- 1 root root 1.3K Apr 9 2019 awstats_generate_static_files
-rw-r--r-- 1 root root 151 Apr 24 17:52 backup_to_backblaze
-rw-r--r-- 1 root root 78 May 31 2024 cacti
-rw-r--r-- 1 root root 125 Dec 11 00:16 letsencrypt
-rw-r--r-- 1 root root 506 Mar 18 2019 phplist
-rw-r--r-- 1 root root 108 Jan 7 2022 raid-check
-rw-r--r-- 1 root root 118 Apr 24 17:56 reboot
-rw------- 1 root root 235 Dec 15 2022 sysstat
[root@opensourceecology cron.d]# cat reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology cron.d]#
# tomorrow morning I should check on the uptime and journalctl to make sure it rebooted sometime around 10:40 UTC
</pre>
# ...
# ok, back to hetzner3: we bought a second IPv4 address for the static sites, but the server's networking was never setup for it; let's add that
<pre>
root@hetzner3 /etc/network # cp interfaces interfaces.20250424
root@hetzner3 /etc/network # vim interfaces
...
</pre>
# well, that failed.
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
You have mail in /var/mail/root
root@hetzner3 ~ #
</pre>
I restored the backup file, and it still failed. The journal and status aren't helpful
<pre>
root@hetzner3 ~ # systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-24 17:18:55 UTC; 52s ago
Duration: 2month 1w 20h 39min 50.765s
Docs: man:interfaces(5)
Process: 3259336 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 3259371 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 3259336 (code=exited, status=1/FAILURE)
CPU: 29ms

Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
root@hetzner3 ~ # journalctl -u networking | tail
Apr 24 17:16:36 hetzner3 ifup[3258504]: ifup: failed to bring up enp0s31f6
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:16:36 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
</pre>
# if I run the ExecStart command manaully, I can add a verbose tag. but that's not especially helpful, either
<pre>
root@hetzner3 ~ # ifup --verbose -a --read-environment
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool

ifup: configuring interface enp0s31f6=enp0s31f6 (inet)
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool
ip addr add 144.76.164.201/255.255.255.224 broadcast 144.76.164.223 dev enp0s31f6 label enp0s31f6
RTNETLINK answers: File exists
ifup: failed to bring up enp0s31f6
run-parts --exit-on-error --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/000resolvconf
run-parts: executing /etc/network/if-up.d/ethtool
run-parts: executing /etc/network/if-up.d/postfix
run-parts: executing /etc/network/if-up.d/resolved
root@hetzner3 ~ #
</pre>
# curiously, though, the new IPv4 address is listed in `ip a`
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# I'm just going to give this server a reboot before proceeding, to make sure the IP config is sticky
# when it came-up, it lost the new IP :(
<pre>
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# well, at least it's restarting now without errors; I can work with that
<pre>
root@hetzner3 /etc/network # systemctl restart networking
You have new mail in /var/mail/root
root@hetzner3 /etc/network # systemctlstatus networking
-bash: systemctlstatus: command not found
root@hetzner3 /etc/network # systemctl status networking
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: active (exited) since Thu 2025-04-24 17:33:40 UTC; 15s ago
Docs: man:interfaces(5)
Process: 8598 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Process: 9022 ExecStart=/bin/sh -c if [ -f /run/network/restart-hotplug ]; then /sbin/ifup -a --read-environment --allow=hotplug; fi (code=exited, status=0/SUCCESS)
Main PID: 9022 (code=exited, status=0/SUCCESS)
CPU: 357ms

Apr 24 17:33:34 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:33:39 hetzner3 ifup[8663]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 ifup[8907]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 systemd[1]: Finished networking.service - Raise network interfaces.
root@hetzner3 /etc/network #
</pre>
# let's try to add it now
<pre>
root@hetzner3 /etc/network # diff interfaces interfaces.20250424
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # vim interfaces
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,23
> iface enp0s31f6 inet static
> address 144.76.164.195
> netmask 255.255.255.224
> gateway 144.76.164.193
> # route 144.76.164.192/27 via 144.76.164.193
> #up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6
>
root@hetzner3 /etc/network #
</pre>
# I gave it a restart, but I have errors again
<pre>
# curiously, it *did* add the new IP address; wtf
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# the internet isn't very helpful because it seems the damn format has changed so many times over the years; lots of outdated info
# lots of people say they fixed this by deleting everything in interfaces.d/, but we don't have anything in that folder
# I did find this hetzner-specific docs on adding a second IP; it's totally different than what I've read elsewhere https://docs.hetzner.com/robot/dedicated-server/network/net-config-debian-ubuntu
<pre>
up ip addr add 10.4.2.1/32 dev eth0
down ip addr del 10.4.2.1/32 dev eth0
</pre>
# I tried this, and gave the server a reboot
<pre>
root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,20
> # 2025-04-24: add second IPv4 address
> up ip addr add 144.76.164.195/32 dev enp0s31f6
> down ip addr del 144.76.164.195/32 dev enp0s31f6
>
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # cat interfaces
### Hetzner Online GmbH installimage

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback
iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet static
address 144.76.164.201
netmask 255.255.255.224
gateway 144.76.164.193
# route 144.76.164.192/27 via 144.76.164.193
up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6

# 2025-04-24: add second IPv4 address
up ip addr add 144.76.164.195/32 dev enp0s31f6
down ip addr del 144.76.164.195/32 dev enp0s31f6

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::2
netmask 64
gateway fe80::1

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::3
netmask 64
gateway fe80::1
root@hetzner3 /etc/network #
</pre>
# the system came-up with the IP I want. Cool!
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# and I'm able to restart the service without it yelling at me (or breaking the IP config)
<pre>
root@hetzner3 ~ # systemctl restart networking
root@hetzner3 ~ #
You have new mail in /var/mail/root
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# I'm also able to ping the server on both IPs, which is a good sign
<pre>
user@disp9871:~$ ping 144.76.164.201
PING 144.76.164.201 (144.76.164.201) 56(84) bytes of data.
64 bytes from 144.76.164.201: icmp_seq=1 ttl=50 time=490 ms
64 bytes from 144.76.164.201: icmp_seq=2 ttl=50 time=490 ms
^C
--- 144.76.164.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 489.558/489.676/489.795/0.118 ms
user@disp9871:~$
user@disp9871:~$ ping 144.76.164.195
PING 144.76.164.195 (144.76.164.195) 56(84) bytes of data.
64 bytes from 144.76.164.195: icmp_seq=1 ttl=50 time=493 ms
64 bytes from 144.76.164.195: icmp_seq=2 ttl=50 time=512 ms
^C
--- 144.76.164.195 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 492.853/502.518/512.184/9.665 ms
user@disp9871:~$
</pre>
# I used netcat to test it. Most ports are closed, and I found that nginx is listening on most of the other ports on all IPs – except 4443
<pre>
root@hetzner3 ~ # nc -s 144.76.164.195 -l -p 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# and this was how it looked on my laptop's side
<pre>
user@disp9871:~$ nc 144.76.164.195 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# ok, so the server's new IPv4 address is configured (and persistent between reboots)

=Sun Apr 20, 2025=
# Marcin replied to my email authorizing the replacement of the /dev/sdb disk on hetzner2 at 2025-04-24 10:00 UTC https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
## I updated the article with the defined date & time
# ...
# I also checked hetzner3. I see that I setup email alerts for the RAID, but not for SMART.
## on hetzner2, we had no errors of the RAID, but we did have SMART errors. I guess eventually if it failed enough that RAID replication was breaking, we would have gotten alerts. But it would be good if we could get alerts *before* that happened..
# I checked munin on hetzner2 to see what data it collects for monitoring disks @ /disk-day.html
## looks like we have latency, throughput, usage, utilization, i/o, and inode usage. There's nothing about "SMART errors"
# looks like there *is* a smart module for munin https://gallery.munin-monitoring.org/plugins/munin/smart_/
# it's already there on hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah | grep -i smart
-rwxr-xr-x 1 root root 11K Mar 21 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Mar 21 2023 smart_
You have new mail in /var/mail/root
root@hetzner3 /usr/share/munin/plugins #
</pre>
# hetzner2 has it too
<pre>
[root@opensourceecology munin]# ls -lah /usr/share/munin/plugins | grep -i smart
-rwxr-xr-x 1 root root 11K Nov 6 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Nov 6 2023 smart_
[root@opensourceecology munin]#
</pre>
# crap, I just checked hetzner3's munin, and I realized that varnish is missing :(
# it looks like ansible *has* pushed-out the script and plugins
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah /usr/share/munin/plugins/ | grep -i varnish
-rwxr-xr-x 1 root root 26K Mar 21 2023 varnish_
-rwxr-xr-x 1 root root 28K Feb 12 00:14 varnish5_
-rwxr-xr-x 1 root root 28K Sep 28 2024 varnish5_.175431.2025-02-12@00:16:02~
-rwxr-xr-x 1 root root 28K Sep 25 2024 varnish5_.20240928.orig
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # ls -lah /etc/munin/plugins/ | grep -i varnish
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_backend_traffic -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_bad -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_expunge -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_hit_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_main_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_memory_usage -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_mgt_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_objects -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_request_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_threads -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_transfer_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Feb 12 00:16 varnish_uptime -> /usr/share/munin/plugins/varnish5_
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I did a diff of the varnish5_ script from my server and ose's server, and I found 2 new lines at the top of the hetzner3 server
## my server
<pre>
maltfield@mail:~$ head /usr/share/munin/plugins/varnish5_
#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
#
# This program is free software; you can redistribute it and/or modify
maltfield@mail:~$
</pre>
## ose's hetzner3
<pre>
maltfield@hetzner3:~$ head /usr/share/munin/plugins/varnish5_
# Ansible managed

#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
maltfield@hetzner3:~$
</pre>
# so basically the issue appears to be that my "ansible managed" comment comes before the shebang, so varnish is interpreting everything as shell, instead of perl
# we can see the result of all these syntax errors with a test run too
## my server
<pre>
root@mail:/etc/munin# munin-run varnish_hit_rate
cache_hitpass.value 0
client_req.value 704255
cache_miss.value 202581
cache_hitmiss.value 2181
cache_hit.value 499493
root@mail:/etc/munin#
</pre>
## ose's hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
/etc/munin/plugins/varnish_hit_rate: 26: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 28: varnish5_: not found
/etc/munin/plugins/varnish_hit_rate: 30: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 32: Varnish: not found
/etc/munin/plugins/varnish_hit_rate: 34: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 36: The: not found
/etc/munin/plugins/varnish_hit_rate: 38: The: not found
/etc/munin/plugins/varnish_hit_rate: 39: [varnish5_*]: not found
/etc/munin/plugins/varnish_hit_rate: 40: group: not found
/etc/munin/plugins/varnish_hit_rate: 41: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 42: env.name: not found
/etc/munin/plugins/varnish_hit_rate: 44: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 108: my: not found
/etc/munin/plugins/varnish_hit_rate: 111: my: not found
/etc/munin/plugins/varnish_hit_rate: 114: my: not found
/etc/munin/plugins/varnish_hit_rate: 117: my: not found
/etc/munin/plugins/varnish_hit_rate: 119: my: not found
/etc/munin/plugins/varnish_hit_rate: 123: Syntax error: "(" unexpected
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I moved the "ansible managed" comment below the shebang in ansible, and pushed it out; now it works
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
client_req.value 10714
cache_hitmiss.value 9
cache_hit.value 6478
cache_hitpass.value 0
cache_miss.value 4227
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I also pushed-out smart at the same time, but it's not working
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_ suggest
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # munin-run smart_
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# the docs page for the smart_ munin plugin says that we need this section at-minimum in the munin config file, so I added it to hetzner2 https://gallery.munin-monitoring.org/plugins/munin/smart_/
<pre>
[root@opensourceecology plugin-conf.d]# tail -n4 zzz-ose

[smart_*]
user root
group disk
[root@opensourceecology plugin-conf.d]#
</pre>
# and I manually created the symlinks for sda & sdb
<pre>
[root@opensourceecology ~]# cd /etc/munin/plugins
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sda
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sdb
[root@opensourceecology plugins]#
</pre>
# sweet, that worked
<pre>
[root@opensourceecology plugins]# munin-run smart_sdb
Program_Fail_Count.value 100
Reallocated_Event_Count.value 100
Ave_Block_Erase_Count.value 001
Reallocate_NAND_Blk_Cnt.value 100
Erase_Fail_Count.value 100
Reported_Uncorrect.value 100
SATA_Interfac_Downshift.value 100
Offline_Uncorrectable.value 100
smartctl_exit_status.value 8
Write_Error_Rate.value 100
FTL_Program_Page_Count.value 100
Current_Pending_Sector.value 100
Success_RAIN_Recov_Cnt.value 100
UDMA_CRC_Error_Count.value 100
Error_Correction_Count.value 100
Temperature_Celsius.value 064
Raw_Read_Error_Rate.value 100
Total_Host_Sector_Write.value 100
Power_Cycle_Count.value 100
Power_On_Hours.value 100
Host_Program_Page_Count.value 100
Unused_Reserve_NAND_Blk.value 000
Percent_Lifetime_Remain.value 000
Unexpect_Power_Loss_Ct.value 100
[root@opensourceecology plugins]#
</pre>
# Unfortunately, I'm not getting the same results on hetzner3. I wonder if this munin plugin doesn't support nvme drives?
# oh, it looks like I'm actually not updating that file anymore in ansible, because it has a backup. I'm going to make a note in ansible so I don't make that mistake again.
# meanwhile, I manually updated the config file on hetzner3 too
<pre>
root@hetzner3 /etc/munin # cd plugin-conf.d/
root@hetzner3 /etc/munin/plugin-conf.d # ls
dhcpd3 munin-node README spamstats zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # touch /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chown root:root /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chmod 0600 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # cp zzz-myconf /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # ls -lah /var/tmp/munin-zzz-myconf.20250420
-rw------- 1 root root 1,7K Apr 20 17:29 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # vim zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # diff /var/tmp/munin-zzz-myconf.20250420 /etc/munin/plugin-conf.d/zzz-myconf
3c3
< # Version: 0.2
---
> # Version: 0.3
9c9
< # Updated: 2024-12-12
---
> # Updated: 2025-04-20
31a32,35
>
> [smart_*]
> user root
> group disk
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# that still fails
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# but, if I restart the service first and then run it, it – uhh – kinda works
<pre>
root@hetzner3 /etc/munin/plugin-conf.d # service munin-node restart
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# so it exits with a non-error, just a U. no further stats. huh.
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
smartctl_exit_status.value U
root@hetzner3 /usr/share/munin/plugins #
</pre>
# yeah, it looks like the smart_ plugin doesn't work for nvme drives :(
## https://github.com/munin-monitoring/munin/issues/790
## https://github.com/aranemac/munin-smart-nvme
# I'm not looking to compile some binary. I think we've reached the point of diminished return here
# while historical smart charts would be great, what I really want to achieve is some email alerts from SMART, like we setup for the RAID
# I found a few guides about this
## https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email
## https://serverfault.com/questions/426761/is-smartd-properly-configured-to-send-alerts-by-email
## https://unix.stackexchange.com/questions/662633/best-practices-to-enable-smart-disk-notifications-on-a-linux-workstation
# I replaced the files
<pre>
root@hetzner3 /etc # mv /etc/smartd.conf /etc/smartd.conf.$(date "+%Y%m%d_%H%M%S").orig
root@hetzner3 /etc #

root@hetzner3 /etc # echo "DEVICESCAN -d removable -n standby -m REDACTED@opensourceecology.org -M exec /usr/share/smartmontools/smartd-runner" > /etc/smartd.conf
root@hetzner3 /etc #
</pre>
# but that didn't work; no email came when I restarted the service (even if I added -M test)
# I checked the status in systemd, and it says that it did try to send the mail
<pre>
root@hetzner3 /etc # systemctl status smartd
● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-04-20 20:58:57 UTC; 3min 22s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1466569 (smartd)
Status: "Next check of 2 devices will start at 21:28:57"
Tasks: 1 (limit: 76834)
Memory: 1.2M
CPU: 66ms
CGroup: /system.slice/smartmontools.service
└─1466569 /usr/sbin/smartd -n

Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, is SMART capable. Adding to "monitor" list.
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme0n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NX0M104566-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
root@hetzner3 /etc #
</pre>
# so I checked the postfix logs, and it looks like google is rejecting our mail?!?
<pre>
root@hetzner3 ~ # journalctl -fu postfix@-
...
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: Untrusted TLS connection established to aspmx.l.google.com[108.177.15.27]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bit
s) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: CB6E5B94BB2: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[108.177.15.27]:25, delay=1.2, delays=0.01/0.01/0.86/0.27, dsn=2.0.0, status=sent (250 2.0.0 OK 1745183017 ffacd0b85a97d-39efa5a45b6si4251829f8f.798 - gsmtp)
Apr 20 21:04:34 hetzner3 postfix/qmgr[4510]: CB6E5B94BB2: removed
Apr 20 21:04:36 hetzner3 postfix/smtp[1468114]: Untrusted TLS connection established to aspmx.l.google.com[2404:6800:4003:c02::1b]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/bounce socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/bounce socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/defer socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/defer socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: D13CAB94BB3: defer service failure
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: D13CAB94BB3: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[2404:6800:4003:c02::1b]:25, delay=4.5, delays=0.01/0.01/3.5/1, dsn=4.3.0, status=deferred (bounce or trace service failure)
...
</pre>
# I changed it to my personal email, restarted, and I got two emails
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme1

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NA0M345614, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# and
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme0

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NX0M104566, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# so I changed it back to the google groups email list email address, and I updated the wiki https://wiki.opensourceecology.org/wiki/Hetzner3
# after lunch, I refreshed munin on hetzne2 and hetzner3, to see if smart info was not being charted
## on hetzner2, there's no changes. I don't see any charts related to SMART
## on hetzner3, there's two new charts (S.M.A.R.T values for drive nvme0n1 & S.M.A.R.T values for drive nvme1n1), but they're both empty; it only has 1 value (smartctl_exit_status), and it's "nan" for all time charts. This is expected, since it can't read the nvme smartctl output format.
# I think maybe I forgot to restart munin on hetzner2, so I gave that a try
<pre>
[root@opensourceecology ~]# service munin-node restart
Redirecting to /bin/systemctl restart munin-node.service
[root@opensourceecology ~]#

[root@opensourceecology ~]# sudo -u munin /usr/bin/munin-cron
2025/04/20 21:29:38 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 55.
closedir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 56.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:52 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
[root@opensourceecology ~]#
</pre>
# whatever; I guess no munin logs on SMART for this dying server
# I also confirmed that varnish logs are now visible in munin
# I committed my ansible changes https://github.com/OpenSourceEcology/ansible/commit/2fb906fd62cf0773d84f50f1cf113ddfe66910ec
# anyway, I also updated smartd.conf on hetzner2
<pre>
[root@opensourceecology smartmontools]# cp smartd.conf smartd.conf.20250420.bak
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# vim smartd.conf
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# diff smartd.conf.20250420.bak smartd.conf
23c23,24
< DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
---
> #DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
> DEVICESCAN -H -m REDACTED@opensourceecology.org -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
[root@opensourceecology smartmontools]#
[root@opensourceecology smartmontools]# systemctl restart smartd
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
[root@opensourceecology smartmontools]#
</pre>
# oh wow, that screaming about the disks failing wasn't just printed to my tty; it got printed to every tty on my screen session. It really is angry..
# but, alas, no email was sent – even from hetzner2. where email should *definitely* be working
# this time the postfix logs on hetzner2 gave us an error from gmail saying why they're blocking us
<pre>
Apr 20 21:40:27 opensourceecology postfix/smtp[21221]: 297716847E6: host aspmx.l.google.com[64.233.167.27] said: 421-4.7.28 Gmail has detected an unusual rate of unso
licited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.go
ogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42a931si4417083f8f.167 - gsmtp (in reply to end
of DATA command)
Apr 20 21:40:27 opensourceecology postfix/smtp[21094]: 3CBF7684804: host aspmx.l.google.com[142.251.168.27] said: 421-4.7.28 Gmail has detected an unusual rate of uns
olicited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.g
oogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42967csi4306047f8f.165 - gsmtp (in reply to end
of DATA command)
</pre>
# marcin sent an email campaign today with phpList. If that didn't make it out due to this, that's kinda problem.
# I see in the log that we're kinda spamming phplist_bounces@opensourceecology.org
# that's basically where phplist is supposed to let our admins know that it failed to deliver to some people on the mailing list
## I confirmed that this account *does* exist in the gsuite admin wui user list
# yeah, crap, it's blocking other mail sent to my personal account from apache.
# woah, I'm tailing the mail log and I just got probably hundereds or thousands of emails tried to be sent. phpList is *supposed* to do it in small batches, but I wonder if, once it fails and gets added to the queue, it'll do the re-send without batching it..
# I checked phpList wui settings and config.php, and I don't see anything about rate-limiting
# here's the docs on it https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
# it says it should be set in config.php. By default, I think it's 5,000 emails per hour
# Marcin's campaign today was sent to 14,111 people
# I checked the event log page, and I see a lot of these "Maximum time for queue processing: 99999" – which I guess means we need to break these up into batches https://phplist.opensourceecology.org/lists/admin/?page=eventlog
# looks like the easiest thing to do is to add a pause with MAILQUEUE_THROTTLE https://discuss.phplist.org/t/some-advice-for-correct-configuration-of-sending-rate/429
# if we send one per second, then we'll send 3,600 per hour.
## If we have 15,000 people on our list, then at that rate we'd need 4-5 hours to send a campaign. That sounds like a good idea.
# I updated the phpList config file to send only one email per second
<pre>
[root@opensourceecology phplist.opensourceecology.org]# diff config.20250420.php config.php
83a84,87
> // only send 1 email per second
> // * https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
> define('MAILQUEUE_THROTTLE',1);
>
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# we should also probably throttle postfix https://serverfault.com/questions/110919/postfix-throttling-for-outgoing-messages
# looks like for both hetzner2 and hetzner3, this is set to no delay
<pre>
[root@opensourceecology phplist.opensourceecology.org]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 0s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# I set this on hetzner2
<pre>
[root@opensourceecology postfix]# diff main.cf.20250420 main.cf
683a684,686
>
> # limit emails to the same-destination-domain to one-email-per-2-seconds
> default_destination_rate_delay = 2s
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# systemctl restart postfix
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 2s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology postfix]#
</pre>
# and I also added this to ansible and pushed it out to the server on hetnzer3 https://github.com/OpenSourceEcology/ansible/commit/7ed339cad055a9a0c5b04f26d32c9416daf3a2c7

=Sat Apr 19, 2025=

# I responded to Tom's email about ssh
# Tom wasn't able to reset their account's password
# I think I created these accounts with `--disabled-password`, probably as some layered security for ssh (to force keys), but that kinda breaks sudo, which requires the password. I could make sudo NOPASSWD, but I think it's safer to have a user password set (and have ssh disabled passoword logins still) rather than set sudoers to NOPASSWD, in general
# disabled passwords are set with the '!' in the second field of /etc/shadown
<pre>
root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing:!:20133:0:99999:7:::
root@hetzner3 ~ #
</pre>
# I just manually edited /etc/shadow with vim to remove the exclimation point
<pre>
root@hetzner3 ~ # vim /etc/shadow
root@hetzner3 ~ #

root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing::20133:0:99999:7:::
</pre>
# Tom replied, saying he can become root on hetzner3 now.
# ...
# I returned to work on the plan for replacing the disks on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb#Change_Steps
# I confirmed that the disks (on both hetzner2 and hetzner3) are MBR partition scheme (not GPT) – indicated by "Disk label type: dos"
<pre>
[root@opensourceecology ~]# fdisk -l /dev/sda

Disk /dev/sda: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x9b8e1266

Device Boot Start End Blocks Id System
/dev/sda1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sda2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sda3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0xd904fc05

Device Boot Start End Blocks Id System
/dev/sdb1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sdb2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sdb3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#
</pre>
# A quick spot-check shows that our backups usually finish at 09:55 – one time as late as 10:07. That's UTC.
# 10:00 UTC is 05:00 my time and 12:00 in Berlin. God that's early, but better to do this early in Germany time..
# I sent an email to Marcin asking if Thr 2025-04-24 @ 10:00 UTC (~05:00 FeF) would be a good time to do this
<pre>
Hey Marcin,

When would be a good time to replace the first disk on hetzner2?

Our backups finish daily at 10:00 UTC, which is:

* 12:00 in Germany (where the server lives)
* 05:00 here in Ecuador, and
* 05:00 at FeF

I propose next week on Thursday 2025-04-24 10:00 UTC.

For details about what this change entails, and expected downtime, please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb

Please let me know if you approve this change, if the suggested time is agreeable to you, and if you have any questions.
</pre>

=Fri Apr 18, 2025=
# Marcin sent another email this morning asking why osemain is down too now, and I responded
<pre>
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

* https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.

Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing? Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly. This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin
</pre>
# ok, so back to trying to figure out the corruption of the mariadb
# looks like the attempt to start it in recovery mode 2 fails after 10 minutes
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 10m0.435s
user 0m0.011s
sys 0m0.012s
[root@opensourceecology etc]#
</pre>
# and the tail of the db log
<pre>
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00 InnoDB: Waiting for the background threads to start
250417 23:06:01 InnoDB: Waiting for the background threads to start
250417 23:06:02 InnoDB: Waiting for the background threads to start
250417 23:06:03 InnoDB: Waiting for the background threads to start
250417 23:06:04 InnoDB: Waiting for the background threads to start
250417 23:06:05 InnoDB: Waiting for the background threads to start
250417 23:06:06 InnoDB: Waiting for the background threads to start
250417 23:06:07 InnoDB: Waiting for the background threads to start
250417 23:06:08 InnoDB: Waiting for the background threads to start
250417 23:06:09 InnoDB: Waiting for the background threads to start
</pre>
# so we have one more recovery mode we can try before it becomes destructive = 3
<pre>
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
>
[root@opensourceecology etc]#
</pre>
# and gave it a restart
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
...
</pre>
# damn, looks like it's stuck on the same thing
<pre>
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18 InnoDB: Waiting for the background threads to start
250418 19:33:19 InnoDB: Waiting for the background threads to start
250418 19:33:20 InnoDB: Waiting for the background threads to start
...
</pre>
# the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
## https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
## https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
# anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
# meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
# oh shit, that worked
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m2.102s
user 0m0.010s
sys 0m0.007s
[root@opensourceecology etc]#
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 22468 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]#
</pre>
# the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
<pre>
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28 InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
</pre>
# oh, the spam stopped. maybe just some startup thing.
# I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
# this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
# but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# this took about a minute to run
<pre>
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]#
</pre>
# good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
<pre>
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log
3dp_db.wp_commentmeta OK
3dp_db.wp_comments OK
3dp_db.wp_links OK
3dp_db.wp_masterslider_options OK
3dp_db.wp_masterslider_sliders OK
3dp_db.wp_options OK
3dp_db.wp_postmeta OK
3dp_db.wp_posts OK
3dp_db.wp_revslider_css OK
3dp_db.wp_revslider_layer_animations OK
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log
cacti_db.automation_ips
note : The storage engine for the table doesn't support check
cacti_db.automation_processes
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note : The storage engine for the table doesn't support check
cacti_db.poller_output
note : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note : The storage engine for the table doesn't support check
osemain_db.wp_options
warning : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]#
</pre>
# let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
## cacti_db
## osemain_db
## oswh_db
# I sent Marcin a status update email
<pre>
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
</pre>
# ok, I made the post-corruption mysqldump backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real 2m48.845s
user 3m19.170s
sys 0m2.023s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# now let's drop those three databases.
<pre>
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db |
+--------------------+
| d3d_db |
+--------------------+
| fef_db |
+--------------------+
| microfactory_db |
+--------------------+
| mysql |
+--------------------+
| obi_db |
+--------------------+
| obi_staging_db |
+--------------------+
| oseforum_db |
+--------------------+
| osemain_s_db |
+--------------------+
| osewiki_db |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db |
+--------------------+
| seedhome_db |
+--------------------+
| store_db |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# that looked good
<pre>
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]#
</pre>
# recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
# nah, it failed
<pre>
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m2.805s
user 0m0.006s
sys 0m0.010s
[root@opensourceecology etc]#
</pre>
# logs are the same, I think?
<pre>
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04 InnoDB: Waiting for the background threads to start
250418 20:10:04 InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
<pre>
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
250418 20:11:42 InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42 InnoDB: Waiting for the background threads to start
</pre>
# well, even though it *says* it's started
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m5.156s
user 0m0.008s
sys 0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 24458 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real 0m0.012s
user 0m0.001s
sys 0m0.007s
[root@opensourceecology etc]#
</pre>
# we can't connect to it with mysqlcheck
<pre>
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log
real 0m0.010s
user 0m0.002s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]#
</pre>
# so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
# huh, all lines say OK
<pre>
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real 0m11.597s
user 0m0.010s
sys 0m0.009s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log
[root@opensourceecology dbFail.20250417]#
</pre>
# well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
# I'm going to restore from the backup and then see if I can do that
# oh, right, we can't INSERT in recovery mode
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]#
</pre>
# well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
# we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
<pre>
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

</pre>
# even after that, it still won't start :'(
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m4.863s
user 0m0.009s
sys 0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real 0m0.010s
user 0m0.002s
sys 0m0.005s
[root@opensourceecology etc]#
</pre>
# before I purge those three system-level DBs, I want to confirm they're in our backups
# as I feared, it looks like they're missing
<pre>
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]#
</pre>
# according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
# I'm just going to manually dump these three anyway. Or try to
# well, I was able to get one of the three to backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real 0m0.010s
user 0m0.006s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real 0m0.142s
user 0m0.155s
sys 0m0.010s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real 0m0.009s
user 0m0.009s
sys 0m0.005s
[root@opensourceecology dbFail.20250417]#
</pre>
# mysql looks good
<pre>
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# I'm just going to move this whole db dir out of the way and see if we can start it fresh
<pre>
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x 4 mysql mysql 4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]#
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]#
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x 2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
[root@opensourceecology lib]#
</pre>
# ok, it's started outside recovery mode now
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m3.550s
user 0m0.007s
sys 0m0.012s
[root@opensourceecology etc]#

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23 InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
</pre>
# it created all these files
<pre>
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x 5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
-rw-rw---- 1 mysql mysql 16K Apr 18 20:56 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Apr 18 20:56 aria_log_control
-rw-rw---- 1 mysql mysql 18M Apr 18 20:56 ibdata1
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx 1 mysql mysql 0 Apr 18 20:56 mysql.sock
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]#
</pre>
# that also would have killed the mysql password; I can't login
<pre>
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]#
</pre>
# I hacked my way in and set the root password
<pre>
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
</pre>
# now I can see our three databases, plus one named test
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql |
+--------------------+
| performance_schema |
+--------------------+
| test |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]#
</pre>
# first let's just restore the 'mysql' database
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]#
</pre>
# that appears to have worked; our users are present now
<pre>
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User |
+------------------+
| oseforum_user |
| cacti_user |
| 3dp_user |
| cacti_user |
| d3d_user |
| fef_user |
| microfactory_usr |
| munin_user |
| obi2_user |
| obi3_user |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]>
</pre>
# I gave it a restart, and ensured it's still working. Great.
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# now let's restore the rest – including even our corrupt databases – and see if it works or breaks
# that took about 11.5 minutes to import ~6.8G of data
<pre>
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real 11m36.530s
user 1m52.944s
sys 0m3.593s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G /var/lib/mysql
[root@opensourceecology dbFail.20250417]#

</pre>
# I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# woah, I gave it a restart, and it came back fine
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# I guess we fixed it with no data loss?
# let's bring up the web servers
<pre>
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]#
</pre>
# the wiki loads now
# so does osemain
# I'd say we're back in business
# I sent an email to Marcin
<pre>
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues.
</pre>
# now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
# at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
# looks like it won't let him reboot if other users are logged-in
<pre>
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
1. maltfield
2. crupp
3. Tom Griffing (tgriffing)
4. jthomas
Choose identity to authenticate as (1-4):
</pre>
# I updated the sudoers command to give marcin *just* access to the reboot command
<pre>
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]#
</pre>
# I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
# it's debian, but sudoers syntax should (hopefully) be the same
<pre>
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~#

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password:
Retype new password:
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
</pre>
# yeah, that worked. Perfect.
# I tested it on hetzner2; it worked too.
<pre>
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
</pre>
# I sent Marcin a reply ask him to test reboots via ssh
<pre>
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need?
</pre>
# I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
## I wonder if I can figure out which one grub uses and replace that one second..
# from my log yesterday, here's our two drive's serial numbers
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# fuck; looks like neither is referenced in /boot/
<pre>
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]#
</pre>
# the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
# it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
<pre>
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64 initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64 initramfs-3.10.0-327.18.2.el7.x86_64.img System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64 initramfs-3.10.0-514.26.2.el7.x86_64.img System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64.img System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64kdump.img System.map-3.10.0-693.2.2.el7.x86_64
efi initrd-plymouth.img vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub lost+found vmlinuz-3.10.0-1127.el7.x86_64
grub2 symvers-3.10.0-1127.el7.x86_64.gz vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img symvers-3.10.0-1160.119.1.el7.x86_64.gz vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img symvers-3.10.0-327.18.2.el7.x86_64.gz vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img symvers-3.10.0-514.26.2.el7.x86_64.gz vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]#
</pre>
# I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
# oh, the efi dir is empty, so I'm thinking we're using grub2
<pre>
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]#
</pre>
# yeah, the grub dir just has one file in it?
<pre>
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11 2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
-rw-r--r-- 1 root root 1.4K Nov 15 2011 splash.xpm.gz
[root@opensourceecology boot]#
</pre>
# grub2 looks most sane
<pre>
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26 2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15 2015 fonts
-rw-r--r-- 1 root root 7.8K Jul 26 2024 grub.cfg
-rw-r--r-- 1 root root 5.3K Jun 1 2016 grub.cfg.1499616907.rpmsave
-rw-r--r-- 1 root root 6.1K Jul 9 2017 grub.cfg.1506097734.rpmsave
-rw-r--r-- 1 root root 7.0K Sep 22 2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26 2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31 2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31 2016 locale
[root@opensourceecology boot]#
</pre>
# it looks like it's referencing the raid, not the drive
<pre>
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a' 9f6b5264-da8c-406d-a444-45e3fb3aeb26
else
search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
fi
linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
</pre>
# right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
# well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
# that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
# Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
# the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 150G 38G 80% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#
</pre>

=Thr Apr 17, 2025=
# Marcin sent me an email last night (and again this morning) asking why the wiki is down
# I hadn't touched ose infra since 6 days ago
# the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
# I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
# anyway, let's check what happened to the wiki on hetzner2
# it's a 500 error complaining about the db
<pre>
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
<input type="hidden" name="num" value="50" />
<input type="hidden" name="ie" value="UTF-8" />
<input type="hidden" name="oe" value="UTF-8" />
<input type="text" name="q" size="31" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<p>
<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
<label><input type="radio" name="sitesearch" value="" />WWW</label>
</p>
user@disp9871:~$
</pre>
# disk is fine
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 96G 92G 52% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
# there's also no new logs in the mariadb error log when I hit the site in real-time
# well, the db isn't running
<pre>
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#
</pre>
# error logs aren't very helpful
<pre>
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
</pre>
# if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
<pre>
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]#
</pre>
# here's the log that pops-up when we try a restart
<pre>
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31 InnoDB: Waiting for the background threads to start
250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# google points to this https://bugs.mysql.com/bug.php?id=61516
## they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
# reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
# before reading any more, I'm going to immediately make a local copy of our most-recent backups
# looks like we have a backup from 13 hours ago and one from 27 hours ago
<pre>
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x 2 root root 4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x 2 root root 4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
</pre>
# this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
## it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
## other warnings suggest solving the underlying issue: why did the data become corrupt?
## well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
## but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
# I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
<pre>
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

* https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry.
</pre>
# the backups of the backups finished
<pre>
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec
total size is 44,542,039,620 speedup is 1.00
[root@opensourceecology ~]#
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 138G 50G 74% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
<pre>
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]#
</pre>
# I should also make a backup of /var/lib/mysql
# I'm going to create a dif for all of this
<pre>
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

* https://bugs.mysql.com/bug.php?id=61516

Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]#
</pre>
# aaaand I added a copy of /var/lib/mysql/
<pre>
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec
total size is 7,383,062,830 speedup is 1.00
[root@opensourceecology ~]#
</pre>
# another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
<pre>
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
</pre>
# Unfortunately, a lot of the links for how to fix this are now dead
## https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
## https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## https://forums.mysql.com/read.php?22,603093,604631#msg-604631
## https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
# we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
## but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
## ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
# this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
## good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
### but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
## it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
### here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
<pre>
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
</pre>
# I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
# ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
# but first let's check SMART
# oh, fuck, my notes on this are on the wiki. Of course.
# arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
# fail
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]#
</pre>
# luckily the yum servers for this EOL OS are still online, and I could install it
<pre>
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1
Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1

Installed:
smartmontools.x86_64 1:7.0-2.el7

Complete!
[root@opensourceecology ~]#
</pre>
# better
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]#
</pre>
# well this is terrifying; it says both our disks are gonna fail within 24 hours
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# compare that to hetnzer3, which says all is good
<pre>
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ #
</pre>
# I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
# oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
## well, it lists two options
### Free Replacement drive nearly new or used and tested; depends on what is in stock.
### At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
## we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
## dang, it says they'll swap the drive within 2-4 hours.
# I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
# hetzner provides 4 docs for assistance on this
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
## https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
## https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
# that first doc says to run the command we just ran
# hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
# ok, the docs say we can get more info with -A
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>
# so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
# well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
# yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
# in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
# the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
# wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
<pre>
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 152.358.379 [78,0 TB]
Data Units Written: 52.125.092 [26,6 TB]
Host Read Commands: 6.873.372.480
Host Write Commands: 1.362.559.127
Controller Busy Time: 22.226
Power Cycles: 28
Power On Hours: 17.245
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 159
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Data Units Read: 140.811.605 [72,0 TB]
Data Units Written: 56.604.901 [28,9 TB]
Host Read Commands: 1.304.073.899
Host Write Commands: 1.364.668.115
Controller Busy Time: 21.180
Power Cycles: 23
Power On Hours: 15.565
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 149
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 45 Celsius

root@hetzner3 ~ #
</pre>
# that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
# the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
# this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
# the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
<pre>
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
</pre>
# I also kicked-off a long test, which I can check tomorrow
<pre>
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
</pre>
# ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
<pre>
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/md2 197G 145G 43G 78% /
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# it won't let me check the filesystem while it's mounted
<pre>
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]#
</pre>
# it probably should be happening on-boot, but I couldn't find it in dmesg
<pre>
[root@opensourceecology ~]# dmesg | grep -i check
[ 0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]#
</pre>
# ok, instead we can just use tune2fs to get the info on the last check that was run
# looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 13131776
Block count: 52496160
Reserved block count: 2624808
Free blocks: 26575102
Free inodes: 12417672
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1011
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue May 31 06:01:12 2016
Last mount time: Thu Apr 17 17:39:11 2025
Last write time: Thu Apr 17 17:39:00 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
Check interval: 0 (<none>)
Lifetime writes: 124 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup: inode blocks
[root@opensourceecology ~]#
</pre>
# both of the filesystems (/ and /boot) look fine
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on: /boot
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 46
Maximum mount count: -1
Last checked: Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on: /
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
</pre>
# well, so far I couldn't find any signs of corruption on the disk/fs level
# back to the db, I set the recovery option in the my.cnf file
<pre>
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]#

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
>
[root@opensourceecology etc]#
</pre>
# it didn't come-up
<pre>
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]#
</pre>
# I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
<pre>
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49 InnoDB: Waiting for the background threads to start
250417 22:32:50 InnoDB: Waiting for the background threads to start
250417 22:32:51 InnoDB: Waiting for the background threads to start
250417 22:32:52 InnoDB: Waiting for the background threads to start
250417 22:32:53 InnoDB: Waiting for the background threads to start
250417 22:32:54 InnoDB: Waiting for the background threads to start
250417 22:32:55 InnoDB: Waiting for the background threads to start
250417 22:32:56 InnoDB: Waiting for the background threads to start
250417 22:32:57 InnoDB: Waiting for the background threads to start
250417 22:32:58 InnoDB: Waiting for the background threads to start
...
</pre>
# it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

=Sun Apr 11, 2025=

# let's get Catarina that broken staging site for osemain on hetzner3
# Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
# so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
# I logged-into hetzner https://robot.hetzner.com/server
# I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
# I clicked on the server for "hetzner3" and the tab "IPs".
## Then I clicked on "Order additional IPs / Nets"
## I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
## it required me to enter a reason (IPv4 is scarce) to which I wrote:
<pre>
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
</pre>
## and I clicked "Apply for IP/subnet in obligation"
## I got a message; looks like it needs human approval
<pre>
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
</pre>
# I typed an email to Marcin and Catarina to notify them of this order
<pre>
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
# I refreshed the hetzner wui, and now I see the new IP
# ...
# following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
## I sent them both emails asking them to confirm access
# I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
# update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
# he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
# I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
# I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
# Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
# ...
# I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
# gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
# I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
# there's too many layers on this server; I checked the logs
<pre>
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
</pre>
# ok, so it's modsecurity?
# gah, that's a lot of files to review
<pre>
[root@opensourceecology httpd]# find . |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]#
</pre>
# looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
<pre>
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]#
</pre>
# it's 13107200
<pre>
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf: SecRequestBodyLimit 13107200
conf.d/mod_security.conf: SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]#
</pre>
# docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
# so 13107200 / 1024 / 1024 = 12.5 MB.
# jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
# ...
# so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
# I didn't want to do this site before the other less-priority ones, but it's just a sandbox
# I realized I never made a CHG file for osemain
# looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
# ugh, I just said I was "following the same guide as with the other sites"
## I was hoping to know which one to CHG to copy-from
## I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
# ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3
# oh, crap, I'm going to remove

Maltfield Log/2025 Q2

2025-04-27T21:57:09Z

Maltfield: Apr 24

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

__TOC__

=See Also=
# [[Maltfield_Log]]
# [[User:Maltfield]]
# [[Special:Contributions/Maltfield]]

=Thr Apr 24, 2025=
# it's 05:00; I tried to login to the wiki, but I got an error
<pre>
There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Go back to the previous page, reload that page and then try again.
</pre>
# oh, under that it says I'm already logged-in?
<pre>
You are already logged in as Maltfield. Use the form below to log in as another user.
</pre>
# anyway, let's start the CHG to replace the failing disk on hetzner 2 https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb
# I confirmed that the RAID looks healthy, and our daily backups finished a few hours ago
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
33521664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# source /root/backups/backup.settings
[root@opensourceecology ~]# ${RCLONE} ls "b2:${B2_BUCKET_NAME}" | grep $(date "+%Y%m%d")
20144027578 daily_hetzner3_20250424_074924.tar.gpg
[root@opensourceecology ~]#

[root@opensourceecology ~]# date -u
Thu Apr 24 10:06:52 UTC 2025
[root@opensourceecology ~]#
</pre>
# I tried to remove the first partition from the RAID, but it said I can't?
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#
</pre>
# apparently the docs say that if the RAID is healthy, you have to force it with '--fail' https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
# crap, I realized I have an issue in my CHG (we need two sysadmins for peer review *sigh*)
## I listed this
<pre>
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3
</pre>
## but it should be this
<pre>
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md1 -r /dev/sdb2
mdadm /dev/md2 -r /dev/sdb3
</pre>
# anyway, it looks like I first need to execute this, to force the RAID into a failure state
<pre>
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md2 --fail /dev/sdb3
</pre>
# ok, I was able to remove it
<pre>
[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1](F)
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0] sdb1[1](F)
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1](F)
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
[root@opensourceecology ~]#

[root@opensourceecology ~]# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# by 10:32 UTC, I submitted the request to hetzner to replace /dev/sdb = "Crucial_CT250MX200SSD1_154410FA4520"
# it says they should do it within 2-4 hours
# meanwhile, I updated https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# at 08:00 my time, I checked and saw that we had an email come from hetzner at 06:36 (my time)
<pre>
Dear Client,

we've replaced the drive via hotswap as wished.

The second drive was unfortunately also briefly disconnected as there was a=
wrong physical label on it.

If you have any further questions or problems, feel free to contact us agai=
n.
</pre>
# well, crap. I tried to load the wiki CHG article, but there's an error
<pre>
Sorry! This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot access the database)
</pre>
# the server wasn't shutdown, and my screen session is still intact, but dmesg is being flooded with RAID and io errors
<pre>
...
[11136.011313] md: super_written gets error=-5, uptodate=0
[11136.011372] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11136.319267] md: super_written gets error=-5, uptodate=0
[11136.319322] md: super_written gets error=-5, uptodate=0
[11138.827642] EXT4-fs error: 5 callbacks suppressed
[11138.827693] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
[11138.827793] EXT4-fs: 5 callbacks suppressed
[11138.827841] EXT4-fs (md2): previous I/O error to superblock detected
[11138.835255] md: super_written gets error=-5, uptodate=0
[11138.835311] md: super_written gets error=-5, uptodate=0
[11138.835367] Buffer I/O error on dev md2, logical block 0, lost sync page write
[11138.835472] EXT4-fs error (device md2): ext4_find_entry:1318: inode #6819864: comm postdrop: reading directory lblock 0
...
</pre>
# well anyway, I'll see if I can at least restart the RAID sync and install grub on the new disk
# son of a bitch, they removed the wrong drive!
<pre>
[root@opensourceecology ~]# date -u
Thu Apr 24 13:05:32 UTC 2025
[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 32G 0 part
├─sdc2 8:34 0 512M 0 part
└─sdc3 8:35 0 200.4G 0 part
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
device node not found
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# it shows a new drive (sdc) and and old drive (sdb)
# ugh, so now we have nothing in the raid?
# here's the new drive
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sdc | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#
</pre>
# christ, so this new disk is half the size of our actual disk? what did they do?!?
# and now we have a prod server online with no redundancy. I can't tell them to put back-in the *correct* disk, or we'll have data loss
# I'm going to stop all the web services before this disaster gets any worse
# great; io errors. this is a damn disaster
<pre>
[root@opensourceecology ~]# systemctl stop nginx
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop varnish
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
[root@opensourceecology ~]#
[root@opensourceecology ~]# systemctl stop apache2
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error
Failed to stop apache2.service: Unit apache2.service not loaded.
[root@opensourceecology ~]#
</pre>
# I went ahead and made partition backups, anyway
# wait, actually, it said that /dev/sdc = Crucial_CT250MX200SSD1_154410FA336C. That's our old /dev/sda
# so they *did* remove the right drive, but the re-insertion of the wrong drive pushed /dev/sda to /dev/sdc. That kinda breaks our ability to map the RAID, but let's at-least partition this new drive
# but this new drive isn't the right size. it's 512G while our old disk was 250G. I guess it's better to have too-big of a disk than too-small of a disk, but we won't be able to use that extra disk space. I'm going to assume that they just didn't have 250G disks in-stock anymore.
# anyway, I tried to backup the partitions, but that wouldn't work since we're read-only
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
mkdir: cannot create directory ‘/var/tmp/chg.20250424_132010’: Read-only file system
[root@opensourceecology ~]# chown root:root $chg_dir
chown: cannot access ‘/var/tmp/chg.20250424_132010’: No such file or directory
[root@opensourceecology ~]#
</pre>
# I don't know what to do besides giving it a reboot, but that scares me
# I'd like to take a backup, but I can't if I get read-only errors :(
# well, I guess that's why we made a backup before this. I don't think I have any option other than to reboot. and pray that grub is intact to bring it back.
# I gave it a reboot. If it doesn't come back, I'll try to boot to the rescue CD from within the hetzner wui
<pre>
[root@opensourceecology ~]# date && reboot
Thu Apr 24 13:24:18 UTC 2025
/usr/bin/pkttyagent: error while loading shared libraries: /lib64/libpolkit-agent-1.so.0: cannot read file data: Input/output error

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!

Failed to start reboot.target: Unit is not loaded properly: Input/output error.
See system logs and 'systemctl status reboot.target' for details.

Broadcast message from maltfield@opensourceecology.org on pts/4 (Thu 2025-04-24 13:24:18 UTC):

The system is going down for reboot NOW!
</pre>
# wtf, it can't even reboot it's so broken.
# I triggered a rest on the hetzner wui
# the server came back, and I immediately shutdown all services again
<pre>
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop apache2
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]#
</pre>
# I went ahead and triggered backups
<pre>
[root@opensourceecology ~]# cat /etc/cron.d/backup_to_backblaze
20 07 * * * root time /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log
20 04 03 * * root time /bin/nice /root/backups/backupReport.sh
[root@opensourceecology ~]#

[root@opensourceecology ~]# time /root/backups/backup.sh &>> /var/log/backups/backup.log
</pre>
# ok, sdc is gone. we have sda and sdb again, and sda is our original sda – as we wanted
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Micron_1100_MTFDDAK512TBN_171416BD4379
ID_SERIAL_SHORT=171416BD4379
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# I made a backup of the partitions; it's not surprising the sdb file is empty
<pre>
[root@opensourceecology ~]# stamp=$(date "+%Y%m%d_%H%M%S")
[root@opensourceecology ~]# chg_dir=/var/tmp/chg.$stamp
[root@opensourceecology ~]# mkdir $chg_dir
[root@opensourceecology ~]# chown root:root $chg_dir
[root@opensourceecology ~]# chmod 0700 $chg_dir
[root@opensourceecology ~]# pushd $chg_dir
/var/tmp/chg.20250424_133230 ~
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sda > ${chg_dir}/sda_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]# sfdisk --dump /dev/sdb > ${chg_dir}/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
[root@opensourceecology chg.20250424_133230]# du -sh ${chg_dir}/*
4.0K /var/tmp/chg.20250424_133230/sda_parttable_mbr.bak
0 /var/tmp/chg.20250424_133230/sdb_parttable_mbr.bak
[root@opensourceecology chg.20250424_133230]#
</pre>
# I copied the partition from sda to sdb
<pre>
[root@opensourceecology chg.20250424_133230]# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 62260 cylinders, 255 heads, 63 sectors/track
sfdisk: /dev/sdb: unrecognized partition table type

Old situation:
sfdisk: No partitions found

New situation:
Units: sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 2048 67110912 67108865 fd Linux raid autodetect
/dev/sdb2 67112960 68161536 1048577 fd Linux raid autodetect
/dev/sdb3 68163584 488395120 420231537 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
[root@opensourceecology chg.20250424_133230]#
</pre>
# that looked good, other than the complaint about not being able to boot from this disk; I'll check later what is LILO and if this will matter for raid grub
# I reloaded the partition table for this disk
<pre>
[root@opensourceecology chg.20250424_133230]# blockdev --rereadpt /dev/sdb
[root@opensourceecology chg.20250424_133230]#
</pre>
# I added the new disk to the RAID, and it shows that it's starting to sync now. excellent
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda1[0]
33521664 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md0 -a /dev/sdb1
mdadm: added /dev/sdb1
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md1 -a /dev/sdb2
mdadm: added /dev/sdb2
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# mdadm /dev/md2 -a /dev/sdb3
mdadm: added /dev/sdb3
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.0% (19712/33521664) finish=481.1min speed=1159K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, it looks like it's not syncing each partition of the RAID at the same time. it's doing md0 now and then it'll do the others after, I guess
# md0 is partition 1 (sda1/sdb1). That's *sigh* swap. It's 32GB.
# I kinda wish we'd sync'd /boot first. I don't think I can install grub until that's sync'd. maybe?
# it says it's moving about 1024K/s. That's 1 MB per sec. 32G*1024 = 32,768 MB. That's 32,768 seconds / 60 = 546 minutes / 60 = 9 hours. Just for swap!
# assuming we have the same speed for the rest of the disk, that's 250 G * 1024 = 256,000 MB / 1 MB/s = 256,000 seconds. 256,000 seconds / 60 = 4,266.666666667 minutes / 60 = 4,266.666666667 = 71.11 hours. I guess we just have to accept the risk and hope that old /dev/sda with all our data doesn't fail within then next 3 days.
# I tried to go ahead and install grub on the new disk, but i got a command not found error
<pre>
[root@opensourceecology chg.20250424_133230]# grub-install /dev/sdb
-bash: grub-install: command not found
[root@opensourceecology chg.20250424_133230]#

[root@opensourceecology chg.20250424_133230]# grub
grub2-bios-setup grub2-glue-efi grub2-mkconfig grub2-mkpasswd-pbkdf2 grub2-probe grub2-set-default
grub2-editenv grub2-install grub2-mkfont grub2-mkrelpath grub2-reboot grub2-setpassword
grub2-file grub2-kbdcomp grub2-mkimage grub2-mkrescue grub2-render-label grub2-sparc64-setup
grub2-fstest grub2-macbless grub2-mklayout grub2-mkstandalone grub2-rpm-sort grub2-syslinux2cfg
grub2-get-kernel-settings grub2-menulst2cfg grub2-mknetdir grub2-ofpathname grub2-script-check grubby
[root@opensourceecology chg.20250424_133230]#
</pre>
# looks like it should be 'grub2-install' I tried that
<pre>
[root@opensourceecology chg.20250424_133230]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology chg.20250424_133230]#
</pre>
# well, that's two warnings but no errors; I'll take it.
# we're up to 12.4% on the RAID sync of swap. It's now going >50x faster than it was before; good news
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[==>..................] recovery = 12.4% (4168832/33521664) finish=8.2min speed=59264K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# calculations at that speed would be 250*1024/58 = 4,413.793103448 seconds / 60 = 73 minutes. Oh, that's just over an hour.
# and now we're at 42.7%
<pre>
[root@opensourceecology chg.20250424_133230]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
resync=DELAYED
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/1] [U_]
[========>............] recovery = 42.7% (14334208/33521664) finish=6.6min speed=47845K/sec

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/1] [U_]
resync=DELAYED

unused devices: <none>
[root@opensourceecology chg.20250424_133230]#
</pre>
# backups are still running; I'll let them finish before starting-up the webservers again
# I wrote a status email to Marcin
# the backups still aren't finished
# I checked on the raid replication, and it shows md0 (swap) and md1 (boot) are both done. Horray! Now we just need to finish root (/), which is 9.8% done and going at 60 MB/s. Great!
<pre>
Thu Apr 24 14:05:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[=>...................] recovery = 9.8% (20767872/209984640) finish=50.5min speed=62429K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# I gave the grub install a double-tap now that it's synced with the first disk; the output was the same
<pre>
[root@opensourceecology ~]# grub2-install /dev/sdb
Installing for i386-pc platform.
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub2-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
[root@opensourceecology ~]#
</pre>
# the output of lsblk looks much nicer now, too
<pre>
[root@opensourceecology ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 200.4G 0 part
└─md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# backups say they're 9% uploaded
<pre>
[root@opensourceecology ~]# tail -f /var/log/backups/backup.log
...
2025/04/24 14:13:48 INFO :
Transferred: 2.210G / 20.472 GBytes, 11%, 2.904 MBytes/s, ETA 1h47m20s
Transferred: 0 / 1, 0%
Elapsed time: 13m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 10% /20.472G, 2.997M/s, 1h43m59s
</pre>
# I decided to just kill the backup script and manually upload it without the bwlimit, so it'll go-out faster
<pre>
[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20250424_133017.tar.gpg b2:ose-server-backups
2025/04/24 14:15:20 INFO :
Transferred: 116.500M / 20.472 GBytes, 1%, 1.958 MBytes/s, ETA 2h57m25s
Transferred: 0 / 1, 0%
Elapsed time: 1m0.5s
Transferring:
* daily_hetzner2_20250424_133017.tar.gpg: 0% /20.472G, 5.065M/s, 1h8m35s
</pre>
# meanwhile we're at 24% on the RAID sync
<pre>
Thu Apr 24 14:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.9% (50200448/209984640) finish=101.1min speed=26325K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# oh, important to note: our new disk doesn't say that it's failing :D
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[root@opensourceecology ~]#
</pre>
# while the old disk says it's reached 100% of its lifecycle, the new disk says it's at – uhh – 96% of it's life? That doesn't sound very good :(
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78516
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3445
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 47
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 046 000 Old_age Always - 40 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 407132499909
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12839097351
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26313144762

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 3
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 3
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 52083
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 004 004 000 Old_age Always - 1449
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 061 049 000 Old_age Always - 39 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 004 004 001 Old_age Offline - 96
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 600236629947
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 18860233219
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 11828985935
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2470
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 12

[root@opensourceecology ~]#
</pre>
# Shame. I was hoping for at least something <50%. Well, I wonder how long that remaining 4% will last us :/
# ok, backups just finished; let's start the web services
<pre>
[root@opensourceecology ~]# systemctl start mariadb
[root@opensourceecology ~]# systemctl start httpd
[root@opensourceecology ~]# systemctl start varnish
[root@opensourceecology ~]# systemctl start nginx
[root@opensourceecology ~]#
</pre>
# I updated the wiki CHG with a status https://wiki.opensourceecology.org/wiki/Category:CHGs
# And I sent an email to Marcin recommending that he replace /dev/sda with an actual new drive
<pre>
Hey Marcin,

Would you authorize spending €41.18 on a new disk for your server?

Update: Your websites are back online. The RAID is still syncing.

I was a bit disappointed to learn that hetzner replaced a disk with 0% "life left" with a disk with 4% "life left". That's what we get for choosing the free disk replacement..

The "free" option said it would replace it with a "Replacement drive nearly new or used and tested; depends on what is in stock." Obviously they didn't give us a "nearly new" drive..

Your other disk is also at 0% "life left". I was already planning on replacing that one next week too, but I would recommend that you pay for a new drive for this one. The cost listed on the website is €41.18.

Do you authorize me selecting €41.18 for the replacement of /dev/sda on hetzner2?
</pre>
# from the output above, our old drive said it had "Power_On_Hours" of 78516/24/365 = 8.96 years
# and our new drive says Power_On_Hours = 52083/24/365 = 5.95 years. Well that's better, I guess.
# oh wow, the power cycle count is crazy; our disk we only rebooted 50 times and the new one was only 33 times.
# also the SMART data for both of these drives has different keys (not just values). apparently it's very vendor-specific, so some of these comparisons are apples-to-oranges
# right, we're at 69.7% replication on root. I'm going to go make breakfast and check-in again after
# ...
# over lunch, I realized that Marcin's last email was possibly hyperbolic panic
# he's worried that he just kicked-off a marketing campaign (for the apprenticeship), which now links to information on a broken website – where potential applicants can't read the info
# but I think the content actually *is* accessible, just not to Marcin
# when you're logged-into the wiki, the cookies bypass the cache. So, regretablly, when hetnzer2's backend is offline, Marcin sees an error
# but I'd bet that the frontpage of all the websites and the recently-published apprenticeship info page that he's published & promoted are still online when he sees that error – for users who are *not* logged-into the site
# but if the backend site is broken for >24 hours, then the cache will cache the errors (not the content)
# as a short-term hack, I recommended that we setup a daily reboot of hetzner2 at 10:40 (a good buffer after the backups finish uploading)
# I asked Marcin if he'd like me to setup a daily reboot at 10:40
<pre>
Hey Marcin,

I don't think the situation is as bad as you think.

> We are missing opportunity,
> the announcement is posted, and our servers are down.

Of course I agree it's not good, and we should migrate away from hetzner2 asap. And I do wish I had more bandwidth to finish the migration faster for you.

But you have a varnish cache that caches pages for 24 hours. Even if your backend webserver and database are down, popular pages (like the frontpage of your wiki or a recent article that you've recently promoted) should still load for users.

The big issue isn't marketing and read-only content. The big issue is editing. That's what is breaking.

When you're logged into the wiki, it bypasses the varnish cache. So, even if the wiki appears down to you, the contents of (most) articles viewed in the past 24 hours will be still visible to potential apprenticeship applicants.

The next time you see the websites are down, try loading it from another device where you're not logged-in. You'll probably see that the apprenticeship info is still accessible, even though the backend for the site is down.

As a short-term hack, I recommend setting-up a daily reboot of the server. Backups typically finish before 10:10 UTC. I recommend we add a cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.

The server seems to function for some time after a fresh reboot, and it caches pages for 24 hours. So the first time someone loads a page in the wiki after that reboot, it'll be cached for the entire time that the server is online until its next reboot. I think this will ensure higher availability of your read-only content (eg information about the apprenticeship).

Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# ...
# I checked-in on the RAID replication status; it's finished
<pre>

Thu Apr 24 15:15:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/1] [U_]
[===================>.] recovery = 96.5% (202794752/209984640) finish=2.5min speed=46324K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Thu Apr 24 15:20:59 UTC 2025
Personalities : [raid1]
md2 : active raid1 sdb3[2] sda3[0]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[2] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>
</pre>
# so it looks like I started it just after 13:32 and it finished just before 15:20. So it took just under 2 hours. Great!
# I updated the article with status updates, marking the CHG as completed successfully https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb#2025-04-24_16:18_UTC
# And I sent an email to Marcin & Catarana to let them know it was successful, and asked again about buying a new drive for replacing /dev/sda
<pre>
Update: your new (used) disk is now fully synced with the old (failing) disk.

* https://wiki.opensourceecology.org/wiki/CHG-2025-04-24_replace_hetzner2_sdb

According to SMART data, you now have one failing disk and one not-failing disk.

Your hetzner2 RAID is now healthy, and you have redundancy spread across two mirrored disks again.

Next week I'd like to replace the other failing disk. Please let me know if you approve the purchase of a new disk for its replacement.
</pre>
# Marcin got back to me, approving the purchase of the new disk; I updated the ticket https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
# Note that the price is listed as "at cost" and it says
<pre>
Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
</pre>
# 1,000 hours is fine. That's compared to the 78,516 hours of /dev/sda and 52,083 hours of our "new" /dev/sdb
# but it's a bit concerning that it says it might not be in-stock. I'm going to message them and ask if they can set one aside for us for next week
<pre>
Hi Support,

Can you set-aside a replacement disk for this server?

Our disks' SMART logs indicated that both disks should be replaced. Today we replaced one of the two disks, but the disk that you replaced it with has 4% of its life left, according to SMART data (it has 52,083 hours of operation).

Next week we would like to replace the other disk, and this time we'd like your "at cost" option, to get a disk with <1,000 hours of operation.

But I was a bit concerned when I read this next to the WUI option for "at cost" on your website

> Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.

Specifically what worries me is the "may not be in stock".

Can you please tell us if you have stock now? And if you do, can you please reserve one disk for us for next week?

We don't want to remove a disk from our RAID and plan for downtime, only to discover that you don't have a disk available for us..

Please let us know if you can reserve 1 disk for us for next week.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# I asked Marcin if Wed next week at 11:00 UTC is ok for replacing hetzner2's sda
<pre>
Hey Marcin,

When would be a good time to replace the second disk on hetzner2?

If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next week on Wednesday 2025-04-30 11:00 UTC, which is:

* 13:00 in Germany (where the server lives)
* 06:00 here in Ecuador, and
* 06:00 at FeF

For details about what this change entails, and expected downtime,
please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda

Please let me know if you approve this change, if the suggested time is
agreeable to you, and if you have any questions.

Thank you,
</pre>
# Marcin returned the email confirming the time
<pre>
Yes, time is perfect at 6 am. Any day.

On Thu, Apr 24, 2025, 12:38 PM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> When would be a good time to replace the second disk on hetzner2?
>
> If we enable daily reboots on hetzner2 at 10:40 UTC, then I propose next
> week on Wednesday 2025-04-30 11:00 UTC, which is:
>
> * 13:00 in Germany (where the server lives)
> * 06:00 here in Ecuador, and
> * 06:00 at FeF
>
> For details about what this change entails, and expected downtime,
> please see the change ticket:
>
> *
> https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
>
> Please let me know if you approve this change, if the suggested time is
> agreeable to you, and if you have any questions.
>
>
> Thank you,
>
>
> Michael Altfield
> https://www.michaelaltfield.net
> PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41
>
> Note: If you cannot reach me via email, please check to see if I have
> changed my email address by visiting my website at
> https://email.michaelaltfield.net
</pre>
# ...
# Marcin got back to me and told me to setup the daily reboot cron on hetzner2
<pre>
Yes, please set up reboot. That is decent for now

On Thu, Apr 24, 2025, 11:08 AM Michael Altfield <REDACTED@disroot.org> wrote:

> Hey Marcin,
>
> I don't think the situation is as bad as you think.
>
> > We are missing opportunity,
> > the announcement is posted, and our servers are down.
>
> Of course I agree it's not good, and we should migrate away from
> hetzner2 asap. And I do wish I had more bandwidth to finish the
> migration faster for you.
>
> But you have a varnish cache that caches pages for 24 hours. Even if
> your backend webserver and database are down, popular pages (like the
> frontpage of your wiki or a recent article that you've recently
> promoted) should still load for users.
>
> The big issue isn't marketing and read-only content. The big issue is
> editing. That's what is breaking.
>
> When you're logged into the wiki, it bypasses the varnish cache. So,
> even if the wiki appears down to you, the contents of (most) articles
> viewed in the past 24 hours will be still visible to potential
> apprenticeship applicants.
>
> The next time you see the websites are down, try loading it from another
> device where you're not logged-in. You'll probably see that the
> apprenticeship info is still accessible, even though the backend for the
> site is down.
>
> As a short-term hack, I recommend setting-up a daily reboot of the
> server. Backups typically finish before 10:10 UTC. I recommend we add a
> cron to hetzner2 to reboot itself every day at 10:40 UTC = 05:40 FeF time.
>
> The server seems to function for some time after a fresh reboot, and it
> caches pages for 24 hours. So the first time someone loads a page in the
> wiki after that reboot, it'll be cached for the entire time that the
> server is online until its next reboot. I think this will ensure higher
> availability of your read-only content (eg information about the
> apprenticeship).
>
> Would you like me to setup a daily reboot at 10:40 UTC on hetzner2?
</pre>
# we don't have ansible for hetzner2; I did this manually
<pre>
[root@opensourceecology cron.d]# pwd
/etc/cron.d
[root@opensourceecology cron.d]# ls -lah
total 52K
drwxr-xr-x. 2 root root 4.0K Apr 24 17:56 .
drwxr-xr-x. 105 root root 12K Apr 18 21:52 ..
-rw-r--r-- 1 root root 128 May 16 2023 0hourly
-rw-r--r-- 1 root root 1.3K Apr 9 2019 awstats_generate_static_files
-rw-r--r-- 1 root root 151 Apr 24 17:52 backup_to_backblaze
-rw-r--r-- 1 root root 78 May 31 2024 cacti
-rw-r--r-- 1 root root 125 Dec 11 00:16 letsencrypt
-rw-r--r-- 1 root root 506 Mar 18 2019 phplist
-rw-r--r-- 1 root root 108 Jan 7 2022 raid-check
-rw-r--r-- 1 root root 118 Apr 24 17:56 reboot
-rw------- 1 root root 235 Dec 15 2022 sysstat
[root@opensourceecology cron.d]# cat reboot
# 2025-04-24: temp hack for unstable hetzner2 while we build-out hetzner3 to replace it
40 10 * * * root /sbin/reboot
[root@opensourceecology cron.d]#
# tomorrow morning I should check on the uptime and journalctl to make sure it rebooted sometime around 10:40 UTC
</pre>
# ...
# ok, back to hetzner3: we bought a second IPv4 address for the static sites, but the server's networking was never setup for it; let's add that
<pre>
root@hetzner3 /etc/network # cp interfaces interfaces.20250424
root@hetzner3 /etc/network # vim interfaces
...
</pre>
# well, that failed.
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
You have mail in /var/mail/root
root@hetzner3 ~ #
</pre>
I restored the backup file, and it still failed. The journal and status aren't helpful
<pre>
root@hetzner3 ~ # systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-24 17:18:55 UTC; 52s ago
Duration: 2month 1w 20h 39min 50.765s
Docs: man:interfaces(5)
Process: 3259336 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 3259371 ExecStopPost=/usr/bin/touch /run/network/restart-hotplug (code=exited, status=0/SUCCESS)
Main PID: 3259336 (code=exited, status=1/FAILURE)
CPU: 29ms

Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
root@hetzner3 ~ # journalctl -u networking | tail
Apr 24 17:16:36 hetzner3 ifup[3258504]: ifup: failed to bring up enp0s31f6
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:16:36 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:16:36 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
Apr 24 17:18:55 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:18:55 hetzner3 ifup[3259347]: RTNETLINK answers: File exists
Apr 24 17:18:55 hetzner3 ifup[3259336]: ifup: failed to bring up enp0s31f6
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 17:18:55 hetzner3 systemd[1]: networking.service: Failed with result 'exit-code'.
Apr 24 17:18:55 hetzner3 systemd[1]: Failed to start networking.service - Raise network interfaces.
root@hetzner3 ~ #
</pre>
# if I run the ExecStart command manaully, I can add a verbose tag. but that's not especially helpful, either
<pre>
root@hetzner3 ~ # ifup --verbose -a --read-environment
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool

ifup: configuring interface enp0s31f6=enp0s31f6 (inet)
run-parts --exit-on-error --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/ethtool
ip addr add 144.76.164.201/255.255.255.224 broadcast 144.76.164.223 dev enp0s31f6 label enp0s31f6
RTNETLINK answers: File exists
ifup: failed to bring up enp0s31f6
run-parts --exit-on-error --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/000resolvconf
run-parts: executing /etc/network/if-up.d/ethtool
run-parts: executing /etc/network/if-up.d/postfix
run-parts: executing /etc/network/if-up.d/resolved
root@hetzner3 ~ #
</pre>
# curiously, though, the new IPv4 address is listed in `ip a`
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# I'm just going to give this server a reboot before proceeding, to make sure the IP config is sticky
# when it came-up, it lost the new IP :(
<pre>
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# well, at least it's restarting now without errors; I can work with that
<pre>
root@hetzner3 /etc/network # systemctl restart networking
You have new mail in /var/mail/root
root@hetzner3 /etc/network # systemctlstatus networking
-bash: systemctlstatus: command not found
root@hetzner3 /etc/network # systemctl status networking
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: active (exited) since Thu 2025-04-24 17:33:40 UTC; 15s ago
Docs: man:interfaces(5)
Process: 8598 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Process: 9022 ExecStart=/bin/sh -c if [ -f /run/network/restart-hotplug ]; then /sbin/ifup -a --read-environment --allow=hotplug; fi (code=exited, status=0/SUCCESS)
Main PID: 9022 (code=exited, status=0/SUCCESS)
CPU: 357ms

Apr 24 17:33:34 hetzner3 systemd[1]: Starting networking.service - Raise network interfaces...
Apr 24 17:33:39 hetzner3 ifup[8663]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 ifup[8907]: Waiting for DAD... Done
Apr 24 17:33:40 hetzner3 systemd[1]: Finished networking.service - Raise network interfaces.
root@hetzner3 /etc/network #
</pre>
# let's try to add it now
<pre>
root@hetzner3 /etc/network # diff interfaces interfaces.20250424
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # vim interfaces
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,23
> iface enp0s31f6 inet static
> address 144.76.164.195
> netmask 255.255.255.224
> gateway 144.76.164.193
> # route 144.76.164.192/27 via 144.76.164.193
> #up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6
>
root@hetzner3 /etc/network #
</pre>
# I gave it a restart, but I have errors again
<pre>
# curiously, it *did* add the new IP address; wtf
<pre>
root@hetzner3 ~ # systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xeu networking.service" for details.
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/27 brd 144.76.164.223 scope global secondary enp0s31f6
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# the internet isn't very helpful because it seems the damn format has changed so many times over the years; lots of outdated info
# lots of people say they fixed this by deleting everything in interfaces.d/, but we don't have anything in that folder
# I did find this hetzner-specific docs on adding a second IP; it's totally different than what I've read elsewhere https://docs.hetzner.com/robot/dedicated-server/network/net-config-debian-ubuntu
<pre>
up ip addr add 10.4.2.1/32 dev eth0
down ip addr del 10.4.2.1/32 dev eth0
</pre>
# I tried this, and gave the server a reboot
<pre>
root@hetzner3 /etc/network # diff interfaces.20250424 interfaces
16a17,20
> # 2025-04-24: add second IPv4 address
> up ip addr add 144.76.164.195/32 dev enp0s31f6
> down ip addr del 144.76.164.195/32 dev enp0s31f6
>
root@hetzner3 /etc/network #

root@hetzner3 /etc/network # cat interfaces
### Hetzner Online GmbH installimage

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback
iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet static
address 144.76.164.201
netmask 255.255.255.224
gateway 144.76.164.193
# route 144.76.164.192/27 via 144.76.164.193
up route add -net 144.76.164.192 netmask 255.255.255.224 gw 144.76.164.193 dev enp0s31f6

# 2025-04-24: add second IPv4 address
up ip addr add 144.76.164.195/32 dev enp0s31f6
down ip addr del 144.76.164.195/32 dev enp0s31f6

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::2
netmask 64
gateway fe80::1

iface enp0s31f6 inet6 static
address 2a01:4f8:200:40d7::3
netmask 64
gateway fe80::1
root@hetzner3 /etc/network #
</pre>
# the system came-up with the IP I want. Cool!
<pre>
root@hetzner3 /etc/network # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 /etc/network #
</pre>
# and I'm able to restart the service without it yelling at me (or breaking the IP config)
<pre>
root@hetzner3 ~ # systemctl restart networking
root@hetzner3 ~ #
You have new mail in /var/mail/root
root@hetzner3 ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 90:1b:0e:c4:28:b4 brd ff:ff:ff:ff:ff:ff
inet 144.76.164.201/27 brd 144.76.164.223 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet 144.76.164.195/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::3/64 scope global
valid_lft forever preferred_lft forever
inet6 2a01:4f8:200:40d7::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::921b:eff:fec4:28b4/64 scope link
valid_lft forever preferred_lft forever
root@hetzner3 ~ #
</pre>
# I'm also able to ping the server on both IPs, which is a good sign
<pre>
user@disp9871:~$ ping 144.76.164.201
PING 144.76.164.201 (144.76.164.201) 56(84) bytes of data.
64 bytes from 144.76.164.201: icmp_seq=1 ttl=50 time=490 ms
64 bytes from 144.76.164.201: icmp_seq=2 ttl=50 time=490 ms
^C
--- 144.76.164.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 489.558/489.676/489.795/0.118 ms
user@disp9871:~$
user@disp9871:~$ ping 144.76.164.195
PING 144.76.164.195 (144.76.164.195) 56(84) bytes of data.
64 bytes from 144.76.164.195: icmp_seq=1 ttl=50 time=493 ms
64 bytes from 144.76.164.195: icmp_seq=2 ttl=50 time=512 ms
^C
--- 144.76.164.195 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 492.853/502.518/512.184/9.665 ms
user@disp9871:~$
</pre>
# I used netcat to test it. Most ports are closed, and I found that nginx is listening on most of the other ports on all IPs – except 4443
<pre>
root@hetzner3 ~ # nc -s 144.76.164.195 -l -p 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# and this was how it looked on my laptop's side
<pre>
user@disp9871:~$ nc 144.76.164.195 4443
I am typing this on my laptop computer's local terminal; it should show-up on the server's terminal
</pre>
# ok, so the server's new IPv4 address is configured (and persistent between reboots)

=Sun Apr 20, 2025=
# Marcin replied to my email authorizing the replacement of the /dev/sdb disk on hetzner2 at 2025-04-24 10:00 UTC https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
## I updated the article with the defined date & time
# ...
# I also checked hetzner3. I see that I setup email alerts for the RAID, but not for SMART.
## on hetzner2, we had no errors of the RAID, but we did have SMART errors. I guess eventually if it failed enough that RAID replication was breaking, we would have gotten alerts. But it would be good if we could get alerts *before* that happened..
# I checked munin on hetzner2 to see what data it collects for monitoring disks @ /disk-day.html
## looks like we have latency, throughput, usage, utilization, i/o, and inode usage. There's nothing about "SMART errors"
# looks like there *is* a smart module for munin https://gallery.munin-monitoring.org/plugins/munin/smart_/
# it's already there on hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah | grep -i smart
-rwxr-xr-x 1 root root 11K Mar 21 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Mar 21 2023 smart_
You have new mail in /var/mail/root
root@hetzner3 /usr/share/munin/plugins #
</pre>
# hetzner2 has it too
<pre>
[root@opensourceecology munin]# ls -lah /usr/share/munin/plugins | grep -i smart
-rwxr-xr-x 1 root root 11K Nov 6 2023 hddtemp_smartctl
-rwxr-xr-x 1 root root 26K Nov 6 2023 smart_
[root@opensourceecology munin]#
</pre>
# crap, I just checked hetzner3's munin, and I realized that varnish is missing :(
# it looks like ansible *has* pushed-out the script and plugins
<pre>
root@hetzner3 /usr/share/munin/plugins # ls -lah /usr/share/munin/plugins/ | grep -i varnish
-rwxr-xr-x 1 root root 26K Mar 21 2023 varnish_
-rwxr-xr-x 1 root root 28K Feb 12 00:14 varnish5_
-rwxr-xr-x 1 root root 28K Sep 28 2024 varnish5_.175431.2025-02-12@00:16:02~
-rwxr-xr-x 1 root root 28K Sep 25 2024 varnish5_.20240928.orig
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # ls -lah /etc/munin/plugins/ | grep -i varnish
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_backend_traffic -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_bad -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_expunge -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_hit_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_main_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_memory_usage -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Dec 13 00:03 varnish_mgt_uptime -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_objects -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_request_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_threads -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Sep 25 2024 varnish_transfer_rate -> /usr/share/munin/plugins/varnish5_
lrwxrwxrwx 1 root root 34 Feb 12 00:16 varnish_uptime -> /usr/share/munin/plugins/varnish5_
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I did a diff of the varnish5_ script from my server and ose's server, and I found 2 new lines at the top of the hetzner3 server
## my server
<pre>
maltfield@mail:~$ head /usr/share/munin/plugins/varnish5_
#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
#
# This program is free software; you can redistribute it and/or modify
maltfield@mail:~$
</pre>
## ose's hetzner3
<pre>
maltfield@hetzner3:~$ head /usr/share/munin/plugins/varnish5_
# Ansible managed

#!/usr/bin/perl
# -*- perl -*-
#
# varnish5_ - Munin plugin to for Varnish 5.x and 6.x
# Copyright (C) 2009,2018 Redpill Linpro AS
#
# Author: Kristian Lyngstøl <kristian@bohemians.org>
# Pål-Eivind Johnsen <pej@redpill-linpro.com>
maltfield@hetzner3:~$
</pre>
# so basically the issue appears to be that my "ansible managed" comment comes before the shebang, so varnish is interpreting everything as shell, instead of perl
# we can see the result of all these syntax errors with a test run too
## my server
<pre>
root@mail:/etc/munin# munin-run varnish_hit_rate
cache_hitpass.value 0
client_req.value 704255
cache_miss.value 202581
cache_hitmiss.value 2181
cache_hit.value 499493
root@mail:/etc/munin#
</pre>
## ose's hetzner3
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
/etc/munin/plugins/varnish_hit_rate: 26: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 28: varnish5_: not found
/etc/munin/plugins/varnish_hit_rate: 30: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 32: Varnish: not found
/etc/munin/plugins/varnish_hit_rate: 34: =head1: not found
/etc/munin/plugins/varnish_hit_rate: 36: The: not found
/etc/munin/plugins/varnish_hit_rate: 38: The: not found
/etc/munin/plugins/varnish_hit_rate: 39: [varnish5_*]: not found
/etc/munin/plugins/varnish_hit_rate: 40: group: not found
/etc/munin/plugins/varnish_hit_rate: 41: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 42: env.name: not found
/etc/munin/plugins/varnish_hit_rate: 44: env.varnishstat: not found
/etc/munin/plugins/varnish_hit_rate: 108: my: not found
/etc/munin/plugins/varnish_hit_rate: 111: my: not found
/etc/munin/plugins/varnish_hit_rate: 114: my: not found
/etc/munin/plugins/varnish_hit_rate: 117: my: not found
/etc/munin/plugins/varnish_hit_rate: 119: my: not found
/etc/munin/plugins/varnish_hit_rate: 123: Syntax error: "(" unexpected
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I moved the "ansible managed" comment below the shebang in ansible, and pushed it out; now it works
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run varnish_hit_rate
client_req.value 10714
cache_hitmiss.value 9
cache_hit.value 6478
cache_hitpass.value 0
cache_miss.value 4227
root@hetzner3 /usr/share/munin/plugins #
</pre>
# I also pushed-out smart at the same time, but it's not working
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_ suggest
root@hetzner3 /usr/share/munin/plugins #

root@hetzner3 /usr/share/munin/plugins # munin-run smart_
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# the docs page for the smart_ munin plugin says that we need this section at-minimum in the munin config file, so I added it to hetzner2 https://gallery.munin-monitoring.org/plugins/munin/smart_/
<pre>
[root@opensourceecology plugin-conf.d]# tail -n4 zzz-ose

[smart_*]
user root
group disk
[root@opensourceecology plugin-conf.d]#
</pre>
# and I manually created the symlinks for sda & sdb
<pre>
[root@opensourceecology ~]# cd /etc/munin/plugins
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sda
[root@opensourceecology plugins]# ln -s /usr/share/munin/plugins/smart_ /etc/munin/plugins/smart_sdb
[root@opensourceecology plugins]#
</pre>
# sweet, that worked
<pre>
[root@opensourceecology plugins]# munin-run smart_sdb
Program_Fail_Count.value 100
Reallocated_Event_Count.value 100
Ave_Block_Erase_Count.value 001
Reallocate_NAND_Blk_Cnt.value 100
Erase_Fail_Count.value 100
Reported_Uncorrect.value 100
SATA_Interfac_Downshift.value 100
Offline_Uncorrectable.value 100
smartctl_exit_status.value 8
Write_Error_Rate.value 100
FTL_Program_Page_Count.value 100
Current_Pending_Sector.value 100
Success_RAIN_Recov_Cnt.value 100
UDMA_CRC_Error_Count.value 100
Error_Correction_Count.value 100
Temperature_Celsius.value 064
Raw_Read_Error_Rate.value 100
Total_Host_Sector_Write.value 100
Power_Cycle_Count.value 100
Power_On_Hours.value 100
Host_Program_Page_Count.value 100
Unused_Reserve_NAND_Blk.value 000
Percent_Lifetime_Remain.value 000
Unexpect_Power_Loss_Ct.value 100
[root@opensourceecology plugins]#
</pre>
# Unfortunately, I'm not getting the same results on hetzner3. I wonder if this munin plugin doesn't support nvme drives?
# oh, it looks like I'm actually not updating that file anymore in ansible, because it has a backup. I'm going to make a note in ansible so I don't make that mistake again.
# meanwhile, I manually updated the config file on hetzner3 too
<pre>
root@hetzner3 /etc/munin # cd plugin-conf.d/
root@hetzner3 /etc/munin/plugin-conf.d # ls
dhcpd3 munin-node README spamstats zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # touch /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chown root:root /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # chmod 0600 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # cp zzz-myconf /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # ls -lah /var/tmp/munin-zzz-myconf.20250420
-rw------- 1 root root 1,7K Apr 20 17:29 /var/tmp/munin-zzz-myconf.20250420
root@hetzner3 /etc/munin/plugin-conf.d # vim zzz-myconf
root@hetzner3 /etc/munin/plugin-conf.d #

root@hetzner3 /etc/munin/plugin-conf.d # diff /var/tmp/munin-zzz-myconf.20250420 /etc/munin/plugin-conf.d/zzz-myconf
3c3
< # Version: 0.2
---
> # Version: 0.3
9c9
< # Updated: 2024-12-12
---
> # Updated: 2025-04-20
31a32,35
>
> [smart_*]
> user root
> group disk
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# that still fails
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
Warning: the execution of 'munin-run' via 'systemd-run' returned an error. This may either be caused by a problem with the plugin to be executed or a failure of the 'systemd-run' wrapper. Details of the latter can be found via 'journalctl'.
root@hetzner3 /usr/share/munin/plugins #
</pre>
# but, if I restart the service first and then run it, it – uhh – kinda works
<pre>
root@hetzner3 /etc/munin/plugin-conf.d # service munin-node restart
root@hetzner3 /etc/munin/plugin-conf.d #
</pre>
# so it exits with a non-error, just a U. no further stats. huh.
<pre>
root@hetzner3 /usr/share/munin/plugins # munin-run smart_nvme0n1
smartctl_exit_status.value U
root@hetzner3 /usr/share/munin/plugins #
</pre>
# yeah, it looks like the smart_ plugin doesn't work for nvme drives :(
## https://github.com/munin-monitoring/munin/issues/790
## https://github.com/aranemac/munin-smart-nvme
# I'm not looking to compile some binary. I think we've reached the point of diminished return here
# while historical smart charts would be great, what I really want to achieve is some email alerts from SMART, like we setup for the RAID
# I found a few guides about this
## https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email
## https://serverfault.com/questions/426761/is-smartd-properly-configured-to-send-alerts-by-email
## https://unix.stackexchange.com/questions/662633/best-practices-to-enable-smart-disk-notifications-on-a-linux-workstation
# I replaced the files
<pre>
root@hetzner3 /etc # mv /etc/smartd.conf /etc/smartd.conf.$(date "+%Y%m%d_%H%M%S").orig
root@hetzner3 /etc #

root@hetzner3 /etc # echo "DEVICESCAN -d removable -n standby -m REDACTED@opensourceecology.org -M exec /usr/share/smartmontools/smartd-runner" > /etc/smartd.conf
root@hetzner3 /etc #
</pre>
# but that didn't work; no email came when I restarted the service (even if I added -M test)
# I checked the status in systemd, and it says that it did try to send the mail
<pre>
root@hetzner3 /etc # systemctl status smartd
● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-04-20 20:58:57 UTC; 3min 22s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1466569 (smartd)
Status: "Next check of 2 devices will start at 21:28:57"
Tasks: 1 (limit: 76834)
Memory: 1.2M
CPU: 66ms
CGroup: /system.slice/smartmontools.service
└─1466569 /usr/sbin/smartd -n

Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, is SMART capable. Adding to "monitor" list.
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Executing test of <mail> to REDACTED@opensourceecology.org ...
Apr 20 20:58:57 hetzner3 smartd[1466569]: Test of <mail> to REDACTED@opensourceecology.org: successful
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme0n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NX0M104566-n1.nvme.state
Apr 20 20:58:57 hetzner3 smartd[1466569]: Device: /dev/nvme1n1, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVLB512HAJQ_00000-S3W8NA0M345614-n1.nvme.state
Apr 20 20:58:57 hetzner3 systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
root@hetzner3 /etc #
</pre>
# so I checked the postfix logs, and it looks like google is rejecting our mail?!?
<pre>
root@hetzner3 ~ # journalctl -fu postfix@-
...
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: Untrusted TLS connection established to aspmx.l.google.com[108.177.15.27]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bit
s) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:34 hetzner3 postfix/smtp[1468111]: CB6E5B94BB2: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[108.177.15.27]:25, delay=1.2, delays=0.01/0.01/0.86/0.27, dsn=2.0.0, status=sent (250 2.0.0 OK 1745183017 ffacd0b85a97d-39efa5a45b6si4251829f8f.798 - gsmtp)
Apr 20 21:04:34 hetzner3 postfix/qmgr[4510]: CB6E5B94BB2: removed
Apr 20 21:04:36 hetzner3 postfix/smtp[1468114]: Untrusted TLS connection established to aspmx.l.google.com[2404:6800:4003:c02::1b]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/bounce socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/bounce socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: unexpected protocol delivery_request_protocol from private/defer socket (expected: delivery_status_protocol)
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: read private/defer socket: Application error
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: warning: D13CAB94BB3: defer service failure
Apr 20 21:04:38 hetzner3 postfix/smtp[1468114]: D13CAB94BB3: to=<REDACTED@opensourceecology.org>, relay=aspmx.l.google.com[2404:6800:4003:c02::1b]:25, delay=4.5, delays=0.01/0.01/3.5/1, dsn=4.3.0, status=deferred (bounce or trace service failure)
...
</pre>
# I changed it to my personal email, restarted, and I got two emails
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme1

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NA0M345614, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# and
<pre>

This message was generated by the smartd daemon running on:

host name: hetzner3
DNS domain: opensourceecology.org

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/nvme0

Device info:
SAMSUNG MZVLB512HAJQ-00000, S/N:S3W8NX0M104566, FW:EXA7301Q, 512 GB

For details see host's SYSLOG.
</pre>
# so I changed it back to the google groups email list email address, and I updated the wiki https://wiki.opensourceecology.org/wiki/Hetzner3
# after lunch, I refreshed munin on hetzne2 and hetzner3, to see if smart info was not being charted
## on hetzner2, there's no changes. I don't see any charts related to SMART
## on hetzner3, there's two new charts (S.M.A.R.T values for drive nvme0n1 & S.M.A.R.T values for drive nvme1n1), but they're both empty; it only has 1 value (smartctl_exit_status), and it's "nan" for all time charts. This is expected, since it can't read the nvme smartctl output format.
# I think maybe I forgot to restart munin on hetzner2, so I gave that a try
<pre>
[root@opensourceecology ~]# service munin-node restart
Redirecting to /bin/systemctl restart munin-node.service
[root@opensourceecology ~]#

[root@opensourceecology ~]# sudo -u munin /usr/bin/munin-cron
2025/04/20 21:29:38 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 55.
closedir() attempted on invalid dirhandle $DIR at /usr/share/munin/munin-update line 56.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:51 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
2025/04/20 21:29:52 [Warning] Could not open includedir directory /etc/munin/conf.d: No such file or directory
readdir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 983.
closedir() attempted on invalid dirhandle $DIR at /usr/share/perl5/vendor_perl/Munin/Master/Utils.pm line 984.
[root@opensourceecology ~]#
</pre>
# whatever; I guess no munin logs on SMART for this dying server
# I also confirmed that varnish logs are now visible in munin
# I committed my ansible changes https://github.com/OpenSourceEcology/ansible/commit/2fb906fd62cf0773d84f50f1cf113ddfe66910ec
# anyway, I also updated smartd.conf on hetzner2
<pre>
[root@opensourceecology smartmontools]# cp smartd.conf smartd.conf.20250420.bak
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# vim smartd.conf
[root@opensourceecology smartmontools]#

[root@opensourceecology smartmontools]# diff smartd.conf.20250420.bak smartd.conf
23c23,24
< DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
---
> #DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
> DEVICESCAN -H -m REDACTED@opensourceecology.org -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
[root@opensourceecology smartmontools]#
[root@opensourceecology smartmontools]# systemctl restart smartd
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
SMART Disk monitor:
Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
[root@opensourceecology smartmontools]#
</pre>
# oh wow, that screaming about the disks failing wasn't just printed to my tty; it got printed to every tty on my screen session. It really is angry..
# but, alas, no email was sent – even from hetzner2. where email should *definitely* be working
# this time the postfix logs on hetzner2 gave us an error from gmail saying why they're blocking us
<pre>
Apr 20 21:40:27 opensourceecology postfix/smtp[21221]: 297716847E6: host aspmx.l.google.com[64.233.167.27] said: 421-4.7.28 Gmail has detected an unusual rate of unso
licited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.go
ogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42a931si4417083f8f.167 - gsmtp (in reply to end
of DATA command)
Apr 20 21:40:27 opensourceecology postfix/smtp[21094]: 3CBF7684804: host aspmx.l.google.com[142.251.168.27] said: 421-4.7.28 Gmail has detected an unusual rate of uns
olicited mail. To protect 421-4.7.28 our users from spam, mail has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 https://support.g
oogle.com/mail/?p=UnsolicitedRateLimitError to 421 4.7.28 review our Bulk Email Senders Guidelines. ffacd0b85a97d-39efa42967csi4306047f8f.165 - gsmtp (in reply to end
of DATA command)
</pre>
# marcin sent an email campaign today with phpList. If that didn't make it out due to this, that's kinda problem.
# I see in the log that we're kinda spamming phplist_bounces@opensourceecology.org
# that's basically where phplist is supposed to let our admins know that it failed to deliver to some people on the mailing list
## I confirmed that this account *does* exist in the gsuite admin wui user list
# yeah, crap, it's blocking other mail sent to my personal account from apache.
# woah, I'm tailing the mail log and I just got probably hundereds or thousands of emails tried to be sent. phpList is *supposed* to do it in small batches, but I wonder if, once it fails and gets added to the queue, it'll do the re-send without batching it..
# I checked phpList wui settings and config.php, and I don't see anything about rate-limiting
# here's the docs on it https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
# it says it should be set in config.php. By default, I think it's 5,000 emails per hour
# Marcin's campaign today was sent to 14,111 people
# I checked the event log page, and I see a lot of these "Maximum time for queue processing: 99999" – which I guess means we need to break these up into batches https://phplist.opensourceecology.org/lists/admin/?page=eventlog
# looks like the easiest thing to do is to add a pause with MAILQUEUE_THROTTLE https://discuss.phplist.org/t/some-advice-for-correct-configuration-of-sending-rate/429
# if we send one per second, then we'll send 3,600 per hour.
## If we have 15,000 people on our list, then at that rate we'd need 4-5 hours to send a campaign. That sounds like a good idea.
# I updated the phpList config file to send only one email per second
<pre>
[root@opensourceecology phplist.opensourceecology.org]# diff config.20250420.php config.php
83a84,87
> // only send 1 email per second
> // * https://www.phplist.org/manual/books/phplist-manual/page/setting-the-send-speed-%28rate%29
> define('MAILQUEUE_THROTTLE',1);
>
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# we should also probably throttle postfix https://serverfault.com/questions/110919/postfix-throttling-for-outgoing-messages
# looks like for both hetzner2 and hetzner3, this is set to no delay
<pre>
[root@opensourceecology phplist.opensourceecology.org]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 0s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology phplist.opensourceecology.org]#
</pre>
# I set this on hetzner2
<pre>
[root@opensourceecology postfix]# diff main.cf.20250420 main.cf
683a684,686
>
> # limit emails to the same-destination-domain to one-email-per-2-seconds
> default_destination_rate_delay = 2s
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# systemctl restart postfix
[root@opensourceecology postfix]#
[root@opensourceecology postfix]# postconf | grep -i _rate_
anvil_rate_time_unit = 60s
default_destination_rate_delay = 2s
error_destination_rate_delay = $default_destination_rate_delay
lmtp_destination_rate_delay = $default_destination_rate_delay
local_destination_rate_delay = $default_destination_rate_delay
relay_destination_rate_delay = $default_destination_rate_delay
retry_destination_rate_delay = $default_destination_rate_delay
smtp_destination_rate_delay = $default_destination_rate_delay
smtpd_client_connection_rate_limit = 0
smtpd_client_message_rate_limit = 0
smtpd_client_new_tls_session_rate_limit = 0
smtpd_client_recipient_rate_limit = 0
virtual_destination_rate_delay = $default_destination_rate_delay
[root@opensourceecology postfix]#
</pre>
# and I also added this to ansible and pushed it out to the server on hetnzer3 https://github.com/OpenSourceEcology/ansible/commit/7ed339cad055a9a0c5b04f26d32c9416daf3a2c7

=Sat Apr 19, 2025=

# I responded to Tom's email about ssh
# Tom wasn't able to reset their account's password
# I think I created these accounts with `--disabled-password`, probably as some layered security for ssh (to force keys), but that kinda breaks sudo, which requires the password. I could make sudo NOPASSWD, but I think it's safer to have a user password set (and have ssh disabled passoword logins still) rather than set sudoers to NOPASSWD, in general
# disabled passwords are set with the '!' in the second field of /etc/shadown
<pre>
root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing:!:20133:0:99999:7:::
root@hetzner3 ~ #
</pre>
# I just manually edited /etc/shadow with vim to remove the exclimation point
<pre>
root@hetzner3 ~ # vim /etc/shadow
root@hetzner3 ~ #

root@hetzner3 ~ # tail /etc/shadow
varnish:!:19990::::::
vcache:!:19990::::::
varnishlog:!:19990::::::
mysql:!:19991::::::
munin:!:19991::::::
wp:!:19994:0:99999:7:::
not-apache:!:19995:0:99999:7:::
marcin:!:20133:0:99999:7:::
cmota:!:20133:0:99999:7:::
tgriffing::20133:0:99999:7:::
</pre>
# Tom replied, saying he can become root on hetzner3 now.
# ...
# I returned to work on the plan for replacing the disks on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb#Change_Steps
# I confirmed that the disks (on both hetzner2 and hetzner3) are MBR partition scheme (not GPT) – indicated by "Disk label type: dos"
<pre>
[root@opensourceecology ~]# fdisk -l /dev/sda

Disk /dev/sda: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x9b8e1266

Device Boot Start End Blocks Id System
/dev/sda1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sda2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sda3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0xd904fc05

Device Boot Start End Blocks Id System
/dev/sdb1 2048 67110912 33554432+ fd Linux raid autodetect
/dev/sdb2 67112960 68161536 524288+ fd Linux raid autodetect
/dev/sdb3 68163584 488395120 210115768+ fd Linux raid autodetect
[root@opensourceecology ~]#
</pre>
# A quick spot-check shows that our backups usually finish at 09:55 – one time as late as 10:07. That's UTC.
# 10:00 UTC is 05:00 my time and 12:00 in Berlin. God that's early, but better to do this early in Germany time..
# I sent an email to Marcin asking if Thr 2025-04-24 @ 10:00 UTC (~05:00 FeF) would be a good time to do this
<pre>
Hey Marcin,

When would be a good time to replace the first disk on hetzner2?

Our backups finish daily at 10:00 UTC, which is:

* 12:00 in Germany (where the server lives)
* 05:00 here in Ecuador, and
* 05:00 at FeF

I propose next week on Thursday 2025-04-24 10:00 UTC.

For details about what this change entails, and expected downtime, please see the change ticket:

* https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb

Please let me know if you approve this change, if the suggested time is agreeable to you, and if you have any questions.
</pre>

=Fri Apr 18, 2025=
# Marcin sent another email this morning asking why osemain is down too now, and I responded
<pre>
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

* https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.

Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing? Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly. This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin
</pre>
# ok, so back to trying to figure out the corruption of the mariadb
# looks like the attempt to start it in recovery mode 2 fails after 10 minutes
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 10m0.435s
user 0m0.011s
sys 0m0.012s
[root@opensourceecology etc]#
</pre>
# and the tail of the db log
<pre>
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00 InnoDB: Waiting for the background threads to start
250417 23:06:01 InnoDB: Waiting for the background threads to start
250417 23:06:02 InnoDB: Waiting for the background threads to start
250417 23:06:03 InnoDB: Waiting for the background threads to start
250417 23:06:04 InnoDB: Waiting for the background threads to start
250417 23:06:05 InnoDB: Waiting for the background threads to start
250417 23:06:06 InnoDB: Waiting for the background threads to start
250417 23:06:07 InnoDB: Waiting for the background threads to start
250417 23:06:08 InnoDB: Waiting for the background threads to start
250417 23:06:09 InnoDB: Waiting for the background threads to start
</pre>
# so we have one more recovery mode we can try before it becomes destructive = 3
<pre>
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
>
[root@opensourceecology etc]#
</pre>
# and gave it a restart
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
...
</pre>
# damn, looks like it's stuck on the same thing
<pre>
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18 InnoDB: Waiting for the background threads to start
250418 19:33:19 InnoDB: Waiting for the background threads to start
250418 19:33:20 InnoDB: Waiting for the background threads to start
...
</pre>
# the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
## https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
## https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
# anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
# meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
# oh shit, that worked
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m2.102s
user 0m0.010s
sys 0m0.007s
[root@opensourceecology etc]#
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 22468 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]#
</pre>
# the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
<pre>
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28 InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28 InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
</pre>
# oh, the spam stopped. maybe just some startup thing.
# I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
# this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
# but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
# this took about a minute to run
<pre>
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]#
</pre>
# good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
<pre>
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log
3dp_db.wp_commentmeta OK
3dp_db.wp_comments OK
3dp_db.wp_links OK
3dp_db.wp_masterslider_options OK
3dp_db.wp_masterslider_sliders OK
3dp_db.wp_options OK
3dp_db.wp_postmeta OK
3dp_db.wp_posts OK
3dp_db.wp_revslider_css OK
3dp_db.wp_revslider_layer_animations OK
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log
cacti_db.automation_ips
note : The storage engine for the table doesn't support check
cacti_db.automation_processes
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note : The storage engine for the table doesn't support check
cacti_db.poller_output
note : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note : The storage engine for the table doesn't support check
osemain_db.wp_options
warning : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]#
</pre>
# let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
## cacti_db
## osemain_db
## oswh_db
# I sent Marcin a status update email
<pre>
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
</pre>
# ok, I made the post-corruption mysqldump backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real 2m48.845s
user 3m19.170s
sys 0m2.023s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# now let's drop those three databases.
<pre>
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db |
+--------------------+
| d3d_db |
+--------------------+
| fef_db |
+--------------------+
| microfactory_db |
+--------------------+
| mysql |
+--------------------+
| obi_db |
+--------------------+
| obi_staging_db |
+--------------------+
| oseforum_db |
+--------------------+
| osemain_s_db |
+--------------------+
| osewiki_db |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db |
+--------------------+
| seedhome_db |
+--------------------+
| store_db |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# that looked good
<pre>
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]#
</pre>
# recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
# nah, it failed
<pre>
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m2.805s
user 0m0.006s
sys 0m0.010s
[root@opensourceecology etc]#
</pre>
# logs are the same, I think?
<pre>
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04 InnoDB: Waiting for the background threads to start
250418 20:10:04 InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
<pre>
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
250418 20:11:42 InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42 InnoDB: Waiting for the background threads to start
</pre>
# well, even though it *says* it's started
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m5.156s
user 0m0.008s
sys 0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 24458 (mysqld_safe)
CGroup: /system.slice/mariadb.service
├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
└─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real 0m0.012s
user 0m0.001s
sys 0m0.007s
[root@opensourceecology etc]#
</pre>
# we can't connect to it with mysqlcheck
<pre>
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log
real 0m0.010s
user 0m0.002s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]#
</pre>
# so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
# huh, all lines say OK
<pre>
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real 0m11.597s
user 0m0.010s
sys 0m0.009s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log
[root@opensourceecology dbFail.20250417]#
</pre>
# well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
# I'm going to restore from the backup and then see if I can do that
# oh, right, we can't INSERT in recovery mode
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]#
</pre>
# well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
# we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
<pre>
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

</pre>
# even after that, it still won't start :'(
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real 0m4.863s
user 0m0.009s
sys 0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real 0m0.010s
user 0m0.002s
sys 0m0.005s
[root@opensourceecology etc]#
</pre>
# before I purge those three system-level DBs, I want to confirm they're in our backups
# as I feared, it looks like they're missing
<pre>
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]#
</pre>
# according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
# I'm just going to manually dump these three anyway. Or try to
# well, I was able to get one of the three to backup
<pre>
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real 0m0.010s
user 0m0.006s
sys 0m0.008s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real 0m0.142s
user 0m0.155s
sys 0m0.010s
[root@opensourceecology dbFail.20250417]#
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real 0m0.009s
user 0m0.009s
sys 0m0.005s
[root@opensourceecology dbFail.20250417]#
</pre>
# mysql looks good
<pre>
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]#
</pre>
# I'm just going to move this whole db dir out of the way and see if we can start it fresh
<pre>
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x 4 mysql mysql 4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]#
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]#
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]#
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x 2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
[root@opensourceecology lib]#
</pre>
# ok, it's started outside recovery mode now
<pre>
[root@opensourceecology etc]# time systemctl restart mariadb

real 0m3.550s
user 0m0.007s
sys 0m0.012s
[root@opensourceecology etc]#

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23 InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server
</pre>
# it created all these files
<pre>
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x 5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root root 4.0K Apr 18 20:55 ..
-rw-rw---- 1 mysql mysql 16K Apr 18 20:56 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Apr 18 20:56 aria_log_control
-rw-rw---- 1 mysql mysql 18M Apr 18 20:56 ibdata1
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw---- 1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx 1 mysql mysql 0 Apr 18 20:56 mysql.sock
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------ 2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]#
</pre>
# that also would have killed the mysql password; I can't login
<pre>
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]#
</pre>
# I hacked my way in and set the root password
<pre>
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
</pre>
# now I can see our three databases, plus one named test
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql |
+--------------------+
| performance_schema |
+--------------------+
| test |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]#
</pre>
# first let's just restore the 'mysql' database
<pre>
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]#
</pre>
# that appears to have worked; our users are present now
<pre>
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User |
+------------------+
| oseforum_user |
| cacti_user |
| 3dp_user |
| cacti_user |
| d3d_user |
| fef_user |
| microfactory_usr |
| munin_user |
| obi2_user |
| obi3_user |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]>
</pre>
# I gave it a restart, and ensured it's still working. Great.
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# now let's restore the rest – including even our corrupt databases – and see if it works or breaks
# that took about 11.5 minutes to import ~6.8G of data
<pre>
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real 11m36.530s
user 1m52.944s
sys 0m3.593s
[root@opensourceecology dbFail.20250417]#

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G /var/lib/mysql
[root@opensourceecology dbFail.20250417]#

</pre>
# I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
<pre>
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# woah, I gave it a restart, and it came back fine
<pre>
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 3dp_db |
| cacti_db |
| d3d_db |
| fef_db |
| microfactory_db |
| mysql |
| obi_db |
| obi_staging_db |
| oseforum_db |
| osemain_db |
| osemain_s_db |
| osewiki_db |
| oswh_db |
| performance_schema |
| phplist_db |
| seedhome_db |
| store_db |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]>
</pre>
# I guess we fixed it with no data loss?
# let's bring up the web servers
<pre>
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]#
</pre>
# the wiki loads now
# so does osemain
# I'd say we're back in business
# I sent an email to Marcin
<pre>
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues.
</pre>
# now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
# at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
# looks like it won't let him reboot if other users are logged-in
<pre>
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
1. maltfield
2. crupp
3. Tom Griffing (tgriffing)
4. jthomas
Choose identity to authenticate as (1-4):
</pre>
# I updated the sudoers command to give marcin *just* access to the reboot command
<pre>
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]#
</pre>
# I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
# it's debian, but sudoers syntax should (hopefully) be the same
<pre>
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~#

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password:
Retype new password:
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin:
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
</pre>
# yeah, that worked. Perfect.
# I tested it on hetzner2; it worked too.
<pre>
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
</pre>
# I sent Marcin a reply ask him to test reboots via ssh
<pre>
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need?
</pre>
# I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
## I wonder if I can figure out which one grub uses and replace that one second..
# from my log yesterday, here's our two drive's serial numbers
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# fuck; looks like neither is referenced in /boot/
<pre>
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]#
</pre>
# the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
# it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
<pre>
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64 initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64 initramfs-3.10.0-327.18.2.el7.x86_64.img System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64 initramfs-3.10.0-514.26.2.el7.x86_64.img System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64.img System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64 initramfs-3.10.0-693.2.2.el7.x86_64kdump.img System.map-3.10.0-693.2.2.el7.x86_64
efi initrd-plymouth.img vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub lost+found vmlinuz-3.10.0-1127.el7.x86_64
grub2 symvers-3.10.0-1127.el7.x86_64.gz vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img symvers-3.10.0-1160.119.1.el7.x86_64.gz vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img symvers-3.10.0-327.18.2.el7.x86_64.gz vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img symvers-3.10.0-514.26.2.el7.x86_64.gz vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]#
</pre>
# I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
# oh, the efi dir is empty, so I'm thinking we're using grub2
<pre>
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]#
</pre>
# yeah, the grub dir just has one file in it?
<pre>
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11 2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
-rw-r--r-- 1 root root 1.4K Nov 15 2011 splash.xpm.gz
[root@opensourceecology boot]#
</pre>
# grub2 looks most sane
<pre>
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26 2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26 2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15 2015 fonts
-rw-r--r-- 1 root root 7.8K Jul 26 2024 grub.cfg
-rw-r--r-- 1 root root 5.3K Jun 1 2016 grub.cfg.1499616907.rpmsave
-rw-r--r-- 1 root root 6.1K Jul 9 2017 grub.cfg.1506097734.rpmsave
-rw-r--r-- 1 root root 7.0K Sep 22 2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26 2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31 2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31 2016 locale
[root@opensourceecology boot]#
</pre>
# it looks like it's referencing the raid, not the drive
<pre>
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a' 9f6b5264-da8c-406d-a444-45e3fb3aeb26
else
search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
fi
linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
</pre>
# right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
# well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
# that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
# Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
# the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 150G 38G 80% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#
</pre>

=Thr Apr 17, 2025=
# Marcin sent me an email last night (and again this morning) asking why the wiki is down
# I hadn't touched ose infra since 6 days ago
# the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
# I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
# anyway, let's check what happened to the wiki on hetzner2
# it's a 500 error complaining about the db
<pre>
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
<input type="hidden" name="num" value="50" />
<input type="hidden" name="ie" value="UTF-8" />
<input type="hidden" name="oe" value="UTF-8" />
<input type="text" name="q" size="31" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<p>
<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
<label><input type="radio" name="sitesearch" value="" />WWW</label>
</p>
user@disp9871:~$
</pre>
# disk is fine
<pre>
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 96G 92G 52% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
# there's also no new logs in the mariadb error log when I hit the site in real-time
# well, the db isn't running
<pre>
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#
</pre>
# error logs aren't very helpful
<pre>
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
</pre>
# if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
<pre>
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]#
</pre>
# here's the log that pops-up when we try a restart
<pre>
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31 InnoDB: Waiting for the background threads to start
250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
</pre>
# google points to this https://bugs.mysql.com/bug.php?id=61516
## they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
# reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
# before reading any more, I'm going to immediately make a local copy of our most-recent backups
# looks like we have a backup from 13 hours ago and one from 27 hours ago
<pre>
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x 2 root root 4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x 2 root root 4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
</pre>
# this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
## it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
## other warnings suggest solving the underlying issue: why did the data become corrupt?
## well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
## but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
# I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
<pre>
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

* https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry.
</pre>
# the backups of the backups finished
<pre>
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec
total size is 44,542,039,620 speedup is 1.00
[root@opensourceecology ~]#
[root@opensourceecology ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 17M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 197G 138G 50G 74% /
/dev/md1 488M 386M 77M 84% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1005
[root@opensourceecology ~]#
</pre>
# I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
<pre>
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]#
</pre>
# I should also make a backup of /var/lib/mysql
# I'm going to create a dif for all of this
<pre>
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]#

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]#

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

* https://bugs.mysql.com/bug.php?id=61516

Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]#
</pre>
# aaaand I added a copy of /var/lib/mysql/
<pre>
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec
total size is 7,383,062,830 speedup is 1.00
[root@opensourceecology ~]#
</pre>
# another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
<pre>
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
</pre>
# Unfortunately, a lot of the links for how to fix this are now dead
## https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
## https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## https://forums.mysql.com/read.php?22,603093,604631#msg-604631
## https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
# we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
## but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
## ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
# this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
## it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
## good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
### but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
## it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
### here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
<pre>
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
</pre>
# I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
# ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
# but first let's check SMART
# oh, fuck, my notes on this are on the wiki. Of course.
# arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
# fail
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]#
</pre>
# luckily the yum servers for this EOL OS are still online, and I could install it
<pre>
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1
Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1

Installed:
smartmontools.x86_64 1:7.0-2.el7

Complete!
[root@opensourceecology ~]#
</pre>
# better
<pre>
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]#
</pre>
# well this is terrifying; it says both our disks are gonna fail within 24 hours
<pre>
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]#
</pre>
# compare that to hetnzer3, which says all is good
<pre>
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ #
</pre>
# I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
# oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
## well, it lists two options
### Free Replacement drive nearly new or used and tested; depends on what is in stock.
### At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
## we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
## dang, it says they'll swap the drive within 2-4 hours.
# I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
# hetzner provides 4 docs for assistance on this
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
## https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
## https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
# that first doc says to run the command we just ran
# hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
# ok, the docs say we can get more info with -A
<pre>
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685

[root@opensourceecology ~]#

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864

[root@opensourceecology ~]#
</pre>
# so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
# well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
# yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
# in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
# the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
# wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
<pre>
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 152.358.379 [78,0 TB]
Data Units Written: 52.125.092 [26,6 TB]
Host Read Commands: 6.873.372.480
Host Write Commands: 1.362.559.127
Controller Busy Time: 22.226
Power Cycles: 28
Power On Hours: 17.245
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 159
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 7%
Data Units Read: 140.811.605 [72,0 TB]
Data Units Written: 56.604.901 [28,9 TB]
Host Read Commands: 1.304.073.899
Host Write Commands: 1.364.668.115
Controller Busy Time: 21.180
Power Cycles: 23
Power On Hours: 15.565
Unsafe Shutdowns: 5
Media and Data Integrity Errors: 0
Error Information Log Entries: 149
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 45 Celsius

root@hetzner3 ~ #
</pre>
# that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
# the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
# this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
<pre>
[root@opensourceecology ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
523712 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
209984640 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
33521664 blocks super 1.2 [2/2] [UU]

unused devices: <none>
[root@opensourceecology ~]#
</pre>
# ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
## first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
# the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
<pre>
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]#

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]#
</pre>
# I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
<pre>
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
</pre>
# I also kicked-off a long test, which I can check tomorrow
<pre>
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]#
</pre>
# ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
<pre>
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/md2 197G 145G 43G 78% /
[root@opensourceecology ~]#

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]#

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
md2 9:2 0 200.3G 0 raid1 /
[root@opensourceecology ~]#
</pre>
# it won't let me check the filesystem while it's mounted
<pre>
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]#
</pre>
# it probably should be happening on-boot, but I couldn't find it in dmesg
<pre>
[root@opensourceecology ~]# dmesg | grep -i check
[ 0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]#
</pre>
# ok, instead we can just use tune2fs to get the info on the last check that was run
# looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 13131776
Block count: 52496160
Reserved block count: 2624808
Free blocks: 26575102
Free inodes: 12417672
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1011
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue May 31 06:01:12 2016
Last mount time: Thu Apr 17 17:39:11 2025
Last write time: Thu Apr 17 17:39:00 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
Check interval: 0 (<none>)
Lifetime writes: 124 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup: inode blocks
[root@opensourceecology ~]#
</pre>
# both of the filesystems (/ and /boot) look fine
<pre>
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on: /boot
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 46
Maximum mount count: -1
Last checked: Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on: /
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Last mount time: Thu Apr 17 17:39:11 2025
Mount count: 1
Maximum mount count: -1
Last checked: Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
</pre>
# well, so far I couldn't find any signs of corruption on the disk/fs level
# back to the db, I set the recovery option in the my.cnf file
<pre>
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]#

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]#

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
>
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
>
[root@opensourceecology etc]#
</pre>
# it didn't come-up
<pre>
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]#
</pre>
# I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
<pre>
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49 InnoDB: Waiting for the background threads to start
250417 22:32:50 InnoDB: Waiting for the background threads to start
250417 22:32:51 InnoDB: Waiting for the background threads to start
250417 22:32:52 InnoDB: Waiting for the background threads to start
250417 22:32:53 InnoDB: Waiting for the background threads to start
250417 22:32:54 InnoDB: Waiting for the background threads to start
250417 22:32:55 InnoDB: Waiting for the background threads to start
250417 22:32:56 InnoDB: Waiting for the background threads to start
250417 22:32:57 InnoDB: Waiting for the background threads to start
250417 22:32:58 InnoDB: Waiting for the background threads to start
...
</pre>
# it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

=Sun Apr 11, 2025=

# let's get Catarina that broken staging site for osemain on hetzner3
# Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
# so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
# I logged-into hetzner https://robot.hetzner.com/server
# I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
# I clicked on the server for "hetzner3" and the tab "IPs".
## Then I clicked on "Order additional IPs / Nets"
## I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
## it required me to enter a reason (IPv4 is scarce) to which I wrote:
<pre>
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
</pre>
## and I clicked "Apply for IP/subnet in obligation"
## I got a message; looks like it needs human approval
<pre>
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
</pre>
# I typed an email to Marcin and Catarina to notify them of this order
<pre>
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.

Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
</pre>
# before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
# I refreshed the hetzner wui, and now I see the new IP
# ...
# following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
## I sent them both emails asking them to confirm access
# I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
# update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
# he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
# I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
# I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
# Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
# ...
# I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
# gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
# I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
# there's too many layers on this server; I checked the logs
<pre>
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
</pre>
# ok, so it's modsecurity?
# gah, that's a lot of files to review
<pre>
[root@opensourceecology httpd]# find . |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]#
</pre>
# looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
<pre>
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]#
</pre>
# it's 13107200
<pre>
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf: SecRequestBodyLimit 13107200
conf.d/mod_security.conf: SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]#
</pre>
# docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
# so 13107200 / 1024 / 1024 = 12.5 MB.
# jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
# ...
# so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
# I didn't want to do this site before the other less-priority ones, but it's just a sandbox
# I realized I never made a CHG file for osemain
# looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
# ugh, I just said I was "following the same guide as with the other sites"
## I was hoping to know which one to CHG to copy-from
## I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
# ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3
# oh, crap, I'm going to remove

Maltfield Log/2025 Q2

2025-04-27T21:48:36Z

Maltfield: Apr 20