Maltfield Log/2025 Q2

From Open Source Ecology
< Maltfield Log
Revision as of 21:41, 27 April 2025 by Maltfield (talk | contribs) (Apr 18)
Jump to navigation Jump to search

My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

See Also

  1. Maltfield_Log
  2. User:Maltfield
  3. Special:Contributions/Maltfield

Fri Apr 18, 2025

  1. Marcin sent another email this morning asking why osemain is down too now, and I responded
Hey Marcin,

> It seems that the ose main website was up when I wrote the
> last message

Your whole database service was down, and it won't start. You have a varnish cache that stores a subset of pages in-memory for 24 hours. That's probably what you saw.

I took webservers down yesterday to prevent the possibility of them corrupting the database worse, if it manages to start in recovery mode.

>> go straight to migration to Hetzner 3.

If you want high uptime, I don't recommend migrating to hetzner3 at this time. It's still not fully provisioned, and I actively work on it like a dev server. Which means I'll be restarting it and its services. It's not a safe place for production. That's why the wiki is the *last* service to migrate.

Status update: yesterday I investigated to see if your underlying storage (disk, filesystem, or RAID) are failing, which might cause corruption. The filesystems were fine. RAID didn't have errors. The SMART logs on the disk said both of your two mirrored drives are failing and should be replaced within 24 hours. But I don't think that's evidence of corruption; I think it's just a timer that's alerting us to the possibility that the disks will fail soon. afaict, disk replacement is free (from Hetzner) but not trivial and high-risk. I'll postpone until after restoring the database.

Likely not all of your database is corrupt. We *could* restore from backup, but I don't recommend that -- as you only have daily backups, and likely you'll have data loss.

Yesterday I put the database in two recovery modes and was unable to get it to start. My plan is to continue to follow this guide, to see if I can find out which databases/tables/pages are corrupt and which are not. That way we can restore only the data we need from backups and minimize data loss

 * https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

I have to go to the hospital today. If I have time, I will try to continue later tonight. And I plan to work on this over the weekend. I hope to have your sites back online early next week.


Cheers,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972  644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 02:58, Marcin Jakubowski wrote:
> Michael,
>
> It seems that the ose main website was up when I wrote the last message -
> but now I'm trying to post the blog posts and the main site appears to be
> down. Is our whole backend crashing?  Or is that something you are doing on
> your end?
>
> Marcin
>
> On Thu, Apr 17, 2025 at 6:41 PM Marcin Jakubowski <
> REDACTED@opensourceecology.org> wrote:
>
>> Can we prioritize the wiki at this point to migrate the wiki right over to
>> Hetzner 3 with the  current up to date software, using the wiki backup from
>> 2 days ago, which is before the crash?
>>
>> The wiki was working at least the first part of yesterday, and I noticed
>> the crash at about 11 PM CST yesterday. Thus taking the backup from 4/15/25
>> should solve this? Ie, forget about trying to fix on Hetzner 2, go straight
>> to migration to Hetzner 3. Is that consistent with a possible shift in your
>> plans, or does that throw off the entire process of migration? OSE stands
>> stuck without it, I will have to do everything in Google docs if I don't
>> have wiki access, and i am justvputtingvout the announcent and recruiting.
>> I can switcj ro more publishing on the website, assuming that all works.
>> Please tell me what would be your proposed solution and how quickly you
>> think we can get back up to a functioning wiki, based on your schedule of
>> availability to work on this, so I can plan accordingly.  This is a much
>> higher priority than doing any of the main website migration.
>>
>> Thanks,
>> Marcin 
  1. ok, so back to trying to figure out the corruption of the mariadb
  2. looks like the attempt to start it in recovery mode 2 fails after 10 minutes
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because a fatal signal was delivered to the control process. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real    10m0.435s
user    0m0.011s
sys     0m0.012s
[root@opensourceecology etc]#
  1. and the tail of the db log
[root@opensourceecology ~]# tail -f /var/log/mariadb/mariadb.log
250417 23:06:00  InnoDB: Waiting for the background threads to start
250417 23:06:01  InnoDB: Waiting for the background threads to start
250417 23:06:02  InnoDB: Waiting for the background threads to start
250417 23:06:03  InnoDB: Waiting for the background threads to start
250417 23:06:04  InnoDB: Waiting for the background threads to start
250417 23:06:05  InnoDB: Waiting for the background threads to start
250417 23:06:06  InnoDB: Waiting for the background threads to start
250417 23:06:07  InnoDB: Waiting for the background threads to start
250417 23:06:08  InnoDB: Waiting for the background threads to start
250417 23:06:09  InnoDB: Waiting for the background threads to start
  1. so we have one more recovery mode we can try before it becomes destructive = 3
[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
> 
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 3
> 
[root@opensourceecology etc]# 
  1. and gave it a restart
[root@opensourceecology etc]# time systemctl restart mariadb
...
  1. damn, looks like it's stuck on the same thing
250418 19:33:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:33:17 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 20076 ...
250418 19:33:17 InnoDB: The InnoDB memory heap is disabled
250418 19:33:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:33:17 InnoDB: Compressed tables use zlib 1.2.7
250418 19:33:17 InnoDB: Using Linux native AIO
250418 19:33:17 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:33:17 InnoDB: Completed initialization of buffer pool
250418 19:33:17 InnoDB: highest supported file format is Barracuda.
250418 19:33:17  InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:33:17  InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:33:18  InnoDB: Waiting for the background threads to start
250418 19:33:19  InnoDB: Waiting for the background threads to start
250418 19:33:20  InnoDB: Waiting for the background threads to start
...
  1. the internet suggests this infinite loop is caused by the default of innodb_purge_threads=1, and it says we should set this to 0
    1. https://serverfault.com/questions/851342/mysql-crashed-and-not-starting-even-after-adding-innodb-force-recovery
    2. https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
  2. I tried to cut off the systemctl restart early, but it's just stuck. I guess I just have to wait 10 minutes.
  3. anyway, I set the recovery back down to 2 and added the purge threads to 0 line; I'll try that when it's not blocked
  4. meanwhile, I read up on innodb_purge_threads, which is documented here https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html
  5. oh shit, that worked
[root@opensourceecology etc]# time systemctl restart mariadb

real    0m2.102s
user    0m0.010s
sys     0m0.007s
[root@opensourceecology etc]# 
[root@opensourceecology etc]# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2025-04-18 19:44:30 UTC; 19s ago
  Process: 22469 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
  Process: 22433 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 22468 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
		   ├─22468 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
		   └─22693 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-...

Apr 18 19:44:28 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 19:44:28 opensourceecology.org mariadb-prepare-db-dir[22433]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 19:44:28 opensourceecology.org mysqld_safe[22468]: 250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 19:44:30 opensourceecology.org systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology etc]# 
  1. the logs are being spammed with these last 5 lines a bunch; I guess something is still trying to access the db?
250418 19:44:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 19:44:28 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 22693 ...
250418 19:44:28 InnoDB: The InnoDB memory heap is disabled
250418 19:44:28 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 19:44:28 InnoDB: Compressed tables use zlib 1.2.7
250418 19:44:28 InnoDB: Using Linux native AIO
250418 19:44:28 InnoDB: Initializing buffer pool, size = 128.0M
250418 19:44:28 InnoDB: Completed initialization of buffer pool
250418 19:44:28 InnoDB: highest supported file format is Barracuda.
250418 19:44:28  InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250418 19:44:28  InnoDB: Starting final batch to recover 11 pages from redo log
250418 19:44:28  InnoDB: Waiting for the background threads to start
250418 19:44:29 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883505166
250418 19:44:29 InnoDB: !!! innodb_force_recovery is set to 2 !!!
250418 19:44:29 [Note] Plugin 'FEEDBACK' is disabled.
250418 19:44:29 [Note] Event Scheduler: Loaded 0 events
250418 19:44:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
  1. oh, the spam stopped. maybe just some startup thing.
  2. I was hoping at startup it would tell us which DBs/tables/pages were corrupt; I guess we have to initiate a scan or something.
  3. this guide doesn't say anything about that https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
  4. but this one recommends running `mysqlcheck` https://community.spiceworks.com/t/how-to-recover-crashed-innodb-tables-on-mysql-database-server/1013051
  5. this took about a minute to run
[root@opensourceecology dbFail.20250417]# mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# 
  1. good news; looks like the wiki isn't fucked. it's just osemain, oswh, and cacti. restoring those from backups is probably not going to cause any data loss
root@opensourceecology dbFail.20250417]# head mysqlcheck.20250418.log 
3dp_db.wp_commentmeta                              OK
3dp_db.wp_comments                                 OK
3dp_db.wp_links                                    OK
3dp_db.wp_masterslider_options                     OK
3dp_db.wp_masterslider_sliders                     OK
3dp_db.wp_options                                  OK
3dp_db.wp_postmeta                                 OK
3dp_db.wp_posts                                    OK
3dp_db.wp_revslider_css                            OK
3dp_db.wp_revslider_layer_animations               OK
[root@opensourceecology dbFail.20250417]# 
[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418.log 
cacti_db.automation_ips
note     : The storage engine for the table doesn't support check
cacti_db.automation_processes
note     : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_cache
note     : The storage engine for the table doesn't support check
cacti_db.data_source_stats_hourly_last
note     : The storage engine for the table doesn't support check
cacti_db.poller_output
note     : The storage engine for the table doesn't support check
cacti_db.poller_output_boost_processes
note     : The storage engine for the table doesn't support check
osemain_db.wp_options
warning  : 1 client is using or hasn't closed the table properly
osemain_s_db.wp_options
warning  : 1 client is using or hasn't closed the table properly
oswh_db.wp_options
warning  : 1 client is using or hasn't closed the table properly
[root@opensourceecology dbFail.20250417]# 
  1. let's go ahead and take a mysqldump now, including the corrupt data. then I'll drop these three databases and restore from backups
    1. cacti_db
    2. osemain_db
    3. oswh_db
  2. I sent Marcin a status update email
Hey Marcin,

I was able to start your database in recovery mode, and I see the following databases have corrupt tables:

1. osemain
2. cacti
3. oswh

Good news that the wiki isn't in that list. And that those particular corrupt DBs don't change much, so recovering just those databases from backups should result in an acceptable data loss, if any.

I'll keep you updated.
  1. ok, I made the post-corruption mysqldump backup
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass --all-databases | gzip -c > mysqldump-after-corruption-while-in-recovery-mode.$(date "+%Y%m%d_%H%M%S").sql.gz

real    2m48.845s
user    3m19.170s
sys     0m2.023s
[root@opensourceecology dbFail.20250417]# 

[root@opensourceecology dbFail.20250417]# ls mysqldump*
mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]# 

[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz 
1.3G    mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
[root@opensourceecology dbFail.20250417]# 
  1. now let's drop those three databases.
[root@opensourceecology dbFail.20250417]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| 3dp_db             |
| cacti_db           |
| d3d_db             |
| fef_db             |
| microfactory_db    |
| mysql              |
| obi_db             |
| obi_staging_db     |
| oseforum_db        |
| osemain_db         |
| osemain_s_db       |
| osewiki_db         |
| oswh_db            |
| performance_schema |
| phplist_db         |
| seedhome_db        |
| store_db           |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE cacti_db;
Query OK, 108 rows affected (0.38 sec)

MariaDB [(none)]> DROP DATABASE osemain_db;
Query OK, 22 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE oswh_db;
Query OK, 12 rows affected (0.03 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
+--------------------+
| information_schema |
+====================+
| 3dp_db             |
+--------------------+
| d3d_db             |
+--------------------+
| fef_db             |
+--------------------+
| microfactory_db    |
+--------------------+
| mysql              |
+--------------------+
| obi_db             |
+--------------------+
| obi_staging_db     |
+--------------------+
| oseforum_db        |
+--------------------+
| osemain_s_db       |
+--------------------+
| osewiki_db         |
+--------------------+
| performance_schema |
+--------------------+
| phplist_db         |
+--------------------+
| seedhome_db        |
+--------------------+
| store_db           |
+--------------------+
+--------------------+
15 rows in set (0.00 sec)

MariaDB [(none)]> 
  1. that looked good
MariaDB [(none)]> exit
Bye
[root@opensourceecology dbFail.20250417]# 
  1. recovery mode isn't going to let us INSERT to recover data from backups, so let's take it out of recovery mode and see if the db will start
  2. nah, it failed
[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]# 

[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real    0m2.805s
user    0m0.006s
sys     0m0.010s
[root@opensourceecology etc]# 
  1. logs are the same, I think?
250418 20:10:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:10:04 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 24305 ...
250418 20:10:04 InnoDB: The InnoDB memory heap is disabled
250418 20:10:04 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:10:04 InnoDB: Compressed tables use zlib 1.2.7
250418 20:10:04 InnoDB: Using Linux native AIO
250418 20:10:04 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:10:04 InnoDB: Completed initialization of buffer pool
250418 20:10:04 InnoDB: highest supported file format is Barracuda.
250418 20:10:04  InnoDB: Waiting for the background threads to start
250418 20:10:04  InnoDB: Assertion failure in thread 140076605044480 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:10:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x560180c61cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x560180875975]
sigaction.c:0(__restore_rt)[0x7f664031f630]
:0(__GI_raise)[0x7f663ea46387]
:0(__GI_abort)[0x7f663ea47a78]
/usr/libexec/mysqld(+0x63845f)[0x560180a0a45f]
/usr/libexec/mysqld(+0x638fa4)[0x560180a0afa4]
/usr/libexec/mysqld(+0x73b504)[0x560180b0d504]
/usr/libexec/mysqld(+0x730487)[0x560180b02487]
/usr/libexec/mysqld(+0x63b17d)[0x560180a0d17d]
/usr/libexec/mysqld(+0x62f0f6)[0x560180a010f6]
pthread_create.c:0(start_thread)[0x7f6640317ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f663eb0eb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:10:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
  1. I re-enabled recovery mode, but this time just as 1. This time it did start, but this loop gets spammed to the logs
250418 20:11:42 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 625883708456
250418 20:11:42 InnoDB: !!! innodb_force_recovery is set to 1 !!!
250418 20:11:42 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:11:42 [Note] Event Scheduler: Loaded 0 events
250418 20:11:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server
250418 20:11:42  InnoDB: Assertion failure in thread 140282494781184 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250418 20:11:42 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e2d6dbbcad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e2d69cf975]
sigaction.c:0(__restore_rt)[0x7f962fbdc630]
:0(__GI_raise)[0x7f962e303387]
:0(__GI_abort)[0x7f962e304a78]
/usr/libexec/mysqld(+0x63845f)[0x55e2d6b6445f]
/usr/libexec/mysqld(+0x638fa4)[0x55e2d6b64fa4]
/usr/libexec/mysqld(+0x73b504)[0x55e2d6c67504]
/usr/libexec/mysqld(+0x730487)[0x55e2d6c5c487]
/usr/libexec/mysqld(+0x63b17d)[0x55e2d6b6717d]
/usr/libexec/mysqld(+0x62e83c)[0x55e2d6b5a83c]
pthread_create.c:0(start_thread)[0x7f962fbd4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f962e3cbb0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250418 20:11:42 mysqld_safe Number of processes running now: 0
250418 20:11:42 mysqld_safe mysqld restarted
250418 20:11:42 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 27371 ...
250418 20:11:42 InnoDB: The InnoDB memory heap is disabled
250418 20:11:42 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:11:42 InnoDB: Compressed tables use zlib 1.2.7
250418 20:11:42 InnoDB: Using Linux native AIO
250418 20:11:42 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:11:42 InnoDB: Completed initialization of buffer pool
250418 20:11:42 InnoDB: highest supported file format is Barracuda.
250418 20:11:42  InnoDB: Waiting for the background threads to start
  1. well, even though it *says* it's started
[root@opensourceecology etc]# time systemctl restart mariadb

real    0m5.156s
user    0m0.008s
sys     0m0.010s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2025-04-18 20:11:07 UTC; 13s ago
  Process: 24459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
  Process: 24423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 24458 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
		   ├─24458 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
		   └─25620 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/v...

Apr 18 20:11:02 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:11:02 opensourceecology.org mariadb-prepare-db-dir[24423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:11:02 opensourceecology.org mysqld_safe[24458]: 250418 20:11:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:11:07 opensourceecology.org systemd[1]: Started MariaDB database server.

real    0m0.012s
user    0m0.001s
sys     0m0.007s
[root@opensourceecology etc]# 
  1. we can't connect to it with mysqlcheck
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log                              
real    0m0.010s
user    0m0.002s
sys     0m0.008s
[root@opensourceecology dbFail.20250417]# 

[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# 
  1. so I set it back to recovery mode 2, restarted, and tried the mysqlcheck again
  2. huh, all lines say OK
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418
mysqlcheck.20250418_201348.log  mysqlcheck.20250418.log
[root@opensourceecology dbFail.20250417]# less mysqlcheck.20250418_201348.log
mysqlcheck: Got error: 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) when trying to connect
[root@opensourceecology dbFail.20250417]# time mysqlcheck --all-databases -u root -p$mysqlPass &> mysqlcheck.$(date "+%Y%m%d_%H%M%S").log

real    0m11.597s
user    0m0.010s
sys     0m0.009s
[root@opensourceecology dbFail.20250417]# 

[root@opensourceecology dbFail.20250417]# grep -vi OK mysqlcheck.20250418_201559.log 
[root@opensourceecology dbFail.20250417]# 
  1. well now I'm wondering if I should have run CHECK TABLE and REPAIR TABLE rather than just DROP them https://dev.mysql.com/doc/refman/8.4/en/myisam-table-close.html
  2. I'm going to restore from the backup and then see if I can do that
  3. oh, right, we can't INSERT in recovery mode
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass
ERROR 1030 (HY000) at line 91: Got error -1 from storage engine
[root@opensourceecology dbFail.20250417]# 
  1. well, fuck, now I don't know why it won't start. And it doesn't tell me why. The good news is that I was able to get a db dump. maybe I can copy this huge dump over to some other server for repair and then copy it back?
  2. we should have backups. I'm going to just purge all the non-system databases and see if we can get this thing started at all
MariaDB [(none)]> DROP DATABASE 3dp_db d3ddb;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'd3ddb' at line 1
MariaDB [(none)]> DROP DATABASE 3dp_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE d3d_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE fef_db;
Query OK, 12 rows affected (0.06 sec)

MariaDB [(none)]> DROP DATABASE microfactory_db;
Query OK, 20 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_db;
Query OK, 21 rows affected (0.09 sec)

MariaDB [(none)]> DROP DATABASE obi_stabing_db;
ERROR 1008 (HY000): Can't drop database 'obi_stabing_db'; database doesn't exist
MariaDB [(none)]> DROP DATABASE oseforum_db;
Query OK, 35 rows affected (0.08 sec)

MariaDB [(none)]> DROP DATABASE osemain_s_db;
Query OK, 20 rows affected (0.04 sec)

MariaDB [(none)]> DROP DATABASE osewiki_db;
Query OK, 59 rows affected (0.31 sec)

MariaDB [(none)]> DROP DATABASE phplist_db;
Query OK, 42 rows affected (0.16 sec)

MariaDB [(none)]> DROP DATABASE seedhome_db;
Query OK, 12 rows affected (0.05 sec)

MariaDB [(none)]> DROP DATABASE store_db;
Query OK, 36 rows affected (0.11 sec)

MariaDB [(none)]> DROP DATABASE obi_staging_db;
Query OK, 21 rows affected (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]> 

  1. even after that, it still won't start :'(
[root@opensourceecology etc]# time systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

real    0m4.863s
user    0m0.009s
sys     0m0.007s
[root@opensourceecology etc]# time systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2025-04-18 20:34:47 UTC; 14s ago
  Process: 18459 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
  Process: 18458 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 18423 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 18458 (code=exited, status=0/SUCCESS)

Apr 18 20:34:46 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 18 20:34:46 opensourceecology.org mariadb-prepare-db-dir[18423]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 18 20:34:46 opensourceecology.org mysqld_safe[18458]: 250418 20:34:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 18 20:34:47 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 18 20:34:47 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 18 20:34:47 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

real    0m0.010s
user    0m0.002s
sys     0m0.005s
[root@opensourceecology etc]# 
  1. before I purge those three system-level DBs, I want to confirm they're in our backups
  2. as I feared, it looks like they're missing
[root@opensourceecology dbFail.20250417]# zgrep -E 'CREATE DATABASE' mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | grep 'IF NOT EXISTS' | grep -E '^.{,100}$'
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `3dp_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `cacti_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `d3d_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `fef_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `microfactory_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `mysql` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `obi_staging_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oseforum_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osemain_s_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `osewiki_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `oswh_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `phplist_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `seedhome_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `store_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
[root@opensourceecology dbFail.20250417]# 
  1. according to this, information_schema is essentially a cache that gets created & destroyed every time mysql is restarted, so we should be ok to loose that https://stackoverflow.com/questions/15306132/information-schema-error-when-restoring-database-dump
  2. I'm just going to manually dump these three anyway. Or try to
  3. well, I was able to get one of the three to backup
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass information_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_information_schema.$(date "+%Y%m%d_%H%M%S").sql.gz 
mysqldump: Got error: 1044: "Access denied for user 'root'@'localhost' to database 'information_schema'" when using LOCK TABLES

real    0m0.010s
user    0m0.006s
sys     0m0.008s
[root@opensourceecology dbFail.20250417]# 
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass mysql | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_mysql.$(date "+%Y%m%d_%H%M%S").sql.gz

real    0m0.142s
user    0m0.155s
sys     0m0.010s
[root@opensourceecology dbFail.20250417]# 
[root@opensourceecology dbFail.20250417]# time mysqldump -uroot -p$mysqlPass performance_schema | gzip -c > mysqldump-after-corruption-while-in-recovery-mode_performance_schema.$(date "+%Y%m%d_%H%M%S").sql.gz
mysqldump: Got error: 1142: "SELECT,LOCK TABL command denied to user 'root'@'localhost' for table 'cond_instances'" when using LOCK TABLES

real    0m0.009s
user    0m0.009s
sys     0m0.005s
[root@opensourceecology dbFail.20250417]# 
  1. mysql looks good
[root@opensourceecology dbFail.20250417]# du -sh mysqldump-after-corruption-while-in-recovery-mode*
1.3G    mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz
4.0K    mysqldump-after-corruption-while-in-recovery-mode_information_schema.20250418_205054.sql.gz
716K    mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz
4.0K    mysqldump-after-corruption-while-in-recovery-mode_performance_schema.20250418_205157.sql.gz
[root@opensourceecology dbFail.20250417]# 
  1. I'm just going to move this whole db dir out of the way and see if we can start it fresh
[root@opensourceecology ~]# cd /var/lib
[root@opensourceecology lib]# du -sh mysql/
6.5G    mysql/
[root@opensourceecology lib]# ls -lah | grep -i mysql
drwxr-xr-x   4 mysql   mysql   4.0K Apr 18 20:50 mysql
[root@opensourceecology lib]# 
[root@opensourceecology lib]# systemctl stop mariadb
[root@opensourceecology lib]# 
[root@opensourceecology lib]# mv mysql mysql.20250418
[root@opensourceecology lib]# 
[root@opensourceecology lib]# mkdir mysql
[root@opensourceecology lib]# chown mysql:mysql mysql
[root@opensourceecology lib]# chmod 0755 mysql
[root@opensourceecology lib]# 
[root@opensourceecology lib]# ls -lah mysql
total 8.0K
drwxr-xr-x   2 mysql mysql 4.0K Apr 18 20:55 .
drwxr-xr-x. 42 root  root  4.0K Apr 18 20:55 ..
[root@opensourceecology lib]# 
  1. ok, it's started outside recovery mode now
[root@opensourceecology etc]# time systemctl restart mariadb

real    0m3.550s
user    0m0.007s
sys     0m0.012s
[root@opensourceecology etc]# 

250418 20:55:06 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
250418 20:56:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250418 20:56:23 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 21252 ...
250418 20:56:23 InnoDB: The InnoDB memory heap is disabled
250418 20:56:23 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250418 20:56:23 InnoDB: Compressed tables use zlib 1.2.7
250418 20:56:23 InnoDB: Using Linux native AIO
250418 20:56:23 InnoDB: Initializing buffer pool, size = 128.0M
250418 20:56:23 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
250418 20:56:23  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
250418 20:56:23  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
250418 20:56:23  InnoDB: Waiting for the background threads to start
250418 20:56:24 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 0
250418 20:56:24 [Note] Plugin 'FEEDBACK' is disabled.
250418 20:56:24 [Note] Event Scheduler: Loaded 0 events
250418 20:56:24 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server
  1. it created all these files
[root@opensourceecology lib]# ls -lah mysql
total 29M
drwxr-xr-x   5 mysql mysql 4.0K Apr 18 20:56 .
drwxr-xr-x. 42 root  root  4.0K Apr 18 20:55 ..
-rw-rw----   1 mysql mysql  16K Apr 18 20:56 aria_log.00000001
-rw-rw----   1 mysql mysql   52 Apr 18 20:56 aria_log_control
-rw-rw----   1 mysql mysql  18M Apr 18 20:56 ibdata1
-rw-rw----   1 mysql mysql 5.0M Apr 18 20:56 ib_logfile0
-rw-rw----   1 mysql mysql 5.0M Apr 18 20:56 ib_logfile1
drwx------   2 mysql mysql 4.0K Apr 18 20:56 mysql
srwxrwxrwx   1 mysql mysql    0 Apr 18 20:56 mysql.sock
drwx------   2 mysql mysql 4.0K Apr 18 20:56 performance_schema
drwx------   2 mysql mysql 4.0K Apr 18 20:56 test
[root@opensourceecology lib]# 
  1. that also would have killed the mysql password; I can't login
[root@opensourceecology lib]# source /root/backups/backup.settings
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
[root@opensourceecology lib]# 
  1. I hacked my way in and set the root password
mysqld_safe --skip-grant-tables --skip-networking &
mysql -u root
use mysql;
update user set password=PASSWORD("new-password") where User='root';
flush privileges;
exit
jobs -l
# kill mysqld_safe
  1. now I can see our three databases, plus one named test
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> 
  1. usually this is where I'd run the mysql hardening script, but let's just drop test manually and restore from backup
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
+--------------------+
| information_schema |
+====================+
| mysql              |
+--------------------+
| performance_schema |
+--------------------+
| test               |
+--------------------+
+--------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> DROP DATABASE test;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye
[root@opensourceecology lib]# 
  1. first let's just restore the 'mysql' database
[root@opensourceecology dbFail.20250417]# zcat mysqldump-after-corruption-while-in-recovery-mode_mysql.20250418_205149.sql.gz | mysql -uroot -p$mysqlPass mysql
[root@opensourceecology dbFail.20250417]# 
  1. that appears to have worked; our users are present now
MariaDB [mysql]> select User from user limit 10;
+------------------+
| User             |
+------------------+
| oseforum_user    |
| cacti_user       |
| 3dp_user         |
| cacti_user       |
| d3d_user         |
| fef_user         |
| microfactory_usr |
| munin_user       |
| obi2_user        |
| obi3_user        |
+------------------+
10 rows in set (0.00 sec)

MariaDB [mysql]> 
  1. I gave it a restart, and ensured it's still working. Great.
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]> 
  1. now let's restore the rest – including even our corrupt databases – and see if it works or breaks
  2. that took about 11.5 minutes to import ~6.8G of data
[root@opensourceecology dbFail.20250417]# time zcat mysqldump-after-corruption-while-in-recovery-mode.20250418_200122.sql.gz | mysql -uroot -p$mysqlPass mysql

real    11m36.530s
user    1m52.944s
sys     0m3.593s
[root@opensourceecology dbFail.20250417]# 

[root@opensourceecology dbFail.20250417]# du -sh /var/lib/mysql
6.8G    /var/lib/mysql
[root@opensourceecology dbFail.20250417]# 

  1. I'm still able to connect, and now I see all our DBs – including the ones it said were corrupt
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 6
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| 3dp_db             |
| cacti_db           |
| d3d_db             |
| fef_db             |
| microfactory_db    |
| mysql              |
| obi_db             |
| obi_staging_db     |
| oseforum_db        |
| osemain_db         |
| osemain_s_db       |
| osewiki_db         |
| oswh_db            |
| performance_schema |
| phplist_db         |
| seedhome_db        |
| store_db           |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> 
  1. woah, I gave it a restart, and it came back fine
[root@opensourceecology lib]# systemctl restart mariadb
[root@opensourceecology lib]# mysql -uroot -p$mysqlPass
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| 3dp_db             |
| cacti_db           |
| d3d_db             |
| fef_db             |
| microfactory_db    |
| mysql              |
| obi_db             |
| obi_staging_db     |
| oseforum_db        |
| osemain_db         |
| osemain_s_db       |
| osewiki_db         |
| oswh_db            |
| performance_schema |
| phplist_db         |
| seedhome_db        |
| store_db           |
+--------------------+
18 rows in set (0.00 sec)

MariaDB [(none)]> 
  1. I guess we fixed it with no data loss?
  2. let's bring up the web servers
[root@opensourceecology lib]# systemctl start httpd
[root@opensourceecology lib]# systemctl start varnish
[root@opensourceecology lib]# systemctl start nginx
[root@opensourceecology lib]# 
  1. the wiki loads now
  2. so does osemain
  3. I'd say we're back in business
  4. I sent an email to Marcin
Hey Marcin,

I think all your sites are back now.

I was able to restore all of your databases from a dump of the database in recovery mode. So nothing needed to be restored from backups.

Please let me know if you see any issues. 
  1. now that Marcin has ssh access on the server again, I wonder if he has permission to execute `restart` – that would be better for him than logging into the hetzner wui and doing hard resets, which likely caused this corruption
  2. at the risk of taking everything down after I just told Marcin that everything is up, I'm going to try it
  3. looks like it won't let him reboot if other users are logged-in
[marcin@opensourceecology ~]$ reboot
User maltfield is logged in on sshd.
User maltfield is logged in on sshd.
Please retry operation after closing inhibitors and logging out other users.
Alternatively, ignore inhibitors and users with 'systemctl reboot -i'.
[marcin@opensourceecology ~]$ systemctl reboot -i
==== AUTHENTICATING FOR org.freedesktop.login1.reboot-multiple-sessions ===
Authentication is required for rebooting the system while other users are logged in.
Multiple identities can be used for authentication:
 1.  maltfield
 2.  crupp
 3.  Tom Griffing (tgriffing)
 4.  jthomas
Choose identity to authenticate as (1-4):
  1. I updated the sudoers command to give marcin *just* access to the reboot command
[root@opensourceecology lib]# visudo
[root@opensourceecology lib]#

[root@opensourceecology lib]# tail /etc/sudoers
# %users  ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users  localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# let marcin reboot the machine gracefully
marcin ALL = NOPASSWD: /sbin/reboot
[root@opensourceecology lib]# 
  1. I couldn't test this on the server without changing marcin's password, so I spun-up a quick DispVM to ensure it *only* gives him access to reboot
  2. it's debian, but sudoers syntax should (hopefully) be the same
user@debian-12-dvm:~$ sudo su -
root@debian-12-dvm:~# adduser marcin --disabled-password --gecos ''
Adding user `marcin' ...
Adding new group `marcin' (1001) ...
Adding new user `marcin' (1001) with group `marcin (1001)' ...
Creating home directory `/home/marcin' ...
Copying files from `/etc/skel' ...
Adding new user `marcin' to supplemental / extra groups `users' ...
Adding user `marcin' to group `users' ...
root@debian-12-dvm:~# 

root@debian-12-dvm:~# visudo
root@debian-12-dvm:~#

root@debian-12-dvm:~# passwd marcin
New password: 
Retype new password: 
passwd: password updated successfully
root@debian-12-dvm:~# sudo su - marcin
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo su -
[sudo] password for marcin: 
Sorry, user marcin is not allowed to execute '/usr/bin/su -' as root on localhost.
marcin@debian-12-dvm:~$

marcin@debian-12-dvm:~$ sudo echo hi
[sudo] password for marcin: 
Sorry, user marcin is not allowed to execute '/usr/bin/echo hi' as root on localhost.
marcin@debian-12-dvm:~$ 

marcin@debian-12-dvm:~$ reboot
-bash: reboot: command not found
marcin@debian-12-dvm:~$ sudo reboot
  1. yeah, that worked. Perfect.
  2. I tested it on hetzner2; it worked too.
[marcin@opensourceecology ~]$ sudo reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
ssh: connect to host opensourceecology.org port 32415: Connection refused
...
  1. I sent Marcin a reply ask him to test reboots via ssh
Sorry the server just went down; that was me testing to make sure your 'marcin' user now has permission to do a proper & safer `sudo reboot` of hetzner2. It does.

> Do things look stable or are the
> risks of recurrence in the near future significant, such that
> I should plan on potential breakage at any time?

Great question. There's a couple things I'd like to implement to prevent this from happening again:

1. Replace both of your disks on hetzner2

2. Give you reboot permission on hetzner2

My best-guess is that the corruption happened because you abruptly shutdown the server. As you know, that's generally not a good idea as it can cause data loss.

But filesystems use journals and databases use pages. They *should* be able to recover from abrupt shutdowns. They wouldn't be very useful if they were so frail as to not be able to recover from something like that...

But in this case, I think it was a "perfect storm" that you caused corruption and it wasn't able to recover from it due to a bug in mariadb. And, because your OS is EOL, we can't update to a newer version of mariadb that *is* able to recover from such a unlucky combination of events.

So, in the meantime, instead of you logging into hetzner's WUI to trigger reboots, I'd prefer if you would ssh into the hetzner2 server and execute

  sudo reboot

Please test this on your computer now to make sure you're setup for it. To ssh into hetzner2, execute this command on your computer:

  ssh -p 32415 marcin@opensourceecology.org

And then at the prompt, execute this command (make sure you type this *after* you've logged into hetzner, or you'll end-up rebooting your own laptop!)

  sudo reboot

The second thing I'd like to do is replace both of your disks on hetzner2. I don't think they caused corruption in this case, but I did discover that they're both screaming that they're going to die soon and asking to be replaced, so I would be a fool not to heed that warning.

Hetzner shouldn't charge us to replace a failing disk, but I'll schedule some downtime for remote hetzner hands to shutdown the machine, then I'll need to format the new drive, add it to the RAID (the mirror of two redundant disks), and update your grub boot partition.

There's some risk in doing this, because you'll be running on one non-redundant disk (a disk which is screaming at us saying it's going to die within 24 hours) while the RAID is re-building. But, of course, there's risk in not doing it..

Please confirm that you can now reboot hetzner2 via ssh.


Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972  644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net

On 4/18/25 16:39, Marcin Jakubowski wrote:
> Thats excellent, thabk you, looks good. Do things look stable or are the
> risks of recurrence in the near future significant, such that I should plan
> on potential breakage at any time? Regarding the full migration, how many
> more hours/days of provisioning do tou still expwct to need? 
  1. I created an article for the CHG to replace the first disk on hetzner2 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sda
    1. I wonder if I can figure out which one grub uses and replace that one second..
  2. from my log yesterday, here's our two drive's serial numbers
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]# 

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]# 
  1. fuck; looks like neither is referenced in /boot/
[root@opensourceecology grub2]# grep -irl '154410FA4520' /boot
[root@opensourceecology grub2]# grep -irl '154410FA336C' /boot
[root@opensourceecology grub2]# 
  1. the steps to setup grub are actually quite simple, according to the hetzner docs https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
    1. it says if we're doing it on the booted system, then we just need to run `grub-install /dev/sdX`
  2. it has additional instructions for grub1. And, uh, looks like we have grub1, grub2, *and* an efi dir in /boot
[root@opensourceecology grub2]# ls /boot
config-3.10.0-1127.el7.x86_64                            initramfs-3.10.0-1160.119.1.el7.x86_64kdump.img  System.map-3.10.0-1127.el7.x86_64
config-3.10.0-1160.119.1.el7.x86_64                      initramfs-3.10.0-327.18.2.el7.x86_64.img         System.map-3.10.0-1160.119.1.el7.x86_64
config-3.10.0-327.18.2.el7.x86_64                        initramfs-3.10.0-514.26.2.el7.x86_64.img         System.map-3.10.0-327.18.2.el7.x86_64
config-3.10.0-514.26.2.el7.x86_64                        initramfs-3.10.0-693.2.2.el7.x86_64.img          System.map-3.10.0-514.26.2.el7.x86_64
config-3.10.0-693.2.2.el7.x86_64                         initramfs-3.10.0-693.2.2.el7.x86_64kdump.img     System.map-3.10.0-693.2.2.el7.x86_64
efi                                                      initrd-plymouth.img                              vmlinuz-0-rescue-34946d7b5edb0946bfb52c0f6cae67af
grub                                                     lost+found                                       vmlinuz-3.10.0-1127.el7.x86_64
grub2                                                    symvers-3.10.0-1127.el7.x86_64.gz                vmlinuz-3.10.0-1160.119.1.el7.x86_64
initramfs-0-rescue-34946d7b5edb0946bfb52c0f6cae67af.img  symvers-3.10.0-1160.119.1.el7.x86_64.gz          vmlinuz-3.10.0-327.18.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64.img                     symvers-3.10.0-327.18.2.el7.x86_64.gz            vmlinuz-3.10.0-514.26.2.el7.x86_64
initramfs-3.10.0-1127.el7.x86_64kdump.img                symvers-3.10.0-514.26.2.el7.x86_64.gz            vmlinuz-3.10.0-693.2.2.el7.x86_64
initramfs-3.10.0-1160.119.1.el7.x86_64.img               symvers-3.10.0-693.2.2.el7.x86_64.gz
[root@opensourceecology grub2]# 
  1. I'm thinking we should actually just tell hetzner to do a hot swap while the system is on, so we can do this "easy install" of grub without risking the system not coming-up after they removed the drive
  2. oh, the efi dir is empty, so I'm thinking we're using grub2
[root@opensourceecology boot]# find efi
efi
efi/EFI
efi/EFI/centos
[root@opensourceecology boot]# 
  1. yeah, the grub dir just has one file in it?
[root@opensourceecology boot]# ls -lah grub
total 10K
drwxr-xr-x. 2 root root 1.0K Apr 11  2016 .
dr-xr-xr-x. 6 root root 5.0K Jul 26  2024 ..
-rw-r--r--  1 root root 1.4K Nov 15  2011 splash.xpm.gz
[root@opensourceecology boot]#
  1. grub2 looks most sane
[root@opensourceecology boot]# ls -lah grub2
total 52K
drwx------. 5 root root 1.0K Jul 26  2024 .
dr-xr-xr-x. 6 root root 5.0K Jul 26  2024 ..
drwxr-xr-x. 2 root root 1.0K Dec 15  2015 fonts
-rw-r--r--  1 root root 7.8K Jul 26  2024 grub.cfg
-rw-r--r--  1 root root 5.3K Jun  1  2016 grub.cfg.1499616907.rpmsave
-rw-r--r--  1 root root 6.1K Jul  9  2017 grub.cfg.1506097734.rpmsave
-rw-r--r--  1 root root 7.0K Sep 22  2017 grub.cfg.1588589453.rpmsave
-rw-r--r--. 1 root root 1.0K Jul 26  2024 grubenv
drwxr-xr-x. 2 root root 9.0K May 31  2016 i386-pc
drwxr-xr-x. 2 root root 1.0K May 31  2016 locale
[root@opensourceecology boot]# 
  1. it looks like it's referencing the raid, not the drive
### BEGIN /etc/grub.d/10_linux ###
menuentry 'CentOS Linux (3.10.0-1160.119.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-327.13.1.el7.x86_64-advanced-af18bd25-f715-4003-b055-170a07591c60' {
		load_video
		set gfxpayload=keep
		insmod gzio
		insmod part_msdos
		insmod part_msdos
		insmod diskfilter
		insmod mdraid1x
		insmod ext2
		set root='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'
		if [ x$feature_platform_search_hint = xy ]; then
		  search --no-floppy --fs-uuid --set=root --hint='mduuid/7141f546f6e3f5962a80bdc64c4f6d4a'  9f6b5264-da8c-406d-a444-45e3fb3aeb26
		else
		  search --no-floppy --fs-uuid --set=root 9f6b5264-da8c-406d-a444-45e3fb3aeb26
		fi
		linux16 /vmlinuz-3.10.0-1160.119.1.el7.x86_64 root=/dev/md/2 ro nomodeset rd.auto=1 crashkernel=auto LANG=en_US.UTF-8
		initrd16 /initramfs-3.10.0-1160.119.1.el7.x86_64.img
}
  1. right, so if I understand this correctly: we're not updating grub. We're using 'grub-install' to copy our grub config *to* the drive. that's easier and less concerning than I thought.
  2. well, since I can't see any good reason to pick one drive or the other to replace first, I'm going to have them replace /dev/sdb first. Just because 'sda' seems like it would be primary. I know it's probably not, but, anyway..
  3. that means we'll replace Crucial_CT250MX200SSD1_154410FA4520 first; I created another wiki entry for that https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_replace_hetzner2_sdb
  4. Marcin sent me an email confirming that he's able to restart hetzner2 with `sudo reboot`. I asked him to use this in the future if he needs to reboot it again.
  5. the disk is getting pretty full, but I'm going to leave these files in /var/tmp/ for at least a few days, to make sure we don't actually need to restore from a backup again
[root@opensourceecology ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G   17M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        197G  150G   38G  80% /
/dev/md1        488M  386M   77M  84% /boot
tmpfs           6.3G     0  6.3G   0% /run/user/1005
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mv /var/lib/mysql.20250418 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# 

Thr Apr 17, 2025

  1. Marcin sent me an email last night (and again this morning) asking why the wiki is down
  2. I hadn't touched ose infra since 6 days ago
  3. the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
  4. I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
  5. anyway, let's check what happened to the wiki on hetzner2
  6. it's a 500 error complaining about the db
user@disp9871:~$ curl -iL wiki.opensourceecology.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 17 Apr 2025 20:17:52 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://wiki.opensourceecology.org/
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Thu, 17 Apr 2025 20:17:54 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 976
Connection: keep-alive
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Varnish: 434054
Age: 0
Via: 1.1 varnish-v4

<h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br />
<small>Note that their indexes of our content may be out of date.</small>
</div>
<form method="get" action="//www.google.com/search" id="googlesearch">
	<input type="hidden" name="domains" value="https://wiki.opensourceecology.org" />
	<input type="hidden" name="num" value="50" />
	<input type="hidden" name="ie" value="UTF-8" />
	<input type="hidden" name="oe" value="UTF-8" />
	<input type="text" name="q" size="31" maxlength="255" value="" />
	<input type="submit" name="btnG" value="Search" />
	<p>
		<label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label>
		<label><input type="radio" name="sitesearch" value="" />WWW</label>
	</p>
user@disp9871:~$ 
  1. disk is fine
[root@opensourceecology ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G   17M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        197G   96G   92G  52% /
/dev/md1        488M  386M   77M  84% /boot
tmpfs           6.3G     0  6.3G   0% /run/user/1005
[root@opensourceecology ~]# 
  1. there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
  2. there's also no new logs in the mariadb error log when I hit the site in real-time
  3. well, the db isn't running
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago
  Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
  Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 1226 (code=exited, status=0/SUCCESS)

Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]# 
  1. error logs aren't very helpful
[root@opensourceecology log]# journalctl -fu mariadb
-- Logs begin at Thu 2025-04-17 17:38:59 UTC. --
Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server...
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server.
Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state.
Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
  1. if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
[root@opensourceecology ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology ~]# 
  1. here's the log that pops-up when we try a restart
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ...
250417 20:24:31 InnoDB: The InnoDB memory heap is disabled
250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7
250417 20:24:31 InnoDB: Using Linux native AIO
250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M
250417 20:24:31 InnoDB: Completed initialization of buffer pool
250417 20:24:31 InnoDB: highest supported file format is Barracuda.
250417 20:24:31  InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 20:24:31  InnoDB: Starting final batch to recover 11 pages from redo log
250417 20:24:31  InnoDB: Waiting for the background threads to start
250417 20:24:31  InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822
InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
250417 20:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.68-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975]
sigaction.c:0(__restore_rt)[0x7f6a294c9630]
:0(__GI_raise)[0x7f6a27bf0387]
:0(__GI_abort)[0x7f6a27bf1a78]
/usr/libexec/mysqld(+0x63845f)[0x563a1beae45f]
/usr/libexec/mysqld(+0x638f69)[0x563a1beaef69]
/usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504]
/usr/libexec/mysqld(+0x730487)[0x563a1bfa6487]
/usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d]
/usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6]
pthread_create.c:0(start_thread)[0x7f6a294c1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
  1. google points to this https://bugs.mysql.com/bug.php?id=61516
    1. they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
  2. reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
  3. before reading any more, I'm going to immediately make a local copy of our most-recent backups
  4. looks like we have a backup from 13 hours ago and one from 27 hours ago
[maltfield@opensourceecology ~]$ date
Thu Apr 17 20:36:56 UTC 2025
[maltfield@opensourceecology ~]$ 

[root@opensourceecology ~]# ls -lah /home/b2user/sync
total 21G
drwxr-xr-x  2 root   root   4.0K Apr 17 07:49 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r--  1 b2user root    21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg
[root@opensourceecology ~]# ls -lah /home/b2user/sync.old/
total 22G
drwxr-xr-x  2 root   root   4.0K Apr 16 07:52 .
drwx------ 10 b2user b2user 4.0K Apr 17 07:20 ..
-rw-r--r--  1 b2user root    22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg
[root@opensourceecology ~]#
  1. this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
    1. it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
    2. other warnings suggest solving the underlying issue: why did the data become corrupt?
    3. well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
    4. but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
  2. I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
Hey Marcin,

Your database is corrupt and won't start.

Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8.

 * https://bugs.mysql.com/bug.php?id=61516

I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup.

This is not going to be an easy or fast fix, sorry. 
  1. the backups of the backups finished
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/
sending incremental file list
daily_hetzner2_20250416_072001.tar.gpg
 22,975,631,986 100%  139.63MB/s    0:02:36 (xfr#1, to-chk=1/2)
daily_hetzner2_20250417_072001.tar.gpg
 21,566,407,634 100%  103.43MB/s    0:03:18 (xfr#2, to-chk=0/2)

sent 44,552,914,338 bytes  received 54 bytes  125,324,653.70 bytes/sec
total size is 44,542,039,620  speedup is 1.00
[root@opensourceecology ~]# 
[root@opensourceecology ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G   17M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        197G  138G   50G  74% /
/dev/md1        488M  386M   77M  84% /boot
tmpfs           6.3G     0  6.3G   0% /run/user/1005
[root@opensourceecology ~]# 
  1. I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
[root@opensourceecology ~]# systemctl stop httpd
[root@opensourceecology ~]# systemctl stop varnish
[root@opensourceecology ~]# systemctl stop nginx
[root@opensourceecology ~]# 
  1. I should also make a backup of /var/lib/mysql
  2. I'm going to create a dif for all of this
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417
[root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# 

[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041
[root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/
[root@opensourceecology ~]# 

[root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt
[root@opensourceecology ~]# 

[root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt 
2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version

		 * https://bugs.mysql.com/bug.php?id=61516

		Anyway, I'm creating this folder to store some backups before we make things worse.
[root@opensourceecology ~]# 
  1. aaaand I added a copy of /var/lib/mysql/
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d")
sending incremental file list
created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417
mysql/
mysql/aria_log.00000001
		 16,384 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=707/709)
...
mysql/store_db/wp_woocommerce_tax_rate_locations.frm
		  8,714 100%    9.26kB/s    0:00:00 (xfr#689, to-chk=1/709)
mysql/store_db/wp_woocommerce_tax_rates.frm
		 13,128 100%   13.95kB/s    0:00:00 (xfr#690, to-chk=0/709)

sent 7,384,914,964 bytes  received 13,343 bytes  114,495,012.51 bytes/sec
total size is 7,383,062,830  speedup is 1.00
[root@opensourceecology ~]# 
  1. another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
  1. Unfortunately, a lot of the links for how to fix this are now dead
    1. https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
    2. https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
    3. https://forums.mysql.com/read.php?22,603093,604631#msg-604631
    4. https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
  2. we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
    1. but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
    2. ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
  3. this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
    1. it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
    2. good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
      1. but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
    3. it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
      1. here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.

One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.

When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.

All the InnoDB disk data structures within a MySQL instance share the same page size.

See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
  1. I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
  2. ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
  3. but first let's check SMART
  4. oh, fuck, my notes on this are on the wiki. Of course.
  5. arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
  6. fail
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
-bash: smartctl: command not found
[root@opensourceecology ~]# 
  1. luckily the yum servers for this EOL OS are still online, and I could install it
[root@opensourceecology ~]# yum install smartmontools
...
Total download size: 546 k
Installed size: 2.0 M
Is this ok [y/d/N]: y
Downloading packages:
smartmontools-7.0-2.el7.x86_64.rpm                                                                                                              | 546 kB  00:00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : 1:smartmontools-7.0-2.el7.x86_64                                                                                                                    1/1 
  Verifying  : 1:smartmontools-7.0-2.el7.x86_64                                                                                                                    1/1 

Installed:
  smartmontools.x86_64 1:7.0-2.el7                                                                                                                                     

Complete!
[root@opensourceecology ~]# 
  1. better
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:'
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[root@opensourceecology ~]# 
  1. well this is terrifying; it says both our disks are gonna fail within 24 hours
[root@opensourceecology ~]# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

[root@opensourceecology ~]# 
  1. compare that to hetnzer3, which says all is good
root@hetzner3 ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # smartctl -H /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@hetzner3 ~ # 
  1. I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
  2. oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
    1. well, it lists two options
      1. Free Replacement drive nearly new or used and tested; depends on what is in stock.
      2. At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
    2. we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
    3. dang, it says they'll swap the drive within 2-4 hours.
  3. I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
  4. hetzner provides 4 docs for assistance on this
    1. https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
    2. https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
    3. https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
    4. https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
  5. that first doc says to run the command we just ran
  6. hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
  7. ok, the docs say we can get more info with -A
[root@opensourceecology ~]# smartctl -A /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

START OF READ SMART DATA SECTION
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78355
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3433
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2599
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   046   000    Old_age   Always       -       36 (Min/Max 24/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       405734134966
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12794981941
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       26207531685

[root@opensourceecology ~]# 

[root@opensourceecology ~]# smartctl -A /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       78354
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       3742
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2585
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   044   000    Old_age   Always       -       35 (Min/Max 24/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       406209116828
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12809824998
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       42504271864

[root@opensourceecology ~]# 
  1. so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
  2. well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
  3. yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
  4. in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
  5. the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
  6. wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
root@hetzner3 ~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    6%
Data Units Read:                    152.358.379 [78,0 TB]
Data Units Written:                 52.125.092 [26,6 TB]
Host Read Commands:                 6.873.372.480
Host Write Commands:                1.362.559.127
Controller Busy Time:               22.226
Power Cycles:                       28
Power On Hours:                     17.245
Unsafe Shutdowns:                   5
Media and Data Integrity Errors:    0
Error Information Log Entries:      159
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               48 Celsius

root@hetzner3 ~ # smartctl -A /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

START OF SMART DATA SECTION
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    7%
Data Units Read:                    140.811.605 [72,0 TB]
Data Units Written:                 56.604.901 [28,9 TB]
Host Read Commands:                 1.304.073.899
Host Write Commands:                1.364.668.115
Controller Busy Time:               21.180
Power Cycles:                       23
Power On Hours:                     15.565
Unsafe Shutdowns:                   5
Media and Data Integrity Errors:    0
Error Information Log Entries:      149
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               45 Celsius

root@hetzner3 ~ # 
  1. that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
  2. the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
  3. this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
[root@opensourceecology ~]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb2[1] sda2[0]
	  523712 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sda3[0] sdb3[1]
	  209984640 blocks super 1.2 [2/2] [UU]
	  bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
	  33521664 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
[root@opensourceecology ~]# 
  1. ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
    1. first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
  2. the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C
ID_SERIAL_SHORT=154410FA336C
[root@opensourceecology ~]# 

[root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER
ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520
ID_SERIAL_SHORT=154410FA4520
[root@opensourceecology ~]# 
  1. I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
[root@opensourceecology ~]# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:07:55 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -t short /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu Apr 17 22:08:18 2025

Use smartctl -X to abort test.
  1. I also kicked-off a long test, which I can check tomorrow
[root@opensourceecology ~]# smartctl -t long /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:12 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]# 
[root@opensourceecology ~]# smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 5 minutes for test to complete.
Test will complete after Thu Apr 17 22:15:14 2025

Use smartctl -X to abort test.
[root@opensourceecology ~]# 
  1. ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
[root@opensourceecology ~]# df -h /var/lib/mysql
Filesystem      Size  Used Avail Use% Mounted on
/dev/md2        197G  145G   43G  78% /
[root@opensourceecology ~]# 

[root@opensourceecology ~]# fdisk -l /dev/md2

Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

[root@opensourceecology ~]# 

[root@opensourceecology ~]# lsblk /dev/md2
NAME MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
md2    9:2    0 200.3G  0 raid1 /
[root@opensourceecology ~]# 
  1. it won't let me check the filesystem while it's mounted
[root@opensourceecology ~]# fsck /dev/md2
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md2 is mounted.
e2fsck: Cannot continue, aborting.
[root@opensourceecology ~]# 
  1. it probably should be happening on-boot, but I couldn't find it in dmesg
[root@opensourceecology ~]# dmesg | grep -i check
[    0.000000] Early table checksum verification disabled
[root@opensourceecology ~]# dmesg | grep -i fsck
[root@opensourceecology ~]# 
  1. ok, instead we can just use tune2fs to get the info on the last check that was run
  2. looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
[root@opensourceecology ~]# tune2fs -l /dev/md2
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          af18bd25-f715-4003-b055-170a07591c60
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              13131776
Block count:              52496160
Reserved block count:     2624808
Free blocks:              26575102
Free inodes:              12417672
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1011
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Tue May 31 06:01:12 2016
Last mount time:          Thu Apr 17 17:39:11 2025
Last write time:          Thu Apr 17 17:39:00 2025
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Apr 17 17:39:00 2025
Check interval:           0 (<none>)
Lifetime writes:          124 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      b9456d9f-1608-4444-99c2-02e6f327e42d
Journal backup:           inode blocks
[root@opensourceecology ~]# 
  1. both of the filesystems (/ and /boot) look fine
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked'
Last mounted on:          /boot
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Last mount time:          Thu Apr 17 17:39:11 2025
Mount count:              46
Maximum mount count:      -1
Last checked:             Tue May 31 06:01:07 2016
[root@opensourceecology ~]#

[root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked'
Last mounted on:          /
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Last mount time:          Thu Apr 17 17:39:11 2025
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Apr 17 17:39:00 2025
[root@opensourceecology ~]#
  1. well, so far I couldn't find any signs of corruption on the disk/fs level
  2. back to the db, I set the recovery option in the my.cnf file
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417
[root@opensourceecology etc]# 

[root@opensourceecology etc]# vim my.cnf
[root@opensourceecology etc]# 

[root@opensourceecology etc]# diff my.cnf.20250417 my.cnf
1a2,5
> 
> # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
> innodb_force_recovery = 1
> 
[root@opensourceecology etc]# 
  1. it didn't come-up
[root@opensourceecology etc]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@opensourceecology etc]# 
  1. I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ...
250417 22:32:49 InnoDB: The InnoDB memory heap is disabled
250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins
250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7
250417 22:32:49 InnoDB: Using Linux native AIO
250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M
250417 22:32:49 InnoDB: Completed initialization of buffer pool
250417 22:32:49 InnoDB: highest supported file format is Barracuda.
250417 22:32:49  InnoDB: Starting crash recovery from checkpoint LSN=625883462907
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
250417 22:32:49  InnoDB: Starting final batch to recover 11 pages from redo log
250417 22:32:49  InnoDB: Waiting for the background threads to start
250417 22:32:50  InnoDB: Waiting for the background threads to start
250417 22:32:51  InnoDB: Waiting for the background threads to start
250417 22:32:52  InnoDB: Waiting for the background threads to start
250417 22:32:53  InnoDB: Waiting for the background threads to start
250417 22:32:54  InnoDB: Waiting for the background threads to start
250417 22:32:55  InnoDB: Waiting for the background threads to start
250417 22:32:56  InnoDB: Waiting for the background threads to start
250417 22:32:57  InnoDB: Waiting for the background threads to start
250417 22:32:58  InnoDB: Waiting for the background threads to start
...
  1. it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.

Sun Apr 11, 2025

  1. let's get Catarina that broken staging site for osemain on hetzner3
  2. Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
  3. so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
  4. I logged-into hetzner https://robot.hetzner.com/server
  5. I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
  6. I clicked on the server for "hetzner3" and the tab "IPs".
    1. Then I clicked on "Order additional IPs / Nets"
    2. I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
    3. it required me to enter a reason (IPv4 is scarce) to which I wrote:
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
    1. and I clicked "Apply for IP/subnet in obligation"
    2. I got a message; looks like it needs human approval
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
  1. I typed an email to Marcin and Catarina to notify them of this order
Hey Marcin,

As authorized on our last call, I ordered an additional IPv4 address for your hetzner account.

IPv4 addresses are scarce, and it appears that they need to approve it manually.

The cost is €1.70 per month + € 4.90 once-off setup.

This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi.

Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address.


Thank you,

Michael Altfield
https://www.michaelaltfield.net
PGP Fingerprint: 0465 E42F 7120 6785 E972  644C FE1B 8449 4E64 0D41

Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
  1. before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
  2. I refreshed the hetzner wui, and now I see the new IP
  3. ...
  4. following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
    1. I sent them both emails asking them to confirm access
  5. I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
  6. update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
  7. he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
  8. I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
  9. I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
  10. Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
  11. ...
  12. I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
  13. gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
  14. I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
  15. there's too many layers on this server; I checked the logs
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit
HTTP/1.1 413 Request Entity Too Large
Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413)
Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"]
127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0"
146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
  1. ok, so it's modsecurity?
  2. gah, that's a lot of files to review
[root@opensourceecology httpd]# find .  |grep -i security
./conf.d/mod_security.wordpress.include
./conf.d/mod_security.conf
./conf.modules.d/10-mod_security.conf
./modsecurity.d
./modsecurity.d/activated_rules
./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf
./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound.data
./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf
./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example
./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data
./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf
./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf
./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf
./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf
./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data
./modsecurity.d/activated_rules/modsecurity_35_scanners.data
./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data
./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf
./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf
./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf
./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf
./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf
./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf
./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig
./modsecurity.d/modsecurity_crs_10_config.conf
./modsecurity.d/do_not_log_passwords.conf
[root@opensourceecology httpd]# 
  1. looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
[root@opensourceecology httpd]# grep -irl 'BodyLimit' *
conf.d/mod_security.conf
modules/mod_security2.so
[root@opensourceecology httpd]# 
  1. it's 13107200
[root@opensourceecology httpd]# grep -ir 'BodyLimit' *
conf.d/mod_security.conf:    SecRequestBodyLimit 13107200
conf.d/mod_security.conf:    SecRequestBodyLimitAction Reject
Binary file modules/mod_security2.so matches
[root@opensourceecology httpd]# 
  1. docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
  2. so 13107200 / 1024 / 1024 = 12.5 MB.
  3. jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
  4. ...
  5. so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
  6. I didn't want to do this site before the other less-priority ones, but it's just a sandbox
  7. I realized I never made a CHG file for osemain
  8. looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
  9. ugh, I just said I was "following the same guide as with the other sites"
    1. I was hoping to know which one to CHG to copy-from
    2. I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
  10. ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3
  11. oh, crap, I'm going to remove