Maltfield Log/2025 Q2
Jump to navigation
Jump to search
My work log from the second quarter of the year 2025. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.
See Also
Sun Apr 17, 2025
- Marcin sent me an email last night (and again this morning) asking why the wiki is down
- I hadn't touched ose infra since 6 days ago
- the wiki is still on hetzner2, which is on EOL Cent, so I'm not terribly surprised it's falling apart.
- I first warned Marcin about this many years ago, and hopefully the migration to hetzner3 will be finished before the end of this year
- anyway, let's check what happened to the wiki on hetzner2
- it's a 500 error complaining about the db
user@disp9871:~$ curl -iL wiki.opensourceecology.org HTTP/1.1 301 Moved Permanently Server: nginx Date: Thu, 17 Apr 2025 20:17:52 GMT Content-Type: text/html Content-Length: 162 Connection: keep-alive Location: https://wiki.opensourceecology.org/ X-Frame-Options: SAMEORIGIN X-XSS-Protection: 1; mode=block HTTP/1.1 500 Internal Server Error Server: nginx Date: Thu, 17 Apr 2025 20:17:54 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 976 Connection: keep-alive X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block X-Varnish: 434054 Age: 0 Via: 1.1 varnish-v4 <h1>Sorry! This site is experiencing technical difficulties.</h1><p>Try waiting a few minutes and reloading.</p><p><small>(Cannot access the database)</small></p><hr /><div style="margin: 1.5em">You can try searching via Google in the meantime.<br /> <small>Note that their indexes of our content may be out of date.</small> </div> <form method="get" action="//www.google.com/search" id="googlesearch"> <input type="hidden" name="domains" value="https://wiki.opensourceecology.org" /> <input type="hidden" name="num" value="50" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="hidden" name="oe" value="UTF-8" /> <input type="text" name="q" size="31" maxlength="255" value="" /> <input type="submit" name="btnG" value="Search" /> <p> <label><input type="radio" name="sitesearch" value="https://wiki.opensourceecology.org" checked="checked" />Open Source Ecology</label> <label><input type="radio" name="sitesearch" value="" />WWW</label> </p> user@disp9871:~$
- disk is fine
[root@opensourceecology ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 17M 32G 1% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/md2 197G 96G 92G 52% / /dev/md1 488M 386M 77M 84% /boot tmpfs 6.3G 0 6.3G 0% /run/user/1005 [root@opensourceecology ~]#
- there's no new logs in the apache error log when I hit the site in real-time (bypassing the cache)
- there's also no new logs in the mariadb error log when I hit the site in real-time
- well, the db isn't running
[root@opensourceecology ~]# systemctl status mariadb ● mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2025-04-17 17:39:24 UTC; 2h 42min ago Process: 1227 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE) Process: 1226 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS) Process: 1103 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS) Main PID: 1226 (code=exited, status=0/SUCCESS) Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server... Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done. Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-p...db-dir. Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1 Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server. Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state. Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed. Hint: Some lines were ellipsized, use -l to show in full. [root@opensourceecology ~]#
- error logs aren't very helpful
[root@opensourceecology log]# journalctl -fu mariadb -- Logs begin at Thu 2025-04-17 17:38:59 UTC. -- Apr 17 17:39:22 opensourceecology.org systemd[1]: Starting MariaDB database server... Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done. Apr 17 17:39:22 opensourceecology.org mariadb-prepare-db-dir[1103]: If this is not the case, make sure the /var/lib/mysql is empty before running mariadb-prepare-db-dir. Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. Apr 17 17:39:22 opensourceecology.org mysqld_safe[1226]: 250417 17:39:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service: control process exited, code=exited status=1 Apr 17 17:39:24 opensourceecology.org systemd[1]: Failed to start MariaDB database server. Apr 17 17:39:24 opensourceecology.org systemd[1]: Unit mariadb.service entered failed state. Apr 17 17:39:24 opensourceecology.org systemd[1]: mariadb.service failed.
- if I try to restart it manually, nothing gets put in the journal logs, but there's a bunch to the actual log file that the journal log mentions (damn systemd)
[root@opensourceecology ~]# systemctl restart mariadb Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details. [root@opensourceecology ~]#
- here's the log that pops-up when we try a restart
250417 20:24:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 250417 20:24:31 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 10583 ... 250417 20:24:31 InnoDB: The InnoDB memory heap is disabled 250417 20:24:31 InnoDB: Mutexes and rw_locks use GCC atomic builtins 250417 20:24:31 InnoDB: Compressed tables use zlib 1.2.7 250417 20:24:31 InnoDB: Using Linux native AIO 250417 20:24:31 InnoDB: Initializing buffer pool, size = 128.0M 250417 20:24:31 InnoDB: Completed initialization of buffer pool 250417 20:24:31 InnoDB: highest supported file format is Barracuda. 250417 20:24:31 InnoDB: Starting crash recovery from checkpoint LSN=625883462907 InnoDB: Restoring possible half-written data pages from the doublewrite buffer... 250417 20:24:31 InnoDB: Starting final batch to recover 11 pages from redo log 250417 20:24:31 InnoDB: Waiting for the background threads to start 250417 20:24:31 InnoDB: Assertion failure in thread 140093400303360 in file trx0purge.c line 822 InnoDB: Failing assertion: purge_sys->purge_trx_no <= purge_sys->rseg->last_trx_no InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html InnoDB: about forcing recovery. 250417 20:24:31 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see http://kb.askmonty.org/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 5.5.68-MariaDB key_buffer_size=134217728 read_buffer_size=131072 max_used_connections=0 max_threads=153 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 466719 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x48000 /usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x563a1c105cad] /usr/libexec/mysqld(handle_fatal_signal+0x515)[0x563a1bd19975] sigaction.c:0(__restore_rt)[0x7f6a294c9630] :0(__GI_raise)[0x7f6a27bf0387] :0(__GI_abort)[0x7f6a27bf1a78] /usr/libexec/mysqld(+0x63845f)[0x563a1beae45f] /usr/libexec/mysqld(+0x638f69)[0x563a1beaef69] /usr/libexec/mysqld(+0x73b504)[0x563a1bfb1504] /usr/libexec/mysqld(+0x730487)[0x563a1bfa6487] /usr/libexec/mysqld(+0x63b17d)[0x563a1beb117d] /usr/libexec/mysqld(+0x62f0f6)[0x563a1bea50f6] pthread_create.c:0(start_thread)[0x7f6a294c1ea5] /lib64/libc.so.6(clone+0x6d)[0x7f6a27cb8b0d] The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. 250417 20:24:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
- google points to this https://bugs.mysql.com/bug.php?id=61516
- they say it could be a bug that might be fixed in v5.7. We're using 5.5.68. hetzner3 uses 5.8.
- reddit says we're fucked and should restore from backup https://old.reddit.com/r/mysql/comments/d3nkc7/innodb_assertion_failure_in_thread_4560_in_file/
- before reading any more, I'm going to immediately make a local copy of our most-recent backups
- looks like we have a backup from 13 hours ago and one from 27 hours ago
[maltfield@opensourceecology ~]$ date Thu Apr 17 20:36:56 UTC 2025 [maltfield@opensourceecology ~]$ [root@opensourceecology ~]# ls -lah /home/b2user/sync total 21G drwxr-xr-x 2 root root 4.0K Apr 17 07:49 . drwx------ 10 b2user b2user 4.0K Apr 17 07:20 .. -rw-r--r-- 1 b2user root 21G Apr 17 07:48 daily_hetzner2_20250417_072001.tar.gpg [root@opensourceecology ~]# ls -lah /home/b2user/sync.old/ total 22G drwxr-xr-x 2 root root 4.0K Apr 16 07:52 . drwx------ 10 b2user b2user 4.0K Apr 17 07:20 .. -rw-r--r-- 1 b2user root 22G Apr 16 07:52 daily_hetzner2_20250416_072001.tar.gpg [root@opensourceecology ~]#
- this SE answer is helpful https://serverfault.com/questions/592793/mysql-crashed-and-wont-start-up
- it says we can force the db to start (in "recovery mode") and then try to figure out which table is corrupted. Then we might be able to backup more-recent data from the not-corrupt tables and only recover the fucked table
- other warnings suggest solving the underlying issue: why did the data become corrupt?
- well, we know Marcin has been hard-resetting the server (via the hetzner wui) about every week because it keeps breaking since some months ago (it's EOL and not worth debugging)
- but it's also possible we have a worse issue, like a disk failing. We do have RAID1 tho, so idk. Still, it would be wise to check the SMART data and RAID logs and filesystem for corruption
- I sent a quick status update to Marcin so he knows the severity of the issue and that this isn't going to be fixed soon
Hey Marcin, Your database is corrupt and won't start. Quick internet search for the error messages suggests this could be a bug that's been fixed in mariadb 5.7. You're using 5.6 and can't upgrade because your OS is EOL. hetnzer3 is running 5.8. * https://bugs.mysql.com/bug.php?id=61516 I'm looking into seeing what is corrupt, what isn't corrupt, and if we can restore from backup. This is not going to be an easy or fast fix, sorry.
- the backups of the backups finished
[root@opensourceecology ~]# rsync -av --progress /home/b2user/sync*/* /var/tmp/ sending incremental file list daily_hetzner2_20250416_072001.tar.gpg 22,975,631,986 100% 139.63MB/s 0:02:36 (xfr#1, to-chk=1/2) daily_hetzner2_20250417_072001.tar.gpg 21,566,407,634 100% 103.43MB/s 0:03:18 (xfr#2, to-chk=0/2) sent 44,552,914,338 bytes received 54 bytes 125,324,653.70 bytes/sec total size is 44,542,039,620 speedup is 1.00 [root@opensourceecology ~]# [root@opensourceecology ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 17M 32G 1% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/md2 197G 138G 50G 74% / /dev/md1 488M 386M 77M 84% /boot tmpfs 6.3G 0 6.3G 0% /run/user/1005 [root@opensourceecology ~]#
- I'm also going to take down the webservers, so that they can't fuck-up the database worse, if we do start it in some recovery mode
[root@opensourceecology ~]# systemctl stop httpd [root@opensourceecology ~]# systemctl stop varnish [root@opensourceecology ~]# systemctl stop nginx [root@opensourceecology ~]#
- I should also make a backup of /var/lib/mysql
- I'm going to create a dif for all of this
[root@opensourceecology ~]# mkdir /var/tmp/dbFail.20250417 [root@opensourceecology ~]# chown root:root /var/tmp/dbFail.20250417/ [root@opensourceecology ~]# chmod 0700 /var/tmp/dbFail.20250417/ [root@opensourceecology ~]# [root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041 [root@opensourceecology ~]# mv /var/tmp/daily_hetzner2_2025041* /var/tmp/dbFail.20250417/ [root@opensourceecology ~]# [root@opensourceecology ~]# vim /var/tmp/dbFail.20250417/info.txt [root@opensourceecology ~]# [root@opensourceecology ~]# cat /var/tmp/dbFail.20250417/info.txt 2025-04-17: Marcin emailed me last night saying the wiki was down with a db error. Today I tried to start it, but it refues to come-up. Looks like it's preventing itself from starting because it realizes something is corrupt and starting it would make things worse. Internet says maybe this was fixed in a newer version; we can't upgrade because Cent is EOL. Hetzner3 has the newer version * https://bugs.mysql.com/bug.php?id=61516 Anyway, I'm creating this folder to store some backups before we make things worse. [root@opensourceecology ~]#
- aaaand I added a copy of /var/lib/mysql/
[root@opensourceecology ~]# rsync -av --progress /var/lib/mysql /var/tmp/dbFail.20250417/var-lib-mysql.$(date "+%Y%m%d") sending incremental file list created directory /var/tmp/dbFail.20250417/var-lib-mysql.20250417 mysql/ mysql/aria_log.00000001 16,384 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=707/709) ... mysql/store_db/wp_woocommerce_tax_rate_locations.frm 8,714 100% 9.26kB/s 0:00:00 (xfr#689, to-chk=1/709) mysql/store_db/wp_woocommerce_tax_rates.frm 13,128 100% 13.95kB/s 0:00:00 (xfr#690, to-chk=0/709) sent 7,384,914,964 bytes received 13,343 bytes 114,495,012.51 bytes/sec total size is 7,383,062,830 speedup is 1.00 [root@opensourceecology ~]#
- another important note: apparently we can keep increasing the value of innodb_force_recovery until it starts, but anything >3 could corrupt the data worse https://dba.stackexchange.com/q/241714
from Marko, MariaDB Innodb lead: MDEV-15370 was a bug when ugprading to 10.3, caused by MDEV-12288. Actually upgrades can still fail (MDEV-15912) if a slow shutdown of the old server was not made. Because the scenario does not involve upgrading to 10.3 or later, I am afraid that the user witnessed some kind of undo log corruption. Starting up with innodb_force_recovery=3 might allow dumping all data. If that crashes, then try innodb_force_recovery=5, but be aware that anything >3 may corrupt the database further, and therefore you should not use the database for anything else than mysqldump
- Unfortunately, a lot of the links for how to fix this are now dead
- https://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
- https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
- https://forums.mysql.com/read.php?22,603093,604631#msg-604631
- https://support.plesk.com/hc/en-us/articles/12377798484375-Plesk-is-not-accessible-ERROR-Zend-Db-Adapter-Exception-SQLSTATE-HY000-2002-No-such-file-or-directory
- we're running 5.6, so it should be this https://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
- but note that redirects to 8.6 for some reason? https://dev.mysql.com/doc/refman/8.4/en/forcing-innodb-recovery.html
- ah, so does 1.1 – apparently anything it doesn't like just reidrects to the latest version https://dev.mysql.com/doc/refman/1.1/en/forcing-innodb-recovery.html
- this suggests that, if we're going to use innodb_force_recovery 4 or greater, we only do it on another machine. So basically take the data I just backed-up put it on a separate machine, and do the fucker *there* instead https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
- it also says that dumps of 4 or greater could still render corrupt data, so they shouldn't be trusted, anyway
- good news: it says the db blocks all INSERT, UPDATE, and DELETE commands when any recovery mode is enabled
- but we *can* run DROP. so the idea is to dump everything in recovery mode and drop what is corrupt. then restart with the recovery value set to 0 and restore.
- it says that dumps from recover mode of 1 or 2 or 3 are safe, and only the page is corrupt
- here's the definition of a page https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_page
A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page. One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns. When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time. All the InnoDB disk data structures within a MySQL instance share the same page size. See Also buffer pool, compact row format, compressed row format, data files, extent, page size, row.
- I guess that just means data that hasn't been written to disk yet. So I *think* it should be OK to trust data that only has corrupt pages?
- ok, I think I have enough to proceed – at least for recovery modes 1, 2, and 3.
- but first let's check SMART
- oh, fuck, my notes on this are on the wiki. Of course.
- arch wiki to the rescue https://wiki.archlinux.org/title/S.M.A.R.T.
- fail
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:' -bash: smartctl: command not found [root@opensourceecology ~]#
- luckily the yum servers for this EOL OS are still online, and I could install it
[root@opensourceecology ~]# yum install smartmontools ... Total download size: 546 k Installed size: 2.0 M Is this ok [y/d/N]: y Downloading packages: smartmontools-7.0-2.el7.x86_64.rpm | 546 kB 00:00:00 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : 1:smartmontools-7.0-2.el7.x86_64 1/1 Verifying : 1:smartmontools-7.0-2.el7.x86_64 1/1 Installed: smartmontools.x86_64 1:7.0-2.el7 Complete! [root@opensourceecology ~]#
- better
[root@opensourceecology ~]# smartctl --info /dev/sda | grep 'SMART support is:' SMART support is: Available - device has SMART capability. SMART support is: Enabled [root@opensourceecology ~]#
- well this is terrifying; it says both our disks are gonna fail within 24 hours
[root@opensourceecology ~]# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. [root@opensourceecology ~]#
- compare that to hetnzer3, which says all is good
root@hetzner3 ~ # smartctl -H /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED root@hetzner3 ~ # smartctl -H /dev/nvme1n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED root@hetzner3 ~ #
- I'm not 100% convinced that this is true. I still want to initiate a test on the drives, but I'm going to go ahead and pass this to hetzner support asap and ask them if there's a fee for them to replace our drives.
- oh, interesting. they have a walkthrough that says it's free via Server -> Technical -> Disk Failure https://robot.hetzner.com/support/index
- well, it lists two options
- Free Replacement drive nearly new or used and tested; depends on what is in stock.
- At cost Replacement drive guaranteed to be nearly new (less than 1000 hours of operation); one-time fee € 41.18 (excl. VAT); may not be in stock.
- we were given an option if we should hot swap while the system is on or shutdown. I'm going to say shutdown. That'll be simpler from the OS side, I think
- dang, it says they'll swap the drive within 2-4 hours.
- well, it lists two options
- I've never done this before, but it's a hardware raid. My understanding is that as soon as it comes-up, it'll begin copying the data from one disk to the other disk. But, christ, if both disks are fucked then which disk should I choose them to replace? Can I see which one is more fucked than the other?
- hetzner provides 4 docs for assistance on this
- https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#information-on-defective-drives
- https://docs.hetzner.com/robot/dedicated-server/maintainance/nvme/#show-serial-number-of-a-specific-nvme-ssd
- https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
- https://docs.hetzner.com/robot/dedicated-server/troubleshooting/serial-numbers-and-information-on-defective-hard-drives/#creating-a-complete-smart-log
- that first doc says to run the command we just ran
- hmm..it says for more info we should look at the "Failed Attributes" – but we have none for either disk
- ok, the docs say we can get more info with -A
[root@opensourceecology ~]# smartctl -A /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org START OF READ SMART DATA SECTION SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78355 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3433 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2599 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 064 046 000 Old_age Always - 36 (Min/Max 24/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 405734134966 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12794981941 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26207531685 [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -A /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 78354 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 3742 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2585 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 065 044 000 Old_age Always - 35 (Min/Max 24/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 406209116828 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 12809824998 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 42504271864 [root@opensourceecology ~]#
- so both say "Percent_Lifetime_Remain" is an issue. does that mean it's not *actually* writing corrupt data, but it's literally just a timer that hit and said "yeah you should probably replace the disk??"
- well, "Percent_Lifetime_Remain" doesn't appear in the docs table. nor in the source wikipedia table https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes
- yeah, reddit suggests that means the drive "should be replaced soon" but not that it's actually detected as failing now https://www.reddit.com/r/homelab/comments/kaaqma/percent_lifetime_remain_failing_now/
- in that case, I guess it doesn't matter which disk we replace. But let's go ahead and get one replaced. I don't think this was the cause of the db corruption (I still think it's "shutting down the computer abruptly + a bug in old mariadb that prevents it from recovering"), but I would be stupid not to take a free replacement of a RAID1-mirrored disk that's alerting us that it's too old to be in prod.
- the second hetnzer docs refer to nvme. that's relevant on hetzner3 but not hetzner2. anyway, I do want to know how to check this on hetzer2 (even if I can't update the wiki right now with this docs)
- wow, the output for smartctl looks very different for NVMEs on Debian than it does on CentOS
root@hetzner3 ~ # smartctl -A /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org START OF SMART DATA SECTION SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 39 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 6% Data Units Read: 152.358.379 [78,0 TB] Data Units Written: 52.125.092 [26,6 TB] Host Read Commands: 6.873.372.480 Host Write Commands: 1.362.559.127 Controller Busy Time: 22.226 Power Cycles: 28 Power On Hours: 17.245 Unsafe Shutdowns: 5 Media and Data Integrity Errors: 0 Error Information Log Entries: 159 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 39 Celsius Temperature Sensor 2: 48 Celsius root@hetzner3 ~ # smartctl -A /dev/nvme1n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org START OF SMART DATA SECTION SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 40 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 7% Data Units Read: 140.811.605 [72,0 TB] Data Units Written: 56.604.901 [28,9 TB] Host Read Commands: 1.304.073.899 Host Write Commands: 1.364.668.115 Controller Busy Time: 21.180 Power Cycles: 23 Power On Hours: 15.565 Unsafe Shutdowns: 5 Media and Data Integrity Errors: 0 Error Information Log Entries: 149 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 40 Celsius Temperature Sensor 2: 45 Celsius root@hetzner3 ~ #
- that shows we're at 6% and 7% usage on hetzner3, whereas I guess we're at 100% on hetzner2
- the third hetzner doc refers to a software raid. actually, I thought we were using a hardware raid, but now I'm not sure
- this indicates that our raid is fine. two UUs (eg `[UU]`) is fine. Bad would be a U and a missing U (eg `[U_]`)
[root@opensourceecology ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb2[1] sda2[0] 523712 blocks super 1.2 [2/2] [UU] md2 : active raid1 sda3[0] sdb3[1] 209984640 blocks super 1.2 [2/2] [UU] bitmap: 2/2 pages [8KB], 65536KB chunk md0 : active raid1 sdb1[1] sda1[0] 33521664 blocks super 1.2 [2/2] [UU] unused devices: <none> [root@opensourceecology ~]#
- ah crap, the process to bring the new drive back into the RAID is not-trivial https://docs.hetzner.com/robot/dedicated-server/raid/exchanging-hard-disks-in-a-software-raid/
- first we have to format the new drive exactly as the old drive, then add each partition into the RAID array, then update grub. And, of course, meanwhile we'll be running on one disk. So if we fuck-up any of those steps, we loose everything. This could take me a few days (or weeks), and meanwhile the sites are all offline and our daily backups on backblaze are being deleted/rotated out of existance. Sadly, I think I'm going to postpone this until after we get the sites back-up.
- the last hetzner doc shows us how to get the serial number of our disks (which hetzner will ask-for when we tell them to swap it)
[root@opensourceecology ~]# udevadm info --query=property --name sda | grep ID_SER ID_SERIAL=Crucial_CT250MX200SSD1_154410FA336C ID_SERIAL_SHORT=154410FA336C [root@opensourceecology ~]# [root@opensourceecology ~]# udevadm info --query=property --name sdb | grep ID_SER ID_SERIAL=Crucial_CT250MX200SSD1_154410FA4520 ID_SERIAL_SHORT=154410FA4520 [root@opensourceecology ~]#
- I went ahead and ran a SMART test; it says it'll take just 2 minutes to run
[root@opensourceecology ~]# smartctl -t short /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Thu Apr 17 22:07:55 2025 Use smartctl -X to abort test. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -t short /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Thu Apr 17 22:08:18 2025 Use smartctl -X to abort test.
- I also kicked-off a long test, which I can check tomorrow
[root@opensourceecology ~]# smartctl -t long /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 5 minutes for test to complete. Test will complete after Thu Apr 17 22:15:12 2025 Use smartctl -X to abort test. [root@opensourceecology ~]# [root@opensourceecology ~]# smartctl -t long /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 5 minutes for test to complete. Test will complete after Thu Apr 17 22:15:14 2025 Use smartctl -X to abort test. [root@opensourceecology ~]#
- ok, then we have the filesystem. it looks like /var/lib/msyql/ lives on '/' which is /dev/md2
[root@opensourceecology ~]# df -h /var/lib/mysql Filesystem Size Used Avail Use% Mounted on /dev/md2 197G 145G 43G 78% / [root@opensourceecology ~]# [root@opensourceecology ~]# fdisk -l /dev/md2 Disk /dev/md2: 215.0 GB, 215024271360 bytes, 419969280 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes [root@opensourceecology ~]# [root@opensourceecology ~]# lsblk /dev/md2 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md2 9:2 0 200.3G 0 raid1 / [root@opensourceecology ~]#
- it won't let me check the filesystem while it's mounted
[root@opensourceecology ~]# fsck /dev/md2 fsck from util-linux 2.23.2 e2fsck 1.42.9 (28-Dec-2013) /dev/md2 is mounted. e2fsck: Cannot continue, aborting. [root@opensourceecology ~]#
- it probably should be happening on-boot, but I couldn't find it in dmesg
[root@opensourceecology ~]# dmesg | grep -i check [ 0.000000] Early table checksum verification disabled [root@opensourceecology ~]# dmesg | grep -i fsck [root@opensourceecology ~]#
- ok, instead we can just use tune2fs to get the info on the last check that was run
- looks like it ran today; probably when Marcin rebooted it https://unix.stackexchange.com/questions/400851/what-should-i-do-to-force-the-root-filesystem-check-and-optionally-a-fix-at-bo
[root@opensourceecology ~]# tune2fs -l /dev/md2 tune2fs 1.42.9 (28-Dec-2013) Filesystem volume name: <none> Last mounted on: / Filesystem UUID: af18bd25-f715-4003-b055-170a07591c60 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 13131776 Block count: 52496160 Reserved block count: 2624808 Free blocks: 26575102 Free inodes: 12417672 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1011 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Tue May 31 06:01:12 2016 Last mount time: Thu Apr 17 17:39:11 2025 Last write time: Thu Apr 17 17:39:00 2025 Mount count: 1 Maximum mount count: -1 Last checked: Thu Apr 17 17:39:00 2025 Check interval: 0 (<none>) Lifetime writes: 124 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: b9456d9f-1608-4444-99c2-02e6f327e42d Journal backup: inode blocks [root@opensourceecology ~]#
- both of the filesystems (/ and /boot) look fine
[root@opensourceecology ~]# tune2fs -l /dev/md1 | grep -iE 'state|error|mount|checked' Last mounted on: /boot Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Last mount time: Thu Apr 17 17:39:11 2025 Mount count: 46 Maximum mount count: -1 Last checked: Tue May 31 06:01:07 2016 [root@opensourceecology ~]# [root@opensourceecology ~]# tune2fs -l /dev/md2 | grep -iE 'state|error|mount|checked' Last mounted on: / Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Last mount time: Thu Apr 17 17:39:11 2025 Mount count: 1 Maximum mount count: -1 Last checked: Thu Apr 17 17:39:00 2025 [root@opensourceecology ~]#
- well, so far I couldn't find any signs of corruption on the disk/fs level
- back to the db, I set the recovery option in the my.cnf file
[root@opensourceecology etc]# cp my.cnf my.cnf.20250417 [root@opensourceecology etc]# [root@opensourceecology etc]# vim my.cnf [root@opensourceecology etc]# [root@opensourceecology etc]# diff my.cnf.20250417 my.cnf 1a2,5 > > # attempt to recover corrupt db https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html > innodb_force_recovery = 1 > [root@opensourceecology etc]#
- it didn't come-up
[root@opensourceecology etc]# systemctl restart mariadb Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details. [root@opensourceecology etc]#
- I tried changing it to restore level 2; this time it got stuck "waiting for the background threads"
250417 22:32:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 250417 22:32:49 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 14901 ... 250417 22:32:49 InnoDB: The InnoDB memory heap is disabled 250417 22:32:49 InnoDB: Mutexes and rw_locks use GCC atomic builtins 250417 22:32:49 InnoDB: Compressed tables use zlib 1.2.7 250417 22:32:49 InnoDB: Using Linux native AIO 250417 22:32:49 InnoDB: Initializing buffer pool, size = 128.0M 250417 22:32:49 InnoDB: Completed initialization of buffer pool 250417 22:32:49 InnoDB: highest supported file format is Barracuda. 250417 22:32:49 InnoDB: Starting crash recovery from checkpoint LSN=625883462907 InnoDB: Restoring possible half-written data pages from the doublewrite buffer... 250417 22:32:49 InnoDB: Starting final batch to recover 11 pages from redo log 250417 22:32:49 InnoDB: Waiting for the background threads to start 250417 22:32:50 InnoDB: Waiting for the background threads to start 250417 22:32:51 InnoDB: Waiting for the background threads to start 250417 22:32:52 InnoDB: Waiting for the background threads to start 250417 22:32:53 InnoDB: Waiting for the background threads to start 250417 22:32:54 InnoDB: Waiting for the background threads to start 250417 22:32:55 InnoDB: Waiting for the background threads to start 250417 22:32:56 InnoDB: Waiting for the background threads to start 250417 22:32:57 InnoDB: Waiting for the background threads to start 250417 22:32:58 InnoDB: Waiting for the background threads to start ...
- it seems infinite. I don't know if it's going to time-out, but I'm just going to leave it and come-back tomorrow.
Sun Apr 11, 2025
- let's get Catarina that broken staging site for osemain on hetzner3
- Marcin still hasn't regained access to his ssh key (so he can update the ose keepass), but he did finally send me the password to our hetzner account
- so now I can order a second IPv4 address, as needed for obi & osemain to have two distinct sites on hetzner3
- I logged-into hetzner https://robot.hetzner.com/server
- I also typed a "name" into the blank "name" fields for our two servers. one is now called "hetzner2" and the new one "hetzner3"
- I clicked on the server for "hetzner3" and the tab "IPs".
- Then I clicked on "Order additional IPs / Nets"
- I selected "One additional IP with costs (€ 1.70 max. per month / € 0.0027 per hour + € 4.90 once-off setup)"
- it required me to enter a reason (IPv4 is scarce) to which I wrote:
we need to run two websites with the same domain name that are already running on our primary IPv4 address, and a client doesn't have IPv6 working at their office
- and I clicked "Apply for IP/subnet in obligation"
- I got a message; looks like it needs human approval
Your request for additional IPs/subnets was successfully sent. We will send you an email as soon as your IP/subnet is ready.
- I typed an email to Marcin and Catarina to notify them of this order
Hey Marcin, As authorized on our last call, I ordered an additional IPv4 address for your hetzner account. IPv4 addresses are scarce, and it appears that they need to approve it manually. The cost is €1.70 per month + € 4.90 once-off setup. This will allow us to run more than one website with the same domain off the same server. That will be needed for osemain and obi. Once you finish rebuilding those websites on hetzner3 to use a new not-broken theme, we can cancel this second IP address. Thank you, Michael Altfield https://www.michaelaltfield.net PGP Fingerprint: 0465 E42F 7120 6785 E972 644C FE1B 8449 4E64 0D41 Note: If you cannot reach me via email, please check to see if I have changed my email address by visiting my website at https://email.michaelaltfield.net
- before I finished typing ^ that email, I got an email from hetzner indicating that we have a new IP
- I refreshed the hetzner wui, and now I see the new IP
- ...
- following-up on the bus factor, I added Catarina & Tom's ssh keys to their authorized_keys files on hetzner3
- I sent them both emails asking them to confirm access
- I also emailed Marcin asking if he installed zulucrypt yet to try to recover his old ssh key
- update: within a few hours, Marcin had successfully decrypted and mounted his old veracrypt volume using zuluCrypt
- he created this article on the wiki https://wiki.opensourceecology.org/wiki/Zulucrypt
- I found that he had previously documented scattered articles about backups, luks, veracrypt, pgp, cybersec general, etc in a ton of different articles. So I spent some time adding categories and "see also" sections to those articles, in hopes he will be more easily able to do this in the future
- I also asked him to please document what he needed for himself 5 years from now into a README file next to the 'ose-veracrypt' volume on his usb drive.
- Marcin confirmed that he was able to restore his ssh keys and ssh into hetzner3. awesome.
- ...
- I logged all my hours and sent an invoice to OSE for last month (Mar 2025)
- gah, I had obliterated half my 2025Q1 log. when I tried to restore it, I got a 413 error lgo
- I checked php and nginx; it's 10M. How did I write >10 MB of text in one quarter?
- there's too many layers on this server; I checked the logs
[Fri Apr 11 22:18:20.306872 2025] [:error] [pid 13182] [client 127.0.0.1:56606] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"], referer: https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit HTTP/1.1 413 Request Entity Too Large Message: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) Apache-Error: [file "apache2_util.c"] [line 271] [level 3] [client 127.0.0.1] ModSecurity: Request body no files data length is larger than the configured limit (1000000).. Deny with code (413) [hostname "wiki.opensourceecology.org"] [uri "/index.php"] [unique_id "Z-mVLLwDarHC@6u2-5xhBgAAAAg"] 127.0.0.1 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.0" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" 146.70.199.124 - - [11/Apr/2025:22:18:20 +0000] "POST /index.php?title=Maltfield_Log/2025_Q1&action=submit HTTP/1.1" 413 338 "https://wiki.opensourceecology.org/index.php?title=Maltfield_Log/2025_Q1&action=edit" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "-"
- ok, so it's modsecurity?
- gah, that's a lot of files to review
[root@opensourceecology httpd]# find . |grep -i security ./conf.d/mod_security.wordpress.include ./conf.d/mod_security.conf ./conf.modules.d/10-mod_security.conf ./modsecurity.d ./modsecurity.d/activated_rules ./modsecurity.d/activated_rules/modsecurity_crs_42_tight_security.conf ./modsecurity.d/activated_rules/modsecurity_crs_35_bad_robots.conf ./modsecurity.d/activated_rules/modsecurity_50_outbound.data ./modsecurity.d/activated_rules/modsecurity_crs_45_trojans.conf ./modsecurity.d/activated_rules/modsecurity_crs_48_local_exceptions.conf.example ./modsecurity.d/activated_rules/modsecurity_35_bad_robots.data ./modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf ./modsecurity.d/activated_rules/modsecurity_crs_41_sql_injection_attacks.conf ./modsecurity.d/activated_rules/modsecurity_crs_49_inbound_blocking.conf ./modsecurity.d/activated_rules/modsecurity_crs_60_correlation.conf ./modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf ./modsecurity.d/activated_rules/modsecurity_crs_40_generic_attacks.conf ./modsecurity.d/activated_rules/modsecurity_50_outbound_malware.data ./modsecurity.d/activated_rules/modsecurity_35_scanners.data ./modsecurity.d/activated_rules/modsecurity_40_generic_attacks.data ./modsecurity.d/activated_rules/modsecurity_crs_50_outbound.conf ./modsecurity.d/activated_rules/modsecurity_crs_47_common_exceptions.conf ./modsecurity.d/activated_rules/modsecurity_crs_30_http_policy.conf ./modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf ./modsecurity.d/activated_rules/modsecurity_crs_41_xss_attacks.conf ./modsecurity.d/activated_rules/modsecurity_crs_59_outbound_blocking.conf ./modsecurity.d/modsecurity_crs_10_config.conf.20181024.orig ./modsecurity.d/modsecurity_crs_10_config.conf ./modsecurity.d/do_not_log_passwords.conf [root@opensourceecology httpd]#
- looks like it's SecRequestBodyLimit http://stackoverflow.com/questions/13887812/ddg#14690797
[root@opensourceecology httpd]# grep -irl 'BodyLimit' * conf.d/mod_security.conf modules/mod_security2.so [root@opensourceecology httpd]#
- it's 13107200
[root@opensourceecology httpd]# grep -ir 'BodyLimit' * conf.d/mod_security.conf: SecRequestBodyLimit 13107200 conf.d/mod_security.conf: SecRequestBodyLimitAction Reject Binary file modules/mod_security2.so matches [root@opensourceecology httpd]#
- docs say it's in bytes https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyLimit
- so 13107200 / 1024 / 1024 = 12.5 MB.
- jesus that's a lot of data; I'm not gonna increase that in 4 places (nginx, apache, mod_security, php); let's just split it into two articles :(
- ...
- so Marcin is stressing urgancy to get Catarina a sandbox so she can rebuild osemain using some new theme that's not broken on the latest version of wordpress, php, etc on hetzner3
- I didn't want to do this site before the other less-priority ones, but it's just a sandbox
- I realized I never made a CHG file for osemain
- looks like I first did a snapshot Jan 31https://wiki.opensourceecology.org/wiki/Maltfield_Log/2025_Q1#Fri_Jan_31.2C_2025
- ugh, I just said I was "following the same guide as with the other sites"
- I was hoping to know which one to CHG to copy-from
- I guess it makes the most sense to copy from obi, which already has both a static and dynamic site setup (untested)
- ok, I made a first draft of our osemain CHG to migrate to hetnzer3 https://wiki.opensourceecology.org/wiki/CHG-2025-XX-XX_migrate_osemain_to_hetzner3
- oh, crap, I'm going to remove