Revision as of 19:17, 26 July 2024

My work log from the year 2024. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

Fri July 26, 2024

I started the CHG-2024-07-26_yum_update today at 11:00
pre-state proof shows we have lots of outdated system packages, as expected

[root@opensourceecology ~]# yum list updates
...
xz-libs.x86_64                        5.2.2-2.el7_9                       updates
yum.noarch                            3.4.3-168.el7.centos                base   
yum-cron.noarch                       3.4.3-168.el7.centos                base   
yum-plugin-fastestmirror.noarch       1.1.31-54.el7_8                     base   
yum-utils.noarch                      1.1.31-54.el7_8                     base   
zlib.x86_64                           1.2.7-21.el7_9                      updates
[root@opensourceecology ~]#

I tried to check the backups log, but it was empty :/

[root@opensourceecology ~]# cat /var/log/backups/backup.log
[root@opensourceecology ~]#

ok, looks like it rotated already; this file shows a 20.424G backup file successfully uploaded to backblaze with rclone

[root@opensourceecology ~]# ls /var/log/backups/
backup.lo               backup.log-20240628.gz  backup.log-20240714.gz
backup.log              backup.log-20240629.gz  backup.log-20240715.gz
backup.log-20240615.gz  backup.log-20240701.gz  backup.log-20240716.gz
backup.log-20240616.gz  backup.log-20240702.gz  backup.log-20240718.gz
backup.log-20240617.gz  backup.log-20240704.gz  backup.log-20240719.gz
backup.log-20240619.gz  backup.log-20240706.gz  backup.log-20240721.gz
backup.log-20240621.gz  backup.log-20240707.gz  backup.log-20240722.gz
backup.log-20240622.gz  backup.log-20240708.gz  backup.log-20240724.gz
backup.log-20240623.gz  backup.log-20240709.gz  backup.log-20240725.gz
backup.log-20240625.gz  backup.log-20240711.gz  backup.log-20240726
backup.log-20240626.gz  backup.log-20240712.gz
backup.log-20240627.gz  backup.log-20240713.gz
[root@opensourceecology ~]# 

[root@opensourceecology ~]# tail -n20 /var/log/backups/backup.log-20240726 
 *        daily_hetzner2_20240726_072001.tar.gpg:100% /20.424G, 2.935M/s, -

2024/07/26 09:50:31 INFO  : daily_hetzner2_20240726_072001.tar.gpg: Copied (new)
2024/07/26 09:50:31 INFO  : 
Transferred:       20.424G / 20.424 GBytes, 100%, 2.979 MBytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:    1h57m0.8s


real    117m1.219s
user    4m20.240s
sys     2m9.432s
+ echo ================================================================================
================================================================================
++ date -u +%Y%m%d_%H%M%S
+ echo 'INFO: Finished Backup Run at 20240726_095031'
INFO: Finished Backup Run at 20240726_095031
+ echo ================================================================================
================================================================================
+ exit 0
[root@opensourceecology ~]#

the query of b2 backup files also looks good

[root@opensourceecology ~]# sudo -u b2user /home/b2user/virtualenv/bin/b2 ls ose-server-backups | grep `date "+%Y%m%d"`
daily_hetzner2_20240726_072001.tar.gpg
[root@opensourceecology ~]# date -u
Fri Jul 26 16:03:55 UTC 2024
[root@opensourceecology ~]# sudo -u b2user /home/b2user/virtualenv/bin/b2 ls ose-server-backups
daily_hetzner2_20240724_072001.tar.gpg
daily_hetzner2_20240725_072001.tar.gpg
daily_hetzner2_20240726_072001.tar.gpg
monthly_hetzner2_20230801_072001.tar.gpg
monthly_hetzner2_20230901_072001.tar.gpg
monthly_hetzner2_20231001_072001.tar.gpg
monthly_hetzner2_20231101_072001.tar.gpg
monthly_hetzner2_20231201_072001.tar.gpg
monthly_hetzner2_20240201_072001.tar.gpg
monthly_hetzner2_20240301_072001.tar.gpg
monthly_hetzner2_20240401_072001.tar.gpg
monthly_hetzner2_20240501_072001.tar.gpg
monthly_hetzner2_20240601_072001.tar.gpg
monthly_hetzner2_20240701_072001.tar.gpg
weekly_hetzner2_20240708_072001.tar.gpg
weekly_hetzner2_20240715_072001.tar.gpg
weekly_hetzner2_20240722_072001.tar.gpg
yearly_hetzner2_20190101_111520.tar.gpg
yearly_hetzner2_20200101_072001.tar.gpg
yearly_hetzner2_20210101_072001.tar.gpg
yearly_hetzner2_20230101_072001.tar.gpg
yearly_hetzner2_20240101_072001.tar.gpg
[root@opensourceecology ~]#

that backup is already 8 hours old; so let's bring down the webserver + stop the databases and take a real fresh backup before we do anything
stopped nginx

[root@opensourceecology ~]# # create dir for logging the change
[root@opensourceecology ~]# tmpDir="/var/tmp/CHG-2024-07-26_yum_update"
[root@opensourceecology ~]# mkdir -p $tmpDir
[root@opensourceecology ~]# 
[root@opensourceecology ~]# # begin to gracefully shutdown nginx in the background
[root@opensourceecology ~]# time nice /sbin/nginx -s quit
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11

real    0m0.078s
user    0m0.038s
sys     0m0.021s
[root@opensourceecology ~]# 

[root@opensourceecology ~]# date -u
Fri Jul 26 16:06:37 UTC 2024
[root@opensourceecology ~]#

stopped DBs

[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2024-07-22 18:55:28 UTC; 3 days ago
  Process: 1230 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
  Process: 1099 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 1229 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
		   ├─1229 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
		   └─1704 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql ...

Jul 22 18:55:25 opensourceecology.org systemd[1]: Starting MariaDB database s....
Jul 22 18:55:26 opensourceecology.org mariadb-prepare-db-dir[1099]: Database M...
Jul 22 18:55:26 opensourceecology.org mariadb-prepare-db-dir[1099]: If this is...
Jul 22 18:55:26 opensourceecology.org mysqld_safe[1229]: 240722 18:55:26 mysql...
Jul 22 18:55:26 opensourceecology.org mysqld_safe[1229]: 240722 18:55:26 mysql...
Jul 22 18:55:28 opensourceecology.org systemd[1]: Started MariaDB database se....
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]# systemctl stop mariadb
[root@opensourceecology ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Fri 2024-07-26 16:07:43 UTC; 3s ago
  Process: 1230 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
  Process: 1229 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 1099 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 1229 (code=exited, status=0/SUCCESS)

Jul 22 18:55:25 opensourceecology.org systemd[1]: Starting MariaDB database s....
Jul 22 18:55:26 opensourceecology.org mariadb-prepare-db-dir[1099]: Database M...
Jul 22 18:55:26 opensourceecology.org mariadb-prepare-db-dir[1099]: If this is...
Jul 22 18:55:26 opensourceecology.org mysqld_safe[1229]: 240722 18:55:26 mysql...
Jul 22 18:55:26 opensourceecology.org mysqld_safe[1229]: 240722 18:55:26 mysql...
Jul 22 18:55:28 opensourceecology.org systemd[1]: Started MariaDB database se....
Jul 26 16:07:40 opensourceecology.org systemd[1]: Stopping MariaDB database s....
Jul 26 16:07:43 opensourceecology.org systemd[1]: Stopped MariaDB database se....
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#

the backup is taking a long time. while I wait, I checked `top`, and I see `gzip` is using 80%-100%
1. so it seems that gzip is bound by a single core. it could go much faster if it could be split across multiple cores (parallel processing)
2. quick googling while I wait suggests that we could use `pigz` as a replacement to `gzip` to get this (admittedly low priority) performance boost https://stackoverflow.com/questions/12313242/utilizing-multi-core-for-targzip-bzip-compression-decompression
3. there's other options too. apparently xz has native multi-treadded support since v5.2.0 https://askubuntu.com/a/858828
4. there's also pbzip2 for bzip, and many others https://askubuntu.com/a/258228
the other two commands that get stuck on one-core are `tar` and `gpg2`
1. it looks like gpg also attempts to compress with xz. That gives us no benefits in our case because we're encrypting a tarball that just contains a bunch of already-compressed tarballs. So we could probably get some performance improvements by telling gpg to skip the xz compression with `--compression-algo none` https://stackoverflow.com/questions/46261024/how-to-do-large-file-parallel-encryption-using-gnupg-and-gnu-parallel
finally (after ~30 min to generate the encrypted backup file), rclone is using >100% of CPU to upload it, so that's good. Our script does limit upload to 3 MB/s. I guess one improvement would be some argument to bypass that throttle
it said the upload was going to take just under 2 hours, so I canceled it and manually ran the upload command (minus the throttle)
1. upload speeds are now ~27-32 MB/s (so ~10x faster). It says it'll finish in just over 10 minutes.
upload is done

[root@opensourceecology ~]# time sudo /bin/nice /root/backups/backup.sh &>> /var/log/backups/backup.log

^C
real    33m47.250s
user    23m56.551s
sys     2m2.866s
[root@opensourceecology ~]#

[root@opensourceecology ~]# /bin/sudo -u b2user /bin/rclone -v copy /home/b2user/sync/daily_hetzner2_20240726_160837.tar.gpg b2:ose-server-backups
...
2024/07/26 16:56:38 INFO  : 
Transferred:       18.440G / 19.206 GBytes, 96%, 22.492 MBytes/s, ETA 34s
Transferred:            0 / 1, 0%
Elapsed time:      14m0.5s
Transferring:
 *        daily_hetzner2_20240726_160837.tar.gpg: 96% /19.206G, 21.268M/s, 36s

2024/07/26 16:57:36 INFO  : daily_hetzner2_20240726_160837.tar.gpg: Copied (new)
2024/07/26 16:57:36 INFO  : 
Transferred:       19.206G / 19.206 GBytes, 100%, 21.910 MBytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:     14m58.6s

[root@opensourceecology ~]#

ok, this very durable backup is uploaded; let's proceed

[root@opensourceecology ~]# sudo -u b2user /home/b2user/virtualenv/bin/b2 ls ose-server-backups | grep `date "+%Y%m%d"`
daily_hetzner2_20240726_072001.tar.gpg
daily_hetzner2_20240726_160837.tar.gpg
[root@opensourceecology ~]# date -u
Fri Jul 26 16:58:11 UTC 2024
[root@opensourceecology ~]# sudo -u b2user /home/b2user/virtualenv/bin/b2 ls ose-server-backups
daily_hetzner2_20240724_072001.tar.gpg
daily_hetzner2_20240725_072001.tar.gpg
daily_hetzner2_20240726_072001.tar.gpg
daily_hetzner2_20240726_160837.tar.gpg
monthly_hetzner2_20230801_072001.tar.gpg
monthly_hetzner2_20230901_072001.tar.gpg
monthly_hetzner2_20231001_072001.tar.gpg
monthly_hetzner2_20231101_072001.tar.gpg
monthly_hetzner2_20231201_072001.tar.gpg
monthly_hetzner2_20240201_072001.tar.gpg
monthly_hetzner2_20240301_072001.tar.gpg
monthly_hetzner2_20240401_072001.tar.gpg
monthly_hetzner2_20240501_072001.tar.gpg
monthly_hetzner2_20240601_072001.tar.gpg
monthly_hetzner2_20240701_072001.tar.gpg
weekly_hetzner2_20240708_072001.tar.gpg
weekly_hetzner2_20240715_072001.tar.gpg
weekly_hetzner2_20240722_072001.tar.gpg
yearly_hetzner2_20190101_111520.tar.gpg
yearly_hetzner2_20200101_072001.tar.gpg
yearly_hetzner2_20210101_072001.tar.gpg
yearly_hetzner2_20230101_072001.tar.gpg
yearly_hetzner2_20240101_072001.tar.gpg
[root@opensourceecology ~]#

we have a snapshot of the current state of packages

[root@opensourceecology ~]# time nice rpm -qa &> "${tmpDir}/before.log"

real    0m0.716s
user    0m0.678s
sys     0m0.037s
[root@opensourceecology ~]#

[root@opensourceecology ~]# echo $tmpDir
/var/tmp/CHG-2024-07-26_yum_update
[root@opensourceecology ~]#

[root@opensourceecology ~]# tail /var/tmp/CHG-2024-07-26_yum_update/before.log 
libdb-utils-5.3.21-25.el7.x86_64
libuser-0.60-9.el7.x86_64
python-lxml-3.2.1-4.el7.x86_64
net-snmp-agent-libs-5.7.2-48.el7_8.x86_64
epel-release-7-14.noarch
perl-parent-0.225-244.el7.noarch
libstdc++-devel-4.8.5-39.el7.x86_64
libsodium13-1.0.5-1.el7.x86_64
ncurses-5.9-14.20130511.el7_4.x86_64
e2fsprogs-libs-1.42.9-17.el7.x86_64
[root@opensourceecology ~]#

I kicked-off the updates. I got a bit of a freight at first when we got "404 Not Found" errors from 484 mirrors, but eventually `yum` found a server. I'm glad we did the updates now, before all the mirrors shutdown (centos was EOL some years ago, and will no longer be getting maintenance updates as of a few weeks ago)

[root@opensourceecology ~]# grep "Error 404" /var/tmp/CHG-2024-07-26_yum_update/update.log  | wc -l
484
[root@opensourceecology ~]# 

[root@opensourceecology ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
[root@opensourceecology ~]#

actually, it says it's updating 434 packages total. So I guess some dependencies got added to the 200-odd count before
ok, the update command finished in just under 4 minutes of wall time

...
real    3m56.410s
user    2m1.833s
sys     0m44.510s
[root@opensourceecology ~]#

post update info

[root@opensourceecology ~]# # log the post-state packages and versions
[root@opensourceecology ~]# time nice rpm -qa &> "${tmpDir}/after.log"

real    0m0.805s
user    0m0.769s
sys     0m0.036s
[root@opensourceecology ~]#
[root@opensourceecology ~]# time nice needs-restarting &> "${tmpDir}/needs-restarting.log"


real    0m8.156s
user    0m6.956s
sys     0m0.652s
[root@opensourceecology ~]# time nice needs-restarting -r &> "${tmpDir}/needs-reboot.log"

real    0m0.155s
user    0m0.104s
sys     0m0.051s
[root@opensourceecology ~]# 

[root@opensourceecology ~]# cat /var/tmp/CHG-2024-07-26_yum_update/needs-reboot.log 
Core libraries or services have been updated:
  systemd -> 219-78.el7_9.9
  dbus -> 1:1.10.24-15.el7
  openssl-libs -> 1:1.0.2k-26.el7_9
  linux-firmware -> 20200421-83.git78c0348.el7_9
  kernel -> 3.10.0-1160.119.1.el7
  glibc -> 2.17-326.el7_9.3

Reboot is required to ensure that your system benefits from these updates.

More information:
https://access.redhat.com/solutions/27943
[root@opensourceecology ~]# 

[root@opensourceecology ~]# cat /var/tmp/CHG-2024-07-26_yum_update/needs-restarting.log 
30842 : /usr/lib/systemd/systemd-udevd 
13696 : sshd: maltfield@pts/0
27401 : /bin/bash 
744 : /sbin/auditd 
19086 : /bin/bash 
13692 : sshd: maltfield [priv]
30672 : smtpd -n smtp -t inet -u 
13699 : -bash 
18035 : su - 
27436 : less /root/backups/backup.sh 
18036 : -bash 
18030 : sudo su - 
1484 : /var/ossec/bin/ossec-analysisd 
24493 : /bin/bash 
21581 : su - 
21580 : sudo su - 
21582 : -bash 
797 : /usr/lib/systemd/systemd-logind 
24476 : /bin/bash 
1830 : qmgr -l -t unix -u 
30673 : proxymap -t unix -u 
19119 : sudo su - 
24511 : /bin/bash 
29833 : local -t unix 
27417 : sudo su - 
19130 : -bash 
1 : /usr/lib/systemd/systemd --system --deserialize 23 
29830 : cleanup -z -t unix -u 
1500 : /var/ossec/bin/ossec-logcollector 
24475 : SCREEN -S upgrade 
2150 : /usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a 127.0.0.1:6081 -T 127.0.0.1:6082 -S /etc/varnish/secret -u varnish -g varnish -s malloc,40G 
2152 : /usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a 127.0.0.1:6081 -T 127.0.0.1:6082 -S /etc/varnish/secret -u varnish -g varnish -s malloc,40G 
29835 : bounce -z -t unix -u 
775 : /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation 
27419 : -bash 
585 : /usr/lib/systemd/systemd-journald 
771 : /usr/sbin/irqbalance --foreground 
770 : /usr/sbin/acpid 
1170 : /sbin/agetty --noclear tty1 linux 
30690 : smtp -t unix -u 
778 : /usr/sbin/chronyd 
8695 : gpg-agent --daemon --use-standard-socket 
24529 : /bin/bash 
2121 : /var/ossec/bin/ossec-syscheckd 
1806 : /usr/libexec/postfix/master -w 
19129 : su - 
19065 : /bin/bash 
2124 : /var/ossec/bin/ossec-monitord 
29832 : trivial-rewrite -n rewrite -t unix -u 
19044 : /bin/bash 
30693 : smtp -t unix -u 
30692 : smtp -t unix -u 
30691 : cleanup -z -t unix -u 
27418 : su - 
1475 : /var/ossec/bin/ossec-execd 
19025 : /bin/bash 
19024 : SCREEN -S CHG-2024-07-26_yum_update 
1458 : /var/ossec/bin/ossec-maild 
19023 : screen -S CHG-2024-07-26_yum_update 
[root@opensourceecology ~]#

alright, time to reboot

[root@opensourceecology ~]# reboot
Connection to opensourceecology.org closed by remote host.
Connection to opensourceecology.org closed.
user@personal:~$

system came back in about 1 minute
first attempt to load the wiki resulted in a 503 "Error 503 Backend fetch failed" from varnish
it's not just warming-up; apache didn't come-up on start

[root@opensourceecology ~]# systemctl status httpd
● httpd.service - The Apache HTTP Server
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2024-07-26 17:09:47 UTC; 2min 7s ago
	 Docs: man:httpd(8)
		   man:apachectl(8)
  Process: 1094 ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND (code=exited, status=1/FAILURE)
 Main PID: 1094 (code=exited, status=1/FAILURE)

Jul 26 17:09:47 opensourceecology.org systemd[1]: Starting The Apache HTTP Se....
Jul 26 17:09:47 opensourceecology.org httpd[1094]: (98)Address already in use:...
Jul 26 17:09:47 opensourceecology.org httpd[1094]: (98)Address already in use:...
Jul 26 17:09:47 opensourceecology.org httpd[1094]: no listening sockets availa...
Jul 26 17:09:47 opensourceecology.org httpd[1094]: AH00015: Unable to open logs
Jul 26 17:09:47 opensourceecology.org systemd[1]: httpd.service: main process...E
Jul 26 17:09:47 opensourceecology.org systemd[1]: Failed to start The Apache ....
Jul 26 17:09:47 opensourceecology.org systemd[1]: Unit httpd.service entered ....
Jul 26 17:09:47 opensourceecology.org systemd[1]: httpd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology ~]#

it says that the port is already in-use

[root@opensourceecology ~]# journalctl -u httpd --no-pager
-- Logs begin at Fri 2024-07-26 17:09:34 UTC, end at Fri 2024-07-26 17:15:26 UTC. --
Jul 26 17:09:47 opensourceecology.org systemd[1]: Starting The Apache HTTP Server...
Jul 26 17:09:47 opensourceecology.org httpd[1094]: (98)Address already in use: AH00072: make_sock: could not bind to address [::]:443
Jul 26 17:09:47 opensourceecology.org httpd[1094]: (98)Address already in use: AH00072: make_sock: could not bind to address 0.0.0.0:443
Jul 26 17:09:47 opensourceecology.org httpd[1094]: no listening sockets available, shutting down
Jul 26 17:09:47 opensourceecology.org httpd[1094]: AH00015: Unable to open logs
Jul 26 17:09:47 opensourceecology.org systemd[1]: httpd.service: main process exited, code=exited, status=1/FAILURE
Jul 26 17:09:47 opensourceecology.org systemd[1]: Failed to start The Apache HTTP Server.
Jul 26 17:09:47 opensourceecology.org systemd[1]: Unit httpd.service entered failed state.
Jul 26 17:09:47 opensourceecology.org systemd[1]: httpd.service failed.
[root@opensourceecology ~]#

before I start making changes, I'm going to initiate another backup (and wait at-least 30 minutes for the tar to finish)

I'm going to want to diff the apache configs, so I made a copy of the backup that I made just before the updates and copied it into the temp CHG dir

[root@opensourceecology CHG-2024-07-26_yum_update]# mkdir backup_before
[root@opensourceecology CHG-2024-07-26_yum_update]# rsync -av --progress /home/b2user/sync.old/daily_hetzner2_20240726_160837.tar.gpg backup_before/
sending incremental file list
daily_hetzner2_20240726_160837.tar.gpg
 20,622,312,871 100%  127.14MB/s    0:02:34 (xfr#1, to-chk=0/1)

sent 20,627,347,744 bytes  received 35 bytes  133,510,341.61 bytes/sec
total size is 20,622,312,871  speedup is 1.00
[root@opensourceecology CHG-2024-07-26_yum_update]#

well, unfortunately the wiki being down means I can't reference our docs on how to restore backups, but I managed to figure it out

[root@opensourceecology backup_before]# gpg --batch --passphrase-file /root/backups/ose-backups-cron.key --decrypt daily_hetzner2_20240726_160837.tar.gpg > daily_hetzner2_20240726_160837.tar
gpg: AES256 encrypted data
gpg: encrypted with 1 passphrase
[root@opensourceecology backup_before]# 

[root@opensourceecology backup_before]# du -sh *
20G     daily_hetzner2_20240726_160837.tar
20G     daily_hetzner2_20240726_160837.tar.gpg
[root@opensourceecology backup_before]# 

[root@opensourceecology backup_before]# du -sh *
20G     daily_hetzner2_20240726_160837.tar
20G     daily_hetzner2_20240726_160837.tar.gpg
[root@opensourceecology backup_before]# 

[root@opensourceecology backup_before]# tar -xf daily_hetzner2_20240726_160837.tar
[root@opensourceecology backup_before]# 

[root@opensourceecology backup_before]# du -sh *
20G     daily_hetzner2_20240726_160837.tar
20G     daily_hetzner2_20240726_160837.tar.gpg
20G     root
[root@opensourceecology backup_before]# rm -f daily_hetzner2_20240726_160837.tar.gpg 
[root@opensourceecology backup_before]#

[root@opensourceecology backup_before]# du -sh *
20G     daily_hetzner2_20240726_160837.tar
20G     root
[root@opensourceecology backup_before]#

to make this easier for the next person, I created a README directly in the backups dir

[root@opensourceecology backups]# cat /root/backups/README.txt 
2024-07-26
==========

The process to restore from backups is documented on the wiki

 * https://wiki.opensourceecology.org/wiki/Backblaze#Restore_from_backups

Oh, the wiki is down and you need to restore from backups to restore the wiki? Don't worry, I got you.

All backups are stored on Backblaze B2. You can download them with rclone or just by logging into the Backblaze B2 WUI.

First decrypt the main wrapper tar with `gpg`

  gpg --batch --passphrase-file <path-to-symmetric-encrypton-private-key> --decrypt <path-to-encrypted-tarball> > <path-to-decrypted-tarball>

For example:

  gpg --batch --passphrase-file /root/backups/ose-backups-cron.key --decrypt daily_hetzner2_20240726_160837.tar.gpg > daily_hetzner2_20240726_160837.tar

Then you can untar the wrapper tarball and the compressed tarball inside of that. For example:

  tar -xf daily_hetzner2_20240726_160837.tar
  cd root/backups/sync/daily_hetzner2_20240726_160837/www/
  tar -xf www.20240726_160837.tar.gz
  head var/www/html/www.opensourceecology.org/htdocs/index.php

--Michael Altfield <https://michaelaltfield.net.>
[root@opensourceecology backups]#

and I was able to extract the www files from the backups prior to the update

[root@opensourceecology backup_before]# cd root/backups/sync/daily_hetzner2_20240726_160837/www/
[root@opensourceecology www]#

[root@opensourceecology www]# ls
www.20240726_160837.tar.gz
[root@opensourceecology www]#

[root@opensourceecology www]# tar -xf www.20240726_160837.tar.gz
[root@opensourceecology www]#

[root@opensourceecology www]# du -sh *
32G     var
19G     www.20240726_160837.tar.gz
[root@opensourceecology www]#

oh, actually I want the /ettc/ config file

[root@opensourceecology www]# cd ../etc
[root@opensourceecology etc]# 

[root@opensourceecology etc]# tar -xf etc.20240726_160837.tar.gz 
[root@opensourceecology etc]# 

[root@opensourceecology etc]# du -sh *
46M     etc
13M     etc.20240726_160837.tar.gz
[root@opensourceecology etc]#

a diff of the pre-update configs and the current configs shows 4x new files

[root@opensourceecology etc]# diff -ril etc/httpd /etc/httpd
diff: etc/httpd/logs: No such file or directory
diff: etc/httpd/modules: No such file or directory
Only in /etc/httpd/conf.d: autoindex.conf
Only in /etc/httpd/conf.d: ssl.conf
Only in /etc/httpd/conf.d: userdir.conf
Only in /etc/httpd/conf.d: welcome.conf
[root@opensourceecology etc]#

I just moved these 4x files out (into our tmp change dir), and tried a restart; it came up

[root@opensourceecology CHG-2024-07-26_yum_update]# mkdir moved_from_etc_httpd
[root@opensourceecology CHG-2024-07-26_yum_update]# mv /etc/httpd/conf.d/autoindex.conf moved_from_etc_httpd/
[root@opensourceecology CHG-2024-07-26_yum_update]# mv /etc/httpd/conf.d/ssl.conf moved_from_etc_httpd/
[root@opensourceecology CHG-2024-07-26_yum_update]# mv /etc/httpd/conf.d/userdir.conf moved_from_etc_httpd/
[root@opensourceecology CHG-2024-07-26_yum_update]# mv /etc/httpd/conf.d/welcome.conf moved_from_etc_httpd/
[root@opensourceecology CHG-2024-07-26_yum_update]# 

[root@opensourceecology CHG-2024-07-26_yum_update]# systemctl restart httpd
[root@opensourceecology CHG-2024-07-26_yum_update]#

[root@opensourceecology CHG-2024-07-26_yum_update]# systemctl status httpd
● httpd.service - The Apache HTTP Server
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2024-07-26 17:59:36 UTC; 4s ago
	 Docs: man:httpd(8)
		   man:apachectl(8)
 Main PID: 15908 (httpd)
   Status: "Processing requests..."
   CGroup: /system.slice/httpd.service
		   ├─15908 /usr/sbin/httpd -DFOREGROUND
		   ├─15910 /usr/sbin/httpd -DFOREGROUND
		   ├─15911 /usr/sbin/httpd -DFOREGROUND
		   ├─15912 /usr/sbin/httpd -DFOREGROUND
		   ├─15913 /usr/sbin/httpd -DFOREGROUND
		   ├─15914 /usr/sbin/httpd -DFOREGROUND
		   ├─15921 /usr/sbin/httpd -DFOREGROUND
		   ├─15927 /usr/sbin/httpd -DFOREGROUND
		   ├─15928 /usr/sbin/httpd -DFOREGROUND
		   ├─15936 /usr/sbin/httpd -DFOREGROUND
		   ├─15937 /usr/sbin/httpd -DFOREGROUND
		   ├─15938 /usr/sbin/httpd -DFOREGROUND
		   └─15939 /usr/sbin/httpd -DFOREGROUND

Jul 26 17:59:36 opensourceecology.org systemd[1]: Starting The Apache HTTP Se....
Jul 26 17:59:36 opensourceecology.org systemd[1]: Started The Apache HTTP Server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology CHG-2024-07-26_yum_update]#

I was able to load and edit the wiki; I spent some time adding some updates to the CHG article https://wiki.opensourceecology.org/wiki/CHG-2020-05-04_yum_update
for some reason my browser keeps locking-up when all I'm trying to do is edit the text in the textarea for ^ this wiki article. I don't use the wysiwyg editor. I'm literally just editing text in a textarea; that shouldn't require any processing
It took me ~20 minutes just to make a few changes to one wiki article because the page on firefox kept locking-up, sometimes displaying a spinning circle over the page
I launched a new DispVM with *only* firefox running and *only* one tab open in firefox. The issue persisted, and the VM with the (idle) firefox on the edit page was taxed with 20-60% CPU usage; something is definitely wrong, but it's unclear if the bug is on our mediawiki server, my firefox client, or both
anyway, I'm continuing with the validation steps
I was successfully able to load the frontpage of all the 9x websites
1. the logo at the top (and bottom) of https://oswh.opensourceecology.org/ was missing, but I'm not sure if that was the case before the updates or not
  1. I simply get a 404 on the image http://www.opensourcewarehouse.org/wp-content/uploads/2013/02/headfooter-logonew.png
  2. I guess the domain is wrong; we don't appear to use opensourcewarehouse.org anymore, so I guess this was an issue that predates our updates now
2. everything else looked good
I logged into the munin. It loads fine
1. I do see some gaps in mysql charts where everything drops to 0 for a few hours, which I guess is before/why Marcin did reboots again. My job now isn't to investigate this now, but I'm just making a note here
2. otherwise munin is working; validated.
I logged into awstates. it loads fine
1. I just quickly scanned the main pages for www.opensourceecology.org and wiki.opensourceecology.org; they look fine
I already tested edits on wiki.opensourceecolgy.org; they're working (setting aside the client-side lag)
I was successfully able to make a trivial change to the main wordpress site, and then revert that change https://www.opensourceecology.org/offline-wiki-zim-kiwix/
the only thing left is the backups, which have been running the background since shortly after the reboot
1. the backups finished being created successfully
2. the backups are currently being uploaded at the rate-limited 3 MB/s. they're at 39% now, and estimated to finish uploading in 1h10m from now.
3. the upload is the last step; that's good enough for me to consider the backups functional
that completes our validation; I think it's safe to mark this change as successful
I sent an update email to Marcin & Catarina

Hey Marcin & Catarina,

I've finished updating the system packages on the hetzner2 server.

It's a very good thing that we did this, because your server tried and failed to download its updates from 484 mirrors before it finally found a server that it could download its updates from.

As I mentioned in Nov 2022, your server runs CentOS 7, which stopped receiving "Full Updates" by Red Hat in Aug 2020. As of Jun 2024, it is no longer going to be updated in any way (security, maintenance, etc). At some point in the future, I guess all of their update servers will go down too. We're lucky at least one was still online.

* https://wiki.opensourceecology.org/wiki/Maltfield_Log/2022#Wed_November_02.2C_2022

Today I was successfully able to update 434 system packages onto hetzner2. I did some quick validation of a subset of your websites, and I only found a couple minor errors

1. The header & footer images of oswh don't load https://oswh.opensourceecology.org/
2. Editing the wiki sometimes causes my browser to lock-up; it's not clear if this is a pre-existing issue, or if the issue is caused by your server or my client

I did not update your server's applications that cannot be updated by the package manager (eg wordpress, mediawiki, etc). If you don't detect any issues with your server, then I would recommend that we do the application upgrade simultaneously with a migration to a new server running Debian.

I'd like to stress again the urgency of the need to migrate off of CentOS 7. Besides the obvious security risks of running a server that is no longer receiving security patches, at some point in the likely-not-too-distant future, your server is going to break and it will be extremely non-trivial to fix it. The deadline for migrating was in 2020. I highly recommend prioritizing a project to migrate your server to a new Debian server ASAP.

Please spend some time testing your various websites, and let me know if you experience any issues.

Thank you,

Michael Altfield
Senior Technology Advisor
PGP Fingerprint: 8A4B 0AF8 162F 3B6A 79B7 70D2 AA3E DF71 60E2 D97B

Open Source Ecology
www.opensourceecology.org

I confirmed the list of updates on the server is now empty

[root@opensourceecology CHG-2024-07-26_yum_update]# yum list updates
Loaded plugins: fastestmirror, replace
Loading mirror speeds from cached hostfile
 * base: ftp.plusline.net
 * epel: mirrors.n-ix.net
 * extras: ftp.plusline.net
 * updates: mirror.checkdomain.de
[root@opensourceecology CHG-2024-07-26_yum_update]#

I'm considering the change successful
looks like my tmp change dir pushed the disk to 86% capacity; let's clean that up

[root@opensourceecology CHG-2024-07-26_yum_update]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G   17M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        197G  161G   27G  86% /
/dev/md1        488M  386M   77M  84% /boot
tmpfs           6.3G     0  6.3G   0% /run/user/1005
[root@opensourceecology CHG-2024-07-26_yum_update]# ls
after.log      before.log            needs-reboot.log      update.log
backup_before  moved_from_etc_httpd  needs-restarting.log
[root@opensourceecology CHG-2024-07-26_yum_update]# du -sh *
28K     after.log
70G     backup_before
28K     before.log
28K     moved_from_etc_httpd
4.0K    needs-reboot.log
4.0K    needs-restarting.log
216K    update.log
[root@opensourceecology CHG-2024-07-26_yum_update]# ls before.log 
before.log
[root@opensourceecology CHG-2024-07-26_yum_update]#

[root@opensourceecology CHG-2024-07-26_yum_update]# ls backup_before/
daily_hetzner2_20240726_160837.tar  root
[root@opensourceecology CHG-2024-07-26_yum_update]#

[root@opensourceecology CHG-2024-07-26_yum_update]# du -sh backup_before/*
20G     backup_before/daily_hetzner2_20240726_160837.tar
51G     backup_before/root
[root@opensourceecology CHG-2024-07-26_yum_update]#

[root@opensourceecology CHG-2024-07-26_yum_update]# ls backup_before/root
backups
[root@opensourceecology CHG-2024-07-26_yum_update]#

[root@opensourceecology CHG-2024-07-26_yum_update]# rm -rf backup_before/root
[root@opensourceecology CHG-2024-07-26_yum_update]#

great, now we're down to 59%

[root@opensourceecology CHG-2024-07-26_yum_update]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G   17M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        197G  110G   78G  59% /
/dev/md1        488M  386M   77M  84% /boot
tmpfs           6.3G     0  6.3G   0% /run/user/1005
[root@opensourceecology CHG-2024-07-26_yum_update]#

Wed July 24, 2024

Marcin contacted me a few days ago saying that the server needs reboots again
I found that the last time we did a system packages update was in 2020, over 4 years ago. I strongly recommended that we update the system packages, and probably the web applications as well
1. here's the link to the last time I updated the system packages in May 2020 https://wiki.opensourceecology.org/wiki/CHG-2020-05-04_yum_update
I also noted that CentOS is now not only EOL, but it's also no longer receiving (security) updates.
1. I warned Marcin about this approaching deadline in Nov 2022, and urged him to migrate to a new OS before 2024.

  In my prior work at OSE, I've done my best to design your systems to be robust and "well oiled" so that they would run for as long as possible with as little maintenance as possible. However, code rots over time, and there's only so long you can hold-off before things fall apart.

  Python 2.7.5 was End-of-Life'd on 2020-01-01, and it no longer receives any updates.

   * https://en.wikipedia.org/wiki/History_of_Python

  CentOS 7.7 was released 2019-09-17. "Full Updates" stopped 2020-08-06, and it will no longer receive any maintenance updates after 2024-06-30.

   * https://wiki.centos.org/About/Product

  At some point, you're going to want to migrate to a new server with a new OS. I strongly recommend initiating this project before 2024.

1. Here's the log entry https://wiki.opensourceecology.org/wiki/Maltfield_Log/2022#Wed_November_02.2C_2022
2. I told Marcin to budget for ~$10,000 to migrate to a new server, as it's going to be a massive project that will likely require more than a month of full-time work to complete the migration
Marcin said I should go ahead and prepare a CHG "ticket" article for the upgrade and schedule a time to do it
I prepared a change ticket for updating the system packages on Friday https://wiki.opensourceecology.org/wiki/CHG-2024-07-26_yum_update
I also noticed that I kept getting de-auth'd every few minutes on the wiki. That's annoying. Hopefully updates will help this (and other) issues go away.
If we did a migration to debian, then we'd need to migrate to a new server
1. previously when we migrated from hetzner1 to hetzner2, we got a 15x increase in RAM (from 4GB to 64GB). And the price of both servers was the same!
2. I was expecting the next jump would have similar results: we'd migrate to a new server that costs the same for much better specs, but that's not looking like it's going to be the case :(
3. Here's the currently-offered dedicated servers at hetzner https://www.hetzner.com/dedicated-rootserver/
4. Currently we have 8-cores, 64G RAM, and two 250G disks in a RAID-1 software array. We pay 39 EUR/mo
5. The cheapest dedicated server (EX44) currently is 46.41 EUR/month and comes with 14-cores, 64G RAM, and 2x 512G disks. That should meet our requirements https://www.hetzner.com/dedicated-rootserver/ex44/configurator/#/
  1. oh crap, we'd be downgrading the proc from the i7 (Intel® Core™ i7-6700) to an i5 (Intel® Core™ i5-13500)
  2. I'd have to check the munin charts, but I would be surprised if we ever break a load of 2, so that's still probably fine.
I met with Marcin tonight to discuss [a] the system-level package upgrades, [b] the application (eg wordpress, mediawiki, etc) upgrades, and [c] the server migration
1. I recommended that Marcin do the updates on staging, and described the risk of not doing it
  1. the problem is that the current staging environment is down, an d it may take a few days to restore it
  2. the risk is maybe a few days of downtime instead smaller change window during the update
2. we agreed that I'll do the system-level package upgrades direct-to-production; Marcin accepted the risk of a few days of downtime
3. Marcin also mentioned that Hetzner has a "server auction" page, which has some more servers that meet our needs at a slightly discounted price https://www.hetzner.com/sb/
  1. actually many of these are 37.72 EUR/mo, so they're actually *cheaper* than our current 39 EUR/mo. Great!
  2. there's >3 pages of servers for this 37.72 EUR/mo price. One of them has 2x 4TB drives (though it looks like spinning disks). This is a server graveyard built-to-spec for previous customers, it seems. We should be able to find one that meets our needs, so that means we'll easily double our disk and save ~15 EUR per year. Cool :)

Fri March 08, 2024

It's been 1 week since I fixed the VPN cert & ossec & email alert spam loop on the server, so today I'm checking-in to see if that fixed the issue with the apache ram spike & mysql backend outage that was causing varnish errors on the wiki and other sites
Since the last change on March 01, I got two emails from Hetzner about the server being reset
1. 2024-03-03 11:47 and
2. 2024-03-04 16:28
I emailed Marcin, and he confirmed that the server had the same issues, so he initiated the reboots
I logged into Munin
1. postfix mailqueue is lower, but it never really was very high. now it's about 1-3 items in the queue. before my change it was constantly about 4. The peak was back in August when it shot up to 140. Perhaps email campaigns. Anyway, though the emails were very annoying, I don't think that was the issue.
2. I see another 2 gaps in the mysql charts for the last week. The X-axis isn't dated, so it's hard to pin-point exactly when, but they likely preceded Marcin's two reboots.
3. cooresponding to those two gaps are a massive spike in the "number of threads" chart — from about 480 threads to ~850 threads — in both instances
4. again, we see httpd rss (ram) usage spike at the same time as the outages
I checked to see if there was any OOM errors killing the mysql server, but I get anything

[root@opensourceecology log]# journalctl --no-pager | grep -iA300 'invoked oom-killer'
[root@opensourceecology log]# 

[root@opensourceecology log]# journalctl --no-pager | grep -iA300 'oom-killer'
[root@opensourceecology log]# 

[root@opensourceecology log]# dmesg -T | grep -A300 -i 'invoked oom-killer' | grep -E "\[\s*[0-9]+\]|\[ pid \]" | egrep '[0-9]{6,7}'
[root@opensourceecology log]# 
# indeed, it looks like the system would kill mysql second. But varnish first. Anyway, probably httpd should have a higher score than our DB. But this system has 64 GB of RAM; it's hard to imagine it running-out. Anyway, logs above show no OOM event; unless it doesn't go far back enough
<pre>
[root@opensourceecology log]# printf 'PID\tOOM Score\tOOM Adj\tCommand\n' && while read -r pid comm; do [ -f /proc/$pid/oom_score ] && [ $(cat /proc/$pid/oom_score) != 0 ] && printf '%d\t%d\t\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$(cat /proc/$pid/oom_score_adj)" "$comm"; done < <(ps -e -o pid= -o comm=) | sort -k 2nr | head -n 10
PID     OOM Score       OOM Adj Command
2115    220             0       varnishd
1650    5               0       mysqld
11658   1               0       awstats.pl
13681   1               0       httpd
13761   1               0       httpd
14152   1               0       httpd
[root@opensourceecology log]# 

[root@opensourceecology log]# free -g
			  total        used        free      shared  buff/cache   available
Mem:             62          22          34           0           6          39
Swap:            31           0          31
[root@opensourceecology log]#

I checked the apache logs for March 3 & 4. There's some errors but nothing jumped-out as an issue. Except that they were surprisingly short logs..
I checked the mysql logs
1. preceding the line mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql (which would have happened after Marcin's reboot), I see some errors indicating the DB crashed and needs to be repaired
2. curiously, I don't see a trend of this from last week. I only see two of these (from this week only)
  1. Here's the log from 2024-03-03

240229 22:32:47 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
240229 22:32:47 [Note] /usr/libexec/mysqld (mysqld 5.5.65-MariaDB) starting as process 1673 ...
240229 22:32:47 InnoDB: The InnoDB memory heap is disabled
240229 22:32:47 InnoDB: Mutexes and rw_locks use GCC atomic builtins
240229 22:32:47 InnoDB: Compressed tables use zlib 1.2.7
240229 22:32:47 InnoDB: Using Linux native AIO
240229 22:32:47 InnoDB: Initializing buffer pool, size = 128.0M
240229 22:32:47 InnoDB: Completed initialization of buffer pool
240229 22:32:47 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
240229 22:32:48  InnoDB: Waiting for the background threads to start
240229 22:32:49 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 587581798074
240229 22:32:49 [Note] Plugin 'FEEDBACK' is disabled.
240229 22:32:49 [Note] Event Scheduler: Loaded 0 events
240229 22:32:49 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.65-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server
240229 22:32:49 [ERROR] mysqld: Table './osemain_db/wp_options' is marked as crashed and should be repaired
240229 22:32:49 [Warning] Checking table:   './osemain_db/wp_options'
240229 22:33:37 [ERROR] mysqld: Table './osemain_s_db/wp_options' is marked as crashed and should be repaired
240229 22:33:37 [Warning] Checking table:   './osemain_s_db/wp_options'
240229 22:37:57 [ERROR] mysqld: Table './osewiki_db/wiki_searchindex' is marked as crashed and should be repaired
240229 22:37:57 [Warning] Checking table:   './osewiki_db/wiki_searchindex'
240229 23:48:30 [ERROR] mysqld: Table './oswh_db/wp_options' is marked as crashed and should be repaired
240229 23:48:30 [Warning] Checking table:   './oswh_db/wp_options'
240303 16:49:08 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

1. 1. And here's the log from 2024-03-04

240303 16:49:08 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
240303 16:49:08 [Note] /usr/libexec/mysqld (mysqld 5.5.65-MariaDB) starting as process 1689 ...
240303 16:49:08 InnoDB: The InnoDB memory heap is disabled
240303 16:49:08 InnoDB: Mutexes and rw_locks use GCC atomic builtins
240303 16:49:08 InnoDB: Compressed tables use zlib 1.2.7
240303 16:49:08 InnoDB: Using Linux native AIO
240303 16:49:08 InnoDB: Initializing buffer pool, size = 128.0M
240303 16:49:08 InnoDB: Completed initialization of buffer pool
240303 16:49:08 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
240303 16:49:09  InnoDB: Waiting for the background threads to start
240303 16:49:10 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 587767716547
240303 16:49:10 [Note] Plugin 'FEEDBACK' is disabled.
240303 16:49:10 [Note] Event Scheduler: Loaded 0 events
240303 16:49:10 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.65-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server
240304 21:29:24 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
240304 21:29:24 [Note] /usr/libexec/mysqld (mysqld 5.5.65-MariaDB) starting as process 1650 ...
240304 21:29:24 InnoDB: The InnoDB memory heap is disabled
240304 21:29:24 InnoDB: Mutexes and rw_locks use GCC atomic builtins
240304 21:29:24 InnoDB: Compressed tables use zlib 1.2.7
240304 21:29:24 InnoDB: Using Linux native AIO
240304 21:29:24 InnoDB: Initializing buffer pool, size = 128.0M
240304 21:29:24 InnoDB: Completed initialization of buffer pool
240304 21:29:24 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
240304 21:29:24  InnoDB: Waiting for the background threads to start
240304 21:29:25 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 587833491019
240304 21:29:25 [Note] Plugin 'FEEDBACK' is disabled.
240304 21:29:25 [Note] Event Scheduler: Loaded 0 events
240304 21:29:25 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.65-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server
240304 21:29:27 [ERROR] mysqld: Table './osemain_db/wp_options' is marked as crashed and should be repaired
240304 21:29:27 [Warning] Checking table:   './osemain_db/wp_options'
240304 21:42:21 [ERROR] mysqld: Table './osemain_s_db/wp_options' is marked as crashed and should be repaired
240304 21:42:21 [Warning] Checking table:   './osemain_s_db/wp_options'

1. Actually, those errors occurred immediately after startup. So likely they're a symptom of the previous abrupt shutdown and not the cause of the shutdown.
anyway, we learned that the server came back online at 2024-03-03 16:49:08 & 2024-03-04 21:29:24 UTC
1. also, if I check the charts, the outages appear to be ~12 hours before Marcin rebooted the server in both cases
2. curiously the last message in the httpd error error log on 2024-03-03 is 08:27:41
  1. for some reason the Mar 03 logs spills-over from error_log-20240303.gz to error_log-2024-04004.gz
3. Here's the httpd erorr log's lines preceeding the reboot at Mar 03 16:49

[Sat Mar 02 07:33:18.383448 2024] [lbmethod_heartbeat:notice] [pid 1098] AH02282: No slotmem from mod_heartmonitor
[Sat Mar 02 07:33:18.383819 2024] [ssl:warn] [pid 1098] AH01873: Init: Session Cache is not configured [hint: SSLSessionCache]
[Sat Mar 02 07:33:18.384664 2024] [ssl:warn] [pid 1098] AH02292: Init: Name-based SSL virtual hosts only work for clients with TLS server name indication support (RFC 4366)
[Sat Mar 02 07:33:18.442973 2024] [mpm_prefork:notice] [pid 1098] AH00163: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips configured -- resuming normal operations
[Sat Mar 02 07:33:18.442983 2024] [core:notice] [pid 1098] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Sat Mar 02 09:28:33.102614 2024] [:error] [pid 7188] [client 127.0.0.1:37960] [client 127.0.0.1] ModSecurity: Access denied with code 403 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "awstats.opensourceecology.org"] [uri "/wp-login.php"] [unique_id "ZeLxQepLYu-sFhwtgY8GhQAAAA4"], referer: http://awstats.opensourceecology.org/wp-login.php
[Sat Mar 02 09:28:33.254850 2024] [:error] [pid 7188] [client 127.0.0.1:37960] [client 127.0.0.1] ModSecurity: Access denied with code 403 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "munin.opensourceecology.org"] [uri "/wp-login.php"] [unique_id "ZeLxQepLYu-sFhwtgY8GhgAAAA4"], referer: http://munin.opensourceecology.org/wp-login.php
[Sat Mar 02 21:04:15.333825 2024] [:error] [pid 23728] [client 127.0.0.1:47218] script '/var/www/html/certbot/htdocs/wp-login.php' not found or unable to stat
[Sun Mar 03 01:31:46.528392 2024] [:error] [pid 2068] [client 127.0.0.1:33370] script '/var/www/html/certbot/htdocs/timthumb.php' not found or unable to stat
[Sun Mar 03 01:31:54.395564 2024] [:error] [pid 2689] [client 127.0.0.1:33488] script '/var/www/html/certbot/htdocs/thumb.php' not found or unable to stat
[Sun Mar 03 01:31:59.971621 2024] [:error] [pid 3775] [client 127.0.0.1:33512] script '/var/www/html/certbot/htdocs/timthumb.php' not found or unable to stat
[Sun Mar 03 01:32:03.104518 2024] [:error] [pid 3775] [client 127.0.0.1:33512] script '/var/www/html/certbot/htdocs/thumb.php' not found or unable to stat
[Sun Mar 03 01:32:03.742281 2024] [access_compat:error] [pid 4096] [client 127.0.0.1:33596] AH01797: client denied by server configuration: /var/www/html/certbot/htdocs/.git
[Sun Mar 03 01:32:24.865011 2024] [access_compat:error] [pid 3775] [client 127.0.0.1:33876] AH01797: client denied by server configuration: /var/www/html/certbot/htdocs/.git
[Sun Mar 03 05:38:12.408871 2024] [mpm_prefork:error] [pid 1098] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
[Sun Mar 03 08:27:41.722042 2024] [mpm_prefork:notice] [pid 1098] AH00171: Graceful restart requested, doing restart
[Sun Mar 03 08:27:41.803267 2024] [lbmethod_heartbeat:notice] [pid 1098] AH02282: No slotmem from mod_heartmonitor
[Sun Mar 03 08:27:41.806580 2024] [ssl:warn] [pid 1098] AH01873: Init: Session Cache is not configured [hint: SSLSessionCache]
[Sun Mar 03 08:27:41.810343 2024] [ssl:warn] [pid 1098] AH02292: Init: Name-based SSL virtual hosts only work for clients with TLS server name indication support (RFC 4366)
[Sun Mar 03 08:27:41.844035 2024] [mpm_prefork:notice] [pid 1098] AH00163: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips configured -- resuming normal operations
[Sun Mar 03 08:27:41.844044 2024] [core:notice] [pid 1098] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Sun Mar 03 16:49:08.706487 2024] [suexec:notice] [pid 1104] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)

1. Here's the httpd erorr log's lines preceeding the reboot at Mar 04 21:29

[Mon Mar 04 01:48:24.693378 2024] [:error] [pid 26475] [client 127.0.0.1:36712] script '/var/www/html/certbot/htdocs/about.php' not found or unable to stat
[Mon Mar 04 04:50:27.939387 2024] [:error] [pid 4300] [client 127.0.0.1:54342] [client 127.0.0.1] ModSecurity: Access denied with code 403 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "munin.opensourceecology.org"] [uri "/favicon.ico"] [unique_id "ZeVTExLnCbFmyBDlR2p-RAAAAAU"]
[Mon Mar 04 04:53:28.722540 2024] [:error] [pid 7925] [client 127.0.0.1:56620] script '/var/www/html/certbot/htdocs/wp-login.php' not found or unable to stat
[Mon Mar 04 05:22:50.379507 2024] [mpm_prefork:notice] [pid 1104] AH00171: Graceful restart requested, doing restart
[Mon Mar 04 05:22:50.471872 2024] [lbmethod_heartbeat:notice] [pid 1104] AH02282: No slotmem from mod_heartmonitor
[Mon Mar 04 05:22:50.472237 2024] [ssl:warn] [pid 1104] AH01873: Init: Session Cache is not configured [hint: SSLSessionCache]
[Mon Mar 04 05:22:50.473094 2024] [ssl:warn] [pid 1104] AH02292: Init: Name-based SSL virtual hosts only work for clients with TLS server name indication support (RFC 4366)
[Mon Mar 04 05:22:50.507737 2024] [mpm_prefork:notice] [pid 1104] AH00163: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips configured -- resuming normal operations
[Mon Mar 04 05:22:50.507745 2024] [core:notice] [pid 1104] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Mon Mar 04 05:27:16.544129 2024] [:error] [pid 21743] [client 127.0.0.1:51880] script '/var/www/html/certbot/htdocs/xmlrpc.php' not found or unable to stat
[Mon Mar 04 07:27:16.824773 2024] [:error] [pid 2759] [client 127.0.0.1:41932] script '/var/www/html/certbot/htdocs/xmlrpc.php' not found or unable to stat
[Mon Mar 04 08:06:12.898889 2024] [:error] [pid 16264] [client 127.0.0.1:38670] [client 127.0.0.1] ModSecurity: Access denied with code 403 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "munin.opensourceecology.org"] [uri "/thoms.php"] [unique_id "ZeWA9CMy-toJ1WTJh4QWygAAAAo"]
...
(the above line repeats 357 times)
...
[Mon Mar 04 09:30:27.312228 2024] [mpm_prefork:error] [pid 1104] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
[Mon Mar 04 21:29:24.550406 2024] [suexec:notice] [pid 1107] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)

from above, it looks like apache kicked-off child workers until this error was thrown

server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

there's probably some delay between the root cause and the MaxRequestWorkers error, but we're getting closer: the errors popped-up at Mar 03 05:38:12 and Mar 04 09:30:27 UTC
1. curiously, the very last line in access_log-20240303.gz occurred at this exact minute

127.0.0.1 - - [03/Mar/2024:05:37:41 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips (internal dummy connection)"
127.0.0.1 - - [03/Mar/2024:05:38:02 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips (internal dummy connection)"

1. 1. that "internal dummy connection" appears to hit the apache server 1-8-ish times per second
2. there is no request from 09:30 on Mar 04, but there is one 1 minute earlier @ at 09:29. And the next one after that is after the reboot.

127.0.0.1 - - [04/Mar/2024:09:27:31 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips (int
ernal dummy connection)"
127.0.0.1 - - [04/Mar/2024:09:29:29 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips (internal dummy connection)"
127.0.0.1 - - [04/Mar/2024:21:29:57 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips (internal dummy connection)"

apache docs say that these "internal dummy connection" queries are initiated when it's managing child processes. so again, it's a symptom — not the cause https://cwiki.apache.org/confluence/display/httpd/InternalDummyConnection
at Mar 04 08:06 there are a ton of requests from what looks like some bot that are returned 403 errors. But we should be able to handle this

...
127.0.0.1 - - [04/Mar/2024:08:05:59 +0000] "GET /cjfuns.php HTTP/1.0" 403 212 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:05:59 +0000] "GET /wp-content/pm.php HTTP/1.0" 403 219 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:05:59 +0000] "GET /gawean.PhP7 HTTP/1.0" 403 213 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:00 +0000] "GET /404.php HTTP/1.0" 403 209 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:00 +0000] "GET /wp.php HTTP/1.0" 403 208 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:00 +0000] "GET /radio.php HTTP/1.0" 403 211 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:01 +0000] "GET /chosen.php HTTP/1.0" 403 212 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:01 +0000] "GET /simple.php HTTP/1.0" 403 212 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:01 +0000] "GET /class.api.php HTTP/1.0" 403 215 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:02 +0000] "GET /wp-head.php HTTP/1.0" 403 213 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:02 +0000] "GET /file.php HTTP/1.0" 403 210 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:02 +0000] "GET /wp-content/plugins/ccx/index.php HTTP/1.0" 403 234 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:02 +0000] "GET /wp-includes/wp-class.php HTTP/1.0" 403 226 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:03 +0000] "GET /assets/css/403.php HTTP/1.0" 403 220 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:03 +0000] "GET /css/403.php HTTP/1.0" 403 213 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:03 +0000] "GET /403.php HTTP/1.0" 403 209 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:04 +0000] "GET /wp-content/plugins/instabuilder2/cache/plugins/moon.php HTTP/1.0" 403 257 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:04 +0000] "GET /wp-content/plugins/WordPressCore/include.php HTTP/1.0" 403 246 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:04 +0000] "GET /wp-content/plugins/core-plugin/include.php HTTP/1.0" 403 244 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:05 +0000] "GET /wp-content/plugins/core/include.php HTTP/1.0" 403 237 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:05 +0000] "GET /wp-content/plugins/envato-market/inc/class-envato-market-api.php HTTP/1.0" 403 266 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:05 +0000] "GET /worm0.PhP7 HTTP/1.0" 403 212 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:05 +0000] "GET /wp-content/plugins/index.php HTTP/1.0" 403 230 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:06 +0000] "GET /wp-config-sample.php HTTP/1.0" 403 222 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:06 +0000] "GET /.well-known/admin.php HTTP/1.0" 403 223 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:06 +0000] "GET /mah.php HTTP/1.0" 403 209 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:07 +0000] "GET /cong.php HTTP/1.0" 403 210 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:07 +0000] "GET /.tmb/class_api.php HTTP/1.0" 403 220 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:07 +0000] "GET /wp-admin/css/colors/blue/blue.php?wall= HTTP/1.0" 403 235 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:08 +0000] "GET /wp-atom.php HTTP/1.0" 403 213 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:08 +0000] "GET /wp-content/plugins/yyobang/mar.php HTTP/1.0" 403 236 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:08 +0000] "GET /wp-admin/network/class_api.php HTTP/1.0" 403 232 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:08 +0000] "GET /wp-content/plugins/press/wp-class.php HTTP/1.0" 403 239 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:09 +0000] "GET /wp-apxupx.php?apx=upx HTTP/1.0" 403 215 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:09 +0000] "GET /wp-admin/js/widgets/class_api.php HTTP/1.0" 403 235 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:09 +0000] "GET /wp-content/themes/twentyfive/include.php HTTP/1.0" 403 242 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:10 +0000] "GET /wp-admin/user/class_api.php HTTP/1.0" 403 229 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:10 +0000] "GET /wp-admin/includes/class_api.php HTTP/1.0" 403 233 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:10 +0000] "GET /wp-seo.php HTTP/1.0" 403 212 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:11 +0000] "GET /wp-uploads-config.php HTTP/1.0" 403 223 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:11 +0000] "GET /wwr.php HTTP/1.0" 403 209 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:11 +0000] "GET /wp-mail.php HTTP/1.0" 403 213 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:11 +0000] "GET /filefuns.php HTTP/1.0" 403 214 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:12 +0000] "GET /options-reading.php HTTP/1.0" 403 221 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:12 +0000] "GET /postnews.php HTTP/1.0" 403 214 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:12 +0000] "GET /thoms.php HTTP/1.0" 403 211 "-" "-"
127.0.0.1 - - [04/Mar/2024:08:06:17 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips (internal dummy connection)"

there are remarkably few requests hitting the server that are *not* these "internal dummy connection" lines

[root@opensourceecology httpd]# zgrep -vi 'internal dummy' access_log-20240301.gz | wc -l
19
[root@opensourceecology httpd]# 
[root@opensourceecology httpd]# zgrep -vi 'internal dummy' access_log-20240302.gz | wc -l
29
[root@opensourceecology httpd]# zgrep -vi 'internal dummy' access_log-20240303.gz | wc -l
138
[root@opensourceecology httpd]# zgrep -vi 'internal dummy' access_log-20240304.gz | wc -l
43
[root@opensourceecology httpd]# zgrep -vi 'internal dummy' access_log-20240305.gz | wc -l
398
[root@opensourceecology httpd]# zgrep -vi 'internal dummy' access_log-20240306.gz | wc -l
462
[root@opensourceecology httpd]#

ah, it's individual vhost files; the wiki and phpList are by-far the most active

[root@opensourceecology httpd]# for dir in $(find . -type d); do echo $dir; echo -ne '\t'; zgrep -vi 'internal dummy' $dir/access_log-20240302.gz | wc -l; done
.
		29
./wiki.opensourceecology.org
		110949
./www.openbuildinginstitute.org
		1021
./phplist.opensourceecology.org
		67288
./oswh.opensourceecology.org
		48
./forum.opensourceecology.org
		7704
./microfactory.opensourceecology.org
		77
./seedhome.openbuildinginstitute.org
		9
./staging.openbuildinginstitute.org
		gzip: ./staging.openbuildinginstitute.org/access_log-20240302.gz: No such file or directory
0
./store.opensourceecology.org
		172
./fef.opensourceecology.org
		526
./www.opensourceecology.org
		2499
./staging.opensourceecology.org
		165
./munin.opensourceecology.org
		745
[root@opensourceecology httpd]# 

[root@opensourceecology httpd]# for dir in $(find . -type d); do echo $dir; echo -ne '\t'; zgrep -vi 'internal dummy' $dir/access_log-20240303.gz | wc -l; done
.
		138
./wiki.opensourceecology.org
		103831
./www.openbuildinginstitute.org
		444
./phplist.opensourceecology.org
		82226
./oswh.opensourceecology.org
		111
./forum.opensourceecology.org
		2914
./microfactory.opensourceecology.org
		112
./seedhome.openbuildinginstitute.org
		21
./staging.openbuildinginstitute.org
		gzip: ./staging.openbuildinginstitute.org/access_log-20240303.gz: No such file or directory
0
./store.opensourceecology.org
		81
./fef.opensourceecology.org
		641
./www.opensourceecology.org
		2066
./staging.opensourceecology.org
		1696
./munin.opensourceecology.org
		gzip: ./munin.opensourceecology.org/access_log-20240303.gz: No such file or directory
0
[root@opensourceecology httpd]# 

[root@opensourceecology httpd]# for dir in $(find . -type d); do echo $dir; echo -ne '\t'; zgrep -vi 'internal dummy' $dir/access_log-20240304.gz | wc -l; done
.
		43
./wiki.opensourceecology.org
		56407
./www.openbuildinginstitute.org
		426
./phplist.opensourceecology.org
		30369
./oswh.opensourceecology.org
		69
./forum.opensourceecology.org
		1345
./microfactory.opensourceecology.org
		29
./seedhome.openbuildinginstitute.org
		5
./staging.openbuildinginstitute.org
		gzip: ./staging.openbuildinginstitute.org/access_log-20240304.gz: No such file or directory
0
./store.opensourceecology.org
		487
./fef.opensourceecology.org
		238
./www.opensourceecology.org
		2284
./staging.opensourceecology.org
		108
./munin.opensourceecology.org
		gzip: ./munin.opensourceecology.org/access_log-20240304.gz: No such file or directory
0
[root@opensourceecology httpd]# 

[root@opensourceecology httpd]# for dir in $(find . -type d); do echo $dir; echo -ne '\t'; zgrep -vi 'internal dummy' $dir/access_log-20240305.gz | wc -l; done
.
		398
./wiki.opensourceecology.org
		50107
./www.openbuildinginstitute.org
		447
./phplist.opensourceecology.org
		23000
./oswh.opensourceecology.org
		273
./forum.opensourceecology.org
		1614
./microfactory.opensourceecology.org
		324
./seedhome.openbuildinginstitute.org
		1
./staging.openbuildinginstitute.org
		gzip: ./staging.openbuildinginstitute.org/access_log-20240305.gz: No such file or directory
0
./store.opensourceecology.org
		139
./fef.opensourceecology.org
		933
./www.opensourceecology.org
		4151
./staging.opensourceecology.org
		185
./munin.opensourceecology.org
		gzip: ./munin.opensourceecology.org/access_log-20240305.gz: No such file or directory
0
[root@opensourceecology httpd]#

I wouldn't be surprised if this was a DOS attack hitting some vulnerability in the wiki or phplist installs
unfortunately the munin graphs are empty for nginx and apache, so I can't see if there was a spike of DOS requests with munin
I installed goaccess so I could do some analysis on the apache logs

[root@opensourceecology ~]# yum install goaccess
Loaded plugins: fastestmirror, replace
Loading mirror speeds from cached hostfile
 * base: centosmirror.netcup.net
 * epel: mirror.de.leaseweb.net
 * extras: centosmirror.netcup.net
 * updates: de.mirrors.clouvider.net
Resolving Dependencies
--> Running transaction check
---> Package goaccess.x86_64 0:1.3-1.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

=========================================================================================================================
 Package                      Arch                       Version                          Repository                Size
=========================================================================================================================
Installing:
 goaccess                     x86_64                     1.3-1.el7                        epel                     240 k

Transaction Summary
=========================================================================================================================
Install  1 Package

Total download size: 240 k
Installed size: 713 k
Is this ok [y/d/N]: y
Downloading packages:
goaccess-1.3-1.el7.x86_64.rpm                                                                     | 240 kB  00:00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : goaccess-1.3-1.el7.x86_64                                                                             1/1 
  Verifying  : goaccess-1.3-1.el7.x86_64                                                                             1/1 

Installed:
  goaccess.x86_64 0:1.3-1.el7                                                                                            

Complete!
[root@opensourceecology ~]#

a quick analysis of today's log shows by-far the most requested page is phpList's /lists/admin/?page=logout&err=1
1. this page has 62,388 requests so-far today, and climbing by about 1 request per second
2. the second-most requested URL is /feed (this one is normal) with 3,165 requests for the day
I used goaccess to analyze the apache & nginx logs from Mar 02 - Mar 05. I thought maybe I'd see more requests on the two days where the reboots occurred, but that wasn't the case. Reminder that nginx is all traffic and apache is just traffic that made it to the backend (passed varnish)
1. here's nginx; I can't explain the few requests from 2018

   9643 100.00%   6208 100.00%   1.47 MiB 08/Mar/2024 |                                                                  
  39418 100.00%  16007 100.00%   5.98 MiB 07/Mar/2024 ||||||                                                             
  56942 100.00%  19855 100.00%   2.00 GiB 05/Mar/2024 |||||||||                                                          
 293795 100.00%  90522 100.00%   7.46 GiB 04/Mar/2024 |||||||||||||||||||||||||||||||||||||||||||||||||                  
 322677 100.00%  98723 100.00%   7.30 GiB 03/Mar/2024 ||||||||||||||||||||||||||||||||||||||||||||||||||||||             
 393755 100.00% 116410 100.00%  13.33 GiB 02/Mar/2024 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 254643  64.67%  80963  69.55%   9.70 GiB 01/Mar/2024 |||||||||||||||||||||||||||||||||||||||||||                        
	526   0.13%    139   0.12%  52.81 MiB 09/Feb/2018 |

1. here's apache; again lost of weird requests logged as 2017

   1290 100.00%     2 100.00%   1.52 KiB 08/Mar/2024 |                                                                   
   6243 100.00%     8 100.00%  23.94 KiB 07/Mar/2024 ||                                                                  
  29892 100.00% 14565 100.00%   1.76 GiB 05/Mar/2024 |||||||||                                                           
  93555 100.00% 43535 100.00%   5.91 GiB 04/Mar/2024 |||||||||||||||||||||||||||||                                       
 108797 100.00% 49958 100.00%   6.08 GiB 03/Mar/2024 ||||||||||||||||||||||||||||||||||                                  
 212145 100.00% 93738 100.00%  11.60 GiB 02/Mar/2024 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 134768  63.53% 63350  67.58%   8.50 GiB 01/Mar/2024 |||||||||||||||||||||||||||||||||||||||||||                         
   1856   0.87%     9   0.01%  43.03 MiB 23/Nov/2017 |                                                                   
	547   0.26%     4   0.00% 490.01 KiB 22/Nov/2017 |                                                                   
	500   0.24%     6   0.01% 310.14 KiB 21/Nov/2017 |                                                                   
	606   0.29%     4   0.00%   1.62 MiB 20/Nov/2017 |                                                                   
	818   0.39%     4   0.00%   7.27 MiB 19/Nov/2017 |                                                                   
	519   0.24%     6   0.01%   2.74 MiB 18/Nov/2017 |                                                                   
	651   0.31%     5   0.01%   1.85 MiB 17/Nov/2017 |                                                                   
	696   0.33%     4   0.00%   2.35 MiB 16/Nov/2017 |                                                                   
	623   0.29%     5   0.01% 561.22 KiB 15/Nov/2017 |                                                                   
	706   0.33%     3   0.00%   2.15 MiB 14/Nov/2017 |                                                                   
   1378   0.65%     7   0.01%  32.53 MiB 13/Nov/2017 |                                                                   
	924   0.44%     4   0.00%  16.13 MiB 12/Nov/2017 |                                                                   
	866   0.41%     6   0.01%   9.95 MiB 11/Nov/2017 |                                                                   
	575   0.27%     3   0.00% 179.88 KiB 10/Nov/2017 |                                                                   
	618   0.29%     2   0.00%  90.79 KiB 09/Nov/2017 |                                                                   
	531   0.25%     3   0.00%  89.94 KiB 08/Nov/2017 |                                                                   
	763   0.36%     6   0.01% 334.93 KiB 07/Nov/2017 |                                                                   
	685   0.32%     3   0.00%  89.92 KiB 06/Nov/2017 |                                                                   
	509   0.24%     3   0.00%  89.96 KiB 05/Nov/2017 |                                                                   
	692   0.33%     6   0.01% 245.12 KiB 04/Nov/2017 |

I checked the error logs of the phpList server, but I didn't find anything that stood-out around Mar 03 05:38:12 and Mar 04 09:30:27 UTC
I checked the error logs of the wiki server, and I did see some APC errors on Mar 03 05:38

[Sat Mar 02 17:05:35.753507 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  26. apc_store() /var/www/html/wiki.o
pensourceecology.org/htdocs/includes/libs/objectcache/APCBagOStuff.php:95
[Sat Mar 02 17:05:38.633099 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP Warning:  apc_store(): Unable to allo
cate memory for pool. in /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/APCBagOStuff.php on li
ne 95
[Sat Mar 02 17:05:38.633115 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP Stack trace:
[Sat Mar 02 17:05:38.633121 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   1. {main}() /var/www/html/wiki.open
sourceecology.org/htdocs/index.php:0
[Sat Mar 02 17:05:38.633124 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   2. MediaWiki->run() /var/www/html/w
iki.opensourceecology.org/htdocs/index.php:43
[Sat Mar 02 17:05:38.633128 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   3. MediaWiki->main() /var/www/html/
wiki.opensourceecology.org/htdocs/includes/MediaWiki.php:523
[Sat Mar 02 17:05:38.633131 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   4. MediaWiki->getAction() /var/www/
html/wiki.opensourceecology.org/htdocs/includes/MediaWiki.php:768
[Sat Mar 02 17:05:38.633135 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   5. Action::getActionName() /var/www
/html/wiki.opensourceecology.org/htdocs/includes/MediaWiki.php:155
[Sat Mar 02 17:05:38.633138 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   6. Action::factory() /var/www/html/
wiki.opensourceecology.org/htdocs/includes/actions/Action.php:154
[Sat Mar 02 17:05:38.633141 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   7. WikiPage->getActionOverrides() /
var/www/html/wiki.opensourceecology.org/htdocs/includes/actions/Action.php:96
[Sat Mar 02 17:05:38.633145 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   8. WikiPage->getContentHandler() /v
ar/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:219
[Sat Mar 02 17:05:38.633149 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP   9. WikiPage->getContentModel() /var/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:232
[Sat Mar 02 17:05:38.633152 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  10. WANObjectCache->getWithSetCallback() /var/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:528
[Sat Mar 02 17:05:38.633156 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  11. WANObjectCache->doGetWithSetCallback() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:914
[Sat Mar 02 17:05:38.633159 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  12. call_user_func_array:{/var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:1012}() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:1012
[Sat Mar 02 17:05:38.633163 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  13. WikiPage->{closure:/var/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:517-527}() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:1012
[Sat Mar 02 17:05:38.633166 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  14. WikiPage->getRevision() /var/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:518
[Sat Mar 02 17:05:38.633169 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  15. WikiPage->loadLastEdit() /var/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:646
[Sat Mar 02 17:05:38.633172 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  16. Revision::newKnownCurrent() /var/www/html/wiki.opensourceecology.org/htdocs/includes/page/WikiPage.php:624
[Sat Mar 02 17:05:38.633175 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  17. WANObjectCache->getWithSetCallback() /var/www/html/wiki.opensourceecology.org/htdocs/includes/Revision.php:1958
[Sat Mar 02 17:05:38.633184 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  18. WANObjectCache->doGetWithSetCallback() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:914
[Sat Mar 02 17:05:38.633187 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  19. WANObjectCache->set() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:1038
[Sat Mar 02 17:05:38.633190 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  20. BagOStuff->merge() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/WANObjectCache.php:483
[Sat Mar 02 17:05:38.633193 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  21. BagOStuff->mergeViaLock() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/BagOStuff.php:277
[Sat Mar 02 17:05:38.633196 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  22. APCBagOStuff->set() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/BagOStuff.php:366
[Sat Mar 02 17:05:38.633199 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP  23. apc_store() /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/APCBagOStuff.php:95
[Sat Mar 02 17:05:38.635723 2024] [:error] [pid 24406] [client 127.0.0.1:42158] PHP Warning:  apc_store(): Unable to allocate memory for pool. in /var/www/html/wiki.opensourceecology.org/htdocs/includes/libs/objectcache/APCBagOStuff.php on line 95

1. however, ^ that error is spammed throughout the logs constantly
2. it does, however, appear to be worse on the days with the reboots. maybe

[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240228.gz | wc -l
19050
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240229.gz | wc -l
74375
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240301.gz | wc -l
23898
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240302.gz | wc -l
241868
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240303.gz | wc -l
19607
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240304.gz | wc -l
46298
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240305.gz | wc -l
1374
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240306.gz | wc -l
0
[root@opensourceecology log]# zgrep 'apc_store(): Unable to allocate' httpd/wiki.opensourceecology.org/error_log-20240307.gz | wc -l
4157
[root@opensourceecology log]#

google says these errors can be eliminated by dropping the ttl of the apc cache to 0

[root@opensourceecology log]# grep -ir apc /etc/php* | grep ttl
/etc/php.d/apcu.ini:apc.ttl=7200
/etc/php.d/apcu.ini:apc.gc_ttl=3600
/etc/php.d/apcu.ini:apc.rfc1867_ttl=3600
/etc/php.d/apcu.ini.20180316_232434.bak:apc.ttl=7200
/etc/php.d/apcu.ini.20180316_232434.bak:apc.gc_ttl=3600
/etc/php.d/apcu.ini.20180316_232434.bak:apc.rfc1867_ttl=3600
/etc/php-zts.d/apcu.ini:apc.ttl=7200
/etc/php-zts.d/apcu.ini:apc.gc_ttl=3600
/etc/php-zts.d/apcu.ini:apc.rfc1867_ttl=3600
[root@opensourceecology log]#

1. but, again, I have no idea if this is actually the root cause or a red herring
this says that switching from php mod_php/prefork to php-fpm could prevent apache from spawning so many workers when a php script misbehaves (but I want to find out *which* script causes this) https://serverfault.com/questions/823121/why-is-apache-spawning-so-many-processes
1. it appears that we're using mod_php, not cgi or fpm (unfortunately I can't run phpinfo() as it's disabled and I don't want to edit the php config & restart the webserver) https://www.devguide.at/en/backend/mod_php-vs-fastcgi-vs-fpm/

[root@opensourceecology log]# grep -ir php /etc/httpd* | grep -i module
/etc/httpd/conf.modules.d/10-php.conf:# PHP is an HTML-embedded scripting language which attempts to make it
/etc/httpd/conf.modules.d/10-php.conf:  LoadModule php5_module modules/libphp5.so
/etc/httpd/conf.modules.d/10-php.conf:  LoadModule php5_module modules/libphp5-zts.so
[root@opensourceecology log]# grep -ir php /etc/httpd* | grep -i handler
/etc/httpd/conf.d/php.conf:AddHandler php5-script .php
/etc/httpd/conf.d/php.conf:php_value session.save_handler "files"
[root@opensourceecology log]# grep -ir fpm /etc/httpd*
[root@opensourceecology log]#

mod_evasive says that it's blacklisted 127.0.0.1. Hopefully it's also on a whitelist?

[root@opensourceecology log]# apachectl status
* httpd.service - The Apache HTTP Server
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2024-03-04 21:29:24 UTC; 4 days ago
	 Docs: man:httpd(8)
		   man:apachectl(8)
  Process: 19584 ExecReload=/usr/sbin/httpd $OPTIONS -k graceful (code=exited, status=0/SUCCESS)
 Main PID: 1107 (httpd)
   Status: "Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec"
   CGroup: /system.slice/httpd.service
		   |- 1107 /usr/sbin/httpd -DFOREGROUND
		   |-15605 /usr/sbin/httpd -DFOREGROUND
		   |-15820 /usr/sbin/httpd -DFOREGROUND
		   |-17510 /usr/sbin/httpd -DFOREGROUND
		   |-17565 /usr/sbin/httpd -DFOREGROUND
		   |-18685 /usr/sbin/httpd -DFOREGROUND
		   |-18693 /usr/sbin/httpd -DFOREGROUND
		   |-18694 /usr/sbin/httpd -DFOREGROUND
		   |-18971 /usr/sbin/httpd -DFOREGROUND
		   |-18992 /usr/sbin/httpd -DFOREGROUND
		   `-18993 /usr/sbin/httpd -DFOREGROUND

Mar 04 21:29:24 opensourceecology.org systemd[1]: Started The Apache HTTP Server.
Mar 04 23:09:48 opensourceecology.org mod_evasive[9328]: Blacklisting address 127.0.0.1: possible DoS attack.
Mar 05 04:15:56 opensourceecology.org systemd[1]: Reloading The Apache HTTP Server.
Mar 05 04:15:56 opensourceecology.org systemd[1]: Reloaded The Apache HTTP Server.
Mar 06 05:52:06 opensourceecology.org systemd[1]: Reloading The Apache HTTP Server.
Mar 06 05:52:06 opensourceecology.org systemd[1]: Reloaded The Apache HTTP Server.
Mar 07 07:18:43 opensourceecology.org systemd[1]: Reloading The Apache HTTP Server.
Mar 07 07:18:44 opensourceecology.org systemd[1]: Reloaded The Apache HTTP Server.
Mar 08 04:09:05 opensourceecology.org systemd[1]: Reloading The Apache HTTP Server.
Mar 08 04:09:05 opensourceecology.org systemd[1]: Reloaded The Apache HTTP Server.
[root@opensourceecology log]#

I couldn't find what the MaxRequestWorkers setting is.
1. I'm not sure if that means its some default or infinity. Currently we have less than 20 running. The delta on the munin chart between ~460 to ~850 tells me that apache spawned ~390 child workers before the system crashed
2. maybe we should just try updating the apache config to a max of 100 workers and see what that does. If apache crashes but mysql does not, I'd consider this progress. Then we just need to isolate the issue: perhaps with php profiling and/or switching from mod_php to php-fpm

[root@opensourceecology log]# apachectl -M | grep mpm
 mpm_prefork_module (shared)
[root@opensourceecology log]#
 
[root@opensourceecology log]# apachectl -t -D DUMP_RUN_CFG
Passing arguments to httpd using apachectl is no longer supported.
You can only start/stop/restart httpd using this script.
If you want to pass extra arguments to httpd, edit the
/etc/sysconfig/httpd config file.
ServerRoot: "/etc/httpd"
Main DocumentRoot: "/etc/httpd/htdocs"
Main ErrorLog: "/etc/httpd/logs/error_log"
Mutex ssl-stapling: using_defaults
Mutex proxy: using_defaults
Mutex authn-socache: using_defaults
Mutex ssl-cache: using_defaults
Mutex default: dir="/run/httpd/" mechanism=default 
Mutex mpm-accept: using_defaults
Mutex authdigest-opaque: using_defaults
Mutex proxy-balancer-shm: using_defaults
Mutex rewrite-map: using_defaults
Mutex authdigest-client: using_defaults
PidFile: "/run/httpd/httpd.pid"
Define: _RH_HAS_HTTPPROTOCOLOPTIONS
Define: DUMP_RUN_CFG
Define: MODSEC_2.5
Define: MODSEC_2.9
User: name="apache" id=48
Group: name="apache" id=48
[root@opensourceecology log]#

I created a new apache config /etc/

[root@opensourceecology conf.d]# pwd
/etc/httpd/conf.d
[root@opensourceecology conf.d]# 

[root@opensourceecology conf.d]# ls
000-www.opensourceecology.org.conf          mod_evasive.conf
00-fef.opensourceecology.org.conf           mod_security.conf
00-forum.opensourceecology.org.conf         mod_security.wordpress.include
00-microfactory.opensourceecology.org.conf  munin.20180304.bak
00-oswh.opensourceecology.org.conf          munin.opensourceecology.org.conf
00-phplist.opensourceecology.org.conf       php.conf
00-seedhome.openbuildinginstitute.org.conf  README
00-store.opensourceecology.org.conf         ssl.conf.20200504_105511.disabled
00-wiki.opensourceecology.org.conf          ssl.conf.disabled
00-www.openbuildinginstitute.org.conf       ssl.conf.orig
autoindex.conf.20200504_105511.disabled     ssl.openbuildinginstitute.org
awstats.conf                                ssl.opensourceecology.org
awstats.conf.20200504_105511.disabled       staging.openbuildinginstitute.org.conf.bak
awstats.openbuildinginstitute.org.conf      staging.opensourceecology.org.conf
awstats.opensourceecology.org.conf          userdir.conf.20200504_105511.disabled
cacti.conf                                  varnishTest.conf.disabled
certbot.conf                                welcome.conf.20200504_105511.disabled
[root@opensourceecology conf.d]# 

[root@opensourceecology conf.d]# vim mpm_prefork.conf 
[root@opensourceecology conf.d]# 

[root@opensourceecology conf.d]# cat mpm_prefork.conf 
<IfModule mpm_prefork_module>  
	StartServers            5  
	MinSpareServers         5  
	MaxSpareServers        10  
	MaxRequestWorkers     100
	MaxConnectionsPerChild  0  
</IfModule>
[root@opensourceecology conf.d]#

and I reloaded the apache config

[root@opensourceecology conf.d]# apachectl -t
Syntax OK
[root@opensourceecology conf.d]# systemctl reload httpd
[root@opensourceecology conf.d]#

I'm not sure how to proceed. I have a lot of ideas, but I need input from Marcin. I sent him this email

Hey Marcin,

Do you have a budget for fixing this?

I spent a few more hours today investigating the issue. It does appear that your backend apache web server is suddenly spawning child workers like crazy, but it's not clear exactly what is causing this (though I do have some guesses after reviewing your logs). For now I've just limited the maximum number of workers to 100 (you're running fine with about 10 now and it spawned ~400 in the two crashes last week). But it's not trivial to see if that change applied and I'm not sure it will prevent your server from crashing again. For more detailed information of my investigations today, see my log:

* https://wiki.opensourceecology.org/wiki/Maltfield_Log

If I had days or weeks to investigate, I could definitely get to the bottom of it, but I don't know what is your budget on me fixing this. Here's some things I could do:

1. Upgrade all the packages on the system, including apache. This would take at-best a few hours and at-worst a few days.

3. Upgrade phpList and the wiki, which are the two sites getting the most traffic. This would take at-best half a day or at-worst a week (you're very far behind in versions, and that doesn't always mean smooth upgrades).

3. Runs some PHP Profiling with XHProf to determine exactly which websites & PHP scripts are consuming resources and causing apache to spawn so many children. This would take minimum a couple days, maybe a week.

4. Switch from mod_php to php-fpm, which should help PHP scale better without needing to spawn so many apache child processes. This would take a day or two.

5. Write some hacky script that runs every 5 minutes and checks to see if apache has >99 child workers. If it does, just restart apache. This would take half a day to 1 day to write & test.

Please let me know how much time you'd like me to invest into fixing this, and how you'd like me to proceed.

Thank you,

Michael Altfield
Senior Technology Advisor
PGP Fingerprint: 8A4B 0AF8 162F 3B6A 79B7 70D2 AA3E DF71 60E2 D97B

Open Source Ecology
www.opensourceecology.org

Fri March 01, 2024

I checked my email and only got 1-3 ossec alerts per hour overnight (as opposed to 12 every hour)
currently we get an error every hour complaining about sshd trying to bind to an occupied port, not sure if related or unrelated


OSSEC HIDS Notification.
2024 Mar 01 17:13:36

Received From: opensourceecology->/var/log/secure
Rule: 1002 fired (level 2) -> "Unknown problem somewhere in the system."
Portion of the log(s):

Mar  1 17:13:36 opensourceecology sshd[22147]: error: bind: Address already in use



 --END OF NOTIFICATION



OSSEC HIDS Notification.
2024 Mar 01 17:13:36

Received From: opensourceecology->/var/log/secure
Rule: 1002 fired (level 2) -> "Unknown problem somewhere in the system."
Portion of the log(s):

Mar  1 17:13:36 opensourceecology sshd[22147]: error: channel_setup_fwd_listener_tcpip: cannot listen to port: 54261



 --END OF NOTIFICATION

1. I couldn't find anything on the wiki when searching for port 54261
anyway let's just wait until 2024-03-08. If the server runs fine for a week, then I'll consider this resolved.
1. I just need to check munin. indication of the same error is a massive spike of apache rss usage cooresponding to a gap (drop to zero) of mysql queries

Thr February 29, 2024

Marcin alerted me on Feb 24 that the wiki was down
1. I did a quick test and the frontpage loaded fine. The status page also didn't indicate any downtime
2. however, when I tried to do a search on the wiki, it stalled for a while and eventually I got a varnish error saying it couldn't reach the backend
3. I tried to check munin, but I got the same varnish error
4. I told Marcin that he could try to just give it a reboot from the hetzner WUI, which he did
5. after it came back, I checked Munin and saw that the db was down. I also saw a huge spike in apache memory usage
Over the next few days, I noticed that OSSEC has been spamming me with emails once every 5 minutes

OSSEC HIDS Notification.
2024 Feb 27 23:12:07

Received From: opensourceecology->/var/log/messages
Rule: 1002 fired (level 2) -> "Unknown problem somewhere in the system."
Portion of the log(s):

Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483



 --END OF NOTIFICATION

1. I'm not certain, but the email might be caused by some OpenVPN cron used to connect the prod & staging servers. It's possible that it expired and that the expiration is causing some loop that's bogging down the server somehow
Every time Marcin reboots the server, I get an email alert from hetzner
1. I noticed that Marcin rebooted it again today (Feb 29) at 10:03 (EST). I sent him an email and he confirmed this
2. I also see he did another reboot today at 17:58
I ssh'd into the server and checked /var/log/messages. Here's one of the snippets

Mar  1 03:38:24 opensourceecology connect.sh: TCP/UDP: Preserving recently used remote address: [AF_INET]195.201.233.113:1194
Mar  1 03:38:24 opensourceecology connect.sh: Socket Buffers: R=[212992->212992] S=[212992->212992]
Mar  1 03:38:24 opensourceecology connect.sh: UDP link local: (not bound)
Mar  1 03:38:24 opensourceecology connect.sh: UDP link remote: [AF_INET]195.201.233.113:1194
Mar  1 03:38:24 opensourceecology connect.sh: TLS: Initial packet from [AF_INET]195.201.233.113:1194, sid=852b79d5 cd86dfc4
Mar  1 03:38:24 opensourceecology connect.sh: VERIFY OK: depth=1, CN=osedev1
Mar  1 03:38:24 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483
Mar  1 03:38:24 opensourceecology connect.sh: OpenSSL: error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed
Mar  1 03:38:24 opensourceecology connect.sh: TLS_ERROR: BIO read tls_read_plaintext error
Mar  1 03:38:24 opensourceecology connect.sh: TLS Error: TLS object -> incoming plaintext read error
Mar  1 03:38:24 opensourceecology connect.sh: TLS Error: TLS handshake failed
Mar  1 03:38:24 opensourceecology connect.sh: SIGUSR1[soft,tls-error] received, process restarting
Mar  1 03:38:24 opensourceecology connect.sh: Restart pause, 300 second(s)

1. indeed, the line above what was emailed from OSSEC says
```
osedev
```
  , so this appears to be related to the staging/dev setup
2. running `systemctl` shows that the service
```
openvpn-client.service
```
  is in the
```
failed
```
  state

[root@opensourceecology cron.d]# systemctl 
...
● openvpn-client.service                       loaded failed failed    openvpn-client.service

I found the config in /etc/systemd/system

[root@opensourceecology ~]# cd /etc/systemd/system
[root@opensourceecology system]# ls
basic.target.wants    getty.target.wants       nginx.service.d         sysinit.target.wants
default.target        local-fs.target.wants    openvpn-client.service  system-update.target.wants
default.target.wants  multi-user.target.wants  sockets.target.wants
[root@opensourceecology system]# cat openvpn-client.service 
[Unit]
Description=OpenVPN tunnel for %I
After=syslog.target network-online.target
Wants=network-online.target
Documentation=man:openvpn(8)
Documentation=https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
Documentation=https://community.openvpn.net/openvpn/wiki/HOWTO

[Service]
User=root
Type=notify
PrivateTmp=true
WorkingDirectory=/etc/openvpn/client
#WorkingDirectory=/root/openvpn
#ExecStart=/usr/sbin/openvpn --suppress-timestamps --nobind --config client.conf
ExecStart=/etc/openvpn/client/connect.sh
CapabilityBoundingSet=CAP_IPC_LOCK CAP_NET_ADMIN CAP_NET_RAW CAP_SETGID CAP_SETUID CAP_SYS_CHROOT CAP_DAC_OVERRIDE
#LimitNPROC=10
LimitNPROC=infinity
DeviceAllow=/dev/null rw
DeviceAllow=/dev/net/tun rw
ProtectSystem=true
ProtectHome=true
KillMode=process

[Install]
WantedBy=multi-user.target
[root@opensourceecology system]#

from above, it looks like the path to the connect.sh script is
```
/etc/openvpn/client/connect.sh
```

[root@opensourceecology system]# cd /etc/openvpn/client/
[root@opensourceecology client]# ls
auth.txt  ca.crt  client.conf  connect.sh  hetzner2.crt  hetzner2.key  ta.key
[root@opensourceecology client]# cat connect.sh 
#!/bin/bash

# yes, storing 2fa secret keys here doesn't add security; we're doing it only
# because we can't exclude 2fa on a client-by-client basis in the openvpn
# free server. 2fa is important for humans with bad passwords, not so much for
# our server accounts. If someone can read this file, we've bigger problems.
TOTP_SECRET=OBFUSCATED
USERNAME=OBFUSCATED
token=`oathtool --base32 --totp ${TOTP_SECRET}`

echo -e "${USERNAME}\n${token}" > auth.txt

/usr/sbin/openvpn --suppress-timestamps --nobind --config client.conf
[root@opensourceecology client]#

If I manually run the openvpn command from the above script, I get the error

[root@opensourceecology client]# /usr/sbin/openvpn --suppress-timestamps --nobind --config client.conf
OpenVPN 2.4.12 x86_64-redhat-linux-gnu [Fedora EPEL patched] [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Mar 17 2022
library versions: OpenSSL 1.0.2k-fips  26 Jan 2017, LZO 2.06
WARNING: Your certificate has expired!
Outgoing Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Incoming Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
TCP/UDP: Preserving recently used remote address: [AF_INET]195.201.233.113:1194
Socket Buffers: R=[212992->212992] S=[212992->212992]
UDP link local: (not bound)
UDP link remote: [AF_INET]195.201.233.113:1194
TLS: Initial packet from [AF_INET]195.201.233.113:1194, sid=11bab8ae 02ebcb41
VERIFY OK: depth=1, CN=osedev1
VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483
OpenSSL: error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed
TLS_ERROR: BIO read tls_read_plaintext error
TLS Error: TLS object -> incoming plaintext read error
TLS Error: TLS handshake failed
SIGUSR1[soft,tls-error] received, process restarting
Restart pause, 5 second(s)
^CSIGINT[hard,init_instance] received, process exiting
[root@opensourceecology client]#

that's enlightening, but I'm not sure I understand why this would crash the system
I checked munin
1. again, the db queries suddenly drops to 0 in 3 places, and we can clearly see when the server is broken
2. cooresponding with all 3 gaps in the db chart, there's a sawtooth vertical climb in httpd rss usage. interesting.
3. there's also a huge vertical spike in number of processes at these 3 times
4. uptime was 1,400 days before we rebooted :whistles:
  1. but that's the point: this was a very well-oiled machine. why is running a "connect" script over-and-over bringing the whole thing to its knees? could that really be it?
5. there is a huge spike in postfix mail queue. so it's possible that it's actually openvpn that's triggering ossec that's triggering postfix and that's causing issues
honestly, nobody is using osedev or osestaging. I could renew the certs, but I'm not sure that's warranted atm
instead, I'll do two things: [1] I'll add an ossec rule not to email alerts for this and [2] I'll disable the openvpn-client unit in systemd
first we test the current behaviour of ossec with this log entry

[root@opensourceecology ~]# /var/ossec/bin/ossec-logtest
2024/03/01 04:58:05 ossec-testrule: INFO: Reading local decoder file.
2024/03/01 04:58:05 ossec-testrule: INFO: Started (pid: 12113).
ossec-testrule: Type one log per line.

Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483


**Phase 1: Completed pre-decoding.
	   full event: 'Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'
	   hostname: 'opensourceecology'
	   program_name: 'connect.sh'
	   log: 'VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'

**Phase 2: Completed decoding.
	   No decoder matched.

**Phase 3: Completed filtering (rules).
	   Rule id: '1002'
	   Level: '2'
	   Description: 'Unknown problem somewhere in the system.'
**Alert to be generated.

next we make a backup of the local rules file

[root@opensourceecology client]# cd /var/ossec/rules/
[root@opensourceecology rules]# ls
apache_rules.xml     hordeimp_rules.xml     nginx_rules.xml       rules_config.xml       trend-osce_rules.xml
apparmor_rules.xml   ids_rules.xml          openbsd_rules.xml     sendmail_rules.xml     unbound_rules.xml
arpwatch_rules.xml   imapd_rules.xml        opensmtpd_rules.xml   smbd_rules.xml         vmpop3d_rules.xml
asterisk_rules.xml   local_rules.xml        ossec_rules.xml       solaris_bsm_rules.xml  vmware_rules.xml
attack_rules.xml     mailscanner_rules.xml  pam_rules.xml         sonicwall_rules.xml    vpn_concentrator_rules.xml
cimserver_rules.xml  mcafee_av_rules.xml    php_rules.xml         spamd_rules.xml        vpopmail_rules.xml
cisco-ios_rules.xml  msauth_rules.xml       pix_rules.xml         squid_rules.xml        vsftpd_rules.xml
clam_av_rules.xml    ms_dhcp_rules.xml      policy_rules.xml      sshd_rules.xml         web_appsec_rules.xml
courier_rules.xml    ms-exchange_rules.xml  postfix_rules.xml     symantec-av_rules.xml  web_rules.xml
dovecot_rules.xml    ms_ftpd_rules.xml      postgresql_rules.xml  symantec-ws_rules.xml  wordpress_rules.xml
dropbear_rules.xml   ms-se_rules.xml        proftpd_rules.xml     syslog_rules.xml       zeus_rules.xml
firewalld_rules.xml  mysql_rules.xml        pure-ftpd_rules.xml   sysmon_rules.xml
firewall_rules.xml   named_rules.xml        racoon_rules.xml      systemd_rules.xml
ftpd_rules.xml       netscreenfw_rules.xml  roundcube_rules.xml   telnetd_rules.xml
[root@opensourceecology rules]# cp local_rules.xml local_rules.xml.20240229

I added a new rule to catch this and stop emailing on them

[root@opensourceecology rules]# diff local_rules.xml.20240229 local_rules.xml
93a94,99
>   <rule id="100057" level="2">
>     <if_sid>1002</if_sid>
>     <match>error=certificate has expired</match>
>     <options>no_email_alert</options>
>     <description>2024-02-29: Ignore OpenVPN cert errors that spam us every 5 minutes and crash the server</description>
>   </rule>
[root@opensourceecology rules]#

I restarted ossec and confirmed it matches to the new rule. For some reason it still says "Alert to be generated", which I don't understand why. The rule 100057 clearly has "no_email_alert" set

[root@opensourceecology ~]# systemctl restart ossec
[root@opensourceecology ~]# /var/ossec/bin/ossec-logtest
2024/03/01 05:07:33 ossec-testrule: INFO: Reading local decoder file.
2024/03/01 05:07:33 ossec-testrule: INFO: Started (pid: 15844).
ossec-testrule: Type one log per line.

Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483


**Phase 1: Completed pre-decoding.
	   full event: 'Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'
	   hostname: 'opensourceecology'
	   program_name: 'connect.sh'
	   log: 'VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'

**Phase 2: Completed decoding.
	   No decoder matched.

**Phase 3: Completed filtering (rules).
	   Rule id: '100057'
	   Level: '2'
	   Description: '2024-02-29: Ignore OpenVPN cert errors that spam us every 5 minutes and crash the server'
**Alert to be generated.

ah crap, I just realized that there's tons of different lines that are triggering these emails

Feb 29 08:18:02 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=1780983422210014075884459813273922948

Feb 29 08:18:02 opensourceecology connect.sh: OpenSSL: error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed

Feb 29 08:18:02 opensourceecology connect.sh: TLS_ERROR: BIO read tls_read_plaintext error

Feb 29 08:18:02 opensourceecology connect.sh: TLS Error: TLS object -> incoming plaintext read error

Feb 29 08:18:02 opensourceecology connect.sh: TLS Error: TLS handshake failed

Feb 29 08:18:02 opensourceecology connect.sh: SIGUSR1[soft,tls-error] received, process restarting

well, the one thing all these lines have in common is that they contain
```
opensourceecology connect.sh:
```
, so I'll match on that instead

[root@opensourceecology rules]# diff local_rules.xml.20240229 local_rules.xml
93a94,99
>   <rule id="100057" level="2">
>     <if_sid>1002</if_sid>
>     <match>opensourceecology connect.sh: </match>
>     <options>no_email_alert</options>
>     <description>2024-02-29: Ignore OpenVPN cert errors that spam us every 5 minutes and crash the server</description>
>   </rule>
[root@opensourceecology rules]#

and a test

[root@opensourceecology ~]# systemctl restart ossec
[root@opensourceecology ~]# /var/ossec/bin/ossec-logtest
2024/03/01 05:14:57 ossec-testrule: INFO: Reading local decoder file.
2024/03/01 05:14:57 ossec-testrule: INFO: Started (pid: 17861).
ossec-testrule: Type one log per line.

Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483


**Phase 1: Completed pre-decoding.
	   full event: 'Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'
	   hostname: 'opensourceecology'
	   program_name: 'connect.sh'
	   log: 'VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'

**Phase 2: Completed decoding.
	   No decoder matched.

**Phase 3: Completed filtering (rules).
	   Rule id: '1002'
	   Level: '2'
	   Description: 'Unknown problem somewhere in the system.'
**Alert to be generated.

shit, that didn't work. probably regex syntax on one of the special characters
ah, the decoder extracted the two fields as
```
hostname
```
and
```
program_name
```
, and it appears that
```
<match>
```
only applies to the
```
log
```
portion of the message. this works

[root@opensourceecology rules]# diff local_rules.xml.20240229 local_rules.xml
93a94,100
>   <rule id="100057" level="2">
>     <if_sid>1002</if_sid>
>     <hostname>opensourceecology</hostname>
>     <program_name>connect.sh</program_name>
>     <options>no_email_alert</options>
>     <description>2024-02-29: Ignore OpenVPN cert errors that spam us every 5 minutes and crash the server</description>
>   </rule>
[root@opensourceecology rules]#

1. but for some reason it still says
```
Alert will be generated
```

[root@opensourceecology ~]# systemctl restart ossec
[root@opensourceecology ~]# /var/ossec/bin/ossec-logtest
2024/03/01 05:24:57 ossec-testrule: INFO: Reading local decoder file.
2024/03/01 05:24:57 ossec-testrule: INFO: Started (pid: 22484).
ossec-testrule: Type one log per line.

Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483


**Phase 1: Completed pre-decoding.
	   full event: 'Feb 27 23:12:07 opensourceecology connect.sh: VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'
	   hostname: 'opensourceecology'
	   program_name: 'connect.sh'
	   log: 'VERIFY ERROR: depth=0, error=certificate has expired: CN=server, serial=17809834222100140758844598132739229483'

**Phase 2: Completed decoding.
	   No decoder matched.

**Phase 3: Completed filtering (rules).
	   Rule id: '100057'
	   Level: '2'
	   Description: '2024-02-29: Ignore OpenVPN cert errors that spam us every 5 minutes and crash the server'
**Alert to be generated.

I also went ahead and stopped & disabled the openvpn-client service in systemd

[root@opensourceecology rules]# systemctl status openvpn-client
● openvpn-client.service
   Loaded: loaded (/etc/systemd/system/openvpn-client.service; enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Thu 2024-02-29 22:34:17 UTC; 6h ago
	 Docs: man:openvpn(8)
		   https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
		   https://community.openvpn.net/openvpn/wiki/HOWTO
  Process: 1110 ExecStart=/etc/openvpn/client/connect.sh (code=killed, signal=TERM)
 Main PID: 1110 (code=killed, signal=TERM)
   CGroup: /system.slice/openvpn-client.service
		   └─1171 /usr/sbin/openvpn --suppress-timestamps --nobind --config client.conf

Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: UDP link remote: [AF_INET]195.201.233.113:1194
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS: Initial packet from [AF_INET]195.201.233.113:1194...453
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: VERIFY OK: depth=1, CN=osedev1
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: VERIFY ERROR: depth=0, error=certificate has expired: ...483
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: OpenSSL: error:14090086:SSL routines:ssl3_get_server_c...led
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS_ERROR: BIO read tls_read_plaintext error
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS Error: TLS object -> incoming plaintext read error
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS Error: TLS handshake failed
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: SIGUSR1[soft,tls-error] received, process restarting
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: Restart pause, 300 second(s)
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology rules]# systemctl stop openvpn-client
[root@opensourceecology rules]# systemctl disable openvpn-client
Removed symlink /etc/systemd/system/multi-user.target.wants/openvpn-client.service.
[root@opensourceecology rules]# systemctl status openvpn-client
● openvpn-client.service
   Loaded: loaded (/etc/systemd/system/openvpn-client.service; disabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Thu 2024-02-29 22:34:17 UTC; 6h ago
	 Docs: man:openvpn(8)
		   https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
		   https://community.openvpn.net/openvpn/wiki/HOWTO
 Main PID: 1110 (code=killed, signal=TERM)
   CGroup: /system.slice/openvpn-client.service
		   └─1171 /usr/sbin/openvpn --suppress-timestamps --nobind --config client.conf

Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS: Initial packet from [AF_INET]195.201.233.113:1194...453
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: VERIFY OK: depth=1, CN=osedev1
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: VERIFY ERROR: depth=0, error=certificate has expired: ...483
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: OpenSSL: error:14090086:SSL routines:ssl3_get_server_c...led
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS_ERROR: BIO read tls_read_plaintext error
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS Error: TLS object -> incoming plaintext read error
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: TLS Error: TLS handshake failed
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: SIGUSR1[soft,tls-error] received, process restarting
Mar 01 05:28:26 opensourceecology.org connect.sh[1110]: Restart pause, 300 second(s)
Mar 01 05:29:33 opensourceecology.org systemd[1]: [/etc/systemd/system/openvpn-client.service:2] Failed to re...cess
Hint: Some lines were ellipsized, use -l to show in full.
[root@opensourceecology rules]#

I sent an email to Marcin, and I'll let that sit overnight and see what the postfix charts look like in the morning

I logged into the server and disabled the openvpn-client service (which was responsible for establishing the connection between your osedev, osestaging, and oseprod servers) and added a rule to your ossec config to stop it from sending an email every time this connect.sh script has issues.

More details here:

 * https://wiki.opensourceecology.org/wiki/Maltfield_Log/2024

I'll check it in the morning and if the server doesn't have issues for a week, I guess we can call it fixed.

PS: Your server had been running without a reboot for ~1,400 days before you rebooted it. That's a well-oiled machine.


Cheers,

Michael Altfield
Senior Technology Advisor
PGP Fingerprint: 8A4B 0AF8 162F 3B6A 79B7  70D2 AA3E DF71 60E2 D97B

Open Source Ecology
www.opensourceecology.org

Sat January 10, 2024

Marcin emailed me yesterday (2024-01-19) asking about cert issues with www.opensourceecology.org
I confirmed that wiki.opensourceecology.org is fine, but www.opensourceecology.org displays a cert issue in firefox
checking the cert details in firefox, I see that the following SANs are available. Note that the naked domain and 'www' subdomain are absent

awstats.opensourceecology.org, fef.opensourceecology.org, forum.opensourceecology.org, microfactory.opensourceecology.org, munin.opensourceecology.org, opensourceecology.org, oswh.opensourceecology.org, phplist.opensourceecology.org, staging.opensourceecology.org, store.opensourceecology.org, wiki.opensourceecology.org

looks like there's a cron named 'letsencrypt'

[root@opensourceecology cron.d]# ls
0hourly  awstats_generate_static_files  backup_to_backblaze  cacti  letsencrypt  munin  phplist  raid-check  sysstat
[root@opensourceecology cron.d]#

the log file is /var/log/letsEncryptRenew.log

# once a month, update our letsencrypt cert
20 4 13 * * root /root/bin/letsencrypt/renew.sh &>> /var/log/letsEncryptRenew.log
[root@opensourceecology cron.d]#

the very bottom of the log file contains this message

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: www.opensourceecology.org
   Type:   dns
   Detail: DNS problem: networking error looking up CAA for
   www.opensourceecology.org
Redirecting to /bin/systemctl reload nginx.service

I attempted a run now

[root@opensourceecology cron.d]# certbot renew
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/openbuildinginstitute.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert not yet due for renewal

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/opensourceecology.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert not yet due for renewal

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The following certs are not due for renewal yet:
  /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem expires on 2024-04-12 (skipped)
  /etc/letsencrypt/live/opensourceecology.org/fullchain.pem expires on 2024-04-12 (skipped)
No renewals were attempted.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology cron.d]#

checking the existing cert info, it shows naked domain in the list, but somehow the the 'www' subdomian is missing

[root@opensourceecology cron.d]# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: openbuildinginstitute.org
	Domains: www.openbuildinginstitute.org awstats.openbuildinginstitute.org openbuildinginstitute.org seedhome.openbuildinginstitute.org
	Expiry Date: 2024-04-12 03:24:05+00:00 (VALID: 82 days)
	Certificate Path: /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/openbuildinginstitute.org/privkey.pem
  Certificate Name: opensourceecology.org
	Domains: opensourceecology.org awstats.opensourceecology.org fef.opensourceecology.org forum.opensourceecology.org microfactory.opensourceecology.org munin.opensourceecology.org oswh.opensourceecology.org phplist.opensourceecology.org staging.opensourceecology.org store.opensourceecology.org wiki.opensourceecology.org
	Expiry Date: 2024-04-12 03:24:18+00:00 (VALID: 82 days)
	Certificate Path: /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/opensourceecology.org/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology cron.d]

I see the 'www' in the config file

[root@opensourceecology cron.d]# cd /etc/letsencrypt
[root@opensourceecology cron.d]#

[root@opensourceecology letsencrypt]# grep -irl 'wiki.opensourceecology.org' *
renewal/opensourceecology.org.conf
[root@opensourceecology letsencrypt]# cat renewal/opensourceecology.org.conf 
# renew_before_expiry = 30 days
version = 1.3.0
archive_dir = /etc/letsencrypt/archive/opensourceecology.org
cert = /etc/letsencrypt/live/opensourceecology.org/cert.pem
privkey = /etc/letsencrypt/live/opensourceecology.org/privkey.pem
chain = /etc/letsencrypt/live/opensourceecology.org/chain.pem
fullchain = /etc/letsencrypt/live/opensourceecology.org/fullchain.pem

# Options used in the renewal process
[renewalparams]
authenticator = webroot
account = OBFUSCATED
server = OBFUSCATED
allow_subset_of_names = True
webroot_map
store.opensourceecology.org = /var/www/html/store.opensourceecology.org/htdocs
phplist.opensourceecology.org = /var/www/html/phplist.opensourceecology.org/public_html
www.opensourceecology.org = /var/www/html/www.opensourceecology.org/htdocs
munin.opensourceecology.org = /var/www/html/certbot/htdocs
microfactory.opensourceecology.org = /var/www/html/microfactory.opensourceecology.org/htdocs
opensourceecology.org = /var/www/html/www.opensourceecology.org/htdocs
wiki.opensourceecology.org = /var/www/html/wiki.opensourceecology.org/htdocs
fef.opensourceecology.org = /var/www/html/fef.opensourceecology.org/htdocs
awstats.opensourceecology.org = /var/www/html/certbot/htdocs
forum.opensourceecology.org = /var/www/html/forum.opensourceecology.org/htdocs
staging.opensourceecology.org = /var/www/html/staging.opensourceecology.org/htdocs
oswh.opensourceecology.org = /var/www/html/oswh.opensourceecology.org/htdocs
[root@opensourceecology letsencrypt]#

I triggered a manual force-renewal

[root@opensourceecology letsencrypt]# certbot renew --force-renewal
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/openbuildinginstitute.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/opensourceecology.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/opensourceecology.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Congratulations, all renewals succeeded. The following certs have been renewed:
  /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem (success)
  /etc/letsencrypt/live/opensourceecology.org/fullchain.pem (success)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

still the 'www' subdomain is absent

[root@opensourceecology letsencrypt]# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: openbuildinginstitute.org
	Domains: www.openbuildinginstitute.org awstats.openbuildinginstitute.org openbuildinginstitute.org seedhome.openbuildinginstitute.org
	Expiry Date: 2024-04-20 00:47:41+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/openbuildinginstitute.org/privkey.pem
  Certificate Name: opensourceecology.org
	Domains: opensourceecology.org awstats.opensourceecology.org fef.opensourceecology.org forum.opensourceecology.org microfactory.opensourceecology.org munin.opensourceecology.org oswh.opensourceecology.org phplist.opensourceecology.org staging.opensourceecology.org store.opensourceecology.org wiki.opensourceecology.org
	Expiry Date: 2024-04-20 00:47:47+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/opensourceecology.org/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

I reloaded the site in the browser; the issue persists
1. I checked the cert; it says it was issued today (actually 2024-01-21, which is "today" in UTC but "tomorrow" in my timezone). Anyway, that shows that the cert did, in fact, renew. But it still is missing the subdomain that it is *supposed* to fetch as a SAN, according to the config above. Huh.
I sent an email to Marcin asking if any changes were made recently
I checked crt.sh (the cert observatory) to see the history of our certs for the domain opensourceecology.org https://crt.sh/?q=opensourceecology.org
1. It looks like we've been using the 'www.opensourceecology.org SAN in our cert since 2018-03-01 (prior to that it was cloudflare wildcard), and it suddenly disappeared in the 2014-01-13 renewal. It was present in the one before that, issued 2023-11-13.
therefore, this issue occurred sometime between 2023-11-13 and 2024-01-13.
unfortunately, the letsencrypt logs are not timestampped
1. the only time YYYY appears is the date of the certs expiry

# here's the log from the last renewal (that worked; note the line that lists www.opensourceecology.org)
<pre>
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/openbuildinginstitute.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert is due for renewal, auto-renewing...
Non-interactive renewal: random delay of 118.029492405 seconds
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for awstats.openbuildinginstitute.org
http-01 challenge for openbuildinginstitute.org
http-01 challenge for seedhome.openbuildinginstitute.org
http-01 challenge for www.openbuildinginstitute.org
Waiting for verification...
Cleaning up challenges

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/opensourceecology.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert is due for renewal, auto-renewing...
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for awstats.opensourceecology.org
http-01 challenge for fef.opensourceecology.org
http-01 challenge for forum.opensourceecology.org
http-01 challenge for microfactory.opensourceecology.org
http-01 challenge for munin.opensourceecology.org
http-01 challenge for opensourceecology.org
http-01 challenge for oswh.opensourceecology.org
http-01 challenge for phplist.opensourceecology.org
http-01 challenge for staging.opensourceecology.org
http-01 challenge for store.opensourceecology.org
http-01 challenge for wiki.opensourceecology.org
http-01 challenge for www.opensourceecology.org
Waiting for verification...
Cleaning up challenges

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/opensourceecology.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Congratulations, all renewals succeeded. The following certs have been renewed:
  /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem (success)
  /etc/letsencrypt/live/opensourceecology.org/fullchain.pem (success)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Redirecting to /bin/systemctl reload nginx.service

1. there's one entry after that where it said the certs aren't up for renewal; I'll skip that
2. the entry after that is where it renewed, but didn't fetch www.opensourceecology.org due to some (temp?) DNS network issue

Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/openbuildinginstitute.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert is due for renewal, auto-renewing...
Non-interactive renewal: random delay of 236.7908044 seconds
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for awstats.openbuildinginstitute.org
http-01 challenge for openbuildinginstitute.org
http-01 challenge for seedhome.openbuildinginstitute.org
http-01 challenge for www.openbuildinginstitute.org
Waiting for verification...
Cleaning up challenges

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/opensourceecology.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert is due for renewal, auto-renewing...
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for awstats.opensourceecology.org
http-01 challenge for fef.opensourceecology.org
http-01 challenge for forum.opensourceecology.org
http-01 challenge for microfactory.opensourceecology.org
http-01 challenge for munin.opensourceecology.org
http-01 challenge for opensourceecology.org
http-01 challenge for oswh.opensourceecology.org
http-01 challenge for phplist.opensourceecology.org
http-01 challenge for staging.opensourceecology.org
http-01 challenge for store.opensourceecology.org
http-01 challenge for wiki.opensourceecology.org
http-01 challenge for www.opensourceecology.org
Waiting for verification...
Challenge failed for domain www.opensourceecology.org
http-01 challenge for www.opensourceecology.org
Cleaning up challenges

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/opensourceecology.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Congratulations, all renewals succeeded. The following certs have been renewed:
  /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem (success)
  /etc/letsencrypt/live/opensourceecology.org/fullchain.pem (success)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: www.opensourceecology.org
   Type:   dns
   Detail: DNS problem: networking error looking up CAA for
   www.opensourceecology.org
Redirecting to /bin/systemctl reload nginx.service

there's a recent thread with this error message on the Let's Encrypt fourms. It was asked 2023-12-20, which is right in our timeline of when this issue hit OSE https://community.letsencrypt.org/t/networking-error-when-letsencrypt-is-looking-up-caa/210483
there's also a recent bug report filed last week (2024-01-15) https://github.com/cert-manager/cert-manager/issues/6640
none of the above two threads were helpful. I'm just going to try to upgrade cerbot; maybe there's some old bug that was fixed
1. currently we're running certbot v1.3.0-1

[root@opensourceecology letsencrypt]# rpm -qa | grep -i certbot
python2-certbot-1.3.0-1.el7.noarch
certbot-1.3.0-1.el7.noarch
[root@opensourceecology letsencrypt]#

1. I upgraded to v1.11.0-2

[root@opensourceecology letsencrypt]# yum install certbot
Loaded plugins: fastestmirror, replace
Loading mirror speeds from cached hostfile
 * base: mirror.fra1.de.leaseweb.net
 * epel: mirror.de.leaseweb.net
 * extras: mirror.checkdomain.de
 * updates: mirror.checkdomain.de
Resolving Dependencies
--> Running transaction check
---> Package certbot.noarch 0:1.3.0-1.el7 will be updated
---> Package certbot.noarch 0:1.11.0-2.el7 will be an update
--> Processing Dependency: python2-certbot = 1.11.0-2.el7 for package: certbot-1.11.0-2.el7.noarch
--> Running transaction check
---> Package python2-certbot.noarch 0:1.3.0-1.el7 will be updated
---> Package python2-certbot.noarch 0:1.11.0-2.el7 will be an update
--> Processing Dependency: python2-acme >= 1.8.0 for package: python2-certbot-1.11.0-2.el7.noarch
--> Running transaction check
---> Package python2-acme.noarch 0:1.3.0-1.el7 will be updated
---> Package python2-acme.noarch 0:1.11.0-1.el7 will be an update
--> Finished Dependency Resolution

Dependencies Resolved

==================================================================================================================================
 Package                             Arch                       Version                            Repository                Size
==================================================================================================================================
Updating:
 certbot                             noarch                     1.11.0-2.el7                       epel                      47 k
Updating for dependencies:
 python2-acme                        noarch                     1.11.0-1.el7                       epel                      83 k
 python2-certbot                     noarch                     1.11.0-2.el7                       epel                     386 k

Transaction Summary
==================================================================================================================================
Upgrade  1 Package (+2 Dependent packages)

Total download size: 515 k
Is this ok [y/d/N]: y
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
(1/3): certbot-1.11.0-2.el7.noarch.rpm                                                                     |  47 kB  00:00:00     
(2/3): python2-acme-1.11.0-1.el7.noarch.rpm                                                                |  83 kB  00:00:00     
(3/3): python2-certbot-1.11.0-2.el7.noarch.rpm                                                             | 386 kB  00:00:00     
--------------------
Total                                                                                             1.9 MB/s | 515 kB  00:00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Updating   : python2-acme-1.11.0-1.el7.noarch                                                                               1/6 
  Updating   : python2-certbot-1.11.0-2.el7.noarch                                                                            2/6 
  Updating   : certbot-1.11.0-2.el7.noarch                                                                                    3/6 
  Cleanup    : certbot-1.3.0-1.el7.noarch                                                                                     4/6 
  Cleanup    : python2-certbot-1.3.0-1.el7.noarch                                                                             5/6 
  Cleanup    : python2-acme-1.3.0-1.el7.noarch                                                                                6/6 
  Verifying  : certbot-1.11.0-2.el7.noarch                                                                                    1/6 
  Verifying  : python2-acme-1.11.0-1.el7.noarch                                                                               2/6 
  Verifying  : python2-certbot-1.11.0-2.el7.noarch                                                                            3/6 
  Verifying  : python2-certbot-1.3.0-1.el7.noarch                                                                             4/6 
  Verifying  : certbot-1.3.0-1.el7.noarch                                                                                     5/6 
  Verifying  : python2-acme-1.3.0-1.el7.noarch                                                                                6/6 

Updated:
  certbot.noarch 0:1.11.0-2.el7                                                                                                   

Dependency Updated:
  python2-acme.noarch 0:1.11.0-1.el7                             python2-certbot.noarch 0:1.11.0-2.el7                            

Complete!
[root@opensourceecology letsencrypt]#

I did another force renewal

[root@opensourceecology letsencrypt]# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: openbuildinginstitute.org
	Serial Number: 47c22e806946a20e6a574394ef507d0aff2
	Key Type: RSA
	Domains: www.openbuildinginstitute.org awstats.openbuildinginstitute.org openbuildinginstitute.org seedhome.openbuildinginstitute.org
	Expiry Date: 2024-04-20 00:47:41+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/openbuildinginstitute.org/privkey.pem
  Certificate Name: opensourceecology.org
	Serial Number: 4c27188b056dad08bff1b624995a3774a61
	Key Type: RSA
	Domains: opensourceecology.org awstats.opensourceecology.org fef.opensourceecology.org forum.opensourceecology.org microfactory.opensourceecology.org munin.opensourceecology.org oswh.opensourceecology.org phplist.opensourceecology.org staging.opensourceecology.org store.opensourceecology.org wiki.opensourceecology.org
	Expiry Date: 2024-04-20 00:47:47+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/opensourceecology.org/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

[root@opensourceecology letsencrypt]# certbot renew --force-renew
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/openbuildinginstitute.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate for www.openbuildinginstitute.org and 3 more domains

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/opensourceecology.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate for opensourceecology.org and 10 more domains

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/opensourceecology.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Congratulations, all renewals succeeded: 
  /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem (success)
  /etc/letsencrypt/live/opensourceecology.org/fullchain.pem (success)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

1. sadly, there's still no 'www' subdomain

[root@opensourceecology letsencrypt]# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: openbuildinginstitute.org
	Serial Number: 3a18ce304580629bd0d83fe28d9285b492f
	Key Type: RSA
	Domains: www.openbuildinginstitute.org awstats.openbuildinginstitute.org openbuildinginstitute.org seedhome.openbuildinginstitute.org
	Expiry Date: 2024-04-20 01:17:10+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/openbuildinginstitute.org/privkey.pem
  Certificate Name: opensourceecology.org
	Serial Number: 4c559a2d15c6bea75c5e7bb65170fbe2d9c
	Key Type: RSA
	Domains: opensourceecology.org awstats.opensourceecology.org fef.opensourceecology.org forum.opensourceecology.org microfactory.opensourceecology.org munin.opensourceecology.org oswh.opensourceecology.org phplist.opensourceecology.org staging.opensourceecology.org store.opensourceecology.org wiki.opensourceecology.org
	Expiry Date: 2024-04-20 01:17:15+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/opensourceecology.org/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

I think I should try to just explicitly tell certbot to update its config. the process for this is documented in the wiki here https://wiki.opensourceecology.org/wiki/Web_server_configuration#https
1. here's the command as-stated in the wiki (doc rot)

certbot -nv --expand --cert-name opensourceecology.org certonly -v --webroot -w /var/www/html/fef.opensourceecology.org/htdocs/ -d fef.opensourceecology.org -w /var/www/html/www.opensourceecology.org/htdocs -d www.opensourceecology.org -w /var/www/html/oswh.opensourceecology.org/htdocs/ -d oswh.opensourceecology.org -w /var/www/html/forum.opensourceecology.org/htdocs -d forum.opensourceecology.org
/bin/chmod 0400 /etc/letsencrypt/archive/*/pri*
nginx -t && service nginx reload

1. here's the updated command

certbot -nv --expand --cert-name opensourceecology.org certonly -v --webroot -w /var/www/html/fef.opensourceecology.org/htdocs/ -d fef.opensourceecology.org -w /var/www/html/www.opensourceecology.org/htdocs -d www.opensourceecology.org -d opensourceecology.org -w /var/www/html/oswh.opensourceecology.org/htdocs/ -d oswh.opensourceecology.org -w /var/www/html/forum.opensourceecology.org/htdocs -d forum.opensourceecology.org -w /var/www/html/store.opensourceecology.org/htdocs -d store.opensourceecology.org -w /var/www/html/phplist.opensourceecology.org/public_html -d phplist.opensourceecology.org -w /var/www/html/certbot/htdocs -d munin.opensourceecology.org -d awstats.opensourceecology.org -w /var/www/html/microfactory.opensourceecology.org/htdocs -d microfactory.opensourceecology.org -w /var/www/html/wiki.opensourceecology.org/htdocs -d wiki.opensourceecology.org -w /var/www/html/staging.opensourceecology.org/htdocs -d staging.opensourceecology.org

gave it a run

[root@opensourceecology letsencrypt]# certbot -nv --expand --cert-name opensourceecology.org certonly -v --webroot -w /var/www/html/fef.opensourceecology.org/htdocs/ -d fef.opensourceecology.org -w /var/www/html/www.opensourceecology.org/htdocs -d www.opensourceecology.org -d opensourceecology.org -w /var/www/html/oswh.opensourceecology.org/htdocs/ -d oswh.opensourceecology.org -w /var/www/html/forum.opensourceecology.org/htdocs -d forum.opensourceecology.org -w /var/www/html/store.opensourceecology.org/htdocs -d store.opensourceecology.org -w /var/www/html/phplist.opensourceecology.org/public_html -d phplist.opensourceecology.org -w /var/www/html/certbot/htdocs -d munin.opensourceecology.org -d awstats.opensourceecology.org -w /var/www/html/microfactory.opensourceecology.org/htdocs -d microfactory.opensourceecology.org -w /var/www/html/wiki.opensourceecology.org/htdocs -d wiki.opensourceecology.org -w /var/www/html/staging.opensourceecology.org/htdocs -d staging.opensourceecology.org
...
Writing new config /etc/letsencrypt/renewal/opensourceecology.org.conf.new.
Reporting to user: Congratulations! Your certificate and chain have been saved at:
/etc/letsencrypt/live/opensourceecology.org/fullchain.pem
Your key file has been saved at:
/etc/letsencrypt/live/opensourceecology.org/privkey.pem
Your certificate will expire on 2024-04-20. To obtain a new or tweaked version of this certificate in the future, simply run certb
ot again. To non-interactively renew *all* of your certificates, run "certbot renew"
Reporting to user: If you like Certbot, please consider supporting our work by:

Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
Donating to EFF:                    https://eff.org/donate-le



IMPORTANT NOTES:
 - Congratulations! Your certificate and chain have been saved at:
   /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
   Your key file has been saved at:
   /etc/letsencrypt/live/opensourceecology.org/privkey.pem
   Your certificate will expire on 2024-04-20. To obtain a new or
   tweaked version of this certificate in the future, simply run
   certbot again. To non-interactively renew *all* of your
   certificates, run "certbot renew"
 - If you like Certbot, please consider supporting our work by:

   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
   Donating to EFF:                    https://eff.org/donate-le

[root@opensourceecology letsencrypt]#

finally, the 'www' subdomain is in the list now

[root@opensourceecology letsencrypt]# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: openbuildinginstitute.org
	Serial Number: 3a18ce304580629bd0d83fe28d9285b492f
	Key Type: RSA
	Domains: www.openbuildinginstitute.org awstats.openbuildinginstitute.org openbuildinginstitute.org seedhome.openbuildinginstitute.org
	Expiry Date: 2024-04-20 01:17:10+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/openbuildinginstitute.org/privkey.pem
  Certificate Name: opensourceecology.org
	Serial Number: 30ec9a5c7a9a9603d3f2c5bd943be6d8d75
	Key Type: RSA
	Domains: fef.opensourceecology.org awstats.opensourceecology.org forum.opensourceecology.org microfactory.opensourceecology.org munin.opensourceecology.org opensourceecology.org oswh.opensourceecology.org phplist.opensourceecology.org staging.opensourceecology.org store.opensourceecology.org wiki.opensourceecology.org www.opensourceecology.org
	Expiry Date: 2024-04-20 01:28:35+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/opensourceecology.org/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

refreshing the page doesn't work; let's try the next two documented commands

[root@opensourceecology letsencrypt]# /bin/chmod 0400 /etc/letsencrypt/archive/*/pri*
[root@opensourceecology letsencrypt]# nginx -t && service nginx reload 
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.openbuildinginstitute.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: [warn] the "ssl" directive is deprecated, use the "listen ... ssl" directive instead in /etc/nginx/conf.d/ssl.opensourceecology.org.include:11
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Redirecting to /bin/systemctl reload nginx.service
[root@opensourceecology letsencrypt]#

well looks like we're using some deprecated 'ssl' option in our nginx configs, but otherwise that looked good
I refreshed the web browser, and now the cert errors are gone :D
I tested all of the other subdomains; I did not get any errors
for good measure (to make sure it won't break again on next renewal), I forced one more renewal

[root@opensourceecology letsencrypt]# certbot renew --force-renew
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/openbuildinginstitute.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate for www.openbuildinginstitute.org and 3 more domains

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/opensourceecology.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Plugins selected: Authenticator webroot, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Renewing an existing certificate for fef.opensourceecology.org and 11 more domains

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
new certificate deployed without reload, fullchain is
/etc/letsencrypt/live/opensourceecology.org/fullchain.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Congratulations, all renewals succeeded: 
  /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem (success)
  /etc/letsencrypt/live/opensourceecology.org/fullchain.pem (success)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

I confirmed that the 'www' subdomain is, in fact, still present. Success!

[root@opensourceecology letsencrypt]# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: openbuildinginstitute.org
	Serial Number: 30f44a6ec177f0c7c24b8d450aa9e57ef36
	Key Type: RSA
	Domains: www.openbuildinginstitute.org awstats.openbuildinginstitute.org openbuildinginstitute.org seedhome.openbuildinginstitute.org
	Expiry Date: 2024-04-20 01:35:00+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/openbuildinginstitute.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/openbuildinginstitute.org/privkey.pem
  Certificate Name: opensourceecology.org
	Serial Number: 360f3ea6ad9910ecaff89dc9725efd8f1a9
	Key Type: RSA
	Domains: fef.opensourceecology.org awstats.opensourceecology.org forum.opensourceecology.org microfactory.opensourceecology.org munin.opensourceecology.org opensourceecology.org oswh.opensourceecology.org phplist.opensourceecology.org staging.opensourceecology.org store.opensourceecology.org wiki.opensourceecology.org www.opensourceecology.org
	Expiry Date: 2024-04-20 01:35:05+00:00 (VALID: 89 days)
	Certificate Path: /etc/letsencrypt/live/opensourceecology.org/fullchain.pem
	Private Key Path: /etc/letsencrypt/live/opensourceecology.org/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[root@opensourceecology letsencrypt]#

In conclusion, it appears that there's some recent & known bug in certbot where DNS errors cause failures at some very small probability. This broke renewal for just one of our subdomains sometime between 2023-11-13 and 2024-01-13 https://github.com/cert-manager/cert-manager/issues/6640
I think I may have also discovered some other unknown bug where ^ this bug causes certbot to forget about a defined subdomain on subsequent renewals. But I don't have enough evidence of this yet to submit a proper bug report
the fix was to manually construct and re-run the `certbot -nv --expand --cert-name opensourceecology.org certonly -v --webroot -w ...` command && chmod && nginx reload (see above)
I sent an email to Marcin about this root cause & resolution

Hey Marcin,

I fixed the certbot config.

It looks like there's some sort of bug (reported to Let's Encrypt GitHub last week) that causes some domain renewals to fail due to DNS issues at some highly improbable likelihood. This happened for the 'www.opensourceeoclogy.org' subdomain sometime between 2023-11-13 and 2024-01-13. And, for some reason, this prevented subsequent renewals from attempting to renew that subdomain.

I fixed it by refreshing the config for the opensourceecology.org subdomain. I tested that subsequent renewals will work as well.

For more information, see my log entry:

 * https://wiki.opensourceecology.org/wiki/Maltfield_Log/2024#Sat_January_10.2C_2024

Please let me know if you have any further questions.


Thank you,

Michael Altfield
Senior Technology Advisor
PGP Fingerprint: 8A4B 0AF8 162F 3B6A 79B7  70D2 AA3E DF71 60E2 D97B

Open Source Ecology
www.opensourceecology.org

Maltfield Log/2024 H1: Difference between revisions

Revision as of 19:17, 26 July 2024

Contents

See Also

Fri July 26, 2024

Wed July 24, 2024

Fri March 08, 2024

Fri March 01, 2024

Thr February 29, 2024

Sat January 10, 2024

Navigation menu

Maltfield Log/2024 H1: Difference between revisions

Revision as of 19:17, 26 July 2024

See Also

Fri July 26, 2024

Wed July 24, 2024

Fri March 08, 2024

Fri March 01, 2024

Thr February 29, 2024

Sat January 10, 2024

Navigation menu

Search