Maltfield Log/2019 Q3
My work log from the year 2019 Quarter 3. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.
See Also
Sat Aug 17, 2019
- I found this shitty Help Desk article on BackBlaze B2's non-payment procedure to determine at what point they delete all our precious backup data if we accidentally don't pay again. Answer: After 1.5 months. In this case, we discovered the issue after 1.25 months; that was close! https://help.backblaze.com/hc/en-us/articles/219361957-B2-Non-payment-procedures
- but the above article says that all the dates are subject to change, so who the fuck knows *shrug*
- I recommended to Marcin that he setup email forwards and filters from our backblaze b2 google account so that he can be notified more sooner in the 1-month grace period. Personally, I can't fucking login to that account anymore due to google "security features" even though, yeah, I'm the G Suite Admin *facepalm*
...
- I had some emails with Chris about the wiki archival process, which is also an important component of backups and OSE's mission in general
- same as I did back in 2018-05, I created a new snapshot for him since he lost the old version https://wiki.opensourceecology.org/wiki/Maltfield_Log/2018_Q2#Sat_May_26.2C_2018
# DECLARE VARS snapshotDestDir='/var/tmp/snapshotOfWikiForChris.20190818' wikiDbName='osewiki_db' wikiDbUser='osewiki_user' wikiDbPass='CHANGEME' stamp=`date +%Y%m%d_%T` pushd "${snapshotDestDir}" time nice mysqldump --single-transaction -u"${wikiDbUser}" -p"${wikiDbPass}" --databases "${wikiDbName}" | gzip -c > "${wikiDbName}.${stamp}.sql.gz" time nice tar -czvf "${snapshotDestDir}/wiki.opensourceecology.org.vhost.${stamp}.tar.gz" /var/www/html/wiki.opensourceecology.org/*
Fri Aug 16, 2019
- added an ossec local rule to prevent emails alerts from being triggered on modsec rejecting queries as they're too many and hide more important alerts
- added an ossec local rule to prevent email alerts from being triggered on 500 errors as they're too many and hide more important alerts
Thr Aug 15, 2019
- confirmed that ose backups are working again. we're missing the first-of-the-month, but the past few days look good
[root@opensourceecology ~]# sudo su - b2user Last login: Sat Aug 3 05:57:40 UTC 2019 on pts/0 [b2user@opensourceecology ~]$ ~/virtualenv/bin/b2 ls ose-server-backups daily_hetzner2_20190813_072001.tar.gpg daily_hetzner2_20190814_072001.tar.gpg monthly_hetzner2_20181001_091809.tar.gpg monthly_hetzner2_20181101_091810.tar.gpg monthly_hetzner2_20181201_091759.tar.gpg monthly_hetzner2_20190201_072001.tar.gpg monthly_hetzner2_20190301_072001.tar.gpg monthly_hetzner2_20190401_072001.tar.gpg monthly_hetzner2_20190501_072001.tar.gpg monthly_hetzner2_20190601_072001.tar.gpg monthly_hetzner2_20190701_072001.tar.gpg weekly_hetzner2_20190812_072001.tar.gpg yearly_hetzner2_20190101_111520.tar.gpg [b2user@opensourceecology ~]$
- I also documented these commands on the wiki for future, easy reference https://wiki.opensourceecology.org/wiki/Backblaze
- re-ran backup report
- fixed error in backup report
- re-ran backup report, looks good
[root@opensourceecology backups]# ./backupReport.sh INFO: email body below ATTENTION: BACKUPS MISSING! WARNING: First of this month's backup (20190801) is missing! See below for the contents of the backblaze b2 bucket = ose-server-backups daily_hetzner2_20190813_072001.tar.gpg daily_hetzner2_20190814_072001.tar.gpg monthly_hetzner2_20181001_091809.tar.gpg monthly_hetzner2_20181101_091810.tar.gpg monthly_hetzner2_20181201_091759.tar.gpg monthly_hetzner2_20190201_072001.tar.gpg monthly_hetzner2_20190301_072001.tar.gpg monthly_hetzner2_20190401_072001.tar.gpg monthly_hetzner2_20190501_072001.tar.gpg monthly_hetzner2_20190601_072001.tar.gpg monthly_hetzner2_20190701_072001.tar.gpg weekly_hetzner2_20190812_072001.tar.gpg yearly_hetzner2_20190101_111520.tar.gpg --- Note: This report was generated on 20190815_084159 UTC by script '/root/backups/backupReport.sh' This script was triggered by '/etc/cron.d/backup_to_backblaze' For more information about OSE backups, please see the relevant documentation pages on the wiki: * https://wiki.opensourceecology.org/wiki/Backblaze * https://wiki.opensourceecology.org/wiki/OSE_Server#Backups [root@opensourceecology backups]#
- confirmed that our accrued bill of $2.57 was paid with Marcin's updates. backups are stable again!
- I emailed Chris asking about the status of the wiki archival process -> archive.org
- I did some fixing to the ossec email alerts
Sat Aug 03, 2019
- we just got an email from the server stating that there was errors with the backups
ATTENTION: BACKUPS MISSING! WARNING: First of this month's backup (20190801) is missing! WARNING: First of last month's backup (20190701) is missing! WARNING: Yesterday's backup (20190802) is missing! WARNING: The day before yesterday's backup (20190801) is missing! See below for the contents of the backblaze b2 bucket = ose-server-backups
- note that there was no contents under the "see below for the contents of the backblaze b2 bucket = ose-server-backups"
- this error was generated by the cron job /etc/cron.d/backup_to_backblaze and the script /root/backups/backupReport.sh. This is the first time I've seen it return a critical failure like this.
- the fact that the output is totally empty and it states that we're missing all the backups even though this s the first time we've recieved this, suggests it's a false-positive
- I logged into the server, changed to the 'b2user' and ran the command to get a listing of the contens of the bucket, and--sure enough--I got an error
[b2user@opensourceecology ~]$ ~/virtualenv/bin/b2 ls ose-server-backups ERROR: Unknown error: 403 account_trouble Account trouble. Please log into your b2 account at www.backblaze.com. [b2user@opensourceecology ~]$
- Per the error message, I logged-into the b2 website. As soon as I authenticated, I saw this pop-up
B2 Access Denied Your access to B2 has been suspended because your account has not been in good standing and your grace period has now ended. Please review your account and update your payment method at payment history, or contact tech support for assistance. B2 API Error Error Detail: Account trouble. Please log into your b2 account at www.backblaze.com. B2 API errors happen for a variety of reasons including failures to connect to the B2 servers, unexpectedly high B2 server load and general networking problems. Please see our documentation for more information about specific errors returned for each API call. You should also investigate our easy-to-use command line tool here: https://www.backblaze.com/b2/docs/quick_command_line.html
- I'm sure they sent us alerts to our account email (backblaze at opensourcecology dot org), but I can't fucking check because gmail demands 2fa via sms that isn't tied to the account. ugh.
- I made some improvements to the backupReport.sh script.
- it now redirects STDERR to STDOUT, so the any errors are captured & sent with the email where the backup files usually appear
- it now has a footer that includes the timestamp of when the script was executed
- it now lists the path of the script itself, to help future admins debug issues
- it now lists the path of the cron that executes the script, to help future admins debug issues
- it now prints links to two relevant documenation pages on the wiki, to help future admins debug issues
- The new email looks like this
ATTENTION: BACKUPS MISSING! ERROR: Unknown error: 403 account_trouble Account trouble. Please log into your b2 account at www.backblaze.com. WARNING: First of this month's backup (20190801) is missing! WARNING: First of last month's backup (20190701) is missing! WARNING: Yesterday's backup (20190802) is missing! WARNING: The day before yesterday's backup (20190801) is missing! See below for the contents of the backblaze b2 bucket = ose-server-backups ERROR: Unknown error: 403 account_trouble Account trouble. Please log into your b2 account at www.backblaze.com. --- Note: This report was generated on 20190803_071847 UTC by script '/root/backups/backupReport.sh' This script was triggered by '/etc/cron.d/backup_to_backblaze' For more information about OSE backups, please see the relevant documentation pages on the wiki: * https://wiki.opensourceecology.org/wiki/Backblaze * https://wiki.opensourceecology.org/wiki/OSE_Server#Backups
Thr Aug 01, 2019
- with Tom on DR & bus factor contingency planning
- Marcin asked if our server could handle thousands of concurrent editors on the wiki for the upcoming cordless drill microfactory contest
- hetzner2 is basically idle. I'm not sure where its limits are, but we're nowhere near it. With varnish in-place, writes are much more costly than concurrent readers. I explained to Marcin that scaling hetzner2 would be dividing up parts (add 1 or more DB servers, 1+ memcache (db cache) servers, 1+ apache backend servers, 1+ nginx ssl terminator servers, 1+ haproxy load balancing servers, 1+ mail servers, 1+ varnish frontend caching servers, etc)
- I went to check munin, but it the graphs were bare! Looks like our server rebooted, and munin wasn't enabled to start at system boot. I fixed that.
[root@opensourceecology init.d]# systemctl enable munin-node Created symlink from /etc/systemd/system/multi-user.target.wants/munin-node.service to /usr/lib/systemd/system/munin-node.service. [root@opensourceecology init.d]# systemctl status munin-node ● munin-node.service - Munin Node Server. Loaded: loaded (/usr/lib/systemd/system/munin-node.service; enabled; vendor preset: disabled) Active: inactive (dead) Docs: man:munin-node [root@opensourceecology init.d]# systemctl start munin-node [root@opensourceecology init.d]# systemctl status munin-node ● munin-node.service - Munin Node Server. Loaded: loaded (/usr/lib/systemd/system/munin-node.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2019-08-01 10:17:09 UTC; 2s ago Docs: man:munin-node Process: 20015 ExecStart=/usr/sbin/munin-node (code=exited, status=0/SUCCESS) Main PID: 20016 (munin-node) CGroup: /system.slice/munin-node.service └─20016 /usr/bin/perl -wT /usr/sbin/munin-node Aug 01 10:17:09 opensourceecology.org systemd[1]: Starting Munin Node Server.... Aug 01 10:17:09 opensourceecology.org systemd[1]: Started Munin Node Server.. [root@opensourceecology init.d]#
- yearly grahps are available showing the data cutting off sometime in June
- Marcin said Discourse is no replacement for Askbot, so we should go with both.
- Marcin approved my request for $100/yr for a dev server in the hetzner cloud. I'll provision a CX11 w/ 50G block storage when I get back from my upcoming vacation
Wed Jul 31, 2019
- Discussion with Tom on DR & bus factor contingency planning
- Wiki changes
Tue Jul 30, 2019
1. Stack Exchange & Askbot research 2. I told Marcin that I think Discourse is the best option, but the dependencies may break our prod server, and I asked for a budget for a dev server 3. the dev server woudn't need to be very powerful, but it does need to have the same setup & disk as prod. 4. I checked the current disk on our prod server, and it has 145G used a. 34G are in /home/b2user = redundant backup data. b. Wow, there's also 72G in /tmp/systemd-private-2311ab4052754ae68f4a114aefa85295-httpd.service-LqLH0q/tmp/ a. so this appears to be caused by a "PrivateTemp" feature of systemd because many apps like httpd will create files in the 777'd /tmp dir. At OSE, I hardened php so that it writes temp files *not* in this dir, anyway. I found several guides on how to disable PrivateTemp, but preventing apache from writing to a 777 dir doesn't sound so bad.https://gryzli.info/2015/06/21/centos-7-missing-phpapache-temporary-files-in-tmp-systemd-private-temp/ b. better question: how do I just cleanup this shit? I tried `systemd-tmpfiles --clean` & `systemd-tmpfiles --remove` to no avail
[root@opensourceecology tmp]# systemd-tmpfiles --clean [root@opensourceecology tmp]# du -sh /tmp 72G /tmp [root@opensourceecology tmp]# systemd-tmpfiles --remove [root@opensourceecology tmp]# du -sh /tmp 72G /tmp [root@opensourceecology tmp]#
6. I also confirmed that the above script *should* be being run every day, anyway https://unix.stackexchange.com/questions/489940/linux-files-folders-cleanup-under-tmp
[root@opensourceecology tmp]# systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES n/a n/a Sat 2019-06-22 03:11:54 UTC 1 months 7 days ago systemd-readahead-done.timer systemd-readahead-done Wed 2019-07-31 03:29:18 UTC 21h left Tue 2019-07-30 03:29:18 UTC 2h 25min ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean 2 timers listed. Pass --all to see loaded but inactive timers, too. [root@opensourceecology tmp]#
8. to make matters worse, it does appear that we have everything on one partition
[root@opensourceecology tmp]# cat /etc/fstab proc /proc proc defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 tmpfs /dev/shm tmpfs defaults 0 0 sysfs /sys sysfs defaults 0 0 /dev/md/0 none swap sw 0 0 /dev/md/1 /boot ext3 defaults 0 0 /dev/md/2 / ext4 defaults 0 0 [root@opensourceecology tmp]# mount [root@opensourceecology tmp]# mount sysfs on /sys type sysfs (rw,relatime) proc on /proc type proc (rw,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=32792068k,nr_inodes=8198017,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,relatime) devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/md2 on / type ext4 (rw,relatime,data=ordered) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=10157) debugfs on /sys/kernel/debug type debugfs (rw,relatime) mqueue on /dev/mqueue type mqueue (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime) /dev/md1 on /boot type ext3 (rw,relatime,stripe=4,data=ordered) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6563484k,mode=700) tmpfs on /run/user/1005 type tmpfs (rw,nosuid,nodev,relatime,size=6563484k,mode=700,uid=1005,gid=1005) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) [root@opensourceecology tmp]#
10. It appears there's just a ton of cachegrind files here 444,670 files to be exact (all <1M)
[root@opensourceecology tmp]# ls -lah | grep -vi cachegrind total 72G drwxrwxrwt 2 root root 22M Jul 30 06:02 . drwx------ 3 root root 4.0K Jun 22 03:11 .. -rw-r--r-- 1 apache apache 5 Jun 22 03:12 dos-127.0.0.1 -rw-r--r-- 1 apache apache 112M Jul 30 06:02 xdebug.log [root@opensourceecology tmp]# ls -lah | grep "M" drwxrwxrwt 2 root root 22M Jul 30 06:02 . -rw-r--r-- 1 apache apache 112M Jul 30 06:02 xdebug.log [root@opensourceecology tmp]# ls -lah | grep "G" total 72G [root@opensourceecology tmp]# ls -lah | grep 'cachegrind.out' | wc -l 444670 [root@opensourceecology tmp]# pwd /tmp/systemd-private-2311ab4052754ae68f4a114aefa85295-httpd.service-LqLH0q/tmp [root@opensourceecology tmp]# date Tue Jul 30 06:04:26 UTC 2019 [root@opensourceecology tmp]#
13. These files should be deleted after 30 days, and that appears to be the case https://bugzilla.redhat.com/show_bug.cgi?id=1183684#c4 14. A quick search for xdebug shows that I enabled it for phplist that's probably what's generating these cachegrind files. I uncommented the lines enabling xdebug in the phplist apache vhost config file and gave httpd a restart. That cleared the tmp files. Now the disk usage is down to 73G used and 11M in /tmp
[root@opensourceecology tmp]# date Tue Jul 30 06:10:18 UTC 2019 [root@opensourceecology tmp]# du -sh /tmp 11M /tmp [root@opensourceecology tmp]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md2 197G 73G 115G 40% / devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 865M 31G 3% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/md1 488M 289M 174M 63% /boot tmpfs 6.3G 0 6.3G 0% /run/user/0 tmpfs 6.3G 0 6.3G 0% /run/user/1005 [root@opensourceecology tmp]#
16. Ok, so that's 73 - 34 = 39G of disk usage. 39*1.3 = 51G for good measure. 17. I found this guide for using rsync and a few touch-up commands to migrate a hetzner vServer to their cloud service https://wiki.hetzner.de/index.php/How_to_migrate_vServers_to_Cloud/en 18. the cheapest cloud hetzner node with >51G is their CX31 w/ 80G disk @ 8.90 EUR/mo = 106.8 EUR/yr = $119/yr 19. ...but they also have block volume storage (where we could, for example, mount /var = 37G). Then we'd only need a 51-37 = 14G root, and we could get hetzner's cheapest cloud node = CX11 w/ 20G disk @ 2.49 EUR/mo = 29.88 EUR/yr = $33.29 /yr + a 50G block volume for 2 EUR/mo = 24 EUR/yr = $26.74/yr. That's a total of 33.29+26.74 = $60.03/yr for a dev node 20. I asked Marcin if he could approve spending $100/yr for a dev node in the hetzner cloud.
Tue Jul 28, 2019
1. Stack Exchange research 2. WebGL research for 3d models on OSE Store Product Pages (ie: 3d printer)
Tue Jul 18, 2019
1. Marcin asked if there's any way to log activity for the Confirm Accounts extentionhttps://www.mediawiki.org/wiki/Extension:ConfirmAccount 2. I didn't find anything in the documenation about logs, but scanning through the code showed some calls to wfDebug() 3. we already have a debug file defined, but apparently mediawiki stopped writing after it reached 2.1G. I created a new file where we can monitor any future issues 4. in the long-term, we should probably setup this file to logrotate and compress after, say, 1G
Tue Jul 02, 2019
1. marchin mentioned that some users are unable to request wiki accounts. 2. I was only given the info for one user in specific. I manually queried the DB by their email address. I found 0 entreis in the 'wiki_user' table and 0 entries in the 'wiki_account_requests" table 3. I was able to request an account using their email address, and I confirmed that it appeared in the Special:ConfirmAccounts WUI. I deleted the row, and confirmed that it disappeared from the WUI. I re-registered (to confirm that they could as well), and deleted the row again. 4. So I can't reproduce this. 5. I emailed Marcin telling him to tell users as a short fix to try again using a different Username and Password. As a long fix, tell us: The "Username" they requested The "Email address" used for the request The day, time, and timezone when they submitted the request, and Any relevant error messages that they were given (bonus points for screenshots) 6. ...so that I can research this further