Revision as of 09:32, 15 August 2019

My work log from the year 2019 Quarter 3. I intentionally made this verbose to make future admin's work easier when troubleshooting. The more keywords, error messages, etc that are listed in this log, the more helpful it will be for the future OSE Sysadmin.

Sat Aug 03, 2019

we just got an email from the server stating that there was errors with the backups

ATTENTION: BACKUPS MISSING!


WARNING: First of this month's backup (20190801) is missing!
WARNING: First of last month's backup (20190701) is missing!
WARNING: Yesterday's backup (20190802) is missing!
WARNING: The day before yesterday's backup (20190801) is missing!

See below for the contents of the backblaze b2 bucket = ose-server-backups

note that there was no contents under the "see below for the contents of the backblaze b2 bucket = ose-server-backups"
this error was generated by the cron job /etc/cron.d/backup_to_backblaze and the script /root/backups/backupReport.sh. This is the first time I've seen it return a critical failure like this.
the fact that the output is totally empty and it states that we're missing all the backups even though this s the first time we've recieved this, suggests it's a false-positive
I logged into the server, changed to the 'b2user' and ran the command to get a listing of the contens of the bucket, and--sure enough--I got an error

[b2user@opensourceecology ~]$ ~/virtualenv/bin/b2 ls ose-server-backups
ERROR: Unknown error: 403 account_trouble Account trouble. Please log into your b2 account at www.backblaze.com.
[b2user@opensourceecology ~]$

Per the error message, I logged-into the b2 website. As soon as I authenticated, I saw this pop-up


B2 Access Denied

Your access to B2 has been suspended because your account has not been in good standing and your grace period has now ended. Please review your account and update your payment method at payment history, or contact tech support for assistance.

B2 API Error
Error Detail:

	Account trouble. Please log into your b2 account at www.backblaze.com.

B2 API errors happen for a variety of reasons including failures to connect to the B2 servers, unexpectedly high B2 server load and general networking problems. Please see our documentation for more information about specific errors returned for each API call.

You should also investigate our easy-to-use command line tool here: https://www.backblaze.com/b2/docs/quick_command_line.html

I'm sure they sent us alerts to our account email (backblaze at opensourcecology dot org), but I can't fucking check because gmail demands 2fa via sms that isn't tied to the account. ugh.
I made some improvements to the backupReport.sh script.
1. it now redirects STDERR to STDOUT, so the any errors are captured & sent with the email where the backup files usually appear
2. it now has a footer that includes the timestamp of when the script was executed
3. it now lists the path of the script itself, to help future admins debug issues
4. it now lists the path of the cron that executes the script, to help future admins debug issues
5. it now prints links to two relevant documenation pages on the wiki, to help future admins debug issues
The new email looks like this



ATTENTION: BACKUPS MISSING!


ERROR: Unknown error: 403 account_trouble Account trouble. Please log into your b2 account at www.backblaze.com.
WARNING: First of this month's backup (20190801) is missing!
WARNING: First of last month's backup (20190701) is missing!
WARNING: Yesterday's backup (20190802) is missing!
WARNING: The day before yesterday's backup (20190801) is missing!

See below for the contents of the backblaze b2 bucket = ose-server-backups

ERROR: Unknown error: 403 account_trouble Account trouble. Please log into your b2 account at www.backblaze.com.
---
Note: This report was generated on 20190803_071847 UTC by script '/root/backups/backupReport.sh'
	  This script was triggered by '/etc/cron.d/backup_to_backblaze'

	  For more information about OSE backups, please see the relevant documentation pages on the wiki:
	   * https://wiki.opensourceecology.org/wiki/Backblaze
	   * https://wiki.opensourceecology.org/wiki/OSE_Server#Backups

Thr Aug 01, 2019

with Tom on DR & bus factor contingency planning
Marcin asked if our server could handle thousands of concurrent editors on the wiki for the upcoming cordless drill microfactory contest
hetzner2 is basically idle. I'm not sure where its limits are, but we're nowhere near it. With varnish in-place, writes are much more costly than concurrent readers. I explained to Marcin that scaling hetzner2 would be dividing up parts (add 1 or more DB servers, 1+ memcache (db cache) servers, 1+ apache backend servers, 1+ nginx ssl terminator servers, 1+ haproxy load balancing servers, 1+ mail servers, 1+ varnish frontend caching servers, etc)
I went to check munin, but it the graphs were bare! Looks like our server rebooted, and munin wasn't enabled to start at system boot. I fixed that.

[root@opensourceecology init.d]# systemctl enable munin-node
Created symlink from /etc/systemd/system/multi-user.target.wants/munin-node.service to /usr/lib/systemd/system/munin-node.service.
[root@opensourceecology init.d]# systemctl status munin-node
● munin-node.service - Munin Node Server.
   Loaded: loaded (/usr/lib/systemd/system/munin-node.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
	 Docs: man:munin-node
[root@opensourceecology init.d]# systemctl start munin-node
[root@opensourceecology init.d]# systemctl status munin-node
● munin-node.service - Munin Node Server.
   Loaded: loaded (/usr/lib/systemd/system/munin-node.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-08-01 10:17:09 UTC; 2s ago
	 Docs: man:munin-node
  Process: 20015 ExecStart=/usr/sbin/munin-node (code=exited, status=0/SUCCESS)
 Main PID: 20016 (munin-node)
   CGroup: /system.slice/munin-node.service
		   └─20016 /usr/bin/perl -wT /usr/sbin/munin-node

Aug 01 10:17:09 opensourceecology.org systemd[1]: Starting Munin Node Server....
Aug 01 10:17:09 opensourceecology.org systemd[1]: Started Munin Node Server..
[root@opensourceecology init.d]#

yearly grahps are available showing the data cutting off sometime in June
Marcin said Discourse is no replacement for Askbot, so we should go with both.
Marcin approved my request for $100/yr for a dev server in the hetzner cloud. I'll provision a CX11 w/ 50G block storage when I get back from my upcoming vacation

Wed Jul 31, 2019

Discussion with Tom on DR & bus factor contingency planning
Wiki changes

Tue Jul 30, 2019

1. Stack Exchange & Askbot research 2. I told Marcin that I think Discourse is the best option, but the dependencies may break our prod server, and I asked for a budget for a dev server 3. the dev server woudn't need to be very powerful, but it does need to have the same setup & disk as prod. 4. I checked the current disk on our prod server, and it has 145G used a. 34G are in /home/b2user = redundant backup data. b. Wow, there's also 72G in /tmp/systemd-private-2311ab4052754ae68f4a114aefa85295-httpd.service-LqLH0q/tmp/ a. so this appears to be caused by a "PrivateTemp" feature of systemd because many apps like httpd will create files in the 777'd /tmp dir. At OSE, I hardened php so that it writes temp files *not* in this dir, anyway. I found several guides on how to disable PrivateTemp, but preventing apache from writing to a 777 dir doesn't sound so bad.https://gryzli.info/2015/06/21/centos-7-missing-phpapache-temporary-files-in-tmp-systemd-private-temp/ b. better question: how do I just cleanup this shit? I tried `systemd-tmpfiles --clean` & `systemd-tmpfiles --remove` to no avail

[root@opensourceecology tmp]# systemd-tmpfiles --clean
[root@opensourceecology tmp]# du -sh /tmp
72G	/tmp
[root@opensourceecology tmp]# systemd-tmpfiles --remove
[root@opensourceecology tmp]# du -sh /tmp
72G	/tmp
[root@opensourceecology tmp]#

6. I also confirmed that the above script *should* be being run every day, anyway https://unix.stackexchange.com/questions/489940/linux-files-folders-cleanup-under-tmp

[root@opensourceecology tmp]# systemctl list-timers
NEXT                         LEFT     LAST                         PASSED              UNIT                         ACTIVATES
n/a                          n/a      Sat 2019-06-22 03:11:54 UTC  1 months 7 days ago systemd-readahead-done.timer systemd-readahead-done
Wed 2019-07-31 03:29:18 UTC  21h left Tue 2019-07-30 03:29:18 UTC  2h 25min ago        systemd-tmpfiles-clean.timer systemd-tmpfiles-clean

2 timers listed.
Pass --all to see loaded but inactive timers, too.
[root@opensourceecology tmp]#

8. to make matters worse, it does appear that we have everything on one partition

[root@opensourceecology tmp]# cat /etc/fstab
proc /proc proc defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
tmpfs /dev/shm tmpfs defaults 0 0
sysfs /sys sysfs defaults 0 0
/dev/md/0 none swap sw 0 0
/dev/md/1 /boot ext3 defaults 0 0
/dev/md/2 / ext4 defaults 0 0
[root@opensourceecology tmp]# mount


[root@opensourceecology tmp]# mount
sysfs on /sys type sysfs (rw,relatime)
proc on /proc type proc (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=32792068k,nr_inodes=8198017,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,relatime)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/md2 on / type ext4 (rw,relatime,data=ordered)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=10157)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
/dev/md1 on /boot type ext3 (rw,relatime,stripe=4,data=ordered)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6563484k,mode=700)
tmpfs on /run/user/1005 type tmpfs (rw,nosuid,nodev,relatime,size=6563484k,mode=700,uid=1005,gid=1005)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
[root@opensourceecology tmp]#

10. It appears there's just a ton of cachegrind files here 444,670 files to be exact (all <1M)

[root@opensourceecology tmp]# ls -lah | grep -vi cachegrind
total 72G
drwxrwxrwt 2 root   root    22M Jul 30 06:02 .
drwx------ 3 root   root   4.0K Jun 22 03:11 ..
-rw-r--r-- 1 apache apache    5 Jun 22 03:12 dos-127.0.0.1
-rw-r--r-- 1 apache apache 112M Jul 30 06:02 xdebug.log
[root@opensourceecology tmp]# ls -lah | grep "M"
drwxrwxrwt 2 root   root    22M Jul 30 06:02 .
-rw-r--r-- 1 apache apache 112M Jul 30 06:02 xdebug.log
[root@opensourceecology tmp]# ls -lah | grep "G"
total 72G
[root@opensourceecology tmp]# ls -lah | grep 'cachegrind.out' | wc -l
444670
[root@opensourceecology tmp]# pwd
/tmp/systemd-private-2311ab4052754ae68f4a114aefa85295-httpd.service-LqLH0q/tmp
[root@opensourceecology tmp]# date
Tue Jul 30 06:04:26 UTC 2019
[root@opensourceecology tmp]#

13. These files should be deleted after 30 days, and that appears to be the case https://bugzilla.redhat.com/show_bug.cgi?id=1183684#c4 14. A quick search for xdebug shows that I enabled it for phplist that's probably what's generating these cachegrind files. I uncommented the lines enabling xdebug in the phplist apache vhost config file and gave httpd a restart. That cleared the tmp files. Now the disk usage is down to 73G used and 11M in /tmp

[root@opensourceecology tmp]# date
Tue Jul 30 06:10:18 UTC 2019
[root@opensourceecology tmp]# du -sh /tmp
11M	/tmp
[root@opensourceecology tmp]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/md2        197G   73G  115G  40% /
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G  865M   31G   3% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md1        488M  289M  174M  63% /boot
tmpfs           6.3G     0  6.3G   0% /run/user/0
tmpfs           6.3G     0  6.3G   0% /run/user/1005
[root@opensourceecology tmp]#

16. Ok, so that's 73 - 34 = 39G of disk usage. 39*1.3 = 51G for good measure. 17. I found this guide for using rsync and a few touch-up commands to migrate a hetzner vServer to their cloud service https://wiki.hetzner.de/index.php/How_to_migrate_vServers_to_Cloud/en 18. the cheapest cloud hetzner node with >51G is their CX31 w/ 80G disk @ 8.90 EUR/mo = 106.8 EUR/yr = $119/yr 19. ...but they also have block volume storage (where we could, for example, mount /var = 37G). Then we'd only need a 51-37 = 14G root, and we could get hetzner's cheapest cloud node = CX11 w/ 20G disk @ 2.49 EUR/mo = 29.88 EUR/yr = $33.29 /yr + a 50G block volume for 2 EUR/mo = 24 EUR/yr = $26.74/yr. That's a total of 33.29+26.74 = $60.03/yr for a dev node 20. I asked Marcin if he could approve spending $100/yr for a dev node in the hetzner cloud.

Tue Jul 28, 2019

1. Stack Exchange research 2. WebGL research for 3d models on OSE Store Product Pages (ie: 3d printer)

Tue Jul 18, 2019

1. Marcin asked if there's any way to log activity for the Confirm Accounts extentionhttps://www.mediawiki.org/wiki/Extension:ConfirmAccount 2. I didn't find anything in the documenation about logs, but scanning through the code showed some calls to wfDebug() 3. we already have a debug file defined, but apparently mediawiki stopped writing after it reached 2.1G. I created a new file where we can monitor any future issues 4. in the long-term, we should probably setup this file to logrotate and compress after, say, 1G

Tue Jul 02, 2019

1. marchin mentioned that some users are unable to request wiki accounts. 2. I was only given the info for one user in specific. I manually queried the DB by their email address. I found 0 entreis in the 'wiki_user' table and 0 entries in the 'wiki_account_requests" table 3. I was able to request an account using their email address, and I confirmed that it appeared in the Special:ConfirmAccounts WUI. I deleted the row, and confirmed that it disappeared from the WUI. I re-registered (to confirm that they could as well), and deleted the row again. 4. So I can't reproduce this. 5. I emailed Marcin telling him to tell users as a short fix to try again using a different Username and Password. As a long fix, tell us: The "Username" they requested The "Email address" used for the request The day, time, and timezone when they submitted the request, and Any relevant error messages that they were given (bonus points for screenshots) 6. ...so that I can research this further

Maltfield Log/2019 Q3: Difference between revisions

Revision as of 09:32, 15 August 2019

Contents

See Also

Sat Aug 03, 2019

Thr Aug 01, 2019

Wed Jul 31, 2019

Tue Jul 30, 2019

Tue Jul 28, 2019

Tue Jul 18, 2019

Tue Jul 02, 2019

Navigation menu

Maltfield Log/2019 Q3: Difference between revisions

Revision as of 09:32, 15 August 2019

See Also

Sat Aug 03, 2019

Thr Aug 01, 2019

Wed Jul 31, 2019

Tue Jul 30, 2019

Tue Jul 28, 2019

Tue Jul 18, 2019

Tue Jul 02, 2019

Navigation menu

Search