Amazon Glacier
OSE briefly used Amazon Glacier to store some old backups of our data when Dreamhost notified us on 2018-03-20 that we were violating their ulimited storage policy by storing backups on their servers.
In 2019, we left Amazon Glacier for Backblaze for the following reasons:
- Backblaze is cheaper after considering that Glacier has minimum archive retention requirements in their fine-print
- Backblaze is way, way easier to use
Contents
Actual Storage Quotas and Costs
Account
- ops at opensourceecology.org
Restore from Glacier
We use Amazon Glacier for cheap long-term backups. Glacier is one of the cheapest options to store about a TB of data, but also can be very difficult to use. And data retrieval costs are high.
Glacier has no notion of files & dirs. Archives are uploaded to Glacier into vaults. Archives are identified by a long UID & a description. At OSE, we use the tool 'glacier-cli' to simplify large uploads; this tool uses the description field as a file name. For each tarball, I uploaded a cooresponding metadata text file that lists all the files that were uploaded (this should save costs if someone doesn't know which archive to download, since the metadata file is significantly smaller than the tarball archive itself).
Archives >4G require splitting into multiple parts & providing the API with a tree checksum of the parts. This is a very nontrivial process, and most of our backups are >4G. Therefore, we use the tool glacier-cli, which does most of this tedious work for you.
Install glacier-cli
If you don't already have this installed (try executing `glacier.py`), you can install the glacier-cli tool as follows
# install glacier-cli prereqs yum install python-boto python2-iso8601 python-sqlalchemy # install glacier-cli mkdir -p /root/sandbox cd /root/sandbox git clone git://github.com/basak/glacier-cli.git cd glacier-cli chmod +x glacier.py ./glacier.py -h # create symlink in $PATH mkdir -p /root/bin cd /root/bin ln -s /root/sandbox/glacier-cli/glacier.py
Sync Vault Contents
The AWS console will show you the vaults you have, the number of archvies it has, and the total size in bytes. It does *not* show you the archives you have in your vault (ie: their IDs, descriptions, & individual sizes). In order to get this, you have to pay (and wait ~4 hours) for an inventory job. glacier-cli keeps a local copy of this inventory data, but--if you haven't updated it recently--you should probably refresh it anyway. Here's how:
# set creds (check keepass for 'ose-backups-cron') export AWS_ACCESS_KEY_ID='CHANGEME' export AWS_SECRET_ACCESS_KEY='CHANGEME' # query glacier to get an up-to-date inventory of the given vault (this will take ~4 hours to complete) # note: to determine the vault name, it's best to check the aws console glacier.py --region us-west-2 vault sync --max-age=0 --wait <vaultName> # now list the contents of the vault glacier.py --region us-west-2 archive list <vaultName>
Restore Archives
The glacier-cli tool uses the archive description as the file name. You cannot restore by the archive id using glacier-cli. Here's how to restore by the "name" of the archive:
# create tmp dir (make sure not to download big files into dirs that are themselves being backed-up daily!) stamp=`date +%Y%m%d_%T` tmpDir=/var/tmp/glacierRestore.$stamp mkdir $tmpDir chown root:root $tmpDir chmod 0700 $tmpDir pushd $tmpDir # download the encrypted archive time glacier.py --region us-west-2 archive retrieve --wait <vaultName> <archive1> <archive2> ...
The above command will take many hours to complete. When it does, the file(s) will be present in your cwd.
Decrypt Archive Contents
OSE's backup data holds very sensitive content (ie; passwords, logs, etc), so they're encrypted before being uploaded to 3rd parties.
Use gpg and the 4K 'ose-backups-cron.key' keyfile (which can be found in keepass) to decrypt this data as follows:
Note: Depending on the version of `gpg` installed, you may need to omit the '--batch' option.
[root@hetzner2 glacierRestore]# gpg --batch --passphrase-file /root/backups/ose-backups-cron.key --output hetzner1_20170901-052001.fileList.txt.bz2 --decrypt hetzner1_20170901-052001.fileList.txt.bz2.gpg gpg: AES encrypted data gpg: encrypted with 1 passphrase [root@hetzner2 glacierRestore]#
There should now be a decrypted file. You can extract it to view the contents using `tar`.
Delete from Glacier
It is unbelievably, jaw-dropping difficult to delete files from Amazon Glacier.
It is not possible to delete a bucket from Amazon Glacier using the AWS Console WUI. To delete something from Amazon Glacier, you must:
- Create API keys
- Install the AWS CLI tool
- Initiate an "inventory job" for the bucket you'd like to delete using the AWS API
- Wait several days for the "inventory job" to complete (but don't wait too long or the job's output will be deleted)
- Download the inventory job's output using the AWS API
- Iterate through all the ArchivesIds in the inventory job's output and delete each one using the AWS API
For more information, see:
- https://docs.aws.amazon.com/amazonglacier/latest/dev/using-aws-sdk.html
- https://gist.github.com/veuncent/ac21ae8131f24d3971a621fac0d95be5
deleteMeIn2020
In 2024, Marcin noticed that OSE had been charged $1.03 per month from Amazon glacier. It took a long time digging through the AWS Console just to figure out that there was an old bucket (named 'deleteMeIn2020'). The 'deleteMeIn2020' bucket contained a backup of hetzner1 files. It was created just before we terminated the hetzner1 server's contact in 2018. The intention was for us to delete the 'deleteMeIn2020' bucket in 2020 (2 years after it was created) if we hadn't needed to recover files from it over the past 2 years since it was created. The bucket was had 285.3 GB (as of last inventory).
For more logs documenting the herculean effort to delete this vault, see the following articles:
- CHG-2018-07-06_hetzner1_deprecation
- Maltfield_Log/2018_Q1#Sat_Mar_31.2C_2018
- Maltfield_Log/2024_Q4#Wed_Oct_02.2C_2024
- Maltfield_Log/2024_Q4#Fri_Oct_04.2C_2024
- Maltfield_Log/2024_Q4#Sun_Oct_06.2C_2024
Hetzner 1
On 2018-07-06, we deprecated our managed hosting hetzner1 server, replacing it with hetzner2, a dedicated server with root access that had more resources _and_ cost less per month.
All of the files from hetzner1 were uploaded to Glacier for safe long-term storage in-case they ever need to be recovered.