My approach to backups - Matt Marshall's Personal Site

Jack Baty and Derek Sivers each posted about how they handle backups and it prompted me to write down where I’m at with backups. I’ve been reflecting on my backup regime for a while now, with a view to making my system more robust and manageable.

Backups are important to me because I like to keep the majority of my files as close to the machine I’m using as possible. I don’t subscribe to, nor do I trust, various cloud service providers like Google Drive and Dropbox to keep my data intact or to keep it safe. This means that my live copies of documents, photos, home video, media, and literature all live on my local machines, devices, and external drives that I keep around me.

My approach to backup is that it should be based off of some sound basic principles and these should be applied to your individual situation and needs. There isn’t a one-size fits all product or solution, but there are some boxes to tick and some wisdom to apply to ensure you’re covering your bases.

The tl;dr of this post is that I currently lean on UNIX coreutils and programs such as rsync to the majority of my backups to external drives onsite, and I have plans to improve the system based on the principles outlined in the rest of the post and using some CLI tools one might not often think of for backups to achieve some more robustness.

Table of Contents

Some basic principles

While I work in technology and have Computing Science degrees, I have never worked as a sysadmin or managed a data centre. I’ve never been trained in industry standard backup practices. I mention this because there’s a real risk that the things I’m about to reference may be naive or outdated from a best practice point of view. If that’s the case, please get in touch! I’m very happy to learn (but also be aware that my backup needs are relatively humble).

The two most influential resources on my backup regime have been The Tao Of Backup and the 3-2-1 Backup Rule.

The Tao of Backup
3-2-1 Backup rule (there are lots of references to this across the web but this one is clear and came up first)

I first read the Tao of Backup when I was a teenager. I’m not sure how long the site has been in its current form but it’s not changed since then and the copyright notice is dated 1997. Ultimately, the site is selling some software known as Veracity, but the lessons in it are great. It’s a little cheesy, as the main content of the site is presented like a daoist parable about a novice sysadmin learning at the feet of an enlightened master. There are seven lessons in the Tao of Backup:

Coverage – backup all of your data (incl system and application files)
Frequency – backup at an appropriate frequency (usually daily as you work for most files)
Separation – take some backups offsite
History – keep some of your old backups
Testing – test your backups to ensure they’re fit for purpose
Security – keep your backups secure
Integrity – perform integrity checking on files and backups

That’s quite a lot to handle (and I definitely don’t do all of it) but the Tao of Backup is a resource I find myself returning to each time that I want to tweak or make my backup regime stronger. It’s definitely geared towards enterprise environments with lots of data — and is definitely shilling the author’s software solution — so likely overkill for my humble needs but I believe that the underlying principles are sound.

The 3-2-1 backup rule overlaps with the Tao a bit, and doesn’t go as far. It’s much more suited to home users or even those who are taking personal responsibility for backing up their work files. The rule goes as follows:

Maintain three copies of your data (original and two copies)
Use two different types of storage medium to enhance redundancy
Keep at least one copy offsite to enhance safety

I feel that the 3-2-1 rule is a great basis to begin a backup regime, and then start tweaking it and building on it to suit your needs.

Another resource I have found useful in thinking about my backup needs is Tony Florida’s blog post about creating automated daily, weekly, and monthly backups of data using some common tools available on *NIX platforms. Again, there is definitely overlap with the other philosophies — a good thing since it reinforces their messages — but doesn’t go much beyond creating the three copies of the data. It’s heavy on practice and lighter on theory, but it forms a great starting point to get a bare minimal backup system running.

Tony’s blog post on creating Daily, Weekly, and Monthly backups

Finally, while in an unrelated keyboard-related rabbit hole, I stumbled across Ben Vallack’s video on how he backs up his home data. It’s a bit Mac-centric and relies heavily on Time Machine to handle the History aspect, but introduces the concept of having an Archive Drive separate to the primary working drive which holds data that you’re keeping but not actively working on, and having bootable clones of your drives which you swap around between onsite and offsite locations on a weekly basis. Ben also gives a fairly accessible overview of the various risks which lead to the need for a robust backup regime.

Ben’s video on backups (Youtube) (alternative Invidious link)

So, this is where I’m coming from in terms seeking a good backup regime. I’m not looking to synthesise the above into a single unified approach but rather these resources are what I consider good places to start thinking about backups or analysing existing practices to spot holes or vulnerabilities.

What I’m backing up

I use two computers — a primary desktop machine and a laptop — and a smartphone which has some things like contacts and photos which need backing up.

My laptop doesn’t ever contain the the main or only copy of any important data, and in fact is fairly empty and is mostly used for writing things which exist in git repos across multiple machines, and watching media. At any given time, my smartphone contains my contacts and some photos I’ve taken but am yet to sync to my main machine but that’s about it. I also have a separate USB drive dedicated to media storage but I’m not considering backups to that just yet.

This leaves my main desktop computer as the chief concern for backups with some additional light considerations for my smartphone necessary to cover it. I am the sole user of my computer and I store everything in my user’s Home Folder. This means I’m backing up the contents of a single folder which is nice and convenient.

As of drafting this blog post my home folder clocks in at 239GiB. Around 35GiB of that can be blamed on my downloads folder, the bulk of which is split between large sets of open data I download and work with for my job and also youtube-dl which I use to watch YouTube videos locally on my machine and later archive on a USB drive. I’m content that I don’t particularly want or need to back up my downloads, so that leaves just over 200GiB of files left over in my home drive.

For the curious, my home drive is so big largely to the combination of local collections of audio (music, but podcast and audiobook archives take up the most space) and literature (mostly comic books by file size but I keep ebooks and academic papers here too). Photos and home video occupy a few GiB, but the bulk of the weight is made up by audio and comic book files. I could relegate these to storage on the media drive and in fact they’re backed up there too, but I like having them close to hand in my home folder so I can access them quickly and percolate them through my backups.

Not covered in the above is the fact that I take special care to back up my PGP keys, since I use them every day for accessing various files and passwords. PGP keys will also feature in one of my backup strategies, so warrant their own dedicated backup system.

In the future I will also be in charge of backing up the household’s data (we are expecting a deluge of baby photos to be the household’s main product soon), but for now I am mostly working on getting a good handle on my personal machines so that I can expand out sustainably.

What I’ve got in place

This section goes into details about what I’ve got in place to back up the things covered in the previous section.

Phone (Contacts and Photos)

For my phone contacts I have a DAV server set up via a Baïkal install at my web hosting provider which handles sync and backup for my partner and I. This was originally set up so that I could live with an Un-Googled phone and has worked a treat as a low-effort offsite backup for some niche data. I use the DAVx5 app on Android to handle the syncing, which it does pretty much by itself. For key contacts we also have a paper address book in our home.

Baïkal
DAVx5 (F-Droid link)

I don’t take a lot of photos on my phone that I want to keep, although with a baby due any moment now I imagine that will change soon. About once or twice a quarter I connect my phone to my machine via USB cable and use adb to pull everything off into a working directory. I then use a jhead command¹ to read the EXIF data on the image and move it to the appropriate folder under ~/images/photos based on the date. The same goes for any videos on the phone, although they are much rarer.

Once these are on my main machine, they’re integrated into the main backup routine, such as it stands. I definitely need to be better about grabbing the media from my phone as I have been bitten by that multiple times in the past during my PhD. At the time that caused research data loss which was bad enough, but I’d hate to lost a quarter’s worth of family photos because I was too silly to grab them on a Friday afternoon.

Computer Home Folder

At the moment I have a very simple but a relatively fragile backup regime in place for my local machine. After all the elaborating on principles I did earlier in the post, I am basically doing the bare minimum. I am looking to make this much more robust over the coming year, however.

My backup regime currently consists of having a 2.5 inch, 2TB capacity, USB hard drive permanently plugged into my computer to store daily, weekly, and monthly backups. It is LUKS-encrypted and I am prompted for the passphrase by udiskie whenever I log in and start my graphical environment. Once it is successfully unlocked, I can run my backup scripts.

I invoke the scripts via a dmenu script which allows me to run the backups manually. I would like to automate them via cron a la Tony Florida’s approach, however I have some limitations:

While I am on my computer almost daily it can be at inconsistent times, and sometimes I am not on it for days
I like to use notify-send to tell me when a backup has started or finished and whether it was successful, and cron can’t do that
I might type my LUKS key wrong, and the drive will fail to unlock. Therefore the script would fail, without notifications. I am likely to forget to check logs, so these don’t help.

I settled on dmenu as a tool which was very light, easy to integrate into scripts, and a pleasure to use. As Anna Havron often says; use tools you love. It’s a pleasure to open my graphical environment, hit a keybinding, and then select which type of backup I want to run. It makes it much more likely that I’ll keep on top of it, although it is a little bit fragile.

My daily backup script basically runs an rsync command² to sync my home folder to the “daily” folder on the backup drive. I exclude the downloads folder as well as .cache, since these are not really critical at all, take time to sync, and take up space on the drive. The first time I ran the daily backup it took a little while with over 200GiB of files to sync across but subsequent runs are quite fast. They take around 2 or 3 minutes unless I’ve added some home video files.

The weekly backup script then uses rsync again, this time to sync the daily folder with the “weekly” one on the backup drive. I don’t need to add the exclusions this time since they’ve already been handled by the daily script. I run this script on a Friday morning usually. This means that I have a week’s leeway of having deleted an important file to catch it and restore it before it gets removed from the weekly copy.

The monthly backup script uses tar and gzip to create a compressed tarball of the daily folder, names it for the date, and dumps it into the “monthly” folder. Prior to having this invoked via dmenu, I was doing this manually on another drive on an intermittent basis. I actually need to consolidate these at some point, because my 2TB drive won’t store a very large archive of 200GiB tarballs. Side note: my compressed tarballs aren’t much smaller than my home folder, I assume that is because a lot of the content is audio and images (comic books) so they don’t compress well with gzip but if I could be doing something else or better to compress them, please reach out!

That’s it, really. I really need to make the system more robust since at the moment I am depending on 2TB of co-located spinning rust to take care of each daily, weekly, and monthly backups. I’m happy with it as a quick-and-dirty home backup solution to provide onsite copies of my Home folder if my internal drive fails but it won’t stand up to disk failure or physical theft. Thankfully, LUKS offers a modicum of security protection against data theft but I would still have lost the backup copies.

PGP and SSH keys

These are both mission-critical and sensitive, so I take extra care to store these securely.

I keep copies of these on USB sticks (keys, pens, thumb drives etc.). The first copy is stored offsite at a non-technical family member’s house and the entire USB drive is encrypted with LUKS. The second copy is stored on a LUKS-encrypted partition on a USB stick I carry with me attached to my physical keys. There is also a third copy on another LUKS-encrypted drive, stored in a document safe next to instructions for how my family members can deal with my computers if I die (the passphrase to this specific drive is stored separately with another family member).

Misc offsite backups via git

I am a heavy user of git for projects and for personal use. I have an account at Gitlab.com where I have all of my public repos as well as some private ones. Currently, that means that Gitlab.com serves as an additional offsite backups of some personal materials. This includes, but is not limited to:

My configuration file (“dotfiles”)
My passwords (encrypted via PGP)
My writing (creative, household, and academic)
My Plaintext Planner

I am growing a bit more skeptical of Gitlab, and Gitlab.com, but this serves for now. All of my repos are also stored on my personal machine so they are integrated into the backup routine and it’s nice to have an offsite backup for them which I’m not personally maintaining.

Roadmap

My current practices are not perfect. I do not have total coverage, I only back up my personal files and in the event of my main machine melting I would need to spend time setting up my OS and environment again. Luckily, my dotfiles (backed up offsite) ameliorate this a little but not conveniently.

I currently don’t do enough to back up my phone which is odd since I have been bitten multiple times by phones failing. I also over-rely on USB spinning rust drives not failing, as they contain each the daily and weekly copies of my home directory on a single drive. I don’t have any offsite backups for these.

This section continues by proposing some interventions and things I can start doing to improve my backup system.

More Offsite backups

I want to establish some better offsite backups. I plan to address this in two phases: first establish an emergency quick-and-dirty remote backup for ease; secondly establish more long-term offsite archival. Once both are established, I will be closer to having achieved the Separation aspect of the Tao of Backup.

Revisiting the 3-2-1 methodology, it recommends storing the backups across two different storage mediums. Drives fail, that’s just a fact. I’m not sure whether nVme is classed as different medium to spinning rust USB drives but I’m not going to take the chance relying on them for everything; and especially not for long-term archival of data.

For that reason my quick-and-dirty approach will be to take advantage of my web hosting provider as they provide unlimited file storage for sites. I used to run a Nextcloud instance to take advantage of this but I fell out with Nextcloud and I don’t enjoy maintaining it. Instead, I will be uploading encrypted copies of my monthly archives to a non-public area of my hosted space. I expect that over time, Dreamhost may warn me or get me on fair use for this but I think it will be a valid short-term solution to achieve a remote backup of important files.

I am envisioning chaining together some coreutils to create some PGP-encrypted archives which are then split into 100MiB chunks, which can then be uploaded via SFTP to my hosting provider. I will work out the exact details soon, but off the top of my head I was thinking something like the following would get me towards where I want to be:

tar -czvf - $HOME | gpg -r $MY_EMAIL --encrypt | split --bytes 100M --numeric-suffixes - $(date +%Y-%m-%d).tar.gz.gpg.³

The idea is that the PGP encryption offers me some protection against snooping on private files by my web host. There’s nothing untoward in my files, but I don’t trust corps at all and they own the servers I am putting my files on. When it’s personal data, I want it to be encrypted to be protected against theft. I had the idea of using split to split the archive up into pieces so that it was more straightforward to upload via SFTP – if I am interrupted then I can resume later from where I left off rather than have a 200GiB archive partially uploaded.

If this approach doesn’t work, or if it becomes ridiculously cumbersome, I will look to pay for some dedicated remote storage to store copies of my monthly tarballs. Jack Baty noted that he spun up a 5TB storage box on a cloud storage provider which costs around £10 per month. £120/year is more than we spend on a lot of things and with a baby due any moment, this will be something we have to decide upon together, but it’s not bank-breaking for us (a lucky position to be in we know).

For a more long-term archival solution, I have been looking into Blu-Ray. Optical is decent enough to get me what I need from archival, without the need to look into archival quality media. I fantasise about having a tape drive but they’re noisy and very slow so I don’t think that they’re appropriate for my needs. Unfortunately, with a 200GiB Home folder and each Blu-Ray disc storing only around 25GiB (25GB? I forget how optical is labelled) I’m going to need between 8 and 10 discs per monthly backup. That’ll mount up over time, so maybe I need to manage my expectations there. In any case, I’ve got a provisional “Yepp that’s ok” from a relative about storing these discs at their house for an offsite solution.

In the case of the optical backups, I imagine my tarball to split pipeline might be useful for spreading a single encrypted archive over multiple discs. If it works out, that is. I imagine restoring from these backups might not be fun if I have to cat 200GiB of tarball together and then decrypt it all.

Backing up contacts from my phone

I want to start percolating my contacts from my phone through my backup system, even in the short-term. This will involve taking explicit backups of the Baïkal server as well as having plaintext backups of the data to store elsewhere in the system. The AOSP Contacts app which I use on LineageOS can export contacts in a VCard format, however I have read that it is very idiosyncratic and thus is only really useful for restoring to another instance of the same app.

I’ll see if there’s any other solutions, but that may have to be the approach in the short term.

Easier photo/video backups from phone

I need to take photos and videos from my phone more often than I do. I’m still working out the details of this but I figured that I could collect the adb and jhead commands into a script that I can call quickly from dmenu. There may be some thing on the side of my phone I can do, via Termux and rsync, however the fewer moving parts the better for me as then there’s less to maintain. USB cables haven’t failed me thus far!

Better onsite archival storage

There’s only so far that my little 2.5 inch USB drives can get me in terms of storing my monthly backups. In an ideal world I’d have some form of expandable storage, and the obvious solution is to buy and set up a NAS with some enterprise drives.

This would be an incredibly fun project and I’d love putting it together however I have two concerns: it’s another piece of infrastructure to maintain; and it’s more kWh demand for my bills.

The first concern is that I have never maintained such infrastructure before, even on small home scale. There’s RAID to consider, ZFS is a thing people seemingly love but I have no idea how it works, and drives need mirroring and replacing. The other concern is that with Britain in the thrall of a pro-capitalist and anti-worker government, every kWh costs an extortionate amount money (in context with the profits of these companies) and I don’t want to drastically increase my fuel bills and line the pockets of capitalists (although I will concede my provider is not too voracious).

I think my initial attempt at a middle ground is that I might try out a 3.5 inch USB dock. It looks like these usually hold around two drives, so I reckon I could reasonably expect to get two 16TB enterprise drives and have these set up to be mirrored to protect against failure of a single drive. This 16TB would then act as my daily, weekly, and monthly archival storage taking the place of the little 2TB drive I use now. There are also 5 bay devices which look more robust, but that means more drives, more complexity, and more kWh used.

Backup system drive

I also need to back up my system data so that I can recover from disaster more easily. My dotfiles and install scripts get me some of the way, but it’d be much better to just be able to boot from secondary drive which was up to date.

This is a tough one for me, as I’m not sure where my line is drawn. Part of me just wants to get a second nVme drive for my machine and dd my main drive to it every day, but I think that would chew up resources and issue way too many writes to the drive to keep it healthy for long.

Once I have my larger-capacity archival drives set up for data backup, I could dd to the 2TB USB drive I’m using for my daily backups at the moment. Again, that chews up resources. I could theoretically do it overnight but I don’t like leaving my machine on between uses since it eats kWh and it feels wasteful. I’ll ponder this. It might be worth doing a a monthly system image backup and then using rsync in the other direction to restore my files from the daily backup, if it becomes necessary.

Integrity checking

The above gets me nicely over the threshold of the 3-2-1 rule and covers the first parts of the Tao of Backup. If I do the above, I will have backed up all of my data including system files and spread it across several media with several offsite copies.

The thing that is missing is integrity checking. This is a blank spot for me to be honest as I am unsure as to what tools I’d be using and how to go about starting this and maintaining it. It’s something for the far-off horizon once I’ve actually managed to achieve what I’ve set out to here.

Summary

Inspired by other tech bloggy folk online, I have reflected on what I believe makes a good backup system, analysed my current practices and elaborated on tools I use to achieve the backups, as well as outlined some thoughts for improving on areas where I’m lacking.

It looks like my pocket money is going towards storage for some time…

jhead -n$HOME/images/photos/%Y/%m/%d/%f *.jpg ↩︎
rsync -av --delete --exclude='downloads' --exclude='.cache' "$HOME/" "$BACKUP_DRIVE/daily/ ↩︎
Update 2024-10: I’ve modified this command to remove a plaintext instance of my email address, as a SPAM-preventative ↩︎