It was World Backup Day a few days ago, and for once Slashdot managed to warn us about it ahead of time. Well done to them, and I hope it’s a policy they apply for future events. (For a moment I entertained the smug fantasy that it was in response to my previous rant, but given that that post’s only had 11 hits in the last year, it seems unlikely. :p) This time, it’s me who’s posting about it late.
Backups protect against a variety of calamities. From the most common to the most catastrophic, some of them are:
- Human error resulting in a deleted or incorrect modified file. Simply keeping a copy of important stuff in a different folder or using a local version control system is an easy and effective way to mitigate this.
- Physical hard drive failure, or partition/file system failure. Just keep copies on another drive, or image the whole drive, or use RAID.
- The loss of the whole computer; for instance, by a serious power spike, theft, or misplacement. Regularly copy the files to removable media. This can be automated if you have a local network.
- Destruction of the immediate physical environment, such as one’s house burning down. Keeping backups on removable media on the other side of room is not sufficient to protect against this. One simple plan is to keep a backup at work, or a friend’s house.
- The loss of a regional location, which could be caused by a tsunami, asteroid strike, or more commonly, network disconnectivity. Global distribution protects against this, and is not that difficult using some of the online providers.
- One’s planet blowing up. As yet, there is no practical protection against this, but NASA are working on it.
How is my personal backup situation?
For most of my files, I’m reasonably well protected against the first three. But that was my level of data safety in November 2010, so I am acutely aware of its inadequacy.
My source code is hosted on Google Code, and my email is on G-mail. Most of my writing output is on this blog, but I have a few personal documents in Subversion, which are not externally hosted anywhere.
Dropbox (or similar) is a simple system that covers smaller items for most people, with 2GB for free storage. I’ve set it up for my father and it works seamlessly. In my case I have a little too much in the way of smaller items, so I’ve not used it myself. I’m also averse to running background tasks on my machine, and prefer to have a bit more control over the backup process.
For larger items, such as games, music, my PostgreSQL databases, and Subversion backups, I have a removable HDD. There is a Python program to copy specified folders into it. For each backup of a target folder, a new directory is made named after the current date. This will completely reflect the file structure on the computer. New files are copied in, but files that are unchanged from a previous backup are reused (by means of NTFS hard links) to save space. (The program is a bit of a mess, but it mostly works, and does use a few interesting optimisations such as hard links, directory symlinks, and the NTFS journal for change tracking. I’ve been working on a couple of posts for those, but they’ve been in draft for over a year! Further procrastination is likely, though perhaps mentioning it here will impel me to do something about them.)
This is a general improvement, but is not enough yet.
- It’s not regular and automated. Important files should be copied daily. Currently I have to remember to find my HDD, plug it in, and run the script. In the medium term I’m hoping to address that by building a separate server.
- It doesn’t cover anything. My system partition, with various unsorted files, work in progress, program settings, etc. is not backed up. I find the Windows 7 backup to be irritatingly opaque and annoyingly slow.
- It’s not remotely hosted. I would like to achieve level 4 on the above hierarchy, at a minimum.
By next World Backup Day, I should have some kind of NAS server to address points 1 and 2. The amount of data to backup versus the expense of network traffic means that point 3 is likely to be addressed by a combination of network transfers for small or important items, and physical shipping (such as by keeping a few external HDDs) for larger items. Some large items are less than crucial (usually because they can be re-obtained with some effort), which can reduce the burden of backing them up.
But we need an organised regime of classifying files into crucial data that needs to be backed up often and remotely, less crucial data that should be backed up occasionally, and data that can be ignored. And policies for versioning and file organisation — in general I like the backup to reflect a snapshot of the file system at the time, but things like version control systems and databases have their own backup formats, which should be used (PostgreSQL databases are not generally portable between architectures and versions).
There is one more thing to complete the backup approach. The backups should be checked for restorability. In practice, that usually happens when a file is inadvertently deleted or a house is inadvertently burned down. The backup system could at least periodically run a checksum on the data. What if it doesn’t match, and there is no copy on the original system? Well then it’s time to start thinking about redundancy in the backup system.