Simple but efficient backups

Backups are perennially on my To-do list. Just having them would be good, but more than that I want to automate their creation and maintenance.

My problem, of course, is I prefer to make my own system for doing this rather than use someone else’s. This doesn’t always work out. One of my early attempts was coming along great, and I decided to test it — on its own source directory. It was soon apparent that I had transposed the from and to arguments to the copy function, as the backup program’s source files had all been truncated to zero length. Karmic neatness aside, there is a lesson there: always make a good backup before testing a backup program for the first time.

After talking with Kris about his script for backups in Unix, I started investigating whether a similar job could be done on Windows with NTFS. The reason is not so much that I want to learn the Windows API, though there is no harm in that (in moderation, at least ;-)). The real problem is that I use Linux and Windows approximately equally at home, with files scattered over both OSs and three or more machines. I want a backup system that consistently looks after all my files.

My requirements are pretty basic:

  1. Run from the command line.
  2. Make simple copies of files, reflecting the original directory structure.
  3. Try to reuse files from earlier backups if they are unchanged.

So over the last 18 months I’ve been gradually completing the python program that I’d salvaged from email. As at this point it basically meets the above requirements (notwithstanding bugs, design flaws and the occasional cosmic ray). I’d originally written it for Windows, using the win32 extensions to read the USN Journal and create links to existing items. It now works in Unix too (except for the Journal functionality).

Links in Unix are easy, because the system calls are in the standard library (link and symlink) and there is a well-known and easy to use command line program () for creating them.

Links in Windows have been struggling to crawl out of the primordial soup, from non-existent, to a hack, to an afterthought, to an experts-only feature, to something that’s slightly tricky to use but mostly works like it does on Unix except not quite. NTFS is a curiously powerful file system and supports a lot of things which Windows itself has not always exposed. Hard links are one example. Junction points are another. An entry in a directory does not have to point to a file or other directory; it can contain special instructions to tell NTFS to do something else instead when it is accessed. In practice the most obvious junction point action has been to operate as a symlink.

Junction points could be used in Windows XP at least, and actual symlinks themselves were added more recently. As of Windows 7, there is a mklink command which can create hardlinks, symlinks, and directory junction points. Historically there was a serious danger with the use of junction points as symlinks, because the rest of Windows transparently treated them as directories: this is great when you want to access their contents. But not so great when you delete a symlink, and find that Windows Explorer has dutifully deleted everything in the directory that the link pointed to.

I have run a quick test and discovered that both junction points and symlinks will not be recursively deleted by Windows Explorer, so using links is almost complete in Windows now. The one outstanding weakness is likely to be that programs for Windows may operate on the assumption that all directories are directories and hence perform recursive operations on symlinked directories that may be inconsistent with the idea that the same sets of contents may be accessed via a symlink.

My backup program is still very unpolished, but it can be used like this:

backup.py C:\games H:\backup\games 20120804

This will create a copy of C:\games at H:\backup\games\20120804. If other backups created by the program exist in H:\backup\games, their data will be reused where possible.

The algorithm is simple:

  1. Read from the command line a Source directory, a Target directory, and a Name for the backup within the target (typically the current date).
  2. Optionally collect a list of changed and affected files and directories from the Journal.
  3. Back up the Source item into Target/Name:
    1. If the item is a file, and is unchanged from last time or has the same MD5 as a previously backed up file, create a hardlink to the previous one.
    2. If the item is a directory and is unchanged, make a symlink to the previously backed up directory.
    3. Otherwise, copy the file or back up the directory, recursively from step 3.

Backing up a file is actually implemented as a check that an old backup can be safely reused, which falls back to a simple copy if it can’t. This step could be optimised.

My other problem, when it comes to projects, is that I’m better at starting them than finishing them! But I can at least throw out some ideas on what still needs to be done:

  1. Make the command line more powerful, supporting options to enable the Journal, directly symlinks, verbosity, etc.
  2. Use the exclusions file.
  3. Improve robustnest: warn if an item can’t be copied.
  4. Use volume shadow copies in Windows — which means more Windows system calls to learn.
  5. Log more consistently, and optionally keep the log in the backup.
  6. Preserve timestamps and other meta-data.
  7. Optionally use timestamps to detect that a file is unchanged (rather than having to painstakingly compute the MD5 for it).
  8. Add the option to compress the backup (one idea is that some source directories would be marked as compressible, and all the items in them go into a single ZIP file).
  9. Don’t backup the contents of symlinks.
  10. Implement a GUI mode for the program.

And of course, set it up to run regularly, instead of on sporadic occasions when I remember to. This will likely be a goal of my personal NAS project, due to be commenced any day now.  I seem to have promised to have one by next World Backup Day.

Anyway, the source is on Google Code.

Advertisements
This entry was posted in Programming and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s