Date: 2008-08-30 21:53:00
backup solution ideas

Inspired by Git's content-indexed repository storage format, I've designed a new backup program that solves a few problems that have been bugging me. First, let me rewind a bit.

I still vividly remember the first time I accidentally overwrote an important source file. I was about 14 years old, writing an Apple ][ word processor for visually impaired people in 6502 assembly language. The program was too big to edit as a single source file, so I split it into two halves, and one day overwrote my only copy of the second half with the first half! I managed to mostly recover it by using a disassembler, but I learned an important lesson that day. It didn't properly sink in though, because I learned that very same lesson no fewer than three more times after that (maybe I'll tell those stories some day too).

I've tried many different backup solutions over the years. I've owned at least three different types of tape drives (for which I may still have tapes, but not the drives), stacks of burned CDs, several external hard drives, etc. More recently, I've tried rsync, rdiff-backup, duplicity, and my own implementation of a program that mirrors files to Amazon S3.

Each of these solutions, while sophisticated and efficient in their own ways, has one important disadvantage for me: The same file that happens to reside on more than one computer gets stored more than once in the backups. If I move stuff around, between computers or even between directories, this causes a fair bit of noise in the backup volume. The set of bits is the same, why store them more than once?

My current idea is to store file data based on the SHA-1 hash of the file contents. There is an area called the "blob storage" whose only purpose is to store piles of bits named after their hash value. For example, the file named 0dc6832636f48e24a1423d02175e6023d76810bd might contain a valuable source file. The other storage area contains associations between filenames and blobs, for example:


With this arrangement, if I moved the hello.* files from src/hello.* to src/hello/hello.*, the blobs would not need to change, only the references. Obviously this is of greater benefit for larger files. And, if I have the same files on a different computer, and I backed up that other computer to the same storage area, the blobs would still remain the same and the other computer's references file would point to them too. A couple of auxiliary commands are needed to (a) check the integrity of the whole storage area, to ensure that all referenced blobs exist, and (b) to clean out blobs that are no longer referenced.

I'm sure this isn't a new idea, for example I know that corporate email systems will generally store only one copy of an attachment in their data store, no matter how many recipients there are. When the last recipient deletes the message, only then is the attachment data deleted. This is just that same idea applied to backups.

There is room for improvement, too. Currently I compress all the data before it's put in the blob storage, and I'm going to add an encryption option. A significant improvement might come from the ability to create a new type of blob, which doesn't contain the whole file but only the differences from some other blob. Not sure whether I've got all the details sorted for that yet though. Come to think of it, it might be possible to use the librsync algorithm in combination with an Amazon EC2 instance to efficiently compute the differences.

I have started to implement this idea in a program called shaback, which could be either represent "SHAred BACKup", or "SHA-1 BACKup". It is implemented in Python, depends on my s3lib library, and uses Amazon S3 as the storage backend. I've published what I've got so far on github - please clone and hack on it if you feel so inclined.

That's pretty cool! My information isn't quite that important though.
You could make those sha1s accessible on a fileserver, and then write a small CGI to up and download them. Properly secured by an authtoken that is some kind of hash combination of user credentials and token expiration time. Or something....

(note for other readers who are puzzled: this is an inside joke)
I'm sure I don't know what you're talking about!

Actually, I think I had forgotten how that all worked. You will too. :)
I claim amnesia.

This is cool enough to motivate me enough to finally learn Python.

One thing you could do is to divide the files into blocks, and store the blocks in the blobs instead of the whole file. It would take a little more work, but if you've got a lot of files with redundant content, it could save some space, depending on how you slice them up. The rsync algorithm would be a good place to start, but I haven't done more than peek at that.

Are you planning on handling hash collisions? Or do you think that's too low a probability to worry about?
Dividing up files into blocks is also a good idea, one could do that by having the reference point to a special type of blob that contains a list of other blobs that make up the file. These blobs could point to other blob lists or to a blob diff as I described, or to raw blob content. The trick, of course, is using such a structure most effectively!

I have decided to ignore hash collisions, because the probability of a collision is less than that of a meteor directly striking my server. But, there's no reason why this approach is tied to SHA-1, and the more paranoid among us could use something larger like SHA-256 or SHA-512.
Or use two different kinds of hashes (say, SHA-1 plus MD-5) and presume that the chance of a collision in two separate hash kinds simultaneously will be much lower.
SHA-256; just cause! SHA-1 is being deprecated where possible due to breaks but that won't really affect your usage of course. I finished writing a nice optimized/standalone C version (with no includes) of it a few weeks back and tested it against the government standards and 25,000 randomized file sizes/contents vs OpenSSL. Surprisingly enjoyable; porting to assembly soon.

Awesome -- I've started writing several backup tools to do almost exactly this, but haven't managed to build up enough momentum to finish one. Now I can just use yours instead ;)

I think bradfitz's "brackup" attempts something like this. I never worked up the confidence to actually use it for backing up my own stuff, though.

One handy feature would be the ability to tar the files up into chunks so that you don't need to have one file in S3 for every tiny file on the original disk. As files get updated, you'll end up with chunks containing some good files and some out of date ones (with newer versions stored in later chunks), but if this got to be a problem you could spin up an EC2 instance and compact the archive, recompressing/encrypting chunks to clear out the old files.
Oh, I'd forgotten about brackup! A very quick look at its code seems to indicate that it does something similar, but I didn't see an actual explanation of what it does internally.

I suppose there's no reason why this same structure couldn't support combined chunks too. Good idea.
[info]myelin : Downloading s3lib?
Do you have a git URL for s3lib? I managed to extract a tarball from but it would be nice to be able to follow it with git (perhaps as a submodule of shaback).
[info]ghewgill : Re: Downloading s3lib?
Yeah, a submodule is a good idea. Currently, the only reason that s3c exists on at all is because it's the first project I picked to do a test conversion from svn to git! It's not really "official" yet and might not even contain all the latest fixes. I'll fix that up, put a copy on github, and make a submodule.

I just counted, I have 83 svn repositories. How did that many happen?
[info]ghewgill : Re: Downloading s3lib?
Ok, I created and added a submodule reference from shaback to it. I think it works, it seems to be okay on a test checkout.

Way too much effort. :-)

  1. Use rdiff-backup.

  2. Walk the resulting tree. For each file:

    1. Calculate checksum.

    2. Use it as the name for a hardlink you create in $bakvolume/sha1.
    3. … unless that link already exists.

    4. … unless the inode numbers differ.

    5. In which case you replace the existing filename with a link to the one in the $bakvolume/sha1 directory.

That's a clever idea. I have two requirements though: (a) the backup media is Amazon S3 or something offline (no hard links to or from there), and (b) dead-simple restore algorithm (no delta-application code required).

Restore is as simple as it ever is with rdiff-backup: you can browse the snapshots like any other part of the filesystem and use normal filesystem tools to grab files out of them. This is after all merely a compacting postprocessing pass over an rdiff-backup snapshot.

The S3 requirement rules out the idea, though.

Greg Hewgill <>