Tuesday, April 24, 2007

How to Use JungleDisk, Amazon S3, and rsync to Backup Your OS X Home Directory

In January, my iBook died; a week ago, my friend Iqram's MacBook died after an OS X Security Update; two days ago my friend Dave's iBook hard drive failed. Iqram and I managed to recover our data; Dave did not. At the moment, I don't have the greatest faith in Apple products, so I decided to backup all my data.

I used to use an external USB hard drive for backups, but my 160 GB external has been acting up of late and sometimes needs to be restarted 2 or 3 times before my computer recognizes it, so about 2 months ago I started playing with Amazon S3 and JungleDisk.

Amazon S3 (Simple Storage Service) is basically an infinite hard drive you can by on a pay per usage basis, and JungleDisk is a utility that allows you to mount S3 as a hard drive on any OS. JungleDisk has a backup tool built in, but right now the tool does not provide mirroring--ie, if you backup a folder on your computer and then delete some stuff locally, that stuff will not be deleted remotely. So if you start moving stuff around, you'll end up with duplicate copies of your data. Redundant data annoys me, so I decided to finally learn how to use the rsync command line utility to mirror folders. Here's how I setup a backup of my home directory using Amazon S3, JungleDisk, and rsync.

Get Amazon S3

Signup for Amazon Simple Storage Service. Here's how the pricing works, according to Amazon's website:

  • Pay only for what you use. There is no minimum fee, and no start-up cost.
  • $0.15 per GB-Month of storage used.
  • $0.20 per GB of data transferred.

Download JungleDisk

JungleDisk is free and available for Linux, Windows, and OS X.

Setup JungleDisk

JungleDisk is pretty easy to setup on OS X. You just download the JungleDisk dmg, open it, and drap the .app file to your Applications Folder. Then you open it and enter your Amazon access keys, which you can find by going to this page on Amazon. You will also need to choose a bucket name that you want JungleDisk to use on your S3 account. I used kortina.

When you configure JungleDisk, you can choose to have it auto-mount as a Volume on your computer whenever you start the JungleDisk app. (this is the default setting.) If you need to mount JungleDisk manually, open Finder, hit ⌘ k, and enter http://localhost:2667/ as the host name. You can find more instructions on configuring JungleDisk on this page on their official site.

Run rsync to Mirror Your Home Directory to Jungle Disk

rsync \
-avvz --size-only --delete \
--exclude .svn --exclude .Trash --exclude Library/Caches --exclude "*.log" \
/Users/kortina /Volumes/JungleDisk

How this rsync Script Works

The rsync command takes a bunch of options and then a source directory and destination directory as arguments. In this case the source is /Users/kortina, my home directory, and the destination is the mounted JungleDisk drive at /Volumes/JungleDisk. Since there is no trailing / after /Volumes/JungleDisk, a directory named kortina will be created on the JungleDisk drive. To copy the contents of your home directory directly to JungleDisk without enclosing them in a folder with your username, simply add a trailing slash: /Volumes/JungleDisk/.

Here's what the options I've used do:

  • -a: This runs rsync in archive mode, which is equivalent to running it with -rlptgoD. The main thing important here is -r, which will recurse into directories, and in fact many of these other options bundled in -a are irrelevant because of the way S3 treats file meta data.
  • -v or -vv: Run in verbose or very verbose mode. Verbose mode will print the name of each file copied or deleted, and very verbose mode will additionally print the names of files that are skipped. I like to run in -vv because I can see progress more easily.
  • -z: this option will compress file data and make things a bit faster.
  • --size-only: usually, rsync will compare the size and last modified date of each file to determine whether it is out of date and needs to be copied. Because of the way S3 handles file metadata, however, the last-modified-date of each file you upload to S3 will always be the date it was uploaded. This will screw up rsync, so you need to use the --size-only option to make this backup script work.
  • --delete: (delete files that don't exist on sender) this is another important part of my backup script. The reason I didn't want to use JungleDisk's built in backup utility was because it didn't support mirroring. --delete is the option that makes rsync to a mirror instead of a simple copy.
  • --exclude: (exclude files matching PATTERN) this option allows you to ignore file patterns like *.log or directory names like Movies.

Once you setup S3 and Jungledisk, if you want to keep the same options I use, just copy and paste the code below, subsitute your_username for kortina, open a terminal and run the command.

rsync -avvz --size-only --delete --exclude .svn --exclude .Trash --exclude Library/Caches --exclude "*.log" /Users/kortina /Volumes/JungleDisk

8 comments:

Anonymous said...

Thanks, man! Awesome!

Dans said...

> -z: this option will compress file
> data and make things a bit faster.

Adding -z will not compress the data between your computer to S3. -z is affective only to remove size is rsync server.

Given that S3 is all or nothing, i.e. if you want to change a file you have to PUT the whole thing again.
Every time a local file has been changed it will uploads the hole new file, not just what has changed. That means that if you are using rsync for doing regular backups you are wasting unnecessary bandwidth.

You can bypass this limitation using 3rd party gateway like: http://www.s3rsync.com/

Anonymous said...

Not completely true, jungledisk plus can do block change uploads, in which case you only upload changed blocks even if the file is >1gb

Ron said...

also, if you're using dropbox, might want to excluse the actual dropbox folder (since it syncs itself) and .dropbox. the caches in that thing are huge.

Aaron said...

This is a great post about rsync and what it's capable of, but there's a key point you should note...

The disk failures you describe at the beginning of your article should not be attributed to Apple build quality. Apple buys internal hard drives from third party manufacturers like Seagate and Samsung. When those drives fail, it's more a reflection on drive manufacturer than the computer builder/assembler.

Otherwise, great post!

Anonymous said...

dude, you forgot something, Jungledisk is NOT free.

Norm Wessley said...

Very nice post.

Has anybody tried www.secobackup.com ? I have used both Jungle Disk and Secobackup and am leaning towards Secobackup?

Both seem to be very comparable in my mind. In general, I think you can use any tool I guess even some freeware to ship to Amazon S3.

Hoakz said...

Nice post!

However, I had my source drive fail on Linux, which made the mount point empty (works the same for Mac I think) which in turn made rsync mirror an empty dir.

In a matter of an hour my whole mirror was gone.

Simply mirroring will also mirror any erroneous file deletions or changes made on the source drive.

In order to prevent this I added the following switches --backup, --backup-dir, and --suffix

--backup makes rsync make a backup of any file that it changes or deletes from the mirror.

--backup-dir tells rsync where to put the backup (I use --backup-dir=$BACKUP_DIR/`date +%Y/%m` -- which gives me a dir per year and month in my "$BACKUP_DIR")

--suffix adds a suffix to backed up files. You can either go with ".bak" to make rsync do one backup and remove older backups, however since I might not always note at once what I removed or what got broken I use --suffix=.`date +%Y-%m-%d` making the backup into YYYY-MM-DD making a backup for each day (for all changed files).

Note: If you set up your system like this you need to keep a keen eye on your S3 account or it will grow and cost more each month -- cleaning out old backups is a recommendation. Most of my settings (the ones involving dates) will only work for Linux (but I think Mac OS X will make it too... windows most certainly won't). If you run windows you can simply use "--backup-dir=/jungledisk/backup" (or wherever you have your jungledisk mounted) and "--suffix=.bak" in which way you'll be a bit safer than just blindly mirroring whatever has happened to your source drive.

This page has a manual entry for rsync http://www.ss64.com/bash/rsync.html. If you have a Linux (or Mac?) box you could do "man rsync" in a terminal window. Reading up on switches is a great way to fine tune your script!