Automatic Backups with rsync and Anacron

Originally published in the LinuxGazette.net, July 2004, Issue 104.

1. Introduction

The thing about backups is that they can just be a pain. Everyone knows just how important they are, but very few people actually take the time to perform proper backups. Even after they have felt the pain of losing all those important files.

In this article I am going to show you how to quickly set up your computer for simple, hassle-free, and transparent backups using only rsync and cron (or Anacron). The premise is simple: every night your computer will make an automatic mirror of all the files you wish to backup, and at chosen intervals these mirrors will be archived and kept for a specified period of time.

Before you get our hands dirty on actual implementation you need to design your own backup policy. In section 3 I discuss what a backup policy should and should not be. I will then introduce the necessary background information on rsync and cron separately. Finally, I will put it all together leaving you with a simple but effective backup regime.

2. Intended Audience

This article and the presented backup procedure is intended for anyone wishing to keep an effective backup of their important data. It is definitely not intended for large organisations or businesses with mission critical data. I would imagine the ideal candidates would include: home users, home office/small office users, students/postgraduates, and researchers.

3. Backup Policies

A common misconception among many people is that a good backup policy is as simple as making a regular copy of your data ("mirroring your data"), always overwriting the previous copy. This, although more effort than most might make, is almost as bad as doing nothing.

Consider, for example, if one of your files becomes corrupt over time. It takes you a week or two before you use it again. In that time, you have made two "backups". You open your file to find your data destroyed. "But", you think to yourself, "that's alright, I'll just turn to my backup". You open your backup to find the exact same corrupted file. You realise only too late how useless your backup policy was.

Most of us have hundreds, if not thousands, of important files in our home directories; address books, e-mails, letters, work related data, programs we have been working on, etc. Some of these files we might use every week, while others might not be looked at for months or even years.

A good backup policy is one which takes "snapshots" of your data and keeps them for a specified period of time. It is up to each individual to decide just how many snapshots to keep and at what intervals. Often this will be decided for you by storage limitations. Where possible, data that changes regularly will benefit from snapshots of smaller intervals, while data that rarely changes requires fewer intervals. The following table demonstrates my own backup procedure:

Data Change Freq. Size Daily Mirror Weekly Snapshots Monthly Snapshots Space Required
1 2 3 1 2 3 4 6 12
E-mails Daily 100Mb Y Y Y Y Y N Y N Y Y 500Mb
MySQL Data Daily 30Mb Y Y N N Y N Y N Y N 70Mb
Website Monthly 900Mb Y Y N N Y N N Y N N 3,200Mb
/etc 2-3 Weeks 28Mb Y Y Y Y Y Y Y Y Y Y 200Mb
Thesis Daily 25Mb Y Y Y Y Y Y Y Y Y N 190Mb
Research Code Rarely 60Mb Y Y N N Y N Y N Y Y 200Mb
Total space required: 4,360Mb

Each of the snapshots is compressed to reduce space. The largest data in my policy is the website. This rarely changes so I keep only a few snapshots. My systems /etc directory also changes rarely, but as it is only 28Mb I have chosen to keep all possible snapshots. You should now make a similar table and decide which data you want to backup and how often.

The next consideration is where to store your backups. Again, this is your own choice. Some locations simply won't be available to you while others might just seem like overkill. The following list gives the various options listed from the best to the worst:

1. On a networked computer in a remote location (such as backing up from home to your office computer or vice-versa)
This is clearly the ideal situation. Your data will be protected from fire, theft, electrical outages, power surges, water/sprinkler damage, etc.

2. On a networked computer in another room of you home or office
Not as good as (1) above as both original and backup computers could be caught by fire, power surges, etc.

3. On a second disk drive in your computer
Far better than having no backups at all, and you will be protected if one of the hard drives fails, but you will still be vulnerable to fire, theft, etc. I would certainly recommend a power surge protector if your PSU (power supply unit) doesn't have one built in.

4. On a separate partition on the same disk drive


5. On the same partition on the same disk drive
The last two on my list are by far the worst. This couldn't really be considered a backup policy of merit. Any hard drive problems, a virus, an accidental mistake, etc could ruin you. You are really only insuring yourself against accidental deletion or similar operation on the original files.

If you do not have access to a remote computer yourself, then consider joining up with a friend; each of you backing up to the others computer. Security concerns can be addressed by encrypting the data before/during transfer and only placing the encrypted versions on the remote computer.

4. rsync - The Fast, Flexible File Transfer Utility

rsync is a very fast and flexible file transfer utility. It uses its own "remote update" protocol to transfer just the differences between two sets of files. It can operate locally or across a network link using rcp, ssh or its own daemon. rsync is included with most standard Linux distributions by default, or it can be downloaded from its website (http://rsync.samba.org).

We are going to use rsync to mirror our files every night. rsync is the ideal choice as it will only transfer new files, the differences between existing files that have changed, and remove old files, minimising the bandwidth usage for dial-up/broadband customers.

The mirrors are easiest to implement when we take entire directories and its sub-directories. Let's take the case where you are mirroring all your e-mail files from your home computer to your office computer. We would use rsync as follows:

rsync -a -e ssh --delete /home/username/mail \
    username@mycomputer.mycompany.com:/backups/mail
where:
-a
Instructs rsync to copy all files and directories recursively while perserving symbolic links, special devices, time stamps, owner and group IDs and permissions.
-e ssh
Tells rsync to use the ssh remote shell. More about this below.
--delete
Instructs rsync to delete files on the receiving side which do not exist on the sending side.
/home/username/mail
The directory we are mirroring.
username@mycomputer.mycompany.com:/backups/mail
Log in as user username on mycomputer.mycompany.com and create/update the mirror in /backups/mail

This will create a mirror of /home/username/mail on mycomputer.mycompany.com under the directory /backups/mail/mail. This is what we want. If you wanted the reverse (backing-up from mycomputer.mycompany.com to your home computer) you would simply switch the source and destination:

rsync -a -e ssh --delete \
   username@mycomputer.mycompany.com:/home/username/mail /backups/mail

I recommend that you use the ssh protocol to ensure the secrecy of your data while it is being transferred. If you are performing this backup on a closed network, feel free to use the older rsh protocol or rsync's own daemon. Using networked backups creates one more problem: we want this to be automatic, with no user interaction, but using rsh or ssh generally requires a password to be entered. We will overcome this by using public/private keys without passphrases.

4.1 Setting up password-less authentication for ssh

This article is not intended as a tutorial on ssh so I will only provide a brief instruction on setting up private/public key authentication using ssh. Please refer to the ssh documentation for a more thorough discussion.

The following two commands will set up password-less authentication from your computer to mycomputer.mycompany.com:

$ ssh-keygen -b 1024 -t rsa -f /home/username/.ssh/id_rsa
      (do not enter a pass-phrase - leave it blank)
$ scp /home/username/.ssh/id_rsa-backup.pub \ 
   username@mycomputer.mycompany.com:/home/username/.ssh/authorized_keys

Usually any problems encountered are down to the permissions of the various key files. Use ssh in verbose mode (ssh -v) and check the ssh daemon logs on both machines (usually /var/log/secure).

In using this method it is important for you to be aware of the security concerns that arise. The ssh-keygen command produced two files:

You should ensure the permissions of the private key are -rw------- (i.e., only readable by the owner). This file is the equivalent of having a text file containing your login password to your account at mycomputer.mycompany.com; anyone who gets their hands on this file will be able to log into that account without knowing your password. That said, this method can still be relatively safe as any potential hacker must first gain access to your home computer in order to get at this file.

If you use this method you should also consider the following security measures:

5. cron - Daemon to Execute Scheduled Commands

cron is an integral part of most Linux distributions. It is used to execute commands at specific times according to a schedule you set. We will use it to set-up a nightly mirror of all the files we wish to backup, and to create the snapshots at the intervals we determined in section 3.

Each user on a Linux system has their own cron table ("crontab") which contains the schedule of commands. This can be listed using 'crontab -l', removed with 'crontab -r' and edited with 'crontab -e'. Let's add the daily mirror command so that it occurs at 2am every day by placing the following in our crontab:

00 02 * * * rsync -a -e ssh --delete /home/username/mail \
    username@mycomputer.mycompany.com:/backups/mail
where the five fields (0 2 * * *) are (respectively):
FieldAllowed Values
minute
hour
day of month
month
day of week
0-59
0-23
1-31
1-12
0-7*

(*0 or 7 is Sunday)

So, in our case, we will mirror the contents of /home/username/mail at 02:00 on every day of every month. We can place similar entries for all other directories you wish to mirror. Alternatively, we could create a script containing all the entries and use cron to execute that script.

There are two useful environment variables you can also set when editing the crontab to override the defaults:
SHELL=/bin/sh
MAILTO=username

The MAILTO is important as all error messages will only be sent by e-mail and so you will notified if your backups are failing. Refer to the crontab man page for more information and examples.

When choosing your cron times, be mindful of possible problems if you are using NTP to automatically update your system time.

6. Putting It All Together

Now that we have the basics of rsync and cron, all we have left to do is to put them all together to create our backup policy. Let's continue with the example where your home computer is sending its daily mirror to your office computer. You office computer will now be responsible for the remainder of the backup policy: the snapshots at the predefined intervals. We will use another crontab on the office machine to accomplish this and I will demonstrate using the schedule for my thesis from section 3.

The method is quite simple. For example, every Sunday we will move the 3 week old snapshot to 1 month old snapshot, the 2 week old to the 3 week old, the 1 week old to the 2 week old and archive the mirror to the 1 week old. So, depending on the time of the week, the 3 week old snapshot could be as young as 2 weeks or as old as 3 weeks.

My schedule requires snapshots that are 1, 2, and 3 weeks old and 1, 2, 3, 4, and 6 months old. We will work from the oldest down (as otherwise we would only be propagating the new snapshot):

# Back up mail files with snapshots of 6,4,3,2,1 months and 3,2,1 weeks
# Order 4m->6m, 3m->4m, 2m->3m, 1m->2m, 3w->1m, 2w->3w, 1w->2w, mirror->1w

# At 3am on the 1st of Jan,Mar,May,Jul,Sep,Nov copy the 4m to the 6m
00 03 1 1,3,5,7,9,11 * cp -f /backups/thesis/backup/4month.tar.gz \
    /backups/thesis/backup/6month.tar.gz

# At 3.02am on the 1st of every month move the 3m to the 4m 
# (and continue for other months)
02 03 1 * * cp -f /backups/thesis/backup/3month.tar.gz \
    /backups/thesis/backup/4month.tar.gz
04 03 1 * * cp -f /backups/thesis/backup/2month.tar.gz \
    /backups/thesis/backup/3month.tar.gz
06 03 1 * * cp -f /backups/thesis/backup/1month.tar.gz \
    /backups/thesis/backup/2month.tar.gz
08 03 1 * * cp -f /backups/thesis/backup/3week.tar.gz  \
    /backups/thesis/backup/1month.tar.gz

# And then every Sunday take care of the weekly snapshots and the archiving
# of the mirror
10 03 * * 0 cp -f /backups/thesis/backup/2week.tar.gz  \
    /backups/thesis/backup/3week.tar.gz
12 03 * * 0 cp -f /backups/thesis/backup/1week.tar.gz  \
    /backups/thesis/backup/2week.tar.gz
14 03 * * 0 rm -f /backups/thesis/backup/1week.tar.gz
16 03 * * 0 tar zcf /backups/thesis/backup/1week.tar.gz \
    /backups/thesis/thesis/*

And that my friends is your automatic, hassle-free, and effective backup system.

A few points on the above:

7. Anacron vs. cron

Anacron is a periodic command scheduler similar to some uses of cron, but it does not assume that the system is running continuously. It can therefore be used for our backup policy on systems that don't run 24 hours a day. Just like rsync and cron, Anacron is now part of most standard Linux distributions.

Every time Anacron is run, it reads a configuration file that specifies the jobs Anacron controls, and their periods in days. If a job wasn't executed in the last n days, where n is the period of that job, Anacron executes it. The configuration file is usually /etc/anacrontab.

For the daily mirroring we could add a line to this configuration file such as:

1   20  mirror  rsync -a -e ssh --delete /home/username/thesis \ 
    username@mycomputer.mycompany.com:/backups/thesis
where the fields mean:
1
the period in days indicating how often this command should be executed
20
the delay in minutes after Anacron begins before it should execute this command
mirror
a unique identifier for this job so Anacron can keep track of when it was last executed
rsync...
the command to execute

And similarly on the backup machine we would place the following in the Anacron configuration file:

# Back up mail files with snapshots of 6,4,3,2,1 months and 3,2,1 weeks
# Order 4m->6m, 3m->4m, 2m->3m, 1m->2m, 3w->1m, 2w->3w, 1w->2w, mirror->1w

# Every 60 days (2 months)
60 20 bup1 cp -f /backups/thesis/backup/4month.tar.gz \
    /backups/thesis/backup/6month.tar.gz

# every 30 days (1 month)
30 22 bup2 cp -f /backups/thesis/backup/3month.tar.gz \
    /backups/thesis/backup/4month.tar.gz
30 24 bup3 cp -f /backups/thesis/backup/2month.tar.gz \
    /backups/thesis/backup/3month.tar.gz
30 26 bup4 cp -f /backups/thesis/backup/1month.tar.gz \
    /backups/thesis/backup/2month.tar.gz
30 28 bup5 cp -f /backups/thesis/backup/3week.tar.gz  \
    /backups/thesis/backup/1month.tar.gz

# And every 7 days
7 30 bup5 cp -f /backups/thesis/backup/2week.tar.gz  \
    /backups/thesis/backup/3week.tar.gz
7 32 bup7 cp -f /backups/thesis/backup/1week.tar.gz  \
    /backups/thesis/backup/2week.tar.gz7 
7 34 bup8 rm -f /backups/thesis/backup/1week.tar.gz
7 36 bup9 tar zcf /backups/thesis/backup/1week.tar.gz \
    /backups/thesis/thesis/*
A few notes on this:

8. Resources

For a more professional backup solution:

Get advance notification before your hard disk fails:

9. Change Log

April 24, 2005
Added warning about NTP time servers and possible problems with cron settings. Tip from Thomas Adam of the LinuxGazette.net.
May 4, 2005
Updated the ssh keys section to increase the security of the system by using a specific public/private key pair for the passphrase-less keys. I also added a working script/example which allows one to specify a specific command that can only be run with a particular ssh key for maximum security. Thanks for the idea goes to Colm MacCarthaigh of ILUG.
May 6, 2005
Changed rsync's -r switch to the much more appropriate -a.

Copyright © 2004, 2005, Barry O'Donovan. Released under the Open Publication license.