Automatic Backups with rsync and Anacron
Originally published in the LinuxGazette.net, July 2004, Issue 104.1. Introduction
The thing about backups is that they can just be a pain. Everyone knows just how important they are, but very few people actually take the time to perform proper backups. Even after they have felt the pain of losing all those important files.
In this article I am going to show you how to quickly set up your computer for simple, hassle-free, and transparent backups using only rsync and cron (or Anacron). The premise is simple: every night your computer will make an automatic mirror of all the files you wish to backup, and at chosen intervals these mirrors will be archived and kept for a specified period of time.
Before you get our hands dirty on actual implementation you need to design your own backup policy. In section 3 I discuss what a backup policy should and should not be. I will then introduce the necessary background information on rsync and cron separately. Finally, I will put it all together leaving you with a simple but effective backup regime.
2. Intended Audience
This article and the presented backup procedure is intended for anyone wishing to keep an effective backup of their important data. It is definitely not intended for large organisations or businesses with mission critical data. I would imagine the ideal candidates would include: home users, home office/small office users, students/postgraduates, and researchers.3. Backup Policies
A common misconception among many people is that a good backup policy is as simple as making a regular copy of your data ("mirroring your data"), always overwriting the previous copy. This, although more effort than most might make, is almost as bad as doing nothing.
Consider, for example, if one of your files becomes corrupt over time. It takes you a week or two before you use it again. In that time, you have made two "backups". You open your file to find your data destroyed. "But", you think to yourself, "that's alright, I'll just turn to my backup". You open your backup to find the exact same corrupted file. You realise only too late how useless your backup policy was.
Most of us have hundreds, if not thousands, of important files in our home directories; address books, e-mails, letters, work related data, programs we have been working on, etc. Some of these files we might use every week, while others might not be looked at for months or even years.
A good backup policy is one which takes "snapshots" of your data and keeps them for a specified period of time. It is up to each individual to decide just how many snapshots to keep and at what intervals. Often this will be decided for you by storage limitations. Where possible, data that changes regularly will benefit from snapshots of smaller intervals, while data that rarely changes requires fewer intervals. The following table demonstrates my own backup procedure:
Data | Change Freq. | Size | Daily Mirror | Weekly Snapshots | Monthly Snapshots | Space Required | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 1 | 2 | 3 | 4 | 6 | 12 | |||||
E-mails | Daily | 100Mb | Y | Y | Y | Y | Y | N | Y | N | Y | Y | 500Mb |
MySQL Data | Daily | 30Mb | Y | Y | N | N | Y | N | Y | N | Y | N | 70Mb |
Website | Monthly | 900Mb | Y | Y | N | N | Y | N | N | Y | N | N | 3,200Mb |
/etc |
2-3 Weeks | 28Mb | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | 200Mb |
Thesis | Daily | 25Mb | Y | Y | Y | Y | Y | Y | Y | Y | Y | N | 190Mb |
Research Code | Rarely | 60Mb | Y | Y | N | N | Y | N | Y | N | Y | Y | 200Mb |
Total space required: | 4,360Mb |
Each of the snapshots is compressed to reduce space. The largest data in my
policy is the website. This rarely changes so I keep only a few snapshots. My
systems /etc
directory also changes rarely, but as it is only 28Mb
I have chosen to keep all possible snapshots. You should now make a similar
table and decide which data you want to backup and how often.
The next consideration is where to store your backups. Again, this is your own choice. Some locations simply won't be available to you while others might just seem like overkill. The following list gives the various options listed from the best to the worst:
- 1. On a networked computer in a remote location (such as backing up from home to your office computer or vice-versa)
- This is clearly the ideal situation. Your data will be protected from fire,
theft, electrical outages, power surges, water/sprinkler damage, etc.
- 2. On a networked computer in another room of you home or office
- Not as good as (1) above as both original and backup computers could be
caught by fire, power surges, etc.
- 3. On a second disk drive in your computer
- Far better than having no backups at all, and you will be protected if one
of the hard drives fails, but you will still be vulnerable to fire, theft, etc.
I would certainly recommend a power surge protector if your PSU (power supply
unit) doesn't have one built in.
- 4. On a separate partition on the same disk drive
- 5. On the same partition on the same disk drive
- The last two on my list are by far the worst. This couldn't really be considered a backup policy of merit. Any hard drive problems, a virus, an accidental mistake, etc could ruin you. You are really only insuring yourself against accidental deletion or similar operation on the original files.
If you do not have access to a remote computer yourself, then consider joining up with a friend; each of you backing up to the others computer. Security concerns can be addressed by encrypting the data before/during transfer and only placing the encrypted versions on the remote computer.
4. rsync - The Fast, Flexible File Transfer Utility
rsync is a very fast and flexible file transfer utility. It uses its own "remote update" protocol to transfer just the differences between two sets of files. It can operate locally or across a network link using rcp, ssh or its own daemon. rsync is included with most standard Linux distributions by default, or it can be downloaded from its website (http://rsync.samba.org).
We are going to use rsync to mirror our files every night. rsync is the ideal choice as it will only transfer new files, the differences between existing files that have changed, and remove old files, minimising the bandwidth usage for dial-up/broadband customers.
The mirrors are easiest to implement when we take entire directories and its sub-directories. Let's take the case where you are mirroring all your e-mail files from your home computer to your office computer. We would use rsync as follows:
rsync -a -e ssh --delete /home/username/mail \ username@mycomputer.mycompany.com:/backups/mailwhere:
-a
- Instructs rsync to copy all files and directories recursively while perserving symbolic links, special devices, time stamps, owner and group IDs and permissions.
-e ssh
- Tells rsync to use the ssh remote shell. More about this below.
--delete
- Instructs rsync to delete files on the receiving side which do not exist on the sending side.
/home/username/mail
- The directory we are mirroring.
username@mycomputer.mycompany.com:/backups/mail
- Log in as user
username
onmycomputer.mycompany.com
and create/update the mirror in/backups/mail
This will create a mirror of /home/username/mail
on
mycomputer.mycompany.com
under the directory
/backups/mail/mail
. This is what we want. If you wanted the
reverse (backing-up from mycomputer.mycompany.com
to your home
computer) you would simply switch the source and destination:
rsync -a -e ssh --delete \ username@mycomputer.mycompany.com:/home/username/mail /backups/mail
I recommend that you use the ssh protocol to ensure the secrecy of your data while it is being transferred. If you are performing this backup on a closed network, feel free to use the older rsh protocol or rsync's own daemon. Using networked backups creates one more problem: we want this to be automatic, with no user interaction, but using rsh or ssh generally requires a password to be entered. We will overcome this by using public/private keys without passphrases.
4.1 Setting up password-less authentication for ssh
This article is not intended as a tutorial on ssh so I will only provide a brief instruction on setting up private/public key authentication using ssh. Please refer to the ssh documentation for a more thorough discussion.
The following two commands will set up password-less authentication from your
computer to mycomputer.mycompany.com
:
$ ssh-keygen -b 1024 -t rsa -f /home/username/.ssh/id_rsa (do not enter a pass-phrase - leave it blank) $ scp /home/username/.ssh/id_rsa-backup.pub \ username@mycomputer.mycompany.com:/home/username/.ssh/authorized_keys
Usually any problems encountered are down to the permissions of the various
key files. Use ssh in verbose mode (ssh -v
) and check the ssh
daemon logs on both machines (usually /var/log/secure
).
In using this method it is important for you to be aware of the security
concerns that arise. The ssh-keygen
command produced two files:
/home/username/.ssh/id_rsa
: the private key/home/username/.ssh/id_rsa.pub
: the public key
You should ensure the permissions of the private key are
-rw-------
(i.e., only readable by the owner). This file is the
equivalent of having a text file containing your login password to your account
at mycomputer.mycompany.com
; anyone who gets their hands
on this file will be able to log into that account without knowing your
password. That said, this method can still be relatively safe as any
potential hacker must first gain access to your home computer in order to get at
this file.
If you use this method you should also consider the following security measures:
- Ensure both machines have an effective firewall configured (see my article in last month's edition here). You can use some of the features of iptables such as specifying the exact IPs that are allowed access to the system.
- Set-up a new user account on the backup machine and use that account for
backups only and do not assign a password to this user. If you already have
assigned a password, you can remove it by executing
passwd -l username
as root. - Use a separate public/private key for the backup. This can be done by using
the
-f
option ofssh-keygen
as demonstrated above (where I actually used the default key names) and then the-i
option ofssh
(seeman ssh
). - Supply options to the passphrase-less key in the
authorized_keys
file to restrict the options normally available through ssh and to ensure that only one command is executed (it is essential that all of the following is part the same line - i.e. do not use the return key):from="192.168.0.1",command="/home/username/bin/secure-rsync", no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-rsa [key] [user@host]
Most of these options are explained in the man page for
sshd
. Thefrom
field is the hostname or IP address of the machine(s) that will be initiating the backup viarsync
; multiple hosts can be specified and wildcards are allowed.When
rsync
connects to the destination machine via ssh, it tries to execute the commandrsync --server [arguments]
. We want to ensure that this is the only command that can be executed via our passphrase-less ssh key. Thecommand
option allows us to specify the command that is executed whenever this key is used for authentication and any other command supplied by the user (in this case thersync
process) is ignored (it is actually passed to$SSH_ORIGINAL_COMMAND
as we will see). Thesecure-rsync
command specified above should be:#!/bin/sh case "$SSH_ORIGINAL_COMMAND" in *\&* | *\;* | *\|*) echo "Access denied" ;; rsync\ --server*) $SSH_ORIGINAL_COMMAND ;; *) echo "Access denied" ;; esac
This simple script ensures that only the rsync server can be executed on the destination machine through the passphrase-less ssh key. It also ensures that a hostile agent does not try to add additional commands after the call to
rsync
by using one of the shell control operators (&&
or||
) or the termination operators (&
or;
).
5. cron - Daemon to Execute Scheduled Commands
cron is an integral part of most Linux distributions. It is used to execute commands at specific times according to a schedule you set. We will use it to set-up a nightly mirror of all the files we wish to backup, and to create the snapshots at the intervals we determined in section 3.
Each user on a Linux system has their own cron table ("crontab") which
contains the schedule of commands. This can be listed using
'crontab -l
', removed with 'crontab -r
' and edited
with 'crontab -e
'. Let's add the daily mirror command so that
it occurs at 2am every day by placing the following in our crontab:
00 02 * * * rsync -a -e ssh --delete /home/username/mail \ username@mycomputer.mycompany.com:/backups/mailwhere the five fields (
0 2 * * *
) are (respectively):
Field Allowed Values minute
hour
day of month
month
day of week0-59
0-23
1-31
1-12
0-7*(*0 or 7 is Sunday)
So, in our case, we will mirror the contents of
/home/username/mail
at 02:00 on every day of every month. We can
place similar entries for all other directories you wish to mirror.
Alternatively, we could create a script containing all the entries and use cron
to execute that script.
There are two useful environment variables you can also set when editing the
crontab to override the defaults:
SHELL=/bin/sh
MAILTO=username
The MAILTO
is important as all error messages will only be sent
by e-mail and so you will notified if your backups are failing. Refer to the
crontab man page for more information and examples.
When choosing your cron times, be mindful of possible problems if you are using NTP to automatically update your system time.
6. Putting It All Together
Now that we have the basics of rsync and cron, all we have left to do is to put them all together to create our backup policy. Let's continue with the example where your home computer is sending its daily mirror to your office computer. You office computer will now be responsible for the remainder of the backup policy: the snapshots at the predefined intervals. We will use another crontab on the office machine to accomplish this and I will demonstrate using the schedule for my thesis from section 3.
The method is quite simple. For example, every Sunday we will move the 3 week old snapshot to 1 month old snapshot, the 2 week old to the 3 week old, the 1 week old to the 2 week old and archive the mirror to the 1 week old. So, depending on the time of the week, the 3 week old snapshot could be as young as 2 weeks or as old as 3 weeks.
My schedule requires snapshots that are 1, 2, and 3 weeks old and 1, 2, 3, 4, and 6 months old. We will work from the oldest down (as otherwise we would only be propagating the new snapshot):
# Back up mail files with snapshots of 6,4,3,2,1 months and 3,2,1 weeks # Order 4m->6m, 3m->4m, 2m->3m, 1m->2m, 3w->1m, 2w->3w, 1w->2w, mirror->1w # At 3am on the 1st of Jan,Mar,May,Jul,Sep,Nov copy the 4m to the 6m 00 03 1 1,3,5,7,9,11 * cp -f /backups/thesis/backup/4month.tar.gz \ /backups/thesis/backup/6month.tar.gz # At 3.02am on the 1st of every month move the 3m to the 4m # (and continue for other months) 02 03 1 * * cp -f /backups/thesis/backup/3month.tar.gz \ /backups/thesis/backup/4month.tar.gz 04 03 1 * * cp -f /backups/thesis/backup/2month.tar.gz \ /backups/thesis/backup/3month.tar.gz 06 03 1 * * cp -f /backups/thesis/backup/1month.tar.gz \ /backups/thesis/backup/2month.tar.gz 08 03 1 * * cp -f /backups/thesis/backup/3week.tar.gz \ /backups/thesis/backup/1month.tar.gz # And then every Sunday take care of the weekly snapshots and the archiving # of the mirror 10 03 * * 0 cp -f /backups/thesis/backup/2week.tar.gz \ /backups/thesis/backup/3week.tar.gz 12 03 * * 0 cp -f /backups/thesis/backup/1week.tar.gz \ /backups/thesis/backup/2week.tar.gz 14 03 * * 0 rm -f /backups/thesis/backup/1week.tar.gz 16 03 * * 0 tar zcf /backups/thesis/backup/1week.tar.gz \ /backups/thesis/thesis/*
And that my friends is your automatic, hassle-free, and effective backup system.
A few points on the above:
- I have placed each command 2 minutes apart to allow the previous one to complete. Adjust this depending on your own file sizes, system load, hard disk speed, etc.
- In the example in section 5 for the automatic mirroring I set the mirror time to 2 a.m. Ensure, as I have done here, that the snapshots get created after the mirror (i.e., allow enough time for the mirroring to complete)
- Before the first run you should ensure all directories are created,
archive the existing mirror, and copy it to all the required files (copy
1week.tar.gz
to2week.tar.gz
,3week.tar.gz
, etc) to prevent unnecessary error messages
7. Anacron vs. cron
Anacron is a periodic command scheduler similar to some uses of cron, but it does not assume that the system is running continuously. It can therefore be used for our backup policy on systems that don't run 24 hours a day. Just like rsync and cron, Anacron is now part of most standard Linux distributions.
Every time Anacron is run, it reads a configuration file that specifies the
jobs Anacron controls, and their periods in days. If a job wasn't executed in
the last n days, where n is the period of that job, Anacron executes it. The
configuration file is usually /etc/anacrontab
.
For the daily mirroring we could add a line to this configuration file such as:
1 20 mirror rsync -a -e ssh --delete /home/username/thesis \ username@mycomputer.mycompany.com:/backups/thesiswhere the fields mean:
1
- the period in days indicating how often this command should be executed
20
- the delay in minutes after Anacron begins before it should execute this command
mirror
- a unique identifier for this job so Anacron can keep track of when it was last executed
rsync...
- the command to execute
And similarly on the backup machine we would place the following in the Anacron configuration file:
# Back up mail files with snapshots of 6,4,3,2,1 months and 3,2,1 weeks # Order 4m->6m, 3m->4m, 2m->3m, 1m->2m, 3w->1m, 2w->3w, 1w->2w, mirror->1w # Every 60 days (2 months) 60 20 bup1 cp -f /backups/thesis/backup/4month.tar.gz \ /backups/thesis/backup/6month.tar.gz # every 30 days (1 month) 30 22 bup2 cp -f /backups/thesis/backup/3month.tar.gz \ /backups/thesis/backup/4month.tar.gz 30 24 bup3 cp -f /backups/thesis/backup/2month.tar.gz \ /backups/thesis/backup/3month.tar.gz 30 26 bup4 cp -f /backups/thesis/backup/1month.tar.gz \ /backups/thesis/backup/2month.tar.gz 30 28 bup5 cp -f /backups/thesis/backup/3week.tar.gz \ /backups/thesis/backup/1month.tar.gz # And every 7 days 7 30 bup5 cp -f /backups/thesis/backup/2week.tar.gz \ /backups/thesis/backup/3week.tar.gz 7 32 bup7 cp -f /backups/thesis/backup/1week.tar.gz \ /backups/thesis/backup/2week.tar.gz7 7 34 bup8 rm -f /backups/thesis/backup/1week.tar.gz 7 36 bup9 tar zcf /backups/thesis/backup/1week.tar.gz \ /backups/thesis/thesis/*A few notes on this:
- You really need to plan well if using Anacron. What if the office machine is regularly off while the home machine is trying to rsync? Anacron can work best in this situation if it is the source machine that is not always running; it can perform the rsync and then take care of the snapshots.
- Ensure you make proper use of the delay time to ensure one job has finished before the other starts.
- Anacron is also ideal for laptop users.
8. Resources
For a more professional backup solution:
- Amanda, The Advanced Maryland Automatic Network Disk Archiver, http://www.amanda.org/
Get advance notification before your hard disk fails:
- smartmontools Home Page - http://smartmontools.sourceforge.net/
9. Change Log
- April 24, 2005
- Added warning about NTP time servers and possible problems with cron settings. Tip from Thomas Adam of the LinuxGazette.net.
- May 4, 2005
- Updated the ssh keys section to increase the security of the system by using a specific public/private key pair for the passphrase-less keys. I also added a working script/example which allows one to specify a specific command that can only be run with a particular ssh key for maximum security. Thanks for the idea goes to Colm MacCarthaigh of ILUG.
- May 6, 2005
-
Changed
rsync
's -r switch to the much more appropriate -a.
Copyright © 2004, 2005, Barry O'Donovan. Released under the Open Publication license.