rapidscaleclusters.com

Linux Server Point in Time Snapshots

(a.k.a Roll Back, System Restore)


 

Overview

Often, a server availability problem is not caused by actual hardware failure but by a configuration mistake or an update that did not work as it was supposed to.  In these cases, the ability to easily "roll back" a server to a previous point in time is very important. We provide this capability for both master and slave servers in a ClusterMaker high availability pair.

Two executables provide the snapshot and restore capability. They are:
/cluster/bin/system_snapshot
/cluster/bin/system_restore


To take a snapshot, just run the system_snapshot executable. The first time you run this, it has to build a replica of your filesystem and will take much longer than on subsequent runs. It will build the duplicate filesystem tree in the /root/snapshots directory. The first run will also take up a lot of space because this initial replica is the same size as your existing filesystem. Subsequent updates store all changes as compressed diffs, and in general a huge amount of additional restore points can be stored without using up much additional storage.

The system_snapshot executable will also backup the Master Boot Record and the files in the boot partition. This enables you to boot from a CD and easily roll back a failed kernel upgrade or some other issue that may have rendered the system unbootable.

Typical output for system_snapshot looks like this:

root@sqlmaster ~ # /cluster/bin/system_snapshot

Current backup dir size is 2 GB

 

Checking to make sure snapshot directory does not exceed 1.5 times

the size of the primary filesystem, as specified in 

/cluster/system_snapshots.conf configuration file.

 

Old snapshots will be removed as needed to create space. However, at

least 2 snapshots will always be retained and snapshots newer

than 86400 seconds will never be automatically removed.

 

Excluding databases found in /cluster/database_locations.

These are backed up separately by /cluster/bin/db_snapshot

 

Taking snapshot at Dec-11-2009--15hrs.59min



Exclusions


The system_snapshot tool will not include common dynamic directories in snapshots. For example, /dev, /proc, /sys, and /mnt are not included. Therefore, any changes to to excluded file types or directories cannot be undone by restoring your system to a system restore point. However, any changes to these directories can usually be undone by rebooting your system. The full list of excluded files / file types is:

*/mnt/ var/lib/heartbeat/*
/var/lock/subsys/* /dev
/sys /proc
/tmp *.pid
*.*$ *.swp
Note: Directories listed in /cluster/database_locations are also excluded. These are handled separately by the database snapshot tools.


 

Under normal circumstances, you will never need to recover any of these file types. However, note the list carefully so that you can avoid accidentally storing important files in any of those locations. Also note that the partition table is not backed up. This is intentional, so that changes to your partition structure (such as adding a disk) do not invalidate your pre-existing snapshots.

Automatically Creating Restore Points


System restore points can be created automatically by scheduling snapshots in a cron job. For example, one could implement this by running:
crontab -e
and adding a line like this:
1 1 * * * /cluser/bin/system_snapshot
In this example, cron would then run system_snapshot on the first minute of the first hour of every day. After a month, this system administrator would have 30 restore points to "roll back" to if needed. The restore points contain all files that have since been modified, including deleted files. Each restore points consists of a set of differences which are then combined and compressed, making them very space efficient. Typically, 30 restore points will not consume much more space than a single restore point, but this will vary depending on how quickly your data turns over.

 

Rolling back to a Restore Point


To use a restore point, there is a separate tool: /cluster/bin/system_restore. Running this tool will provide you with a list of available restore points, and ask you which, if any, you would like to revert to. Pressing Control-C will exit without reverting to any restore point.  After choosing a restore point, you must type 'yes' to confirm that you want to restore your server to an earlier time:
root@sqlmaster ~ # /cluster/bin/system_restore

Backups currently available:
Restore point 1: 12-11-2009 : at 15 hours, 25 min, and 29 seconds. ( 0 days, 0 hrs, 40 minutes ago)
Restore point 2: 12-11-2009 : at 15 hours, 59 min, and 38 seconds. ( 0 days, 0 hrs, 6 minutes ago)

The current server time is Fri Dec 11 16:05:58 EST 2009

   Any restore operation will exclude these databases:
   /var/lib/mysql
   /var/lib/mysql_recovery

   These are backed up separately by /cluster/bin/db_snapshot and
   restoring a system snapshot with this tool WILL NOT touch the data,
   but any configuration files in /etc/ (like /etc/my.cnf ) will be
   restored to their prior state.

 Enter the number of the restore point to recover to (or press CTRL-C to exit): 2

 ARE YOU SURE? THIS WILL REVERT THE FILESYSTEM TO THIS POINT IN TIME!

 Changes since then will be lost, unless a more recent restore point
 exists (if so, you can still roll forward).

To continue and revert the system to a prior point in time, type 'yes' :
Typing "yes" and pressing enter will then revert your system to that point in time. This is an excellent way to recover from unwanted system changes.
 

Tell the developers:

The type of clustering you are most likely to deploy is:
 
What Linux distro do you use for clusters?
 

Copyright 2010    RapidScale Clusters, LLC