Data Replication with ClusterMaker
Overview
ClusterMaker has two ways to replicate data between servers in a high availability pair. The first way is with the Watcher service, our custom replication module that is designed to keep the master server and the backup server in sync.
Watcher is simple, powerful, and customizable so that we can handle whatever kind of Linux services you need to make highly available. It "watches" any specified directories for changes and replicates them immediately and efficiently to the other server. By default, it monitors almost the entire filesystem and is intelligent enough to copy all updates without overwriting critical system files that are unique to the destination server.
The second way we replicate data is with a synchronous real time mirror underneath the file system. This is appropriate for large binary files--like databases--where replicating the differences would otherwise be impractical or impossible with traditional utilities like rsync. It is similar in concept to DRBD (Distributed Replicated Block Device) but it works at the file level, not the block level.
One benefit from this approach is that if the filesystem on the block device becomes corrupted on one server, the corruption will not flow to the other server, as it would with DRBD. Another is that a dedicated partition does not have to be created for the files you wish to replicate. Simply tell the configuration wizard which directories should be included and they will start replicating immediately. More on the similarities and differences will be presented in the "Database Replication" section, but overall our design decision was to emphasize total flexibility and reliability.
The Watcher service
The Watcher service is very simple to configure. First, any directories that should be replicated in near real time to the other server are listed in /cluster/watch_list, one path per line. By default, most of the server is already listed here:
/cluster
/etc
/lib
/bin
/sbin
/opt
/share
/home
/var
/usr
/root
The effect of this is that anything that happens to the primary server is immediately replicated to the standby server, including system updates, program installations, etc. The integration is very tight.
You might want to modify the list if your application uses a custom path in the root of the filesystem, say: /mydata. To include this directory in synchronization, simply add the path to /cluster/watch_list on a new line and restart the watcher service ( service watcher restart ). You can also add subdirectories to this file, of the form: /mydata/foo/subdir.
Exclusions
Some types of files should not be replicated by the Watcher service, and these files can be excluded in two ways.
1. If they don't need to be or should not be replicated, they can be listed in an appropriate exclusion file. Several of these exist in /cluster, and they are all hidden files. They are:
.master2_sync_excludes
.master2_sync_etc_excludes
.master2_sync_root_excludes
.master2_sync_var_excludes
The file name refers to the directory the exclusions will apply to. For instance, .master2_sync_root_excludes contains a list of directories and / or files that should be excluded when Watcher replicates /root from serverA to serverB. Note that .master2_sync_excludes will be applied to any synchronization that is not already using the exclude list for /root, /etc, or /var.
2. Some files, notably databases, are desirable to replicate but are large and constantly changing. These files should be replicated with our Database Replication tools and must be excluded from Watcher synchronizations. To do this, simply make sure the path to these files is listed in /cluster/database_locations. The Watcher service will never replicate a path listed in this file.
Summary
The Watcher service is ideal for monitoring large directory trees for small or infrequent changes. It can easily monitor an entire server and replicate updated files in near real time (within a second or two of when the original change is flushed to disk). It is the preferred method for keeping many small files in sync.
Database Replication Tools
The Watcher service is an excellent tool for small files, but it cannot deal very well with large and constantly changing binary files. These are most often encountered in a database. They are challenging to replicate effectively and most databases have their own unqiue replication mechanisms. Some of them work well and some not so well, and our tools are designed to bridge the gap and make database replication more simple and universal.
Before setting up database replication, Point in Time Database Snapshots (PTDS) should be configured. PTDS does not require a high availability pair of servers and can be enabled immediately after installing ClusterMaker. This will allow the administrator to take live snapshots of the database at any time, and to roll back the data to a selection of restore points if needed.
Database Replication is installed by default when the administrator runs /cluster/bin/master-maker, which prepares a "backup master" by creating a clone of the original "master" server. Master-maker will ask a few questions, gather information, prepare a boot image for the backup server, and then ask if you would like to enable Database Replication. If you choose yes, it will ask for the location of the database files:
...
By default, MySQL data resides at /var/lib/mysql.
To replicate this location, type 'mysql' at the prompt.
If mysql is in a non-standard location or if you use
another type of database, type the directory path
that should be replicated.
[ contact support or see documentation on website ]
[ for details on changing this configuration later ]
Enter [ mysql | /some/other/path ]:
After entering a valid path, master-maker will attempt to reconfigure this portion of your filesystem so that any writes to the database files will be mirrored in real time to both the master server and the backup master. This is completely independent of and outside the database itself, and does not require any changes to the database configuration.
All data is written synchronously to the mirror, which means that writes are committed to both servers simultaneously and the write will not succeed until it has completed on both servers. If the backup master is not available, the write will still succeed and will be committed only to the online server. When the backup master next comes online, its out of date files will be updated on first access.
This method contrasts with traditional asynchronous MySQL master-slave replication mechanisms which accept writes on the master server and then send a log of changes to the slave, which then updates its own copy of the database. That kind of asynchronous replication allows faster write performance, but can become problematic unless it is paired with additional monitoring tools and a significant amount of manual intervention. There are many errors that can stop the replication process, and replication itself does not actually ensure that the master and slaves have identical copies of the data.
Note: MySQL Table Types
MySQL uses MyIsam tables by default. They are fast, but are not as resilient as InnoDB tables. Because they are more easily corrupted and more time consuming to repair, we strongly recommend the use of InnoDB or other transactional tables.
MyIsam tables under a heavy write load will frequently become corrupted in the event of a server crash. If this happens, even having the data running immediately on another server can be of little use if that data is corrupted! However, we understand that some customers prefer MyIsam tables even though InnoDB tables are designed to better survive a crash, or that your application may specifically require MyIsam tables.
For this reason we implement a heartbeat check that is activated on failover. It is a script called database_check that will run after MySQL is started on the new node. The script will check all databases for MyIsam tables and check every table for corruption. If it finds any, it will automatically attempt a repair. In a large database, these checks may take a little time. If this extra downtime and possible data loss is unacceptable, we encourage you to switch to InnoDB tables which are more reliable and faster to recover.
We conducted testing on a small 3.8 GB database consisting of 7 tables, or an average table size of about 543 MB, and a total of about 22 million rows. Using the sysbench load testing utility, we imposed a heavy write load on one particular 811 MB table and then crashed the server. On restart, the table was indeed corrupted but the database_check utility restored the database to a consistent state in about 90 seconds. The average time to repair in our testing was about 2.5 GB per minute for MyIsam tables on a dual Xeon 2.8 Ghz server with 2 GB memory.
The same test scenario using InnoDB tables resulted in just 9 seconds of additional start time for the MySQL daemon (25.3 GB per minute), which automatically checks and repairs InnoDB tables on startup. The bottom line is that InnoDB was 10x faster to recover and provides guarantees about data transactions either completing 100% or not at all in failover scenarios. If you use MyIsam, make sure that you have good backups and use our db_snapshot tool frequently (hint: Cron!) because you may need one of these to recover from after a hard failover.
Configuration Files
Database Replication is essentially controlled by the GlusterFS configuration. The ClusterMaker setup routines will automatically create the required files and there is nothing to configure after this.
However, a more in depth look at mirroring starts in the /etc/glusterfs/glusterfsd.vol configuration file. It exports two mounts to glusterfs clients: the root of the filesystem, and /cluster/shadow/db directory. The latter is where our setup routine should have moved your database files. From here they are exported by glusterfs and then remounted at their old location. In this way, any access to the database runs though the glusterfs layer first. Writes are synchronously replicated to the other server and localhost, while reads are served from the localhost only.
The next part of the configuration is a single line in /etc/fstab , where the mountpoint for the databases is listed. This causes the glusterfs client to automatically mount the database location at every system start. Of course, sometimes things go wrong and the mount does work correctly. It may need to be remounted or fixed in some other way.
For this we have a custom Mon setup that checks all glusterfs mounts every 15 - 30 seconds. The configuration file is /cluster/mon/mon.cf and if desired, changes can be made to alter the timing of the checks, and /or the number or notifications and the recipients. Just restart Mon afterwards, with 'service mon restart'.
All mount points are dynamically read from /etc/fstab, so Mon will always be current. If the regular Mon check determines that any mount is not working properly a series of automated steps will be taken that should result in a working mountpoint in a few seconds. By default Mon will notify the cluster administrator as well, up to three consecutive times, with an email like this:
Summary output : GlusterFS monitor is reporting failures
Group : localhost
Service : gluster
Time noticed : Fri Mar 5 10:32:26 2010
Secs until next alert :
Members : localhost
Detailed text (if any) follows:
-------------------------------
The last configurable part of database replication is in the /etc/ha.d/haresources file. Some situations will call for certain services being started or stopped before other service, and a database is a typical example. For example, if you are running an application that requires a MySQL back end, you will certainly want to start MySQL before the application. Make sure that in the haresources file, the applications are listed in the correct order. Services start from left to right, and stop from right to left. A typical haresources file looks like this:
server1 IPaddr::192.168.85.20/24/eth0 MailTo::
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
::Cluster_Failover_Event syslog watcher mon mysqld
In this example, when the backup server takes over from a failed master server it will first start syslog, then watcher, then mon, then mysqld.
DRBD
Many system administrators are familar with DRBD, which is an excellent project and a time honored favorite in Linux high availability. For this reason, it might be helpful to briefly review some differences and similarities to their approach and ours.
Synchronous writes are guaranteed in both systems, and they have much in common as far as a Sysadmin running a database is concerned. At failover time, the backup server will start MySQL and present its mirror copy of the data. And as previously stated, DRBD operates at the block level and our software operates at the file level using GlusterFS. One nice side effect of this is that you don't need to create a separate partition or logical volume just for your highly available data. Just point our setup utility at your database directory and it can remain a part of your original filesystem.
But where we differ most is that DRBD is highly specialized for mirroring data. On the other hand, we are highly specialized at mirroring servers and creating a cluster environment. ClusterMaker is essentially a complex wrapper around many subsystems that combine to provide several Linux high availability technologies all at once. We include Point in Time Snapshots (for system state and data) with rollback, data mirroring, IP failover, IP load balancing, real time cluster monitoring, and rapid point and click scale out onto additional hardware for applications like Apache and Tomcat.
So if you want a proven low overhead method for mirroring data, DRDB is a great choice and more appropriate than ClusterMaker. But if you want the whole server to be mirrored, such that changes to *any* files get replicated, then DRBD will generally not do this for you. Some system files cannot be replicated without breaking the destination server, and deciding which ones to replicate and when to replicate them is complicated. We also distinguish between master server, backup server, and diskless PXE boot worker nodes which have no local data at all but boot directly into the shared root cluster filesystem on the master to access applications.
As such, ClusterMaker is intended to be more of a holistic platform for general purpose Linux HA clusters than a specific technology which is focused tightly on one task. It cannot be compared to DRBD on a point for point basis; they are very different in purpose.