Heartbeat / Failover Configuration
Ping Nodes
The first thing we can adjust is the ping node and how it is used by the ipfail program. Initially, the software installer uses the machines default gateway as the "ping node" for Heartbeat. This means that connectivity to this node will determine which server is alive in the event that some kind of connectivity has been lost but both servers still "think" they are online. Let's look at a scenario to explain more fully.
ServerA has the floating IP address 10.0.0.99 in a high availability mysql server failover setup and is currently the master node. Say that ServerA's switch port or network card now dies, but ServerA is otherwise fully functional. ServerA will soon realize that it cannot communicate with ServerB over the ethernet broadcast channel (the exact time depends on the heartbeat keepalive and deadtime intervals). Likewise, ServerB will realize that it cannot communicate with ServerA over its ethernet broadcast channel.
Each node will now attempt to determine if the other node is alive and healthy by:
- Trying to reach the ping node
- Asking the other node if it was able to reach the ping node
To continue our scenario, ServerA will not be able to reach the ping node (the default gateway) because its switch port has failed. ServerB will reach the ping node with no problem. ServerB will then ask ServerA (via the serial port channel or ethernet crossover cable, or other secondary channel) if it too can reach the ping node. ServerA will respond that it cannot. In this case, ServerB will correctly deduce that ServerA is alive but has lost connectivity, and it will initiate a failover and take the IP address and the mysql daemon.
In this way, ipfail ensures that in the event of a network card or switch port failure, the two nodes can correctly deduce which node should have the IP address and services and fail them over. Without it, Heartbeat only watches for nodes that are alive or dead.
We can improve on the default configuration by changing the ping node directive to a ping_group directive, and by specifying additional ping nodes in the group. There are two reasons for this. First, the default gateway may not be the most reliable ping node. If it is unreliable or very busy and occasionally fails to respond to pings, we may have unwanted cluster failovers as a result. Second, if the default gateway fails, it will cause ServerA and ServerB to think that they might be failing and they will chatter heavily trying to figure out what is going on.
Edit ha.cf configuration file
To change this, first we need to open /etc/ha.d/ha.cf and change one line. Where it says:
ping 192.168.1.1
That is the line that refers to the ping node, which is probably set to your default gateway. Choose two or three IP addresses that are reliable and always available in your environment, and change the line like this (substituting your IP addresses for the ones in the example):
ping_group always_up_nodes 192.168.1.1 192.168.5 192.168.1.9where each IP address is separated by a space. This will cause ipfail to try contacting each of these nodes when connectivity is in doubt, and if any of them is reached, the node is considered to have connectivity.
Reload Heartbeat
Issue this command and Heartbeat will re-read its configuration file:
service heartbeat reload
Now, a single ping node becoming unreliable for a brief period of time will not cause ServerA or ServerB to think they have lost network connectivity and consider initiating a failover event.
Check the other serverOn the other server, make sure that /etc/ha.d/ha.cf is identical to the new one on the first server. By the time you can log in and see, the Watcher service should have already synchronized the ha.cf file and they should be identical. Also make sure that Heartbeat gets reloaded after the change is made.
Broadcast Device
Each server must have at least one redundant path to the other server. One path is generally ethernet, and the second is either ethernet or serial cable. By default, heartbeat is configured to use a serial cable by ClusterMaker. To use an ethernet crossover cable instead, configure the additional card on each server with an address on a new LAN.
For instance, if the primary (eth0) address is 192.168.75.10 and your network consists of nodes in the range 192.168.75.1 - 254, perhaps the additional card could have an address of 10.0.0.1 and the second card on the other server could have an address of 10.0.0.2. These cards will only talk to each other so the actual network is not very important as long as it's unique.
Additional steps that may be needed to add or enable a network card are outside the scope of this heartbeat reconfiguration howto, but once the card is configured make sure you can ping from one node to the other on the new interface:
[root@server1 /]# ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.110 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.133 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=0.093 ms
Reconfiguration
If pings succeed then heartbeat is ready to be reconfigured. On the master node, open /etc/ha.d/ha.cf. Instead of a broadcast address and a serial port, we will switch to multiple unicast directives. First, remove the lines that refer to the serial port and the line that says 'bcast eth0'.
Second, add lines to reference each ethernet address that heartbeat will use for communication, like this:
ucast eth0 192.168.175.91
ucast eth0 192.168.175.92
ucast eth1 10.0.0.1
ucast eth1 10.0.0.2
Of course, make sure the IP addresses are the ones in use on your pair of servers. Now issue this command on both servers and Heartbeat will re-read its configuration file:
service heartbeat reload
Heartbeat is now configured to use two ethernet communication channels instead of one ethernet plus a serial cable.
For example, if the primary (eth0) cards are connected to a switch, then the secondary cards (eth1) should be connected via crossover cable. It is possible to connect via another switch, but then the switch should have a separate power source from the first so that they cannot reasonably be expected to both go down while the servers remain up. Caution: An incorrect setup may allow data loss and may not provide high availability.
Split Brain
"Split Brain" is the situation where both nodes are alive and believe the other is dead. In general it is to be avoided at all costs, and the occurrence of split brain is a plague amongst cluster systems. The result is usually that each server will write to the data concurrently, which then either corrupts it or causes it to diverge from one coherent set into two (possibly corrupted) different sets.
In shared disk architectures (including DRBD setups) using Red Hat Cluster Suite's GFS and Oracle's OCFS2, a split brain situation is the most dangerous. Each server writes directly to the disk, which can immediately corrupt the filesystem and could result in a total loss of data if some kind of fencing doesn't kick in fast enough.
For ClusterMaker this is less of an issue because of the "shared nothing" design. Each of the two servers maintains its own complete copy of the data, as opposed to sharing a single copy. Therefore it is architecturally impossible for both serves to write to the same disk at the same time. Furthermore, filesystem corruption cannot propagate from one node to another, as is the case with DRBD. However, it is still possible that a split brain could cause the two servers to have differing copies of the data. ClusterMaker tries to avoid this in three ways:
1. The active / passive nature of ClusterMaker instead of active / active: While it is possible to write to both servers concurrently (if the application running on ClusterMaker allows it), we choose not to by default. If all clients connect to one IP address and that IP address cannot exist on two nodes simultaneously, the risk of the two data sets diverging is greatly reduced.
2. Multiple heartbeat communication channels: In addition to the primary ethernet interface, ClusterMaker uses heartbeat over a serial or additional ethernet link, preferably a crossover cable. This ensures that a network failure cannot cause a split brain, because the nodes will still be able to communicate with each other and determine the health of the other node.
3. GlusterFS on top of ext3: Glusterfs is designed to write concurrently to multiple server nodes. Synchronous writes ensure that each server either receives and acknowledges the write, or is down. When the failed node comes back, GlusterFS can tell from the metadata on each file whether both nodes have identical copies. If they don't, the out of date copy on the previously failed node is automatically updated prior to the file being used. In addition, even if GlusterFS becomes inoperable, the data can be read directly from the underlying ext3 filesystem without any special tools.
This is how ClusterMaker provides the very highest levels of availability and recoverability.