Managing Node Failures

Node failures are an unfortunate reality of any computer system, and failures in a Scyld ClusterWare cluster are inevitable and hopefully rare. Various strategies and techniques are available to lessen the impact of node failures.

Protecting an Application from Node Failure

There is only one good solution for protecting your application from node failure, and that is checkpointing. Checkpointing is where at regular intervals your application writes to disk what it has done so far, and at startup checks the file on disk so that it can start off where it was when it last wrote the file.

The way to checkpoint that gives you the highest chance of recovering is to send the data back to the master node and have it checkpoint there, and also make regular backups of your files on the master node.

When setting up checkpointing, it is important to think carefully about how often you want to checkpoint. Some jobs that don't have much data that needs to be saved can checkpoint as often as every 5 minutes, whereas if you have a large data set, it might be smarter to checkpoint every hour, day, week, or longer. It depends a lot on your application. If you have a lot of data to checkpoint, you don't want to do it often as that will drastically increase your run time. However, you also want to make sure that if you only checkpoint once every two days, that you can live with losing two days worth of work if there is ever a problem.

Compute Node Failure

A compute node can fail for any of a variety of reasons, e.g., broken node hardware, a broken network, software bugs, or inadequate hardware resources. A common example of the latter is a condition known as Out Of Memory, or OOM, which occurs when one or more applications on the node have consumed all available RAM memory and no swap space is available. The Linux kernel detects an OOM condition, attempts to report what is happening to the cluster's syslog server, and begins to kill processes on the node in an attempt to eliminate the process that is triggering the problem. While this kernel response may occasionally be successful, more commonly it will kill one or more processes that are important for proper node behavior (e.g., a job manager daemon, the crucial Scyld bpslave daemon, or even a daemon that is required for the kernel's syslog messages to get communicated to the cluster's syslog server). When that happens, the node may still remain up in a technical sense, but the node is useless and must be rebooted.

When Compute Nodes Fail

When a compute node fails, all jobs running on that node will fail. If there was an MPI job running that was using that node, the entire job will fail on all the nodes on which the MPI program was running.

Even though the running jobs running on that node failed, jobs running on other nodes that weren't communicating with jobs on the failed node will continue to run without a problem.

If the problem with the node is easily fixed and you want to bring the node back into the cluster, then you can try to reboot it using bpctl -S nodenumber -R. If the compute node has failed in a more catastrophic way, then such a graceful reboot will not work, and you will need to powercycle or manually reset the hardware. When the node returns to the up state, new jobs can be spawned that will use it.

If you wish to switch out the node for a new physical machine, then you must replace the broken node's MAC addresses with the new machine's MAC addresses. When you boot the new machine, it either appears as a new cluster node that is appended to the end of the list of nodes (if the config file says nodeassign append and there is room for new nodes), or else the node's MAC addresses get written to the /var/beowulf/unknown_addresses file. Alternatively, manually edit the config to change the MAC addresses of the broken node to the MAC addresses of the new machine, followed by the command systemctl reload clusterware. Reboot this node, or use IPMI to powercycle it, and the new machine reboots in the correct node order.

Compute Node Data

What happens to data on a compute node after the node goes down depends on how you have set up the file system on the node. If you are only using a RAMdisk on your compute nodes, then all data stored on your compute node will be lost when it goes down.

If you are using the harddrive on your compute nodes, there are a few more variables to take into account. If you have your cluster configured to run mke2fs on every compute node boot, then all data that was stored on ext2 file systems on the compute nodes will be destroyed. If mke2fs does not execute, then fsck will try to recover the ext2 file systems; however, there are no guarantees that the file system will be recoverable.

Note that even if fsck is able to recover the file system, there is a possibility that files you were writing to at the moment of node failure may be in a corrupt or unstable state.

Master Node Failure

A master node can fail for the same reasons a compute node can fail, i.e., hardware faults or software faults. An Out-Of-Memory condition is more rare on a master node because the master node is typically configured with more physical RAM, more swap space, and is less commonly a participant in user application execution than is a compute node. However, in a Scyld ClusterWare cluster the master node plays an important role in the centralized management of the cluster, so the loss of a master node for any reason has more severe consequences than the loss of a single compute node. One common strategy for reducing the impact of a master node failure is to employ multiple master nodes in the cluster. See Managing Multiple Master Nodes for details.

Another moderating strategy is to enable Run-to-Completion. If the bpslave daemon that runs on each compute node detects that its master node has become unresponsive, then the compute node becomes an orphan. What happens next depends upon whether or not the compute nodes have been configured for Run-to-Completion.

When Master Nodes Fail - Without Run-to-Completion

The default behavior of an orphaned bpslave is to initiate a reboot. All currently executing jobs on the compute node will therefore fail. The reboot generates a new DHCP request and a PXEboot. If multiple master nodes are available, then eventually one master node will respond. The compute node reconnects to this master - perhaps the same master that failed and has itself restarted, or perhaps a different master - and the compute node will be available to accept new jobs.

Currently, Scyld only offers Cold Re-parenting of a compute node, in which a compute node must perform a full reboot in order to "fail-over" and reconnect to a master. See Managing Multiple Master Nodes for details.

When Master Nodes Fail - With Run-to-Completion

You can enable Run-to-Completion by enabling the ClusterWare script: beochkconfig 85run2complete on. When enabled, if the compute node becomes orphaned because its bpslave daemon has lost contact with its master node's bpmaster daemon, then the compute node does not immediately reboot. Instead, the bpslave daemon keeps the node up and running as best it can without the cooperation of an active master node. In an ideal world, most or all jobs running on that compute node will continue to execute until they complete or until they require some external resource that causes them to hang indefinitely.

Run-to-Completion enjoys greatest success when the private cluster network uses file server(s) that require no involvement of any compute node's active master node. In particular, this means not using the master node as an NFS server, and not using a file server that is accessed using IP-forwarding through the master node. Otherwise, an unresponsive master also means an unresponsive file server, and that circumstance is often fatal to a job. Keep in mind that the default /etc/beowulf/fstab uses $MASTER as the NFS server. You should edit /etc/beowulf/fstab to change $MASTER to the IP address of the dedicated (and hopefully long-lived) non-master NFS server.

Stopping or restarting the clusterware service, or just rebooting the compute nodes doing bpctl -S all -R, will not put the compute nodes into an orphan state. These actions instruct each compute node to perform an immediate graceful shutdown and to restart with a PXEboot request to its active master node. Similarly, rebooting the master node will also stop the service with a systemctl stop clusterware as part of the master shutdown, and the compute nodes will immediately reboot and attempt to PXEboot before the master node has fully rebooted and thus ready to service the nodes. This will be a problem unless another master node is running on the private cluster network that will respond to the PXEboot request, or unless the nodes' BIOS have been configured to perpetually retry the PXEboot, or unless you explicitly force all the compute nodes to immediately become orphans prior to rebooting the master with bpctl -S all -O, thereby delaying the nodes' reboots until the master has time to reboot.

Once a compute node has become orphaned, it can only rejoin the cluster by rebooting, i.e., a so-called Cold Re-parenting. There are two modes that bpslave can employ:

  1. No automatic reboot. The cluster administrator must reboot each orphaned node using IPMI or by manually powercycling the server(s).
  2. Reboot the node after being "effectively idle" for a span of N seconds. This is the default mode. The default N is 300 seconds, and the default "effectively idle" is cpu usage below 1% of one cpu's available cpu cycles.

Edit the 85run2complete script to change the defaults. Alternatively, the bpctl can set (or reset) the run-to-completion modes and values. See man bpctl.

The term "effectively idle" means a condition wherein the cpu usage on the compute node is so small as to be interpreted as insignificant, e.g., attributed to various daemons such as bpslave, sendstats, and pbs_mom, which periodically awaken, check fruitlessly for pending work, and quickly go back to sleep. An orphaned node's bpslave periodically computes cpu usage across short time intervals. If the cpu usage is below a threshold percentage P (default 1%) of one cpu's total available cpu cycles, then the node is deemed "effectively idle" across that short time interval. If and when the "effectively idle" condition persists for the full N seconds time span (default 300 seconds), then the node reboots. If the cpu usage exceeds that threshold percentage during any one of those short time intervals, then the time-until-reboot is reset back to the full N seconds.

If the cluster uses TORQUE as a job manager, Run-to-Completion works best if TORQUE is configured for High Availability. Refer to PBS TORQUE documentation for details.

RHEL/CentOS 7 has deprecated the earlier Heartbeat software in preference to Corosync.