High Availability clusters
On operating system level, two cluster architectures exist: High Performance clusters and High Availability clusters. This article describes high availability clusters.
High availability clusters are groups of computers (nodes in a cluster) that can failover applications in case one of the computers fails.
Cluster software
Special clustering software is needed to setup an high availability cluster. The most popular choices for operating systems are:
- Microsoft Cluster Services (Windows)
- HP's Serviceguard (HP-UX and Linux)
- IBM's HACMP (AIX UNIX)
- Sun Cluster (Sun Solaris)
- Veritas Cluster Server
- OpenVMS cluster functionality
This software is used to let applications running on a node in a cluster failover to another node as fast as possible. The software periodically (for instance every minute) checks if the application on a node still works as expected. If the application fails, a failover is initiated: the application is stopped on the failed node (if this is still possible), and restarted on another node in the cluster.
The intention is to have minimal interruptions for the end-users, so they can continue to work as if nothing happened.
Cluster-aware applications
The above description is used for cluster- unaware applications. The applications don't know they are running on a cluster. There are also cluster-aware applications.
An example of a cluster-aware application is Oracle RAC (Real Application Cluster). This way Oracle can run on multiple nodes at the same time, and can cope with node-failures. The end-users will not know a node failed (they might experience some reduced performance though).
Testing
It is crucial for High Availability Clusters to have them tested regularly.
I have experience with a 2-node HP-UX Serviceguard cluster, that was setup correctly once, but was never tested since. Everyone assumed the cluster would perform a correct failover in case of a node failure. But when after some years a node actually failed, the cluster did not function. A considerable amount of downtime was the result.
This could have been prevented if the cluster was tested a few times per year.
This entry was posted on Friday 06 April 2007