Grids and Clusters Part 2: Clusters and their Applications

Would you like to...

Zvi Grauer

In Part 1 we reviewed the types and applications of computer grids, dynamic swarms of loosely interconnected computers that are used for problem solving. Now we focus our attention on computer clusters, collections of tightly integrated systems, set in close proximity to each other for fastest data transfer and communications. Clusters are usually deployed to improve performance over an individual server in a cost effective way.

What is a Cluster?

A computer cluster is a group of individual computers (nodes) that appear as a single entity and act cooperatively to improve performance. Each node is a separate system, with its own hardware, software and operating system, and while the cluster operates as one entity, (e.g., a cluster of web servers that appears as one large web server), each computer within the cluster can be accessed individually. The cluster components can be connected to each other through fast local area networks, or through the internet. The type of connection will depend on the amount of required communication between the components, since slow communication can lead to latency and lower overall speed and responsiveness.

Clusters improve performance and/or availability over a single computer more cost effectively than an upgrade in hardware to a mid-size server or a mainframe computer.

There are several different types of computer clusters, each with its own purpose and unique characteristics.

Cluster Types

1. Load Balancing Clusters

Load balancing clusters consist of a director - a server with an IP address and port - which is bound to several physical servers. Each Client request is received by the director, which selects a physical server and directs the request to it. In case a server fails, the load balancer continues to perform load balancing across the remaining servers that are up. In case of failure of all the servers, requests may be sent to a backup virtual server (if configured) or optionally redirected to a configured URL. For example, a page on a local or remote server which provides information on site maintenance or outages.

With load balancing, each node processes the requests it has been given by the cluster manager. The cluster manager attempts to distribute the workload evenly among all the systems. There is no communication between the nodes.

2. Fail-over Clusters

Fail-over is similar to load balancing. However, instead of requests being distributed among all the cluster nodes, the director sends all requests to one server for processing. Only when that server goes down will one of the other systems in the cluster take over.

3. High Availability Clusters

High availability (HA) clusters aim at improving the availability of services provided by members of the cluster. High availability can be achieved using either load balancing or fail-over. Typically, HA clusters contain a cluster manager and two or more redundant servers. The cluster manager monitors the availability of each server, accepts incoming requests for service, and redirects the requests to most available server. When a server is overloaded with work, the manager will forward requests to the other server(s). If a server fails, the manager will take action to revive the server (including alerting tech support). HA clusters usually have multiple redundancies built in, such as multiple cluster managers, redundant network gear and connections, and redundant storage systems.

High availability clusters are usually interconnected through a high speed local area network. They are not usually used for computations, grid or parallel computing. Their main objective is higher uptime.

Availability is often measured in percentage of uptime. A typical server may be up 99% of the time, whereas a system designed for high availability may be up 99.99% of the time. This is often referred to as "four nines" availability.

4. Beowulf Clusters

A Beowulf cluster is a group of (usually identical) PC computers, networked into a small TCP/IP LAN, and have libraries and programs installed which allow processing to be shared among them. Software must be revised to take advantage of the cluster. Specifically, it must perform multiple independent parallel operations that can be distributed among the available processors. Parallel processing libraries, such as MPI (Message Passing Interface) and PVM (Parallel Virtual Machines) can be used to divide a task among a group of networked computers, and recollect the results of processing, leading to faster processing compared to an individual computer. Beowulf systems are used for scientific computing worldwide.

Clustering software, such as OpenMosix, promises that, once installed, "nodes in the cluster will start talking to one another and the cluster will adapt itself to the workload. Processes originating from any one node, if that node is too busy compared to others, can migrate to any other node...continuously attempts to optimize the resource allocation...creating a reliable, fast and cost-efficient clustering platform that is linearly scalable and adaptive. With Auto Discovery, a new node can be added while the cluster is running and the cluster will automatically begin to use the new resources".

Real Life Application: SQL Clusters

SQL Clusters exemplify the application of clustering technology to bring high availability, scalability and non-disruptive maintenance to the mission critical realm of database operations.

The benefits of SQL clustering include:

  • maintaining performance levels during peak loads, scheduled maintenance and system failures, through dynamic resource allocation
  • maximizing hardware utilization during periods of lower demand
  • reducing infrastructure costs by replacing expensive, high end equipment with low-cost servers while maintaining performance levels

Shared Everything vs. Shared Nothing Architectures

Two competing architectures co-exist in the field of multi-processor clusters: Shared Everything (SE) and Shared Nothing (SN). SE systems include shared memory (multiple processors share a common central memory), shared disk (multiple processors, each with private memory, share a common array of disks), or a combination. With SN, neither memory nor peripheral storage is shared among processors. These architectures apply to both multi-core computer systems, and clusters of separate servers. In the latter case, disk sharing is more common than memory sharing.

In SE SQL clustering, a single database is deployed on a shared disk array, and every node accesses it. With an SN system, each node has its own disk and database, each node controls part of the data, and database queries and commits require communication between nodes to locate the server controlling the relevant data. As evident from this comparison, shared nothing clustering is more complex. Its main advantage is the almost unlimited scalability.

Shared Nothing products include data warehousing applications from IBM (Balanced Configuration Unit), Greenplum (Greenplum Database), Netezza (Performance Server appliance), and Teradata (formerly NCR). Google is also known to be using a Shared Nothing architecture.

Shared Everything products include Adaptive Server® Enterprise (ASE) Cluster Edition, IBM DB/2 for z/OS, Oracle Real Application Cluster (RAC), and Microsoft SQL Server Failover Cluster. Since shared everything systems are more typical for small and medium businesses, we'll take a closer look at a typical SE systems.

Basic SE SQL Failover Clusters

The Active/Passive system is the basic SQL failover cluster. In the Active/Passive system, two or more servers create a cluster. All servers share a single disk array, containing the database. One server is designated as the primary server, and is actively connected to the database. The other server(s) wait for a failure in the primary server, and when such a failure is observed, takes over the database, and continues where the downed server stopped. The system is fault tolerant and highly available, but does not scale well - only one server at a time runs the database.

Nodes in the cluster usually share two separate communication systems. A private network (heartbeat) is used to monitor at regular intervals which nodes are up, and a public network lets the outside world access the cluster. The heartbeat constantly checks the physical server at the operating system level, and verifies that the SQL server is running properly. A failure to respond to the checks in a timely manner triggers the failover mechanism, which includes both an update of the database from the latest backup and the transactions log, and the transfer of database queries to the new primary server.

A more sophisticated failover system is the Active/Active cluster. A two-node Active/Active cluster contains two active/passive nodes, each containing two separate databases. One database is active on node A and passive on node B, and the other database is active on node B and passive on node A. Each node acts as a failover to the other.

Advanced SQL Failover Clusters

So far, our clusters consisted of at most two active servers. Larger clusters can have several topologies, each with its own costs and benefits characteristics.

When a cluster has many active nodes, there are several possible ways to provide failover. One topology, known as N to 1, sets one node as a failover to all the other nodes. So, if we have nodes A, B, C and D, D can act as failover for every one of the remaining nodes. If node A fails, node D will take over its place, ditto for node B or C. Once the failed node is revived, it becomes active again, and node D returns to standby mode. Clearly, this is a cost effective way of providing high availability, but performance suffers if multiple nodes fail at the same time. In addition, the failover node needs enough hardware capabilities to support the load of all N servers simultaneously.

In a similar topology, known as N+1, once the failed node is revived, it becomes the new standby (failover) node. In this case, if node D took over the activity of failed node A, D remains active, and A becomes the new passive (standby) node. This is also an Active/Passive system with several active nodes and one passive node.

An improved topology is the N+M cluster, which is still cost effective, but provides a higher level of performance in case of multiple node failures. In this topology, N active nodes use M failover nodes. Usually N is 5-6, and M is 2-3, or an eight node cluster with 5-6 active nodes and 2-3 standby nodes.

Another topology is the N to N failover configuration, which is used with Active/Active clusters having multiple nodes. If a node fails, its users and databases are distributed between the remaining active nodes. Oracle RAC provides such a topology.

Uses of Clusters

Clusters are common in datacenter environments. Many ISPs offer fail-over and high-availability server clusters to demanding customers. These clusters allow faster and easier scale-up of server performance, and improve the system availability. Corporate datacenters deploy clusters to cost-effectively achieve high computational speeds and reliability for high-load systems. Some of the world's fastest computers, especially in universities, consist of racks of networked computers running clustering software, for use in numerical modeling of complex systems. SQL clusters bring high availability to the realm of database operations.

Conclusion

Clusters and Grids are technologies that can be used to cost effectively improve computational speed and availability, to tap into capabilities offered by large companies such as Microsoft, IBM, Amazon and Google, or to offer web based applications using web services. It can also be used to divide the load of distributing large amounts of data. All of these technologies point into a future where computing becomes a distributed and cooperative process to the benefit of both consumers and providers.

mongoDB Tivoli
    Let's Chat