High-Availability Clustering in Java Applications
If It Can Go Wrong...
Because most developers are firm believers in Murphy's law, it behooves you to examine ways of making applications as "failure-proof" as possible. This leads to better "availability," in market-speak—in other words, it means an application that is up and operational as much of the time as is possible.
Much has been written and said about hardware clustering, and how to ensure that the underlying machinery is as difficult to kill as possible. RAID, redundant power supplies, backup routers—this is a topic worthy of several books unto itself.
So how do we go about making our applications as bullet-proof as possible? While we're at it, is there a way to make our whole cluster more failure resistant without having to throw too much money at specialized failure-resistant hardware?
Let's examine the ways in which a complex client-server application can fail, and see what can be done about them.
Let Me Count the Ways
Applications can fail by means of a fatal error on some input or event. In addition to trying to code around such things (for example, filtering input strings from a Web application to prevent malicious code from being entered as input), you can also take the simple precaution of having an application that auto-restarts. In other words, if the application should die entirely for some reason, an operating-system level script can simply fire it back up again immediately. Hopefully, the reason for the failure was a one-time thing, and you've subjected your users to little more than slight inconvenience and delay. Simple, but often effective.
An application also can fail as a result of losing access to some essential resource, such as a filesystem mounted on a network, a database, or other applications with which it communicates. In some cases, this resource is lost due to hardware failure. Smoke issues from the memory board, and our application goes down—not much we can do about it, at least on that one machine.
Spreading the Load
Of course, an application should be able to be spread across multiple machines. This eliminates the single machine as a point of failure, right? Not quite. Most real-world applications must preserve state in some way, sometimes just for a login session, but often for longer. The most common way of doing this involves a database. If you have but one database machine in the cluster, you're back to having a single point of failure—the database itself.
Think for a minute then about how to eliminate these single failure points in a complex, real-world application. Take the example of a sophisticated Web application distributed in a cluster of inexpensive Unix-type servers. Ignore hardware for the moment, and assume that reasonable precautions have been taking to avoid the whole cluster being disabled at a stroke.
First, there's where the rubber meets the road: the Web server. Assume Tomcat for the sake of discussion. For scalability, say your Web application communicates via JMS to a back-end server process that does all the real "work" of the application, whatever that is. That server process in turn talks to a database. This serves to make failure at the Web server less likely, because the complex processing of the application is not done within the same process, but "farmed out" to another process running elsewhere on your cluster.
So far you've got a Web server, a database, a server process (call it your "application server," not to be confused with a J2EE application server such as JBoss or WebSphere), and a JMS broker in the mix.
If you only have one of any of these pieces, you have created a single point of failure. So, one by one, make them all clustered and distributed. First, the Web server: Tomcat can fairly easily be set up for clustered operation, where the user is routed to whichever server is available from a group of servers (detailed instructions on how to do this are available online from a number of sites). Well and good, but what about state? If you are using the session from the Web server to maintain state as requests go from the user to the back-end server process (via your JMS link, remember?), you must be sure that all servers in the cluster are aware of any changes to state. As it happens, JMS has a mechanism that is idea for this: A message can be "published" to all interested "listeners" associating a certain user with a session, or to a record in a database, as needed. This way, when a user authenticates (for example) with one server in the cluster, then makes their next request to a different cluster, they are already "recognized," and processing continues normally, the user being none the wiser about the switch.
Open-source solutions such as C-JDBC and HA-JDBC can take your single database and turn it into a cluster of databases. This eliminates the database as a single point of failure, and improves your application's scalability to boot. Both C-JDBC and HA-JDBC distribute "read" operations among the cluster of databases, directing the next read to the least loaded DB in the cluster (because they are all the same). Writes, of course, are done to each database, ensuring they remain synchronized.
Application Server Clustering
Because you've already decided on JMS as part of the mix, you also can load-share and fail-safe your application servers quite easily. When the Web application as a service requests for our back-end application, it places a message on a JMS queue. Each of your application server processes listens on that same queue, and the first server to pick up the request processes it. While that server is processing, another request comes along—the next available server will pick it up—and so on until all of the servers in your cluster have a request to chew on. Then the first server gets its second request—an automatic load-balancing, courtesy of JMS. At the same time, each application server is listening to a "subscribe" topic that notifies it of any session changes for a specific client, keeping all servers "in the know" as far as client sessions, thus ensuring that it doesn't matter which server a particular request is processed by. What if one of your application servers should fail? No effect, other than a small reduction in the amount of load you can handle, because the JMS queue is simply serviced by the remaining servers. Need to increase scalability for more users? Put some additional servers on the queue—the number of requests you can handle goes up.
So, now you're in a situation where the database is no longer a single point of failure, the Web servers are clusters, and the application server process is clustered and fail-safe, not to mention highly scalable. Are you bulletproof yet? Not quite—what about the JMS broker? There must be only a single broker, or there would be no way to ensure that session updates are distributed to all application servers. But there's nothing stopping you from setting up failovers for your JMS brokers—have multiple brokers running, and give each client (the Web server in your case) and each application server a list of all of them. If one fails, everybody switches to the next one on the list. This gives an administrator time to diagnose and restart the first broker, turning it into the "hot backup" for the cluster.
This is just a quick birds-eye view of what a failure-resistant software cluster might look like in a certain deployment. All of the above can be done entirely with open-source software, but the same techniques can also be applied with commercial components. There are of course many variations, and different approaches. The example you saw here is the approach taken by the Keel service framework, and has been deployed in a number of high-reliability environments. Hopefully, it will serve as food for thought as you create your own bullet-proof applications!