Distributed Web-Applications using the Keel Meta-Framework, Page 2
So far we're still talking about a deployment where every node (every server and every process) is a single point of failure, however. Any one of these servers going down will take our entire application with it - in fact, increasing the number of servers in the example so far actually makes it slightly more fragile, as we're expanding the number of potential points of failure by introducing more hardware and network connections.
Let's continue a little further then, and see what happens if we increase the user load a bit more, and start to do a bit of thinking about reliability and failover. We'll assume that breaking out the database to a separate server did part of the job for performance, but that now we're looking at a bottleneck on processing the actual business logic - the Keel business-logic (Model) classes are taking a lot of time, and we need to add processing power.
Our next step would be to separate the web-application and the business logic processing. The Keel server always interacts with the user-interface framework (in this case, Struts) via a communications layer. This layer in a single-VM deployment, which is what we've been using so far, is via a simple in-memory queue channel. Several other options are available, though - including JMS, which is what we'll choose in our next example, shown below:
Second Breakout: Increase Business-logic Processing
In this deployment, we add one more component: An instance of the OpenJMS broker. We choose to run the broker component on the same system as the Keel server, with it's application logic and services, but it can in fact be anywhere on the network. We could have chosen several other communications layers here, including web services, depending on the needs of the deployment. It is purely a configuration-level change: no code is modified to switch from one communications layer to another.
We can introduce a second business-logic layer server easily. As we see in our next diagram, the web application server now has the option to communicate with an OpenJMS broker on two different systems. (Again, JMS is just one way to do it). It will use one normally, but fail over to the second one in the event it loses the connection to the first. The same is true of the business logic layer, the Keel Servers. In our diagram you will see that we are now running two Keel server processes on each of two separate machines.
Third Breakout: More Application Performance and Load Balancing
Each will respond to requests it finds on the JMS queue - if any one Keel server is busy handling requests, the next will pick up the request, and so forth. The queuing nature of JMS gives us a certain amount of built-in load balancing, without any additional effort, but we can go much further. Each Keel server also provides some failure protection - if we shut one of them down, or even both on one server, the other servers simply take up the request load, and no failure is seen by the users (just perhaps a slight increase in response time). We can tune our Keel servers to be aware of their own load factors - the percentage of time they are spending handling requests as a percentage of their total elapsed running time - and to "self-tune" as needed. A server that is getting very busy can be set to ignore new requests for a brief time, as opposed to spawning an ever-increasing number of new threads to handle additional simultaneous requests. This delay allows other servers! listening to the same queue to pick up the slack. Let's say that a given server receives a particularly complex request, that eats up a lot of processing time. Its load factor goes up, so it ignores new requests for a few seconds, allowing them to remain on the queue. Another server in the group picks up those new requests until the complex request is complete, the load factor drops, and the original server gets back in the game.
Keel applications are usually written with application logic (model) classes that are single-threaded (like a Stateless session bean). It is possible, however, that a user's session may require state to be maintained - at the very least their login identity. Keel, through Avalon, provides the concept of a per-user Context, roughly the server-side equivalent of the servlet Session. A separate dedicated JMS queue is used for Keel server instances to communicate with each other, sharing any changes made to the per-user context. As a new server comes online, it requests an initial set of context information from the queue, and one of the running servers provides it. As long as at least one server is always running, context is never lost. This allows such conveniences as upgrading a running application, by stopping one Keel server at a time, then restarting it with a new configuration and set of class files. We inch closer to our fully clustered fail-safe system, but we're not! there yet.
We still have two single points of failure, though: the front-end web server, running our Struts UI, and the PostgreSQL database server. What can we do about this? There are several solutions: First, we can replicate the PostgreSQL database, providing a "live" backup to the operational database that we can switch to quickly - although perhaps not automatically. This reduces the huge dependence on the single database server somewhat. For the web application server, we can use any of a number of load-balancing/failover systems to cluster its operation as well: the Balance project (www.inlab.de/balance.html) is one such solution, which will automatically distribute incoming requests among two or more web application servers. At this point we would also want to configure web application session sharing between the Tomcat servers, so that users that may get a different server from one request to the next will not lose any context being s! tored at the web application layer.
Fourth Breakout: UI Tier - Failover and Load Balancing
Now we've eliminated the single point of failure (although we will likely want at least two Balance servers, so that we don't simply move the single point of failure from Tomcat to Balance). Note that Balance might in fact be running on the same physical system as Tomcat and Struts - we show them separately just to illustrate the architecture more clearly. There are also, of course, other options to provide the IP address load sharing, including dedicated hardware solutions - these would work equally well in this scenario. Although the diagram shows an increasing number of communication lines between serves, not all of these are active at once, and network traffic is not excessive between the servers. Each UI connector is configured to be aware of all available JMS brokers, and to fail over from one to the next as needed - at any given moment, only one JMS broker is active.
We've still got that nagging single point of failure of the database itself. We can replicate, so that we have effectively a "hot backup" ready to go in the even the primary fails, but this does little to increase performance, and is still not a fail-safe solution.