By Roland Kuhn and Jamie Allen, Reactive Design Patterns.
Imagine a startup that needs to raise money in the venture capital (VC) world. It can be very easy to build a prototype by using frameworks that support rapid development, and it may also be very easy to find developers skilled in creating applications with such tools. The company builds the prototype and presents it to VC firms in the hope of raising money. If they are successful, they need to go into production, so they quickly turn their prototype application into a production deployment by leveraging a cloud “Platform as a Service” (PaaS) platform. They they begin to market their capabilities so that customers can find them.
However, not all frameworks are created equal from a scalability perspective, and some may have limitations that make it difficult to handle many concurrent connections. There may be some workarounds, but in the end, they may not be enough for an application that will scale as the new company’s user base grows.
As a result, they are forced to add new virtual machines on their hosting service to handle all of the customer requests that are now flowing in. The scalability might be theoretically linear (where adding an additional server would increase your capacity by a multiple; for example, 2 servers doubles your capacity, 3 servers triples it, etc). However, this assumes that the hosting provider routes requests to server instances in a fashion that will distribute load to all servers equally, but not all cloud platforms will do that for you.
Some providers may distribute load in a random fashion instead of a round robin strategy (where each server receives a request in order before starting at the first server again), which means that merely adding new servers does not mean that they will assist in handling the additional load. As a result, this new company is forced to add even more virtual machines to host the server application than would be necessary if distribution were even, based on the probability of load distribution using the request routing algorithm of the cloud provider.
At this point, the new company finds itself running hundreds of virtual machines on the cloud platform. Their theoretical maximum number of concurrent connections that can be handled by this armada of servers is limited to a small multiple of the number of servers due to the limitations of the implementing framework, and may be even less given the variability of the concurrency factors of the tooling and the routing distribution strategy mandated by the cloud provider.
The cost of running so many machines is non-trivial—in today’s dollars, an individual virtual machine instance on such a Platform as a Service hosting provider may cost as much over US$100 per month. For 250 such virtual machines, the cost is now US$25,000/month, at a total cost of US$300,000 per year! That is a very heavy “burn” rate (the rate at which they spend their precious initial investment money) for any startup to withstand.
Such an application might also not be fault tolerant. There may be no way through this platform to coordinate servers without using an external tool, nor does the framework provide constructs that help you manage failure within the logic of the application.
So, this new startup is paying too much money while attempting to scale an application in the face of comparatively low traffic while utilizing an architecture that does not have core constructs that support resilience.
Worse, what if the language and framework used to implement the service is also very slow at runtime, partly because it uses a dynamic type system (where types of values are not declared in the source code and the runtime must make assumptions about the type of the value when evaluated) that is unlikely to execute as fast as a language with static types that has proven the existence of methods and compatibility at compile time. There are other hidden costs to consider, such as the environmental impact of this approach—you are using more virtual machines to handle a relatively small amount of traffic, which requires more power in the data center.
The startup discussed earlier did not choose tooling that allowed them to build a Reactive application, and they pay the costs associated with that decision every single day. Now that we have an understanding of the impact of choosing a specific tool with which to build our application, we can begin to think about the programming constructs that will be most supportive to building Reactive systems. In our book, Reactive Design Patterns, several technologies are presented to describe tools and strategies that can alleviate the costs that this startup now faces. For now, it’s important to look back at the history of Reactive solutions.
Early Reactive Solutions
The startup example above is not contrived; there are companies suffering through these very issues right now. For such a company to lower their costs and be more responsive to their users, they need to migrate their application away from the inefficient language and framework to those that will help them do that, much as Twitter did when re-architecting their application away from Ruby on Rails with a more scalable platform.
Over the past 30 years, many such tools and paradigms have been defined to do help us build applications that do meet the Reactive paradigm. One of the oldest and most notable is the Erlang language, created by Joe Armstrong and his team at Ericsson in the mid-1980s. Erlang was the first language that leveraged Actors to gain mainstream popularity.
Joe and his team faced a daunting challenge – to build a language that would support the creation of applications that would be deployed in a distributed environment and be incredibly resistant to failure. The language evolved in the Ericsson laboratory over time, culminating with the usage of Erlang to build the AXD301 switch in the late 1990s, which reportedly achieved “nine 9s” of uptime. Consider exactly what that means. “Nine 9s” of uptime is the equivalent to saying that an application will be available 99.9999999% of the time. For a single application running on a single machine, there would be roughly 3 seconds of downtime in 100 years!
100 years * 365 days/year * 24 hours/day * 60 minutes/day * 60 seconds/minute = 3,153,600,000 seconds 3,153,600,000 seconds * 0.0000001 expected downtime = 3.1536 seconds of downtime in 100 years
Of course, such uptime of an application running on a single box is purely theoretical; as of the writing of this article, no application could possibly have been running continuously on a machine longer than modern computers have existed. The actual study upon which this claim was based was performed by British Telecom in 2002 through 2003 and involved 14 nodes and a calculation via five node-years of study. Such approximations of application downtime depend as much on the hardware as the application itself, as even this most resilient application will not be fault tolerant if it were running on unreliable computers. But such theoretical uptime of the application and the resulting effect on the fault tolerance dimension of Reactive is highly desirable. Amazingly, no other language or platform has made similar claims since the release of this product.
At the same time, Erlang was created as a language with dynamic types, and it copies message data for every message it passes between actors. The data has to be copied because there is no shared heap space between two actor processes in the Beam VM. This means that data that will be sent between actors must be copied into a new memory space on the receiving actor process’ heap prior to sending the message, to guarantee isolation between the actors and prevent concurrent access to the data being passed.
Figure 1: Illustration of data in the sending Erlang actor’s heap being transferred via message to the heap of a second Erlang actor who will receive the message
While these features provide additional safety, where any Erlang actor can receive any message and no data can be shared, they have the effect of lowering the potential throughput per instance of an application built with it. This means that Erlang deployments have to use more servers than applications built with other tools and platforms in order to handle the same load.
# # #
This article is excerpted from Reactive Design Patterns. Save 39% on Reactive Design Patterns with code 15dzamia at manning.com.
For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/kuhn/.