All too frequently, Web sites do not meet the expectations of those who want to use them. Sometimes, a site is completely unreachable. At other times, although reachable, its performance is poor. Occasionally, users have other unexpected experiences: for example, after they connect to a web site almost instantly, they aren’t authenticated even though they have valid login identification.
There is an economic consequence to the owners of a web site that depends on its users having a satisfactory experience, no matter whether or not a service level agreement (SLA) obligates the owner of the site to deliver at a given level.
Service Level Agreements
The terms of service for a web server (or any other kind of server, for that matter) are sometimes provided in a document called an SLA, for service level agreement. This document details the fine print about how much uptime the service guarantees, how quickly and effectively the provider will respond to an outage, and whether they will compensate the user or reduce their fees if they do breech their service guarantee. An SLA also can contain specifications for other attributes, such as quality of service (QoS).
In short, an SLA is a contract that dictates what level of service the provider is obligated to provide and what credits/remuneration, if any, is required when the terms of the SLA are not met. SLAs can be between different businesses or between different segments of the same organization.
You can read an SLA at http://www.amazon.com/gp/browse.html?node=379654011. Notice that it indemnifies Amazon from both outages and less severe failures caused by a Force Majeure.
The most technology-savvy enterprises can fail to deliver the on-line services that users expect, from time to time. For example, at one time or other over the past year or so, the
- RIM (BlackBerry),
- IBM, and
- Beijing Olympic Games
web sites, just to name some of the most well known cases, have disappointed their users. As you can imagine, these are not isolated instances. So, take a look at a few of the scenarios that many end users encounter.
But, first, it is important to remember that lost connectivity doesn’t mean lost data, just lack of access to the data. The data is still there; you simply can’t get to it right now. For some businesses, that could be disruptive to actual operations. In other cases, it means that backups or disk mirroring is suspended, so that you only have your local copies of data until connectivity is resumed.
Fortunately, full-blown outages are far less common than other situations where performance is just somewhat below acceptable or contractually set limits.
In either case, man-made or natural causes can be to blame. This article will focus on these two situations, both of which can cause your system to perform out of spec; that is, where the penalties of a SLA (if you have one) kick in.
When your site is experiencing difficulty, you should first determine whether the problem is caused by events within your organization or events in the external network(s). Naturally, if it’s not in the latter, it’s in the former. Fortunately, there sometimes are simple steps you can take and tools you can use to investigate this question.
The Internet Traffic Report (ITR) shown in Figures 1, 2, and 3 monitors the flow of Internet data around the world. It then displays a value between 0 and 100. Higher values indicate faster and more reliable connections. The higher the packet loss percentage, the slower the connection will work because, in most instances, it has to send the same piece of information several times.
Note: This free resource for the Internet community is updated every 5 minutes. You can request a router on your network be added to the ITR list.
Internet connectivity may be smooth today, but perhaps users won’t be able to reach a few web sites in Europe tomorrow. ITR will tell you if those regions of the Internet are currently slowed down. So, by checking ITR, you may be able to determine whether your problems are global or local.
Figure 1: Global overview of Internet traffic
Figure 2: Performance of North American routers
Figure 3: Performance of one router located in Elkhorn, Wisconsin, U.S.A
In Figure 3, a momentary period of slightly sluggish performance shortly before 2:00 p.m. can be seen in the performance of one router. This was not a serious problem. However, serious problems did exist elsewhere. For example, in one of the two Vancouver, Canada servers shown in Figure 2, the Current Index was 0.
Global companies with disaster recovery plans in place are often able to failover their entire systems to servers based in other regions of the world. But, even global companies are vulnerable.
For all the power of modern computing and satellites, most of the world’s communications still rely on submarine cables to cross oceans. The Web site http://www.telegeography.com/products/map_cable/index.php contains a good deal of useful information on underwater cables and other network resources.
When two cables in the Mediterranean were severed earlier this year, it was put down to a mishap with a stray anchor. Since then, a third cable has been cut, this time near Dubai. That, along with new evidence that ships’ anchors were not to blame, has sparked theories about more sinister forces that could be at work.
And, an even more recent report now states a fourth cable has been cut, in a different location than the other two cable locations.
When India initially lost as much as half of its Internet capacity earlier this year, traffic was quickly rerouted, but a day or two elapsed before the country was reported to have regained 90% of its usual capacity. The outage also reveals that the effects of such outages are anything but neutral; their effects vary widely depending on the size and resources of the user.
In addition to the kinds of problems discussed above, another, the loss of connectivity that’s software related, warrants mentioning. Users recently experienced problems with Windows Genuine Advantage (WGA) authentication. Users of both Windows XP and Windows Vista could not validate their installations using WGA.
This was a major WGA server outage affecting users across the globe. The WGA support forum exploded with complaints. Microsoft told users who were affected that they should “try again” later.
So, given the unpredictable nature of in-house computer systems and our high reliance on a less than 100% reliable Internet for the movement of data among systems, how does one promise (that is, sign an SLA with) their users or, where an SLA is not required, have a site that delivers a good user experience, thereby contributing positively to your bottom line?
One, albeit legalistic, answer to this troubling question is to place a Force Majeure clause (as was seen in the Amazon SLA cited above) in your SLA. But, this will protect you only so far, because, remember, unhappy customers “walk.”
You can lose revenue, even if you’re not bound by an SLA to deliver a certain level of services, because customers often don’t return to a web site whose performance dissatisfies them. Servers, or the networks users need to connect to servers, with insufficient resources to handle the number of users trying to connect, either during a period of peak usage such as when a news article creates a sudden increase in interest or after the gradual growth in popularity of the site, are a common cause of user dissatisfaction: for example, when QoS issues result in poor voice or video streams.
Shifts Away from Target Performance
As detailed above, there are many sources of variation in overall system characteristics. It is common to classify them into two types: common causes and special causes. Common causes refer to the sources of variation within a process that have a stable and repeatable distribution over time. The random variation, which is inherent in the process, is not easily removable unless we change the very design of the process, and is a common cause found everywhere. If only common causes of variation are present and do not change, the output of a process is predictable as shown at the top of Figure 4. When this is the case, signing an SLA with a degree of confidence in the financial consequences becomes possible.
However, a system with shifts away from target performance (that is, the performance expected by user of a Web site) and with changes in probability distribution over time make prediction difficult as shown at the bottom of Figure 4. Remember the unaccounted-for split in a trans-oceanic cable. Yet, it’s predictability that is needed to write a realistic SLA.
Figure 4: Common (in other words, predictable) vs. special (in other words, unpredictable) variations
There are many factors to consider when choosing a forecast method. Two major considerations are the cost of implementing capacity and disaster recovery plans versus the possibility of lost revenue (in other words, lost benefit) if you don’t. The widely used graph in Figure 5 displays the relationship between the amount of money you invest in failover servers, cloud computing, redundant communication channels and the like verses the resulting benefit.
Figure 5: Cost vs Benefit transition point
“Prediction is very difficult, especially if it’s about the future.”
—Nils Bohr, Nobel laureate in Physics
This quote advises that using a forecasting model based only on data collected in the past can be risky. It is often easy to find a model that fits the past data well. It is quite another matter to find a model that correctly identifies those features of the past data that will be replicated in the future.
So, one should not create a model that is an exact replica of today’s reality. Create a model because it is quick and easy and that emphasizes the complexity and unreliability of predictions.
In today’s competitive Web environment, it is critical to look for ways to improve the performance of your web application. Your goal in doing this should be to achieve three things: the first is to reduce costs, the second is to improve customer satisfaction, and the third is to increase revenue, thereby increasing profits.
As part of achieving these goals, every business should have a business continuity plan to protect against disaster, whether or not it’s man made. If you don’t have one, or haven’t evaluated it in a while, perhaps now is a good time to do that. You should also re-examine whether the resources of your Web site are adequate for the current and anticipated load and type of user connection.
But, understand that no amount of redundancy or fault toleration can protect you completely from many of the causes of poor user experience described above. As illustrated there, no matter how carefully a project is planned, something beyond your control may still go wrong with it.
Bottom Line: Our increasing dependence on the first “w” in “www” can undermine “the best laid plans of mice and men,” to borrow a line that Robert Burns wrote in 1785.
- Gunther, Neil J. Guerrilla Capacity Planning, Springer (2007)
- Menasce, D.A. Performance by Design: Computer Capacity Planning by Example, Prentice Hall (2004)
- Menasce, D.A. et al Capacity Planning for Web Services, Prentice Hall (2002)
- Wells, A. et al Disaster Recovery: Principles and Practices, Prentice Hall (2007)
- Whitman, M. Principles of Incident Response and Disaster Recovery, Course Technology (2007)
- Marchese, M QoS Over Heterogeneous Networks, Wiley (2007)
- Gulesian, M. Service Level Agreements
About the Author
Marcia Gulesian is an IT strategist, hands-on practitioner, and advocate for business-driven architectures. She has served as software developer, project manager, CTO, and CIO. Marcia is author of well more than 100 feature articles on IT, its economics, and its management.