Capacity and Disaster Recovery Planning for an Internet Connection that Can Become Unsatisfactory
This was a major WGA server outage affecting users across the globe. The WGA support forum exploded with complaints. Microsoft told users who were affected that they should "try again" later.
So, given the unpredictable nature of in-house computer systems and our high reliance on a less than 100% reliable Internet for the movement of data among systems, how does one promise (that is, sign an SLA with) their users or, where an SLA is not required, have a site that delivers a good user experience, thereby contributing positively to your bottom line?
One, albeit legalistic, answer to this troubling question is to place a Force Majeure clause (as was seen in the Amazon SLA cited above) in your SLA. But, this will protect you only so far, because, remember, unhappy customers "walk."
You can lose revenue, even if you're not bound by an SLA to deliver a certain level of services, because customers often don't return to a web site whose performance dissatisfies them. Servers, or the networks users need to connect to servers, with insufficient resources to handle the number of users trying to connect, either during a period of peak usage such as when a news article creates a sudden increase in interest or after the gradual growth in popularity of the site, are a common cause of user dissatisfaction: for example, when QoS issues result in poor voice or video streams.
Shifts Away from Target Performance
As detailed above, there are many sources of variation in overall system characteristics. It is common to classify them into two types: common causes and special causes. Common causes refer to the sources of variation within a process that have a stable and repeatable distribution over time. The random variation, which is inherent in the process, is not easily removable unless we change the very design of the process, and is a common cause found everywhere. If only common causes of variation are present and do not change, the output of a process is predictable as shown at the top of Figure 4. When this is the case, signing an SLA with a degree of confidence in the financial consequences becomes possible.
However, a system with shifts away from target performance (that is, the performance expected by user of a Web site) and with changes in probability distribution over time make prediction difficult as shown at the bottom of Figure 4. Remember the unaccounted-for split in a trans-oceanic cable. Yet, it's predictability that is needed to write a realistic SLA.
Figure 4: Common (in other words, predictable) vs. special (in other words, unpredictable) variations
There are many factors to consider when choosing a forecast method. Two major considerations are the cost of implementing capacity and disaster recovery plans versus the possibility of lost revenue (in other words, lost benefit) if you don't. The widely used graph in Figure 5 displays the relationship between the amount of money you invest in failover servers, cloud computing, redundant communication channels and the like verses the resulting benefit.
Figure 5: Cost vs Benefit transition point
"Prediction is very difficult, especially if it's about the future."
—Nils Bohr, Nobel laureate in Physics
This quote advises that using a forecasting model based only on data collected in the past can be risky. It is often easy to find a model that fits the past data well. It is quite another matter to find a model that correctly identifies those features of the past data that will be replicated in the future.
So, one should not create a model that is an exact replica of today's reality. Create a model because it is quick and easy and that emphasizes the complexity and unreliability of predictions.
In today's competitive Web environment, it is critical to look for ways to improve the performance of your web application. Your goal in doing this should be to achieve three things: the first is to reduce costs, the second is to improve customer satisfaction, and the third is to increase revenue, thereby increasing profits.
As part of achieving these goals, every business should have a business continuity plan to protect against disaster, whether or not it's man made. If you don't have one, or haven't evaluated it in a while, perhaps now is a good time to do that. You should also re-examine whether the resources of your Web site are adequate for the current and anticipated load and type of user connection.
But, understand that no amount of redundancy or fault toleration can protect you completely from many of the causes of poor user experience described above. As illustrated there, no matter how carefully a project is planned, something beyond your control may still go wrong with it.
Bottom Line: Our increasing dependence on the first "w" in "www" can undermine "the best laid plans of mice and men," to borrow a line that Robert Burns wrote in 1785.
- Gunther, Neil J. Guerrilla Capacity Planning, Springer (2007)
- Menasce, D.A. Performance by Design: Computer Capacity Planning by Example, Prentice Hall (2004)
- Menasce, D.A. et al Capacity Planning for Web Services, Prentice Hall (2002)
- Wells, A. et al Disaster Recovery: Principles and Practices, Prentice Hall (2007)
- Whitman, M. Principles of Incident Response and Disaster Recovery, Course Technology (2007)
- Marchese, M QoS Over Heterogeneous Networks, Wiley (2007)
- Gulesian, M. Service Level Agreements
About the Author
Marcia Gulesian is an IT strategist, hands-on practitioner, and advocate for business-driven architectures. She has served as software developer, project manager, CTO, and CIO. Marcia is author of well more than 100 feature articles on IT, its economics, and its management.
Page 3 of 3