Performance Improvement: Bigger and Better
In this four part series on performance we've reviewed the fundamentals of assessing performance including using the tools built into Windows to make these assessments. We've covered the considerations for session state, and we've walked through the benefits and problems with caching. However, we've not covered in detail what to do once you've assessed performance or how to leverage what you've learned about session state and caching to solve real world problems. In this article we'll be focused on isolating problems into solvable units and what to do when you believe that things just can't fixed.
Tracking Down Problems
| In most cases, the process of improving performance is about 90% finding the problem and 10% fixing the problem. |
Break It Down
If you've got multiple roles in your architecture, for instance database and front end, on the same server you can split them on to separate servers. This will allow you to determine which of the roles is causing the load. Anything that your architecture allows you to split should be split even if it's only temporary so that the performance can be better understood.
Changing the Scenery
Another way to locate the root cause of a performance problem is to change the circumstances of the problem to either attempt to make the problem go away or to make the problem worse. On the surface making a problem worse doesn't seem like a good idea and generally speaking in production you're right. However, what if you suspect that there's a particular aspect of the program that is a problem or a particular aspect of the data which may be causing issues? It may be the right thing to create an unrealistic set of circumstances in a development or testing environment to try to make the problem easier to find.
Whether you change the scenery or change it to make it worse or better you'll be closer to finding your performance issues just in knowing what environmental thing is contributing to the problem.
It's not always possible to do this; however, it is possible in more cases than one would expect. You can take all of the production data and place it on a development system that is less powerful than the production system so even a much smaller set of load will cause the performance problems to crop up.
Loops and Latency
Sometimes, however, the problems will not show their ugly heads. They don't look like a bottleneck on the server; instead everything appears to be working correctly. And yet, the site isn't performing like you expect or want it to. If you can't locate a bottleneck then you're probably on the hunt for a loop and some latency.
I clearly remember on portal I was working on that was taking over four seconds to log users in. We weren't seeing processing spikes. We weren't seeing database servers being hammered. In fact, the whole system seemed to be just humming along but we definitely couldn't deny that there was something wrong because users were indeed taking several seconds to get logged in.
We isolated the issue when we used a network monitoring tool to see what the network traffic from the portal server was. Imagine our surprise when we found that of the four seconds to log in nearly 3.8 seconds of that was the portal server talking to Active Directory. As it turns out, the design called for building a list of the groups that the user belonged to either directly or indirectly. The problem with this in Active Directory is that in order to determine the entire list of groups that someone belongs to you have to do a query and get the membership for each group that they're a member of. This process is recursive. So each group required evaluating other groups and so on.
Because of the nature of the group memberships the system was walking through nearly 1000 groups per user. Of course, this is bad. However, that doesn't by itself explain 3.8 seconds. When you add to this a slight misconfiguration in Active Directory that caused the requests to be routed to an Active Directory server nearly 1000 miles away with a wire speed latency of 40ms per request it becomes easier to see how 3.8 seconds can get chewed up and not show up anywhere as a bottleneck. It wasn't a bottleneck. It was simple latency multiplied by a large number.
In this case the short term solution was easy fix the Active Directory misconfiguration. The longer term problem meant rearchitecting group memberships and changing the code to only walk one level of group nesting instead of an arbitrary number. With all of the changes the logon times for the application dropped to just over one second.
Many other performance issues that I've seen have been caused by active directory or DNS misconfigurations but I'm careful not to jump to that conclusion particularly since the problem is generally not one-sided as the story above indicates.
Art and Science
Performance improvement is one part science the mechanics of capturing details, making observations, etc. Performance improvement is also one part art. The art is knowing where to look for the cause of a particular problem. Obviously the coarse grained performance numbers can help you look for disk issues on the SQL server or CPU performance of the front end web application but what if you're seeing both? Therein is the art. Knowing what to look for and knowing what is most likely not the problem takes experience and guesswork. While it's possible to teach the mechanics (as we've done in this series), it's not necessarily possible to teach the art component of performance improvement.
The challenge of anything that requires art and science is knowing when you need the art and when you need the science. In performance improvement the art is knowing what to test. The science is doing the testing making the observations. You must be clear that it's the testing and observations that drives the activity drives the next round of problem identification or the quest to create a solution. Performance issues are rarely solved by guesswork without the testing to support and validate that the guess is correct.
New Bottlenecks
| Removing any one problem or any one bottleneck won't necessarily resolve your performance issues completely. |
Sometimes the next bottleneck is easy to spot but more often than not they're impossible to see while the first bottleneck is controlling the scalability of the system. The trick here is to make sure that you realize that removing any one problem or any one bottleneck won't necessarily resolve your performance issues completely it will just make them better. If you don't get to the performance or scalability level you want you'll get to go through the whole process again and isolate the parts of the solution which are creating the next bottleneck until you are able to solve enough problems that the system performs adequately.
Page 1 of 2
