You probably noticed that the focus will be at the enterprise level and not necessarily the parties involved individually. The goal is maximize the utility function of the enterprise, which may or may not mean the two parties are completely satisfied. The enterprise serves as yet another entity in this triangle. This entity is usually presented by a committee such as the Center of Excellence (COE) for a particular technology—in this case, virtualization technologies1.
Figure 1: COE Triangle
The Service Provider is the IT entity that provides the services required by the Service Consumer. The Service Consumers are essentially end users with specific business requirements. In the financial market, users could be traders or risk analysts. In the biotechnology or the pharmaceutical markets, users are researchers and scientists. The Service Consumers provide both functional and non-functional requirements. Functional requirements will be implemented as the business logic by the Service Provider. The non-functional requirements are things such as the turn-around time, response time, and other quality measurements that will be satisfied by the COE. These non-functional requirements will, most likely, end up in an informal document at the beginning among all three parties. Remember that the COE manages the infrastructure, even though the infrastructure requirement is provided by the Service Providers (the group that writes the code).
You have gone full circle with this explanation, but how does this relate to achieving SLA at the enterprise level? Clearly, the COE is responsible for a number of these Service Provider-Service Consumer combinations. The COE is also responsible for providing the infrastructure, a topic that you will learn about later in this article. You can imagine how sharing can allow multiple duples to meet their SLA requirements without any of these groups ever needing to over-provision their infrastructure requirements. Because the COE is managing the infrastructure, provisions can be made at the enterprise level where the overall usage profile is available.
If you recall from the previous article in this series, I mentioned that one of the many goals of virtualization is to maximize utilization of resources. In the scenario just described, how can the COE provision its infrastructure with more resources as needed? Cycle stealing is a term that you have heard over the past decade in reference to Grid and distributed computing as a mean to increase computing power on the fly. The idea is to find out about unused resources across the enterprise and add them to your Grid to be used. When those resources are needed by their original owner, give the owners what you borrowed without any side-effects due to your usage.
Even though cycle stealing adds additional computing capacity to the Grid, a number of steps must be taken beforehand to ensure proper planning:
- Examine the resources and their OS build
- Determine the usage work profile of the target CPUs
- Broker a shared grid relationship between the grid application and resource provider
These preliminary steps are required because, in most cases, the homogeneity of the target resources will be the key factor in how the application scales. For example, a .NET-based application can only utilize resources that are Windows based and primed with a specific version of the .NET Framework.
When a heterogeneous set of resources is present, cycle stealing will only add to the overall available compute resource count if it shares the same OS version, processor architecture, and so forth. In the .NET example, neither a Linux-based resource nor a different version of the .NET Framework aids the application in adding to the overall available resource count. The same applies with C++-based applications. A homogeneous build environment is required in this case to borrow unused CPUs for a C++-based application. Multiple versions of the C++ complier (GNU GCC or Visual Studio) may need to be configured on a given resource for that resource to be part of the available resource.
Complementary workloads would enable proper scheduling of resources in such a way that utilization is maximized. For example, application groups can share between different regions during off hours. A New York-based user requiring resources during normal business hours can harness the computing resources sitting idle in London during their off hours of the night. As shown in Figure 2, resources from across the pond are migrated over as needed to meet the SLA requirement of users in NY. This is a radical case where you know for a fact that after 6PM GMT, a number of desktops in London are left unused.
Figure 2: The COE Controls how resources can be reused to achieve desired QoS
A finer grained scenario would allow you to be more dynamic in your resource provisioning. Let me make the following statement:
Rule 1: “The total maximum resource requirement of all users must be less than or equal to the total available resources at any given time to meet the SLA for all users at all times.”
What does this mean? This is resource sharing amongst complementary workloads. Imagine that you have a total of three users with the following resource needs:
|Time Block 1||Time Block 2||Time Block 3|
|User A||50 CPU/HR||100 CPU/HR||400 CPU/HR|
|User B||200 CPU/HR||50 CPU/HR||100 CPU/HR|
|User C||100 CPU/HR||300 CPU/HR||50 CPU/HR|
Table 1: Depicting Usage Profile and Resource Requirement for three users
Clearly, you need to have at least 550 CPU/HR available at all times. What is interesting is that if there were no sharing available, you needed to have a total of 900 CPU/HR available to ensure that you meet each user’s peak period, as shown in the highlighted cells.
What do I mean by “static?” These are dedicated resources. These are the servers, blades, and other computer resources that are sitting in your data center waiting to be used by the Grid infrastructure. As mentioned before, they are needed for critical and time sensitive applications to meet the minimum resource requirement. You will learn about how you can get the minimum resources required in the next section. For now, you need to know that for all intensive purposes, all the applications that you learn about are critical and somewhat time sensitive. What this means is that there are Quality of Service requirements, even if some of these requirements are not that stringent.
You learned about virtualization in the previous article, but let me talk a bit about virtualizing static resources. For a Grid-type infrastructure, you really do not care where the resources are because the Grid manager takes care of that information. Access to the resources is virtualized in the sense that the users are unaware of the location, type, or details of service execution on those resources. The users simply desire some resources to get the job done. The details are up to the virtualizer.
Dynamic Resource Provisioning
Although the above scenario significantly increased utilization, you must take proper steps to build a solid relationship between the application and the resource provider to ensure proper planning and cut-over time of resources. Unused CPU cycles from desktops (dynamic resources) add a certain level unpredictability to the picture, requiring you to manage these resources more closely.
Once a given resource has been identified for repurposing, an agent (the Grid software agent) is installed on that node and policies are used to manage the resource. Node polices check for scenarios such as:
- Event-based entitlement: Whether the mouse or keyboard has been idle for specified period of time
- Utilization-based entitlement: Whether a processor utilization has been below a certain threshold for a specified period of time
- Time-based entitlement: Whether a node is allowed to be part of the grid for a given period of time
Each environment can use a hybrid combination of these repurposing policies with the support of the Grid. Once in production, further steps are taken to ensure proper usage and administration of the resources. Managing the risk for the resource provider is the key step at this stage. The resource provider must be ensured that its resources will be available once needed. This could mean taking steps to cleanup the resource after usage, and ensure that the resource is back to its proper state before it leaves the Grid.
This is mainly the job for the COE team. The COE also must ensure that these dynamic resources add value to the overall infrastructure. For example, in the scenario depicted in Table 1, the COE needs resources during Time Block 3 more than any other time blocks. If resources are available only during Time Block 1, they might not be as useful. It is always a challenge when it comes to determining what percentage of your resources must be dedicated (static) and what percentage must be scavenged (dynamic). I like to use the following rules of thumb:
Rule 2: “Have 10% more static resources than the minimum total resources required by all the users for a given period of time.”
Rule 3: “Assume that only 10% of your dynamic resources are available to your overall pool of resources at any given point of time”
Why 10%? That is where my comfort level lies. This low water mark might be higher for you or your organization. Putting all three rules together, one comes to the following conclusion:
Total Resource Shortage = Rule 1 – (Rule 2 + Rule 3)
I call the above number “the Predicament Number.” This number represents the amount of resource that will be idle during your off-peak hours. This is a number that you want minimized, either by altering the resource requirement profile (Rule 1), or ideally increasing your dynamic resources (Rule 3). Once you are in production, you can adjust the low water marks that you used in Rule 3 to better model your environment at various times during a given day. I would not suggest altering Rule 2 because it represents the predictable part of your infrastructure. As mentioned, dynamic resources add a certain level of unpredictability to your infrastructure and thus need proper management.
You saw very little about the basics of achieving desired QoS through virtualization. It is important to remember that virtualization can be achieved both statically, and dynamically as described here. Due to the high costs of building data centers as you increase your static resource count, you need to further explore ways of decreasing that cost. Using dynamic resources is certainly one way this can be achieved. You did not spend any time on auditing, tracking, and profiling of resources in this article, but that is coming in the near future. Stay tuned.
1 I am referring to Virtualization as an abstract notion that encompasses a number of different technologies such as Web services, Grid and Utility Computing, and the like. You will narrow your focus mainly to Grid computing as you delve deeper into the topic at hand.
About the Author
Mr. Sedighi is currently the CTO and founder of SoftModule. SoftModule is a startup company with engineering and development offices in Boston and Tel-Aviv and Sales and Management office in New York. He is also the Chief Architect for SoftModule’s Grid Appliance product that has risen from a current need in the market to manage excessive demands of computing power at lower cost.
Before SoftModule, Mr. Sedighi held a Senior Consulting Engineer position at DataSynapse, where he designed and implemented Grid and Distributed Computing fabrics for the Fortune 50. Before DataSynapse, Mr. Sedighi spent a number of years at TIBCO Software where he implemented high-speed messaging solutions for organizations such as the New York Stock Exchange, UBS, Credit Suisse, US Department of Energy, US Department of Defense and many others. Mr Sedighi received his BS in Electrical Engineering and MS in Computer Science, both from Rensselaer Polytechnic Institute.