A Note Before You Begin
You covered some of the basics of virtualization and grid computing in my past two articles, Service Virtualization: The Road to Simplification and
Achieving Service Level Agreement by Virtualization. I feel that this article is more like a bridge between what you have covered and what will be covered in the future articles. I tried to tie up some loose ends from what you have covered thus far, but I have left a number of questions unanswered. As you move forward, these questions will be answered one by one as you delve deeper into this topic.
How do you go about provisioning your unused resources? Is it a safe assumption that your resources will be idle after 5PM? How about 6PM? Do you use a feedback model, or do you simply use a heuristically modeled policy and modify it as needed? The fact of the matter is that there is no one concrete answer when it comes to resource provisioning. In my previous article, I gave you a rule of thumb, but not only I did not explain where that rule of thumb came from, I did not even tell you how you should go about finding that low water mark. I will delve into some of the details of resource provisioning and utilization in this article and try to close some loose ends I left in the previous article.
Before I go any further, I need to cover some basic ground. There are two primary classes of scheduling:
- Adaptive or dynamic
Static scheduling has been around for years and years. Its concept goes back to the mainframe days when you needed to “call ahead” and reserve resources. You knew how many CPUs you had beforehand and scheduled access accordingly. Resources were reserved for a given application based on a user’s priority and type of application. On a good day, the number of CPUs did not change; in other words, no systems crashed, and no system was overloaded. The systems were composed of large, expensive SMPs (Symmetric Multi-Processor), and intra-communication between these large systems was a hassle. Most of the scheduling was done at the Operating System level, with the scheduler simply taking care of the macro-schedule across many systems.
Adaptive scheduling is what many, if not all, commercial vendors support today. The basic idea is that the scheduler can handle a dynamic set of resources. Servers or desktops could be added and removed from the overall infrastructure, and the scheduler could handle this change in real time. Little or no administration is required to provision systems, and off-the-shelf cheap blades are making up the Grid with very high-speed communication links connecting the resources and the users. The resources are located sometimes even globally, and the scheduler opts with policies to make decisions for a given user. Scheduling is more difficult in this case because the number of clients or requests could spike at any moment, and the scheduler must be such that it does not fall apart when these boundary conditions arise. I will spend more time on the scheduler in future articles, but I just wanted to give you a background before you move to the next section.
Types of Resources
There are two types of resources that I want to emphasize for this article:
- Dedicated resources
- Dynamic resources
As its name suggests, a dedicated resource is one that is part of your Grid all the time. You are the sole sponsor and owner of that resource. You dictate its location, its configuration, OS type, amount of memory, vendor, and so forth. I will not get into the details of the Center of Excellence (COE) concept that was previously discussed. Even with a COE involved, the ownership is transferred to the COE in the same manner. The rest are the same. The good thing about a dedicated resource is that, well, it’s dedicated! Little or no planning is required from your perspective to realize your overall compute capabilities. Your dedicated resources represent the minimum resource you have available at your disposal. This is solely due to the fact that, within the enterprise, there are used resources that you could take advantage of when you are allowed to do so.
If you recall the SETI@Home project back in the early 90s, you know what I am referring to when I talk about dynamic resources. One could think of dynamic resources as resources that do not belong to you, but you are granted access to use when the resources are not used for anything more important. I won’t discuss the details of “importance” because everyone’s perspective is different regarding this matter. Desktops that are reprovisioned after hours can be thought of as dynamic resources. Now, the quest is not to set aside a block of hours for a given resource to be used for something else, but have a finer and more granular method by which idleness is measured. As you delve more into the overall enterprise grid model, you will realize why and how this fine granular reporting paves the way for a scalable architecture.
There are a number of questions that arise when we start talking about resource provisioning. The basic question which is normally asked about resource provisioning is, “how do I find out the resource is really idle?” The answer to this eternal question is usually given by installing a piece of software agent on the in-question resource in order for it to tell us when the resource is free. One thing that has always been a touchy subject is over the owner of a resource at any given point of time. Who has priority of a given resource once in use? What exceptions apply to this priority transfer and under what boundary conditions?
It is best to start with the same basic rule that was discussed last time:
“Assume that only 10% of your dynamic resources are available to your overall pool of resources at any given point of time.”
At first, this seems like a very low number, and it is. This number should be rectified once more data is available and usage profiles for resource have been gathered. Profiling needs to be done before any planning is done at the enterprise level to determine the usage profile of dynamic resources. There are two scenarios to consider, however:
- Overall usage of a given resource: this is more like the average taken over a period of time, like a day, or one month
- Transient spikes: the overall average usage may be 10%, but it may be 10% all the same
The reasoning behind the first scenario is simple and rather trivial. The second one needs more explanation. If a server is being considered as a resource, and that server is utilized 10% of the time, is that server “idle” at any point? Your first response would clearly be, “No I would say that this server is 90% idle”—and I would like to use up that 90% when I need to. The point is that idleness comes in many shapes and colors and each scenario must be treated accordingly. You have covered two options for resource provisioning thus far, and I wanted to add another to this list:
- Low utilization on average
- Low utilization during off-peak hours
- Random utilization throughout the day
The combined policy that you apply to the first two applies to the third scenario, but this option still needs to be mentioned for the reasons that will become clear in future articles.
Please go back to the question that I raised at the beginning of this section, “Who is the owner of a resource at any given point in time?” This question is more important than you think. Let me run a scenario by you first. A user (User A) has submitted a job that takes about 10 minutes to run. This job is running on a desktop of a user (User B) who has gone to lunch. It is now 12:55 PM, and the user is coming back at 1:00PM; the job still has another 5 minutes of run time to complete. Who is the owner of this box for the 5 minutes? If User A is the owner, the user (User B) would be very unhappy about the fact that someone has taken over his machine. If User B is the owner, User A would be very unhappy about the fact that a job which was over 50% complete was terminated.
This is a very simple scenario, with only two users involved. What would be your reaction if you are talking about 1000s of desktops, 1000s of requests, and a rather unpredictable workload? The truth of the matter is that there is no “right” answer to any of the questions posed here. Decisions must be made that valuate the pros and cons of each action.
In the scenario that I just prescribed, I would do the following:
- Remove the name of this machine from the available list of resources.
- Decrease the priority of the current running job to a much lower priority. This way, the user has full control of his box.
- Start the same job with high priority at some other resource. This way, you short-circuit the scenario where you do not know when the job will finish. One of these two resources will finish the job first, and you can go ahead and cancel the second.
This was just one of many scenarios that need to be considered when provisioning resources. Ownership of a resource is always one of the more interesting questions because both parties involved struggle for power especially at some of these boundary conditions.
Obviously, as it is with any enterprise-wide deployment, the question of security comes into play. This is even more so when you start talking about an external entity accessing a resource. Security is left for another discussion, but what is important here is the way that the agent cleans up after itself and does not “poke around” per se while turned on. In this little scenario, User A wants full control over the box to get the proper results. User B does not share the same sentiment; he is uncomfortable about having someone else gaining access to confidential documents, perhaps. As you can see, security and the question of ownership go hand-in-hand.
I am not sure if I raised more questions this time or answered a few. When I started writing this article, I wanted to put some directions and answers on paper. I soon realized that these questions are too specific to answer. These questions depend on the application, the environment, and number of users, usage profile, and many other variables that will be very difficult to nail down in four pages. I like to tell you that answers are coming in the future articles, but that would not be the case. You will cover these topics in more detail and you will be able to find the answers, but it will not be clear cut. A workload profile of short running tasks and one with long running tasks have different requirements and deployment configuration. You will learn how to tailor your environment as you go thru this process to suit your needs.
Until next time…
About the Author
Mr. Sedighi is currently the CTO and founder of SoftModule. SoftModule is a startup company with engineering and development offices in Boston and Tel-Aviv and a Sales and Management office in New York. He is also the Chief Architect for SoftModule’s Grid Appliance product that has risen from a current need in the market to manage excessive demands of computing power at lower cost.
Before SoftModule, Mr. Sedighi held a Senior Consulting Engineer position at DataSynapse, where he designed and implemented Grid and Distributed Computing fabrics for the Fortune 50. Before DataSynapse, Mr. Sedighi spent a number of years at TIBCO Software, where he implemented high-speed messaging solutions for organizations such as the New York Stock Exchange, UBS, Credit Suisse, US Department of Energy, US Department of Defense, and many others. Mr. Sedighi received his BS in Electrical Engineering and MS in Computer Science, both from Rensselaer Polytechnic Institute.