The Process
It is no longer enough to simply buy servers. Why? Not because you can never have enough servers, or blades, or workstations; but because you can never have enough computing power! If you don’t agree with this statement, you should stop reading the rest of this article, and go and do something more valuable with your time!
Management of compute-resources has become exponentially more challenging as the sheer number of resources is increasing and the complexity of these resources is becoming less transparent. Resources are becoming more complex as accelerator technologies such as GPU and FPGA becoming more prevalent and multi-core processors such as the IBM Cell are throwing a curved ball to the developers with the concept of heterogeneous multi-core processor.
From the business side of things, requirements keep coming in and demands are more complex; new Quality of Service (QoS) requirements, new response time requirement, and so forth. How did we get here? Where are we going from here and how can we get there? One thing is for sure and that is the fact that there is no “magic.” It’s a process and we have to go every step of this [5-step] process in order to come out on top. Assuming that “on top” of your game is where you want to be!
The Five Stages
What are these five stages that I am referring to?
- Chaos: Bunch of servers
- Organized Chaos: Management of servers; Grid Computing
- Uniformity: Focusing on performance; Cluster Computing
- Understanding: Focusing on scalability and business needs; Utility Computing
- Order: Seamlessness and adaptation; Cloud and Adaptive Computing
Figure 1 further illustrates these five stages, and the rest of this article will be focused on explaining these five stages.
Figure 1: The five stages of datacenter management
First: There Was Chaos
The first goal for the IT part of any organization is “availability.” Most of the time, this directly translates to the notion of buying two of everything and having a backup for every major system. Not much is done in terms of management or automation. When a server fails, the backup takes over. The problem with this approach is the fact that at most one could only achieve 50% utilization (one system at 100% utilization, one system waiting a failure at 0% utilization). The other flaw with this “server farm” approach is that the IT organization is mostly playing “catch up;” in other words, waiting for something to go wrong and then try to fix the problem. This is a reactive approach to the problem, and not proactive. What makes sense here is to be more proactive and add a layer of management to the infrastructure.
Grid: Chaos. Organized.
Why is Grid chaotic? The whole purpose of Grid is to increase utilization of the resources. This is achieved by effective management of the infrastructure. This comes at a great cost: availability! For an organization to choose Grid over a Server Farm, it must evaluate the trade-offs between availability vs. utilization.
A Grid infrastructure promises higher utilization of resources. These “resources” are the Disaster Recovery (DR) resources purchased in case of a primary node failure. The Grid starts utilizing those resources (albeit to achieve a better Quality of Service), but this means those resources are no longer immediately available in case of failure. The key phrase here being “immediately.” Obviously, the proper[1] Grid infrastructure can increase the utilization of the existing infrastructure, but we are focusing on the utilization of resources in a datacenter, and of a specific resource.
Grid does add a layer of provisioning and management to the infrastructure. The infrastructure is more proactive in that it has some capability to seek proper resources for a given job. In heterogeneous environments, this “seek and you shall find” mentality is very useful. This model can result in lower network efficiency, but having more resources are available at your fingertips will increase the quality of service (QoS).
Clusters: Uniformity
Wants and wishes turn to requirements after a Grid solution has been deployed. These requirements usually have a tighter response time and better QoS definitions. It turns out that although Grid’s ability to tie together heterogeneous environments and make best use of underlying resources is desirable; its inability to use the network efficiently renders itself problematic when response-time requirements get tighter.
A cluster environment is usually composed of homogenous machines in close-proximity such as a datacenter. This tighter coupling of resources allows for a better response time, and the close-proximity of resources allow for a might reasonable recovery in case of failure. You are, however, running the risk of lower utilization because you are excluding some of your resources from your cluster for performance reasons. The obvious next step here is to mix the two environments and make one large compute backbone for the enterprise.
Utilitarianism: Key to Understanding
Sharing is key! The ability to share resources of different types among users is a powerful and yet mostly underutilized concept. As discussed in the previous section, the goal here is to combine all resources: slow and fast, desktops and blades, and so on. A concept known mostly as the single-system image (SSI) allows for both environments (Grid and Cluster) to work side by side. Geographically disjointed, but still accessible and can meet the QoS requirements.
This is a difficult stage to get to because there are two ends of the spectrum to meet: on one end, you need to meet tight requirements and meet stringent QoS requirements; on the other end, you need to increase utilization of your resources and still not be heterogeneous and ad-hoc in resource acquisition.
Adaptation: Cloud and Beyond
What’s next for your datacenters? You need to go beyond planning and provisioning. I am not suggesting foregoing datacenters, but rather making them more adaptive. This is the promise of Cloud computing: the ability to adapt to users’ needs without intervention or much planning.
What is added here is the ability to foresee changes; an added layer of self-monitoring where changes are anticipated; where sharing of resources goes beyond job placement but rather the ability to tailor underlying resources to meet the need of your job(s). This is not as futuristic as you might think.
The xFactor Factor
Recall what you seek at each stage:
- Stage 1: Availability
- Stage 2: Utilization
- Stage 3: Performance
- Stage 4: Scalability
- Stage 5: Seamlessness
What you need is a platform that will guide you thru these stages of enlightenment. It does not make any sense for a large organization to go right to stage 5. Smaller organizations can skip a couple of steps, but a large organization needs to go through the initial planning stages. This is due to the fact that all large organizations have legacy hardware and software applications that need to be dealt with. It’s that whole migration and integration step that prevents you from jumping right into a Cloud, so to speak.
Most commercial platforms are able to add order to chaos; some are able to achieve performance at the cost of scalability. xFactor’s ability to scale and integrate with external available cloud vendors makes it ideal for a growing organization. Keep in mind that scalability and seamlessness must go hand-in-hand; in other words, avoid the scenario where the next resource added to the datacenter is “the straw that broke the camel’s back.”
xFactor provides enterprise-class solutions to the High Performance Computing (HPC) domain, with the goal of optimizing client’s datacenter investments. The SoftModule xFactor Grid management software package provides organizations with a method to optimally run and manage compute intensive applications across thousands of CPUs. Leveraging xFactor’s distributed architecture and dynamic resource allocation techniques, clients can achieve dramatic improvement in application performance and resource utilization.
To achieve linear scalability, xFactor creates logical clouds that allow on-demand access to neighboring clouds, but decoupled if not necessary. This on-demand feature allows organizations to access and configure external available resources (Clouds) to meet SLAs without the need or advanced-planning duly required to do so today.
Summary
In this article, you explored the challenges of datacenter investment and evolvement. It is neither an easy task nor a problem that can be solved overnight. I do believe that most organizations have to go through the process that I outlined in this article. The goal is to know what you are up against, and plan.
About the Author
Art Sedighi is the CTO and founder of SoftModule, a startup company with engineering and development offices in Boston and Tel-Aviv and a sales and management office in New York. He is also the Chief Architect for SoftModule’s xFactor product that has risen from a current need in the market to manage excessive demands of computing power at a lower cost.
Before SoftModule, Mr. Sedighi held a Senior Consulting Engineer position at DataSynapse, where he designed and implemented Grid and Distributed Computing fabrics for the Fortune 500. Before DataSynapse, Mr. Sedighi spent a number of years at TIBCO Software, where he implemented high-speed messaging solutions for organizations such as the New York Stock Exchange, UBS, Credit Suisse, US Department of Energy, US Department of Defense, and many others. Mr. Sedighi received his BS in Electrical Engineering and MS in Computer Science, both from Rensselaer Polytechnic Institute.
[1] I will cover this “proper”-ness in another article.