Data is huge! Literally speaking, that is. Management of data is complex, for a single application or for the enterprise. It’s interesting to know that the approach to manage large chunks of data has changed very little over the years; technology has improved, but the method by which you apply the technology is still the same. This article will focus on data management; not the specifics, but a more general overview of available options.
There are two basic scenarios:
- The data is larger than the available memory.
- The data is smaller than the available memory.
It comes down to one or the other, but realistically, the second is the only viable option. If the data is larger than the available memory, you will need to break the large data into smaller chunks that you can process one at a time. You can further classify this scenario into input and output data sizes:
- Input data is small, and output data is small
- Input data is small, and output data is large
- Input data is large, and output data is small
- Input data is large, and output data is large
Figure 1 depicts your options (the plus sign is an “or”). You are showing part of the ultimate solution in the picture in that the “small” data is saved and retrieved locally, and “large” data is in some external location.
Figure 1: The Depiction of Data
The assumption you are making is that the data is either too large or too small to fit in the available memory of the process[or]. It does not matter if you have 4MB of RAM or 16GB of RAM; the assumption is that the data that you want to crunch is larger than the space you have available.
The interesting part is that this problem never goes away! Years from now, when a server comes with 4TB of RAM, you will likely be struggling to find out how to crunch your data, which is of the order of 40TB, for example. The more interesting part is that you will likely solve this problem using the same technique you utilize today—paging!
What is paging? Paging is basically the swapping of data between main memory and secondary storage; to get the data you need into a faster media (in other words, local memory and eventually the processor cache), and swap “old” data from main memory back to the disk to create space for the “new” data. The cleverness of any design is and will be how the “old” and “new” data pages are accessed, copied, and restored onto disk and moved into the processor cache (the fastest accessible large memory) to be processed. So, the basic idea is simple, but implementation is very challenging; that’s why new and clever algorithms are being introduced by processor vendors, memory vendors, drive vendors, operating system vendors, and so forth. Paging is an effective method of managing data, either logically or physically—logically as in a database caching the latest data retrieved even if the data being cached is larger than the available memory or physically when a processor interfaces with local RAM and the storage drives. Obviously, in normal circumstances you will use a heuristic algorithm to find out what to page-in and page-out.
Basically, you want to remove the page that you no longer need. The difficulty arises when you try to find that section of memory which you no longer need. The problem becomes even more challenging when you are trying to predict the next section of data that you will need to avoid waiting for data.
The performance of any system drops to a screeching halt if and when you have to wait for data. Local memory is much faster than any attached storage device. The moment you have go to a storage device to retrieve data, you might as well use an i386 processor! Regardless, there are many instances when you do have to wait, and ultimately determine or guess what the “old” and “new” data pages are.
Or Not to Wait
If there was a way to look into the future and know when and what data piece you might need, you could account for and have the data ready exactly when it is needed. Wouldn’t that be nice? Well, I suggest that you let the physicists at CERN figure out the time travel and “looking into the future” thing, and we mere mortals can focus on simply controlling the future.
Wouldn’t it be nice to have a system that could think one step ahead to optimize and prepare itself for what is about to come? You are not looking to control every aspect and not control too much into the “future,” but only our next step or at most, a couple of steps. Why? If you look too much into the future and predict, you go back to the heuristic scenario that U spoke about. If you don’t, you cannot optimize. The idea is hit a note somewhere in between one that is in tune with our application, users’ requests, data profile, and job type!
With any prediction unit, you run the risk of stale data or stale control information (in other words, a job was canceled, but you got the cancel message late). So, your system must compensate for that. Your system can control its next step (n+1), because it inherently knows what that next step is (“know thyself”). Your system, however, cannot control its n+2, n+3, … too much control could lead to too many wasted resources.
[That Is] the Question
Predicting the next task for a given job is interesting, but trying to expand this notion to a number of jobs is challenging. The paging logic becomes complex when dealing with a multi-job environment, but benefits can only be realized when dealing with a multi-job environment. The logic in not applicable in the micro-sense (a single job), but I will show that it is in fact feasible in the macro-sense (a number of jobs).
All resource managers or schedulers work in the same way: A task comes that is part of an overall job or session; the scheduler looks at the available set of resources and compares them with the requirement of the task[s] queued; the task is assigned to a resource. Done!
Now, assume that you have a number of jobs with each having a number of tasks. Each task has three stages:
- S: Task scheduled
- D: Data for task received by resource
- E: Task starts execution
Normally, a task goes through S→D→E (See Figure 2). Abstractly speaking, you want to exchange S and D! You want the data to be received by the resource, before the task is even scheduled!
Figure 2: Normal Data Flow thru Scheduler[s]
The xFactor Factor
In the previous sections, you examined the relevant attributes of architectural models as they pertain to managing data across a Grid. You have examined a hypothetical (Optimal Scenario) model that combines the best components, thus forming a hybrid environment. Unfortunately, this model is strictly hypothetical; in other words, there is no commercially available offering—until now. The xFactor software offering has taken the hypothetical model and turned it into reality.
xFactor provides enterprise-class solutions to the High Performance Computing (HPC) domain, with the goal of optimizing clients’ datacenter investments. The SoftModule xFactor Grid management software package provides organizations with a method to optimally run and manage compute-intensive applications across thousands of CPUs. By leveraging xFactor’s distributed architecture and dynamic resource allocation techniques, clients can achieve dramatic improvement in application performance and resource utilization.
As I mentioned in the previous section, the goal is that the data be received by the resource, before the task is even scheduled. But, how can you go about placing the data at the resource before the task is scheduled? xFactor is an intelligent system; it learns from previous events that occurred in the system and it adjusts its next action based on the previous events. Obviously, the “previous” event is the result from the first task that was scheduled on that resource (See Figure 3), and can be used by the scheduler as the means to determine its next step[s].
Figure 3: Event learning in xFactor
Figure 4 shows the subsequent steps. Although this article provides a cursory review, it is intended to give you a good general understanding of how xFactor manages data to achieve efficiency in the Grid. In the subsequent task, the task arrives at the scheduler, and the data is sent to the resource. The scheduler, realizing that the resource has already been primed with the proper set of data, places (instead of saying ‘schedules’) the next task on the compute resource that is ready to process the task. Even though this model reduces wait-time, it has flaws as well. You need to realize that the time it takes a client to transmit the data to its destination has not changed (physical constraint); you simply are smarter about how you schedule that wait. In the xFactor case, the scheduler is smart enough to realize that the resource is free (or better yet, it is receiving data for the next task coming down the pipe), and it can handle a task from another job.
If the system is composed of only one job, this method obviously does not help. In systems (Grids) where there are a number of jobs, the benefit can be easily realized because the resource is still considered free during Step 6 (see Figure 4), and it can still process tasks while the scheduler is making its next decision for Step 7.
Figure 4: Data and Task flow thru the xFactor
In this article, you explored the challenges of managing data in a Grid environment. The techniques reviewed within this article help one realize the complexity of the problem and physical constraints surrounding it: network bandwidth, memory bandwidth, processing speed, I/O speed, and the like; the list goes on and on. Ultimately, the optimal solution for your environment will greatly depend on the flexibility in the architecture and the willingness of your vendor(s) to work closely in adapting to your strategy.
About the Author
Art Sedighi is the CTO and founder of SoftModule. SoftModule is a startup company with engineering and development offices in Boston and Tel-Aviv and a sales and management office in New York. He is also the Chief Architect for SoftModule’s xFactor product that has risen from a current need in the market to manage excessive demands of computing power at a lower cost.
Before SoftModule, Mr. Sedighi held a Senior Consulting Engineer position at DataSynapse, where he designed and implemented Grid and Distributed Computing fabrics for the Fortune 500. Before DataSynapse, Mr. Sedighi spent a number of years at TIBCO Software, where he implemented high-speed messaging solutions for organizations such as the New York Stock Exchange, UBS, Credit Suisse, US Department of Energy, US Department of Defense, and many others. Mr. Sedighi received his BS in Electrical Engineering and MS in Computer Science, both from Rensselaer Polytechnic Institute.