MapReduce was a model introduced by Google as a method of solving a class of Big Data problems with large clusters of inexpensive machines. Hadoop imbibes this model into the core of its working process. This article gives an introductory idea of the MapReduce model used by Hadoop in resolving the Big Data problem.
Overview
A typical Big Data application deals with a large set of scalable data. Using a single database to store and retrieve can be a major processing bottleneck. This is particularly true if we use a monolithic database to store a huge amount of data as we can see with relational databases and how they are used as a single repository. This is not going to work, especially we have to deal with large datasets in a distributed environment.
Google used the MapReduce algorithm to address the situation and came up with a solution. The idea is to divide the bigger task into smaller manageable parts and distribute them across computers in the network for processing. The result thus obtained is integrated to form the final dataset. This idea became the base of Doug Cutting’s Hadoop project. Hadoop uses this algorithm to process data in parallel with others to deliver a complete statistical analysis on large data sets. Therefore, Hadoop grossly can be divided into two parts:
- Processing: Leveraged by the MapReduce algorithm
- Storage: Leveraged by HDFS
Hadoop MapReduce is thus an implementation of the algorithm developed and maintained by the Apache Hadoop project. It works like a machine in itself where we provide input and the engine responds by transforming input into output quickly and efficiently, processing through multiple stages. This overtly simplistic idea needs a little elaboration as it follows down the line.
MapReduce
MapReduce is a parallel programming model used for fast data processing in a distributed application environment. It works on datasets (multi-terabytes of data) distributed across clusters (thousands of nodes) in the commodity hardware network. MapReduce programs run on Hadoop and can be written in multiple languages—Java, C++, Python, and Ruby. The principle characteristics of the MapReduce program is that it has inherently imbibed the spirit of parallelism into the programs. This makes it ideal for large scale data analysis which can leverage the model of parallelism into its practice to squeeze results more efficiently and quickly out of an existing infrastructure.
How It Works
Hadoop MapReduce divides the tasks into multiple stages, each with a significant set of functions to extract the desired result from the Big Data. It works on nodes in a cluster hosted on a collection of commodity servers. The process begins with the user request that runs the MapReduce engine and ends with the result being stored back to HDFS.
We can initiate a MapReduce job to run by invoking the JobClient.runJob(conf) method. This is a convenient method to create a new JobClient instance. This in turn invokes submitJob() and polls the job progress every second and reports back to the console if there is any change since the last report was generated. This has a ripple effect and triggers a set of operations behind the scenes. The first step is to find and read the input file which contains the raw data. The file format is arbitrary and must be converted to a format fit for processing. This is the job for InputFormat and the RecordReader(RR). The InputFormat uses the function called InputSplit to split the file into smaller parts. The RecorReader(RR) then transforms the raw data and makes it available for processing by map.
Mapping
Once the data is acceptable to map, it creates a distinct instance to each input pair (key and value) and starts processing. As soon as the mapping function starts to produce output, it is not directly written to the disk; instead, it is stored in the memory buffer to do some presorting. Each map maintains a circular buffer where it redirects the output. On exceeding the threshold size, the content spill is written back to the disk. It further divides the data into partitions acceptable to the reducer to which the data is redirected next. All these works, however, take place in a simultaneous fashion on multiple nodes in the Hadoop cluster. After completing the map tasks, the intermediate results are accumulated in the partition, and shuffling and sorting take place to optimize the output for reduce to take over as input.
Reduce and Merge
What reduce gets is also a key, value pair and acts in a similar fashion as the map. It gathers the map output from several map tasks across the cluster and begins processing only after mapping is complete. It has a set of copier threads to merge and spill to the disk the output provided by the map. As copies gets accumulated on disk, a background thread does the job of merging them into larger, sorted files. It also provides the output in the form of a key, value pair and may need to be reformatted by the OutputFormat before the application be able to accept the format. The OutputFormat finaly takes the key, value pair and writes back the processed data to HDFS. Here, RecordWriter plays the major role, much like RecordReader, except that it takes part while reading from HDFS in the beginning.
Conclusion
This is just the tip of the iceberg. There are many intricate details and much more goes on behind the scenes. In short, Hadoop MapReduce provides the capabilities to break Big Data into smaller, manageable parts, process them in parallel on a distributed cluster, and finally, make the data available for consumption or additional processing. Hadoop today has grown to be a larger ecosystem of tools and technologies to solve cutting age Big Data problems and is evolving quickly to refine its features.