Big Data/Hadoop – An Introduction
In today’s technology world, Big Data is a hot IT buzzword. In short, “Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications“. To mitigate the complexity in processing large volumes of data, Apache developed Hadoop – a reliable, scalable and distributed computing framework.
Hadoop – 10 Facts
Processing massive data is one of the biggest challenges and Hadoop parallelizes data processing across many cluster nodes; also, Hadoop is especially well-suited to large data processing tasks and it can leverage its distributed file system to cheaply and reliably replicate chunks of data to nodes in the cluster, making data available locally on the machine that is processing it.
In this article, I’ll share 10 facts about Hadoop to tackle problems related to large volumes of structured and unstructured data.
1. Import/Export Data to and from HDFS
In the Hadoop world, data can be imported into the Hadoop Distributed File System (HDFS), from various diversified sources. After importing data in HDFS, a required level of processing can happene on the data using MapReduce or any other language like Hive, Pig, etc. The Hadoop system provides you flexibility to not only process the huge volume of data but at the same time processed data like filtered and aggregated, and transformed data can be exported to external or other databases using Sqoop.
Exporting data in other databases like MySQL, SQL Server, or MongoDB etc. is a powerful feature, which can be leveraged for having better control over data.
2. Data Compression in HDFS
Hadoop stores data in HDFS and supports data compression/decompression. Data compression can be achieved using compression algorithms like bzip2, gzip, LZO, etc. Different algorithms can be used in different scenarios based on their capabilities; for example compression/decompression speed or file split ability.
3. Transformation in Hadoop
Hadoop is an ideal environment for extracting and transforming huge volumes of data. Also, Hadoop provides a scalable, reliable and distributed processing environment. There are multiple methods to extract and transform data using MapReduce, Hive, and Pig etc.
Once input data is imported or placed into HDFS then Hadoop cluster can be used to transform large datasets in parallel. As mentioned, transformation can be achieved using available tools, for example if you want to transform data into a tab separated file then MapReduce is one of the best tool for it. In the same line, Hive and Python can be leveraged to clean and transform geographical event data.
4. Achieve Common Task
There are number of common tasks that need to be done during daily processing of data and frequency used to be very high. The available languages like Hive, Pig and MapReduce are very useful to achieve these tasks and make your life easy.
Sometimes one task can be achieved in multiple ways; in such situation a developer or an architect has to make the right decision to implement the right solution. For example, Hive and Pig provide an abstraction layer between data flow and queries, and the MapReduce workflows they compile to. The power of MapReduce can be leveraged for scalable queries. Hive can use to build analytics and manage data using HiveQL (SQL like declarative language). And, Pig can be utilized by writing the operation in Pig Latin.
5. Combining Large Volume Data
In general, to get the final results, data needs to be processed and joined with multiple datasets. In Hadoop there are many ways to join multiple datasets. MapReduce provides map-side and reduce-side joins; these joins are non-trivial in nature and can be an expensive operation. Pig and Hive also have equal capability to apply join on multiple datasets. Pig provides replicated join, merge join and skewed join and Hive has the power of map-side join and full outer joins to analyze the data.
The important fact is that data can be combined by using various tools like MapReduce, Pig and Hive, which can be used based on their in-built capabilities and actual requirements.
6. Ways to Analyze High Volume Data
Often in the Big Data/Hadoop World , a problem is not complicated and the solution may be straight forward but the challenge is the volume of data. In such circumstances the problem solving approach needs to be different. Some of the analytical tasks are to count the distinct IDs from log file, transformation on stored data for a specific date range, page rank, etc. All these tasks can be solved with various tools and techniques in Hadoop, like MapReduce, Hive, Pig, Giraph, and Mahout. These tools provide flexibility to extend their capability with the help of custom routines.
For example, graph and machine learning problems can be resolved using a Giraph framework instead of a MapReduce job to avoid writing a complex algorithm.
The Giraph framework is more useful than a MapReduce job to solve graph and machine learning problems because some problems may require applying iterative steps to resolve them.
7. Debugging in Hadoop World
Debugging is always an important process during any development. The need of debugging in the Hadoop environment is as big as Hadoop. There is a saying that malformed and unexpected input is common, causing everything to break at a high scale, which is an unfortunate downside of working with large amounts of unstructured data.
Although, individual tasks are isolated and given different sets of input, it requires an understanding of the state of each individual task when tracking of various events. This can be achieved with numerous tools and techniques available to support the process of debugging Hadoop jobs. For example, to avoid any job failure there is method to skip bad records and Counters in a MapReduce job that can be used to track bad records, etc.
8. Easy to Control Hadoop System
Product development is an important activity and system maintenance also equally important, which helps to decide future of the product. In Hadoop, the environment set up, maintaining and monitoring the environment, and handling and tuning MapReduce jobs are very much required to benefit from the Hadoop system; for that Hadoop provides lot of flexibility to control the whole system. Hadoop can be configured in three different modes: standalone mode, pseudo-distributed mode and fully-distributed mode. With the help of the Ganglia framework the whole system can be monitored and the health of nodes tracked. Addtionally, parameter configurations functionality provides control over MapReduce jobs.
The Hadoop system has a good amount of flexibility to easily get overall system level control.
9. Scalable Persistence
There are many options available to handle huge volumes of structured and unstructured data, but still scalability of storing massive data is one of the major concerns in the data world. The Hadoop system considers Accumulo to mitigate such issues. Accumulo is inspired by the Google BigTable design and is built top of Hadoop, Zookeeper and Thrift, and offers scalable, distributed cell-based persistence of data backed over Hadoop.
Acumulo comes with a few improvements on the BigTable design in the form of Cell-based access control and a server-side programing mechanism that can help to modify key/value pairs at various points in the process of data management.
10. Data Read and Write in Hadoop
In Hadoop data read and write happens on HDFS. HDFS stands for Hadoop Distributed File System, a fault tolerant distributed file system. It is optimized for streaming reads on large file whereas I/O throughput is favored over low latency.
There are many ways to read from and write data to HDFS efficiently, like FileSystem API, MapReduce, and advanced serialization libraries, etc.
Summary
In this article we explored what can be achieved in the Big Data world using the Hadoop framework. In a future article I’ll be discussing each area in more detail, which would help you in real time implementations andto get best out of Hadoop system.
Stay tuned!
Resources
http://en.wikipedia.org/wiki/Big_data
http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
http://www.ibm.com/developerworks/library/wa-introhdfs/
http://developer.yahoo.com/hadoop/tutorial/module1.html
http://www.cs.brandeis.edu/~rshaull/cs147a-fall-2008/hadoop-intro/