Getting Familiarized with the Hadoop Distribution File System
In this article, we'll discuss more about the Hadoop Distribution File System (HDFS). It is one of the distributed file systems that is designed to adhere to the advantages of a traditional distributed file system (DFS) and Google File system (GFS). The Apache Hadoop framework comes up with its own file system, known as the Hadoop Distributed File System (HDFS).
HDFS is a self-healing, fault-tolerant file system developed in Java to store a huge amount of data (in terms of petabytes or terabytes), irrespective of schema and format, providing high scalability, throughput, and reliability while running on large clusters of a commodity machine.
- Massive Data Storage: HDFS supports petabytes or zeta bytes, or more data.
- Single Writer/Multiple Reader model: HDFS follows the Write Once-Read Many basic principle to resolve data coherency issues, such as data will always be stored in a serialized form. HDFS is designed for batch processing instead of interactive use.
- Use of Commodity Machine: HDFS doesn't require a highly expensive machine. It runs on easily available commodity hardware and manages data in such a way that it addresses all issues without interrupting the client.
- Fault Tolerance: To handle failure of hardware and smooth & quick processing of large data sets, HDFS provides replication of data. In HDFS, files are divided into multiple blocks of data and stored on multiple nodes. Storing of data blocks depends on network topology, and if one of node fails while accessing data, HDFS internally accesses data from another node.
- Manage File System: As with traditional file systems, HDFS also supports a file management system. It includes read, write, delete, rename, modify, and relocate files or directories. HDFS doesn't support hardlink and softlink.
By default, the block size is HDFS is 64 MB, and the replication factor is 3. Both of these properties are configurable in the configuration file.
In this section, we'll discuss some of the key terminologies for HDFS. These terminologies are important to understand HDFS functionality.
- DataNode:DataNode is a commodity machine (less expensive) to store a large amount of data. It is termed, too, as the workhorse of thee file system.
DataNode executes all commands driven by NameNode, such as physically creation, deletion, and replication of a block and also does low-level operation for I/O requests served for the HDFS client.
DataNode is a slave by nature; it sends a heartbeat to NameNode in every three seconds, reporting the health of the HDFS cluster and a block report to NameNode. These block reports contain information regarding which block belongs to which file.
DataNode enables pipelining of data and can be used to forward data to another data node that exists in the same cluster.
- NameNode:NameNode stores metadata information of the file system, such as location, size, permission, hierarchy, and the like. It is a highly reliable machine with a lot of RAM for quick access and to support persistence.
NameNode is a single point of failure and manages the file system by using a transactional log; in other words, it edits the log. By nature, NameNode is the Master and commands DataNode to execute I/O operations. NameNode receives a regular heartbeat as well as block reports from all data nodes that exist in the cluster to monitor the health of the HDFS cluster and assure that all are working fine.
- Blocks:Blocks are the smallest writable units on the disk or file system. The file system does all operation on blocks. The concept of blocks is useful to store large-size petabyte or zeta byte files on HDFS. The default block size is 64 MB; this supports storing data of petabytes or zeta byte on a large commodity machine in a cluster as well as providing high throughput for data access.
Storing of large file in blocks on HDFS cluster brings lot of benefits; a few of them are as follows:
- A file larger than any single storage disk available in the network can be stored on a HDFS cluster.
- Supports Easy Manageable Storage Subsystem: Only Data is being stored in the form of block and don't store metadata information in the block as it's being handled by other systems separately.
- Fault Tolerance and Availability: In HDFS, data blocks are replicated and stored on multiple data nodes. If any node fails or a block gets corrupted, a copy of the block can be read from another location because it's transparent to the HDFS client. Also, if failure of DataNode occurs or block gets corrupted or missed, the failed DataNode's data or block is replicated to a live, light-weighted data node.
- Secondary NameNode:This node also known as the CheckPointNode or HelperNode. It's a separate, highly reliable machine with lots of CPU power and RAM. In NameNode, is the FSImage File, that has an image of File System which is useful during startup of NameNode. It edits the Log/Transaction Log: have logs of series of updation applied after NameNode started like deletion, creation, modification, and so forth of files/directories.
Secondary NameNode merges the FSImage file of NameNode with Edits Log and creates a new FSImage file. It sends the newly created FSImage file back to NameNode, which is being used whenever NameNode starts or any failure occurs.
HDFS has a Master-Slave architecture. A HDFS cluster has a single NameNode and a single or multiple DataNode(s). The NameNode is the Master Node; it manages the file system namespace, providing regulatory access to files by a client. It executes all operations of file/directory given by the client, such as reading, writing, renaming, updating, and so forth. It maps files into a set of blocks that are being stored in the DataNode or Slave nodes. DataNode not only stores data but also is responsible for creating/deleting/updating a block. Because HDFS is being built up in the Java language, it can be deployed on a wide range of machines.
Figure 1: HDFS Architecture
In the rest of this section, we'll discuss how HDFS works.
In the beginning, the first NameNode starts and loads a FSImage file into memory and updates the FSImage File by using the Edits log file. After creating a new FSImage file and empty log file, NameNode starts listening for a RPC & HTTP request from DataNode as well as HDFS clients. Now, all data nodes send their heartbeat and block reports to register into NameNode. This whole process is known as Safe Mode.
NameNode exits from safe mode when the minimum replication condition is met and, by extension, safe mode time reached its configured time in conf-site.xml. Once NameNode exits from safe mode, it finds data blocks that are not replicated to a configured number and replicates them to other DataNodes.
HDFS uses the following block management methodologies:
- Block Placement policy:The HDFS default block placement policy minimizes write cost and maximizes data reliability and reading data. Following are the block placement steps:
- First replica places on a local node if it is the HDFS client part of Hadoop clusters; otherwise, any random node.
- Second replica places on a different rack other than that of the rack where the first replica exists.
- Third replica places on a different node but share racks with first replica.
- Replication Management:The NameNode ensures that blocks are properly replicated (must meet configured replication factor) on DataNodes. There are five possibilities for blocks:
- OverReplicated Blocks: If some blocks get over replicated, NameNode deletes over replicas.
- UnderReplicated Blocks: NameNode creates new replicas until they meet the target replication factor.
- MisReplicated Blocks: These blocks doesn't satisfy the default block placement policy; in other words, all configured replicas are stored in the same DataNode on the same rack in a multi-rack cluster where the replication factor is configured. NameNode then re-replicates these blocks, ensuring the default block placement policy.
- Corrupt blocks: If any block gets corrupted and has at least one non-corrupted replica, NameNode creates a new replica, thus achieving the replication factor.
- Missed blocks: Blocks that have no replica in the Hadoop cluster are known as missed blocks.
In this article, we discussed HDFS in detail. We understand key terminologies of HDFS and what the HDFS architecture looks like.
HDFS is a highly reliable and available data storage and has an upper edge on a traditional distributed file system. HDFS principals support faster read and I/O. There are too many things to discuss about HDFS in this article. I'll cover them in future articles.
About the Author
Anoop worked for Microsoft for almost six and half years and has 12+ years of IT experience. Currently, he is working as a DW\BI Architect in one of the top Fortune Companies. He has worked on end-to-end delivery of enterprise-scale DW\BI projects. He has strong knowledge of database, data warehouse, and business intelligence application design and development and Hadoop/Big Data. Also, He worked extensively on SQL Server, designing of ETLs using SSIS, SSAS, SSRS, and SQL Azure.
Disclaimer: I help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.