DatabaseManage a Hadoop Distribution File System

Manage a Hadoop Distribution File System

Introduction

In the last article, we discussed HDFS’s key terminologies and basic architectures. We understood that HDFS was inspired by Google File System (GFS) and comes with its own advantages over traditional distributed file systems. Also, we talked about the basic characteristics of HDFS.

In today’s article, we’ll move one step ahead and discuss how we can manage HDFS. We’ll discuss more in HDFS, such as configuration of HDFS, adding/deletion of a node in an HDFS cluster, node balancing, rack awareness, managing of data integrity, and miscellaneous HDFS tools like File System Check Tool, DFS Admin tool, and so on.

Configuring HDFS

Hadoop comes with configuration files. These are available in the conf directory after installing Hadoop, The following files are important to configure HDFS:

  • hadoop-env.sh: A batch and executable script, useful to configure the environment variable of Hadoop in a cluster.
  • core-site.xml: A Hadoop configuration xml file, useful to configure the core operation variables of Hadoop like I/O, NameNode mapping, settings for Rack Awareness and Data Integrity, and so on.
  • hdfs-site.xml: A Hadoop configuration file, useful to maintain the configuration specific to HDFS, such as replication factor, SecondaryNameNode address, DataNode address, temporary location, and the like.
  • master: A plain text file containing the NameNode address.
  • slave: A plain text file containing the DataNode addresses available on the cluster.
  • log4j.properties: A common log file that stores all activities that happened in the Hadoop ecosystem.

There are three control scripts that are responsible for starting and stoping a Hadoop cluster. All these exist in the bin directory.

  • start-all.sh: An executable batch file. It starts both HDFS and map-red in the Hadoop cluster.
  • Start-dfs.sh: An executable batch file. It only starts HDFS.
  • Start-map-red.sh: An executable batch file. It only starts Map-Reduce in the Hadoop cluster.

With the preceding set of configurations and control scripts, we can control the execution of HDFS in the Hadoop cluster.

Additing or Deleting a Node in HDFS

Additing or deleting of a DataNode is one of the important activities to manage the HDFS cluster. You may come across circumstances while managing an HDFS cluster where one or more DataNodes is malformed and it’s necessary to remove all of them from the HDFS cluster.

In another scenario, you may need to scale out an HDFS cluster, Thiis can be achieved by adding new DataNodes in the HDFS cluster.

Following are the steps for adding and deleting DataNodes in an HDFS cluster:

Adding/Commissioning of a New DataNode in an HDFS Cluster

  • Add the network addresses of the new node to the include file’s dfs.host and mapred.hosts properties.
  • Update NameNode & JobTracker by running following commands:
    • %hadoop dfsadmin -refreshNodes (Name Node)
    • %hadoop mradmin -refreshNodes (JobTracker)
  • Update slave files with new nodes, so they can be used by control scripts of Hadoop.
  • Start new DataNode & TaskTracker
  • Check new DataNode & TaskTracker appear in WEB UI.

Deleting/Decommissioning Existing Nodes in an HDFS Cluster

  • Add the network address of deleting nodes in the exclude file’s dfs.hosts.exclude and mapred.hosts.exclude properties.
  • Update NameNode & JobTracker by using following commands:
    • %hadoop dfsadmin –refreshNodes (NameNode)
    • %hadoop mradmin –refreshNodes (JobTracker)
  • Open WebUI and check whether the admin state has changed to “De commission In Progress”.
    • In this state, all blocks exist on the decommissioned DataNode, and are copied to other DataNodes.
  • Shut down all decommissioned data nodes when the state changes to “Decommissioned”.
  • Delete the decommissioned nodes entry from the include file and run the following commands again:
    • %hadoop dfsadmin -refreshNodes (NameNode)
    • %hadoop mradmin -refreshNodes (JobTracker)
  • Delete the decommissioned DataNode’s entries from the slave file.

Node Balancing in the HDFS Cluster

Over time, block distribution on DataNodes make a cluster unbalanced; this affects performance of Map Reduce and highly utilized DataNodes. To avoid this, Hadoop comes with a daemon balancer program that redistributes the blocks across DataNodes; in other words, from highly utilized to underutilized DataNodes, ensuring a block replacement policy.

Following are the commands for node balancing:

To Start

bin/start-balancer.sh [-threshold ]

Examples:

bin/ start-balancer.sh,

Starts the balancer with a default threshold of 10%.

bin/ start-balancer.sh -threshold 20

Starts the balancer with a threshold of 20%.

To Stop

bin/stop-balancer.sh

One important fact about the balancer program is that multiple instances of the balancer program is restricted in an HDFS cluster. If there is a cluster of multiple clusters available and admin tries to start the balance program on all individual clusters, it would not run on the later one because one instance of the balancer already is running on the cluster.

The Balancer keeps on running until any one of the following condition is met:

  • Cluster is balanced.
  • No other blocks to move
  • Another balancer is running
  • An IOException occurs while communicating with NameNode
  • No Blocks moved until five consecutive iteration reached

Upon exiting, the balancer program produces output in a log file that contains information, such as timestamp, time taken by balancer program, and so forth, with the following different messages:

  • When the Cluster is balanced, the message will be, “The Cluster is balanced. Exiting …“.
  • When No Block is moved, the message will be, “No Block can be moved. Exiting …“.
  • When an IOException occurs, the message will be, “Received an IOException. Failure Reason. Exiting…
  • When another balancer is running, the message will be, “Another balancer is running. Exiting…“.

Rack Awareness in HDFS

To get maximum performance from a Hadoop cluster, it’s important to configure Hadoop to know its network topology. However, to configure a multi-rack system, Hadoop allows the administrator to decide which rack a node belongs to through the configuration variable net.topology.script.file.name. When this script is configured, each node runs the script to determine its rack id.

By default, it’s assumed that all nodes belongs to same rack. Normally, Rack-id in a Hadoop system is hierarchical and looks like Path Name. In general, network locations (nodes/racks) are represented as tree, reflecting the distance between locations. The NameNode uses A network location when placing block replicas and the Map-Reduce scheduler uses the nearest replica as input to map task.

Following are the steps for configuring Rack Awareness in a Hadoop system:

    • Create a rack-topology script: The rack topology script must be executable files and Copies to the /etc/Hadoop/conf directory on all cluster nodes.
    • Run rack-topology.sh on each cluster to ensure it returns the correct rack information.

ManageHadoop1
Figure 1: Running rack-topology.sh

    • Create a data file: The rack topology data file must be available in the /etc/Hadoop/conf directory on all cluster nodes. This data file contains the IP address of Rack and DataNode.

rack-topology.data:

ManageHadoop2
Figure 2: Creating the data file

    • Modify core-site.xml:
      • Stop HDFS.
      • Change the following property’s value with the name of the script file.
<property>
   <name>net.topology.script.file.name</name>
   <value>/etc/hadoop/conf/rack-topology.sh</value>
</property>
    • Restart HDFS and Map Reduce.
    • Validate/Verify Rack Awareness: The following commands ensure rack awareness:
      • Check name node log, available at the /var/log/Hadoop/hdfs location where the entry should be something like “Adding a new node: /rack01/<ipaddress>"
      • Run the dfsadmin -report command to return a report that includes the rackname and all machines’ name next to it.
      • Run the fsck tool; it will also report the rack associated with the machine name.

    Handling Data Integrity in HDFS

    Data Integration issues may occur at the client end when data received from a DataNode gets corrupted. This corruption may happen due to various reasons, such as problems in storage or data gets corrupted while transmitting over network, and so on.

    To resolve such type ofs issues, HDFS uses a local file system to perform check summing at the client side. During the creation of a file at the client side, not only is a file created but also one more hidden file is created, .crc, in the same directory; this hidden file contains checksums for each chunk of the file. The size of each chunk is, by default, 512 bytes and controlled by the “io.bytes.per.checksum” property. The size of each chunk is stored in the .crc file as metadata with each checksum. The method used for checksum is Error Detecting Code CRC-32 (Cyclic redundancy check), which computes a 32-bit integer checksum for input of any size.

    These .crc files are also stored on DataNode. Before storing the data and its checksum in DataNode, it verifies that the data received from the client and other data nodes during replication has the proper checksum.

    DataNode also runs a Datablock scanner in a background thread that periodically verifies (checksum and associated block) all blocks stored on DataNode. If any corrupted block is found, it heals it by copying one of the good replicas to produce a new block. During the reading of Data by the client from DataNodes, if the client detects an error, it reports the bad block to NameNode from which the DataNode bad block reading is being done. NameNode immediately marks that block corrupted and schedules a copy of a good replica of replace the corrupted block to another DataNode so its replication factor is achieved.

    To investigate the corrupted file before deletion, we have to disable the verification of checksums. There are two ways to achieve this:

    Using JAVA Coding

    Passing “False” to setVerifyCheckSumMethod() on FileSystem before opening the file.

    Using Shell Command/CLI

    Passing the -ignoreCrc option with the -get or -copyToLocal command.

    Miscellaneous HDFS Tools

    There are lots of miscellaneous utilities/tools available to manage the Hadoop ecosystem. Here we’ll discuss two of them, fsck and dfsadmin. Both of these tools are important for any developer or administrator to manage HDFS and other activities.

    1. fsck tool:File system check (fsck) is a utility for checking the health of the file system. It reports all blocks missed/under-replication/over-replication on DataNodes. It checks all Filesystem Namespaces provided with a given path. It doesn’t correct errors. It’s not a shell command and is run on whole Filesystem or subset of Filesystem. Following is the command to run fsck:
      bin/hadoop fsck

      There are many options available that you can refer to in the available manual.

    2. dfsadmin tool:This tool allows you to work as administrator as well as finding information about the health of HDFS. Like fsck, it’s not a shell command. Because it’s an administrator command, it requires super user/admin privilege to execute operations on HDFS.

      Following is the command to run:

      bin/hadoop dfsadmin [options]

      There are many options available that you can refer to in the available manual.

    Summary

    In this article, we discussed various ways to manage HDFS in different scenarios. We can configure HDFS during a first-time installation or it can be managed after a full-fledged implementation.

    There is a rich set of commands available to manage HDFS and gives you flexibility to handle day-to-day challenges.

    References

    http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

    http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/admin_configure_rack_awareness.html

    About the Author

    Anoop worked for Microsoft for almost six and half years and has 12+ years of IT experience. Currently, he is working as a DWBI Architect in a top Fortune Company. He has worked on end-to-end delivery of enterprise-scale DWBI projects. He has strong knowledge on database, data warehouse, and business intelligence application design and development and Hadoop/Big Data. Also, he worked extensively on SQL Server, designing ETLs using SSIS, SSAS, SSRS, and SQL Azure.

    Disclaimer: I help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.

    Get the Free Newsletter!

    Subscribe to Developer Insider for top news, trends & analysis

    Latest Posts

    Related Stories