dcsimg
September 29, 2016
Hot Topics:

Big Data Tools: Hive and Pig

  • January 19, 2015
  • By Anoop Kumar
  • Send Email »
  • More Articles »

Overview

In the last article, we discussed Map-Reduce and how easily any Java developer can come into the 'Big Data' world and develop a program to analyze the data. Earlier, it was not the case. Today, I'll discuss Pig and explain how developers from the Query or Scripting languages communities can leverage their knowledge and can be part of the Big Data world to analyze data.

After reading this article, you will know the pre-requisites for Hive & Pig along with the implementation of the same problem we resolved using Map-Reduce in the last article.

Understanding Hive

Hive is designed for data summarization, ad-hoc querying, and analysis of large volumes of data. Hive is a Data Warehousing package built on top of Hadoop. We should be aware of the fact that Hive is not designed for online transaction processing and doesn't offer real-time queries and row-level updates.

Hive is used for data analysis and targeted towards users comfortable with SQL. It is similar to SQL and is called HiveQL. You need not to know Java and Hadoop APIs to use Hive and HiveQL.

The key property of Hive is "schema on read;" Hive doesn't verify data when it is loaded; verification happens when a query is issued. This property helps very fast initial loading because the data load is a file copy or move operation and data doesn't have to be read, parsed, and serialized to disk in the database's internal format.

Components in Hive Architecture

Figure 1 can help you understand the Hive components.

Hive01
Figure 1: The Hive components

Pre-requisite to a Hive Program

To solve the preceding sample problem, there are certain things that should be available and configured properly to get the desired output. I'll show you what tools should be installed and the required configuration that should be in place as a pre-requisite to start writing your first Hive program.

Tools

  • Hadoop 1.2.1
  • Hive

Hive Installation Steps

    1. Get Hive from http://hive.apache.org.
    2. Untar or unzip the hive folder and install.
    3. Create a new system variable name, such as "Hive_INSTALL ", and set value Hive_INSTALL=<Installation-path>/hive-0.11.0-bin.
    4. Add a Hive_Install path into the existing system path variable; PATH  = %PATH%;%Hive_INSTALL%/bin.
    5. Configure Hive by using hive-site.xml that is present in the <Hive-INSTALLED-DIR>/conf folder.

Hive02
Figure 2: Contents of the hive-site.xml file

    1. You need to create /tmp, and /user/hive/warehouse directories and set up appropriate permissions on HDFS for storing data from HDFS to Hive for further processing; following are the steps.

Run hdfc by using the following command:

    1. cd <Installation-path>/hadoop-1.2.1
    2. <Installation-path>/hadoop-1.2.1$ .bin/start-all
    3. <Installation-path>/hadoop-1.2.1$ hadoop fs -mkdir /tmp
    4. <Installation-path>/hadoop-1.2.1$ hadoop fs -mkdir/user/hive/warehouse
    5. <Installation-path>/hadoop-1.2.1$ hadoop fs -chmod a+w /tmp
    6. <Installation-path>/hadoop-1.2.1$ hadoop fs -chmod a+w /user/hive/warehouse

You need to follow the next steps to confirm that Hive installed and configured properly:

    1. Run the Hive shell:

cd $Hive_INSTALL (variable created on Step 3)

    1. Start Meta Store and run the following command on the Hive shell:
bin/hive --service metastore &
    1. Open a new terminal to start work on Hive:
cd $Hive_INSTALL
bin/hive

The preceding commands will help you reach the Hive shell; this is required to start data processing in the Big Data world.

Sample Problem

Today, we'll learn to write a Hive program to solve one problem:

Problem: How many people belong to each state?

For sample purposes, I have prepared a users.txt file with five columns. Following is the file structure with sample data populated:

<UserId>,<Username>,<city>,<state>,<country>

1,John,Montgomery,Alabama,US

2,David,Phoenix,Arizona,US

3,Sarah,Sacramento,California,US

4,Anoop,Montgomery,Alabama,US

5,Marinda,Phoenix,Arizona,US

6,Maria,Sacramento,California,US

7,Jony,Phoenix,Arizona,US

8,Wilson,Montgomery,Alabama,US

9,Jina,Lincoln,Nebraska,US

10,James,Columbus,Ohio,US

Hive Program

Once we are ready with the pre-requisites, we'll start writing the first Hive program to solve the above problem.

    1. Browse <Installed-dir>/hadoop_1.2.2 by running the following command:
cd <installed-dir>/hadoop_1.2.1
    1. Run dfs by running the following command:
<installed-dir>/hadoop_1.2.1$ .bin/start-dfs.sh
    1. Create the users directory on HDFS by using the following command:
<installed-dir>/hadoop_1.2.1$ hadoop fs-mkdir/users
    1. Put users.txt on the HDFS users directory from local file system:
<installed-dir>/hadoop_1.2.1$ bin/hadoop fs
   -put <localdstpath>/users.txt users
    1. Start Hive shell using the steps explained in the previous section.
    2. Run the following commands on the hive shell to solve the problem.
      1. First, we'll create the table users in the Hive MetaStore to map data from users.txt.

Hive03
Figure 3: Creating the User table

      1. The following command maps users.txt data to the users table by loading data from users.txt.
hive> load data inpath '/user/<user-name>/users/input.txt'
   OVERWRITE INTO TABLE users;
      1. Now, the final command will give the desired output

Hive04
Hive05
Figures 4 and 5: Output of the preceding code

The preceding output is the desired result, which is giving a state-wise user count on the Hive shell by using the Hive program.

    1. After getting the desired output, you need to quit from the Hive shell by using the following command:
      hive> quit;

    With the preceding set of steps and commands used, we understand how Hive can be used to retrieve the data. In this example, the time taken is very high, which you need to ignore for now. Here, the objective was to show how Hive can configure and write a sequence of different commands to retrieve the data rather than highlighting the performance.

    Understanding Pig

    Pig is a high-level scripting data flow language that abstracts the Hadoop system completely from users and uses existing code/libraries for complex and non-regular algorithms. Pig uses its own scripting, known as PigLatin, to express data flows. PigLatin can be executed in two modes a) local mode b) distributed/Map Reduce mode.

    The Pig framework applies a series of transformations (specific to PigLatin constructs) on input data to produce the desired output. These transformations express data flows. Internally, Pig converts all transformation into a map-reduce job so that the developer can focus mainly on data scripting instead of putting an effort to writing a complex set of MR programs. The Pig framework runs on the preceding HDFS.

    Components in PIG Architecture

    Figure 6 can help you to understand the PIG sequence of operations.

    Hive06
    Figure 6: The Pig sequence of operations

    After the preceding sequence operation, it creates a job jar that is to be submitted to the Hadoop cluster.

    Pre-requisite to a Pig Program

    To solve the previous sample problem, certain things should be available and configured properly to get desired output. I'll show you what tools should be installed and what required configuration should be in place as pre-requisites to start writing your first Pig program.

    Tools

    • Hadoop 1.2.1
    • Pig

    Pig Installation Steps

      1. Get Pig from http://pig.apache.org.
      2. Untar or unzip the Pig folder and install.
      3. Create a new system variable name named "PIG_INSTALL " and set value PIG_INSTALL=<Installation-path>/pig-0.14.0.
      4. Add PIG_INSTALL path into the existing system path variable: PATH = %PATH%;%PIG_INSTALL%/bin.
      5. There are two modes to run Pig; these can be updated in the pig.properties file available in the conf directory of the Pig installed location.
        1. Local mode using the following command:
    cd $PIG_HOME
    ./bin/pig -x local
    

    This command will start the grunt shell where you can start writing PigLatin script:

    >grunt
        1. Distributed/Map Reduce mode: The following can be added in the pig.properties file:

    fs.default.name=hdfs://localhost:9090 (value of port where hdfs is running)

    mapred.job.tracker=localhost:8021 (value of port where MR job is running)

    After adding the previous two entries, we can run the following commands to start Pig in Distributed/Map Reduce mode:

    cd $PIG_HOME
    ./bin/pig -x local
    

    The Pig Program

    Once we are ready with the pre-requisites of Pig, we'll start writing the first Pig program to solve the preceding sample problem.

      1. Browse <Installed-dir>/Hadoop_1.2.2 by running the following command:
    cd <installed-dir>/hadoop_1.2.1
      1. Run dfs by using the following command:
    <installed-dir>/hadoop_1.2.1$ .bin/start-dfs.sh
      1. Create a users directory on HDFS by using the following command:
    <installed-dir>/hadoop_1.2.1$ hadoop fs-mkdir/users
      1. Put users.txt on the HDFS users directory from the local file system:
    <installed-dir>/hadoop_1.2.1$ bin/hadoop
       fs -put <localdstpath>/users.txt users
    
    1. Start Pig Grunt by using the steps explained in the previous section.
    2. Run the following commands on Pig Grunt to solve the problem.

      Hive07
      Figure 7: Running commands on Pig Grunt to solve the problem

      The preceding statement creates table users in Pig to map data from users.txt and populates the data, too.

      Now, the final and last command will give the desired output, which will group records by state:

      Hive08
      Figure 8: Grouping records by state

      Hive09
      Figure 9: Outputting the records

      Hive10
      Hive11
      Figures 10 and 11: Viewing the final output

    Now, we understand how to solve the same problem using different available Big Data tools and get the desired results.

    Summary

    In this article, we talked about different Big Data tools Hive & Pig. These tools are useful in data analysis. We discussed different components of Hive and Pig.

    The availability of different Big Data tools has provided an immense opportunity for developer communities to enter into the data and analysis world.

    We wrote sample Hive & Pig programs to solve the sample problem to understand the end-to-end flow of Hive & Pig and their step-by-step executions.

    References

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual

    https://pig.apache.org/docs/r0.11.1/basic.html

    About the Author

    Anoop worked for Microsoft for almost six and half years and has 12+ years of IT experience. Currently, he is working as a DW\BI Architect in one of the top Fortune Companies. He has worked on end-to-end delivery of enterprise-scale DW\BI projects. He carries a strong knowledge on database, data warehouse, and business intelligence application design and development and Hadoop/Big Data. Also, he worked extensively on SQL Server, designing ETLs using SSIS, SSAS, SSRS, and SQL Azure.

    Disclaimer: I help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.


    Tags: Java, SQL, scripting languages, pig, hive, query language, HDFS, HiveQL, big data architecture, PigLatin, Grunt




    Comment and Contribute

     


    (Maximum characters: 1200). You have characters left.

     

     


    Enterprise Development Update

    Don't miss an article. Subscribe to our newsletter below.

    Sitemap | Contact Us

    Thanks for your registration, follow us on our social networks to keep up-to-date
    Rocket Fuel