Architecture & DesignUsing Hive in HDInsight to Analyze Data

Using Hive in HDInsight to Analyze Data

Introduction

Azure HDInsight is the cloud hosted version of Hadoop. HDInsight can be provisioned on both Linux and Windows systems. In this article, we will focus on HDInsight on Windows. HDInsignt is a Hadoop cluster on a Windows server. Hadoop is used to manage big data and its analysis. It provides a robust and a low-cost data storage system compatible with Azure. In this article, we will use Hive in HDInsight to analyze data.

Overview

HDInsight cluster has one main node and a number of worker nodes. In HDInsight, data is stored in Azure blob storage; in other words, WASB. Every node also has a DFS (Distributed file system) configured.

Hive1
Figure 1: Hadoop clusters in HDInsight access and stores big data in cost-effective, scalable Hadoop-compatible Azure Blob storage in the cloud.

The file system on every node can be accessed by using the following URL:

hdfs://<namenodehost>/<path>

and the blob storage can be accessed using the next URL:

wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

Hadoop has a data warehouse system, named Hive, that allows querying and analyzing of data. Hive provides a SQL-like query language named ‘HiveQL’. In this example, we will first create/upload some data to the blob storage, and then use blob storage data in Hive tables. Finally, you can use the table data in Excel.

Hive in HDInsights

First, you need a Azure subscription to create a HDInsight cluster. To access Azure by using PowerShell commands, install ‘Azure PowerShell’ on your system. After ‘Azure PowerShell’ is installed, to connect to Azure from PowerShell, you need to import the publishing settings. Navigate to https://manage.windowsazure.com/publishsettings/ site to download the publishsetting file. The file will have the following naming convention:

<subscription name-<date>-credentials.publishsettings

Now, import this publishsettings file to the PowerShell by using the following command:

Import-AzurePublishSettingsFile <file path>

To test our application using Hive, we will first look at how it can be used using Azure UI. Then, we will also look for the same by using PowerShell scripts.

Uploading a File to Blob Storage

In this case, I am using the “Azure Storage Explorer” tool. Use “Add account” to establish a connection to the storage account and operate in the blob container.

Hive2
Figure 2: Establishing a connection

Click “New” to create new blobs as shown in Figure 3:

Hive3
Figure 3: Creating a new blob

Specify a blob name, such as ‘hiveCodeSample/data1’. You also can enter some commas ‘,’ separated or space ‘ ‘ data in the ‘Blob text content’ area. Data can be entered later, after saving the blob.

Double-click the blob name. This brings up a new popup screen; navigate to the ‘Content’ tab. In the ‘Select a content view’ dropdown, select ‘text’ and click ‘Edit’.

Hive4
Figure 4: Selecting a text area

In the text area, add some delimited areas. Add multiple records in separate rows.

Running HiveQL Scripts

Until this part, we have created blob data in Azure and it’s in multiple files. In real-time scenarios, these files could be the output files after a map reduce job has been run.

Now, our next task is to read these files in Hive by using HiveQL scripts. To run Hive scripts, navigate to the HDInsight container as shown in Figure 5:

Hive5
Figure 5: Navigating to the HDInsight container

Click the Query console, as shown on the screen, that will bring up a new web site. On that web site, click ‘Hive Editor’ as shown in Figure 6:

Hive6
Figure 6: Selecting ‘Hive Editor’

Enter a SQL-like command to create a table and get the data from the blob storage. In this example, we created an External table; that means the table only stores the table definition. In other words, column information and the data still lie in the blob storage in the same format.

When creating a table, if the ‘External’ keyword is not used, it creates a normal table and the data is stored with in the Hive data warehouse.

After executing the preceding command, you can use a normal SQL statement to retrieve the data:

SELECT * FROM sampleCodeLogs

Uploading Files to Blob Storage by Using PowerShell

All this time, we have used tools to see the desired output. Let’s look at the PowerShell scripts to achieve the same result:

1. $subscriptionName = "Subscript name"
2. $storageAccountName = "storage account name"
3. $containerName = "hdhive"
4. $fileName ="C: sampleLog2.txt"
5. $blobName = "hiveCodeSample/data1"
6. Select-AzureSubscription $subscriptionName
7. $storageaccountkey = get-azurestoragekey
   $storageAccountName | %{$_.Primary}
8. $destContext = New-AzureStorageContext -StorageAccountName
   $storageAccountName -StorageAccountKey $storageaccountkey
9. Set-AzureStorageBlobContent -File $fileName -Container
   $containerName -Blob $blobName -context $destContext

Line 4 takes the path of the file that contains the data. Line 5’s blob name can be any namespace you want to give. In Line 6, we select the subscription to work with, and then retrieve the account key in Line 7. Now, using the account name and the key get the storage context (in Line 8) and in Line 9, we upload the file with the blob name mentioned previously.

Executing a HiveQL Script with PowerShell

To execute the HiveQL script from a PowerShell script, follow these commandds:

1. $containerName = "hdhive"
2. $hiveQL = "CREATE EXTERNAL TABLE sampleCodeLogs"
3. $hiveQL += " (t1 string, t2 string, t3 string, t4 string, t5 string,
   t6 string, t7 string, t8 string, t9 string)"
4. $hiveQL += " ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '"
5. $hiveQL += " STORED AS TEXTFILE LOCATION 'wasb:///hiveCodeSample/';"
6. $jobDef = New-AzureHDInsightHiveJobDefinition -Query $hiveQL
7. $hiveJob = Start-AzureHDInsightJob -Cluster $containerName
   -JobDefinition $jobDef

Line 1 stores the container name. Lines 2-5 contain the script that creates the external table and retrieves the data from the storage location. Line 6 defines the job definition and Line 7 starts the job.

Once the data is available in Hive, this can be used in Excel to display the data and for analysis. Follow Reference number 3 to integrate with Excel.

Summary

In this article, we saw how to use Hive in HDInsights, both from the UI and also through PowerShell script. By creating external or internal tables, these tables can be integrated with other BI tools for various analysis and reports. Because HDInsight is hosted on Azure, configuring a Hadoop cluster is made simpler as compared to manual setup. It also provides efficient data storage with blob storage, which is Hadoop compatible. The other advantage of using Hive with external tables is that, even if the data is stored in multiple files, these data can be aggregated and can be viewed as a single entity.

References

  1. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-hive/
  2. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/
  3. http://azure.microsoft.com/en-in/documentation/articles/hdinsight-connect-excel-hive-ODBC-driver/
Get the Free Newsletter!
Subscribe to Developer Insider for top news, trends & analysis
This email address is invalid.
Get the Free Newsletter!
Subscribe to Developer Insider for top news, trends & analysis
This email address is invalid.

Latest Posts

Related Stories