October 24, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Getting Started with Azure HDInsight

  • May 28, 2014
  • By Vipul Patel
  • Send Email »
  • More Articles »

Introduction

Data is everywhere; in fact, we now are in the universe of “big data”, where computing systems need to handle immense amount of data to determine certain behavior patterns. Apache Hadoop is one framework, which is designed to handle such huge amounts of data. As more and more organizations need access to big data, cloud service providers have identified the business need to provide cloud-based frameworks that can handle big data.

That is why Microsoft provides Hadoop as one of its cloud computing offerings. To get the best out of Hadoop, developers need to be able to easily manage Hadoop clusters and to help with that, Microsoft offers HDInsight, a service that deploys and provisions Hadoop clusters in the cloud.

By making Hadoop available as a service, HDInsight can help manage, analyze and report on big data. Apache Hadoop uses Hadoop Distributed File System (HDFS) to provide reliable data storage. The MapReduce programming model is used to process and analyze the data in parallel.

HDInsight provides a simple, easy, scalable and cost-efficient environment. When HDInsight deploys a cluster, a second headnode is added to the clusters to increase availability of the service (unlike classic Hadoop deployments).

The Hadoop/HDInsight ecosystem is visualized below.

Hadoop/HDInsight Ecosystem
Hadoop/HDInsight Ecosystem

How HDInsight Manages and Stores Data

HDInsight uses Azure Blob storage as the default file system. Hadoop clusters are optimized for running MapReduce computational tasks and can be dropped once the tasks are completed.  

To manage Hadoop jobs, HDInsight uses Azure Powershell.

How to Get Stared with Using HDInsight

To begin using HDInsight, visit the Azure Management Portal at https://manage.windowsazure.com  and sign in.

After you are signed in, you will be presented with the home page of your account.

 Azure Management Portal Homepage
 Azure Management Portal Homepage

Click on the HDInsight link on the left.

HDInsight Link
HDInsight Link

Click on the link to “Create an HDinsight cluster”.

Create an HDinsight Cluster
Create an HDinsight Cluster

Provide a cluster name and  also specify the password for the user role “admin” and click “Create HDInsight Cluster”.

Create HDInsight Cluster
Create HDInsight Cluster

Once you submit the information, the Hadoop cluster creation process begins, which can be visualized as below.

Hadoop Cluster Creation Process
Hadoop Cluster Creation Process

It can take up to 10 minutes to complete the provisioning. Once complete, the dashboard will look as under.

Provisioning Complete
Provisioning Complete

Click on the arrow next to the cluster name and you will be redirected to the HDInsight dashboard.

HEInsight Dashboard
HEInsight Dashboard

Here we can track how our Hadoop cluster is performing. We can see that my particular instance is using 24 cores out of a possible of 170 HDInsight cores.

After the HDInsight cluster has been provisioned, we can schedule our MapReduce jobs. A MapReduce job needs a MapReduce program (.jar file), and inputs (if applicable).

Azure PowerShell can be used to run jobs.

Summary

In this article, we gave an overview of HDInsight.

About the author

Vipul Patel is a Program Manager currently working at Amazon Corporation. He has formerly worked at Microsoft in the Lync team and in the .NET team (in the Base Class libraries and the Debugging and Profiling team). He can be reached at vipul.patel@hotmail.com


Tags: Hadoop, Cloud, Microsoft, Azure, HDInsight




Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel