Getting Started with Azure HDInsight
Data is everywhere; in fact, we now are in the universe of “big data”, where computing systems need to handle immense amount of data to determine certain behavior patterns. Apache Hadoop is one framework, which is designed to handle such huge amounts of data. As more and more organizations need access to big data, cloud service providers have identified the business need to provide cloud-based frameworks that can handle big data.
That is why Microsoft provides Hadoop as one of its cloud computing offerings. To get the best out of Hadoop, developers need to be able to easily manage Hadoop clusters and to help with that, Microsoft offers HDInsight, a service that deploys and provisions Hadoop clusters in the cloud.
HDInsight provides a simple, easy, scalable and cost-efficient environment. When HDInsight deploys a cluster, a second headnode is added to the clusters to increase availability of the service (unlike classic Hadoop deployments).
The Hadoop/HDInsight ecosystem is visualized below.
How HDInsight Manages and Stores Data
HDInsight uses Azure Blob storage as the default file system. Hadoop clusters are optimized for running MapReduce computational tasks and can be dropped once the tasks are completed.
To manage Hadoop jobs, HDInsight uses Azure Powershell.
How to Get Stared with Using HDInsight
To begin using HDInsight, visit the Azure Management Portal at https://manage.windowsazure.com and sign in.
After you are signed in, you will be presented with the home page of your account.
Azure Management Portal Homepage
Click on the HDInsight link on the left.
Click on the link to “Create an HDinsight cluster”.
Create an HDinsight Cluster
Provide a cluster name and also specify the password for the user role “admin” and click “Create HDInsight Cluster”.
Create HDInsight Cluster
Once you submit the information, the Hadoop cluster creation process begins, which can be visualized as below.
Hadoop Cluster Creation Process
It can take up to 10 minutes to complete the provisioning. Once complete, the dashboard will look as under.
Click on the arrow next to the cluster name and you will be redirected to the HDInsight dashboard.
Here we can track how our Hadoop cluster is performing. We can see that my particular instance is using 24 cores out of a possible of 170 HDInsight cores.
After the HDInsight cluster has been provisioned, we can schedule our MapReduce jobs. A MapReduce job needs a MapReduce program (.jar file), and inputs (if applicable).
Azure PowerShell can be used to run jobs.
In this article, we gave an overview of HDInsight.
About the author
Vipul Patel is a Program Manager currently working at Amazon Corporation. He has formerly worked at Microsoft in the Lync team and in the .NET team (in the Base Class libraries and the Debugging and Profiling team). He can be reached at email@example.com