Azure Data Lake is a Microsoft offering provided in the cloud for storage and analytics. In other words, it is a data warehouse tool available in the cloud, which is capable of doing analysis on both structured and non-structured data. In this article, we will discuss what Data Lake is and the new services included under Data Lake services.
What is Data Lake? As defined above, it’s a cloud offering in the cloud by Microsoft, which is cost effective and scalable. Data Lake consists of main three components: HDInsight and two new services, Data Lake Store and Data Lake Analytics. The Data Lake Analytics and HDInsight are grouped together as Analytic offerings. Data Lake Store and Data Lake Analytics are available in the preview.
Data Lake integrates with Visual Studio, providing some nice UI features. One of the features helps visualize how the code was executed by providing a 30 second playback irrespective to the actual time taken by the code. It displays the various queries executed in a job and the time taken by each of them. This helps in fine tuning the selective query that’s taking more time.
It is cost effective because there is no upfront costs and is charged based on the resources provisioned. When an analysis is run, the cost is charged based on the nodes provisioned and the time taken to compute. The nodes required to execute the code—the degree of parallelism—is selected by the user. The cost is charged based on the nodes and the execution time irrespective whether or not all the nodes were fully utilized. Visual Studio also provides detailed information on how the computation was performed and how efficiently the nodes were utilized. Based on the last execution details, by using the UI, the user can increase or decrease the number of nodes, which in turn re-calculates the time taken to complete the query execution. This helps in improvising the cost by choosing the approximate nodes required for executing the query effectively.
Architecture and Advantages
As discussed earlier, Azure Data Lake is comprised of Data Lake Analytics, HDInsight, and Data Lake Store. It uses Apache YARN to distribute the query execution. Communication to the Data Lake Store is done using WebHDFS. WebHDFS provides REST APIs to retrieve data from the data store. The Data Lake Store supports both structured and un-structured data. Figure 1 explains the different components of the Data Lake.
Figure 1: Components of the Data Lake
Data Lake Store
The Data Lake Store is the storage layer that is accessible by HDInsight and Analytics. It uses WebHDFS REST APIs to access the data. It provides petabyte scale, unlimited storage. It distributes the large file into multiple storage servers, thereby improving the read operations when data is read in parallel. It integrates with Azure AD for authentication and uses all the available features for Azure AD, such as role based and multi factor authentication.
The UI is very easy to use. After creating the Data lake account, click “Data Explorer” to navigate to the UI where a user can create folders and upload files. Once the files are uploaded, it provides the preview of the data in the file and URL of the file. This URL now can be used to perform the analytics.
Data Lake Analytics
With Data Lake Analytics, Microsoft introduces a new language, U-SQL. Data Lake Analytics is still in preview. Data Lake Analytics supports both Azure Storage and the new Data Lake Store. Azure AD is used for managing the security and uses role based security. A new analytic job can be created in the Azure portal by using the UI. While creating the job, you also can specify the parallelism value. Based on the value specified in the parallelism, the number of compute nodes are spanned to execute the query.
U-SQL is a new Microsoft big data query language that uses a mix of SQL and C#. The WHERE conditions in the U-SQL follows the C# syntax; in other words, comparison operator a==b and not a=b as in T-SQL. The U-SQL types are similar to C# where an object type is implicitly nullable and non-object types implicitly are not nullable.
The U-SQL query has three sections:
- Get data from the source. This can be a file, u-SQL table, or other data source like Azure SQL databases.
- Apply transformations to the rows retrieved.
- Save the result to a file or a U-SQL table.
DECLARE @in string = "/TestData/SalesTrans.txt"; DECLARE @out string = "/result/results.tsv"; @salesTrans = EXTRACT SalesId int, Region string, Units int?, EmployeeId FROM @in USING Extractors.Tsv(); OUTPUT @salesTrans TO @out USING Outputters.Tsv();
The DECLARE clauses are similar to the T-SQL except that it’s case sensitive; it should be in upper case. The EXTRACT clause reads the data from the file. The schema includes nullable column by using ‘?’ as in C#. The OUTPUT copies the transformed data to the output file.
The best part of the Data Lake is the UI features such as the playback video that helps identify the performance of the query. Also, the interactive UI that helps scale up and scale down the parallelism value that spins the number of nodes for execution and affects the cost. Because the Data Lake Analytics and Store are still in preview, we will have to see how it matures as a product.