Architecture & DesignA Comparison Between Amazon Redshift and Azure Data Lake

A Comparison Between Amazon Redshift and Azure Data Lake

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Introduction

In this article, we will do a comparison study of Amazon Redshift and Azure SQL Data Warehouse. Redshift is a data warehouse offering in the cloud offered by Amazon and Azure SQL Data Warehouse is a data warehouse offering in the cloud offered by Microsoft. We will analyze the features offered by both in detail.

Description

Redshift and SQL Data Warehouse both support petabyte scale systems. Both of them have leader or master nodes and compute nodes. A leader node is responsible for distributing the work to the compute nodes and aggregate the data returned by the compute nodes. Both of them use columnar storage to enable parallel processing.

Scalability

In Redshift, when the cluster is modified though a management console or an API, the changes are immediately applied. While the new clusters are being provisioned, the current cluster available in read only mode; in other words, during this process the data is only available for read operations. After the new clusters are provisioned, the data is copied over. The time taken to complete this operation can take from a few hours to days, depending on the amount of data in each node. The storage and compute units are grouped together as a node definition; for example, 160GB DC1.Large or 16TB DS2.8XLarge node.

In SQL Data Warehouse, the scaling of the clusters can happen in minutes. The scale out can be done for compute and storage units independently. SQL Data Warehouse also supports pausing a compute operation. There is no cost applied when the compute nodes are in pause state; only a storage cost is charged.

Indexes

Azure SQL Data Warehouse supports all the SQL concepts, such as indexes, stored procedures, and user defined functions.

Redshift supports two kinds of sort keys: compound and interleaved. A compound sort key a combination of multiple columns, one primary column and one or more secondary columns. A compound sort key helps with joins and where conditions; however, the performance drops when the query is only on secondary columns without the primary column. A compound sort key is the default sort type. In interleaved an sort, each column is given an equal weight. Both compound and interleaved require a re-index to keep the query performance level high.

Data Sources

Data can be integrated with Redshift from Amazon S3 storage, elastic map reduce, No SQL data source DynamoDB, or SSH. If there is an on-premises database to be integrated with Redshift, export the data from the database to a file and then import the file to S3. This file can now be integrated with Redshift. Re-indexing is required to get a better query performance.

Azure SQL Data Warehouse is integrated with Azure Blob storage. It uses a similar approach to as Redshift to import the data from SQL server. The SQL server data is exported to a text file and then copied across to Azure Blob storage. Once the file is in Azure blob storage, it can be imported to Data Warehouse using the Polybase create ‘CREATE EXTERNAL TABLE’ command, followed by the ‘CREATE TABLE…AS SELECT’ command. Once the data is imported, re-create the indexes; in other words, use ‘create statistics’ to get the optimal query performance.

Backup and Security

Redshift can be hosted in a virtual private cloud. The data can be encrypted at rest and also when accessed from the client tools. Data is encrypted over the wire using SSL and, at rest, every block is encrypted using hardware-accelerated AES-256. The compute nodes cannot be accessed directly. they can be accessed only through the leader node. The data is backed up as snapshots in S3 storage automatically and by default they are retained for a day. The retention can be extended up to 35 days.

Azure SQL Data Warehouse currently doesn’t support hosting on a virtual network. It also supports encryption at rest and over the wire. Over the wire, it can be encrypted by using SSL and at rest it can be encrypted by using the ‘ALTER DATABASE <> SET ENCRYPTION ON’ command. The data is backed up automatically to Azure blob storage as snapshots every eight hours and the backup is retained for seven days.

Client BI Tools

Redshift integrates with many popular BI tools, like Tableau. In addition, it also allows connecting using JDBC and ODBC drivers.

Azure SQL Data Warehouse also supports integration with popular BI tools such as Tableau and Power BI.

Summary

Both Redshift and Azure Data Warehouse look promising. Azure SQL Data Warehouse leads in some areas, such as the scalability and decoupling the store from compute. On the other hand, Redshift leads in security by enabling it to be hosted in a VPC.

References

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories