A Comparison Between Amazon Redshift and Azure Data Lake
In this article, we will do a comparison study of Amazon Redshift and Azure SQL Data Warehouse. Redshift is a data warehouse offering in the cloud offered by Amazon and Azure SQL Data Warehouse is a data warehouse offering in the cloud offered by Microsoft. We will analyze the features offered by both in detail.
Redshift and SQL Data Warehouse both support petabyte scale systems. Both of them have leader or master nodes and compute nodes. A leader node is responsible for distributing the work to the compute nodes and aggregate the data returned by the compute nodes. Both of them use columnar storage to enable parallel processing.
In SQL Data Warehouse, the scaling of the clusters can happen in minutes. The scale out can be done for compute and storage units independently. SQL Data Warehouse also supports pausing a compute operation. There is no cost applied when the compute nodes are in pause state; only a storage cost is charged.
Azure SQL Data Warehouse supports all the SQL concepts, such as indexes, stored procedures, and user defined functions.
Redshift supports two kinds of sort keys: compound and interleaved. A compound sort key a combination of multiple columns, one primary column and one or more secondary columns. A compound sort key helps with joins and where conditions; however, the performance drops when the query is only on secondary columns without the primary column. A compound sort key is the default sort type. In interleaved an sort, each column is given an equal weight. Both compound and interleaved require a re-index to keep the query performance level high.
Data can be integrated with Redshift from Amazon S3 storage, elastic map reduce, No SQL data source DynamoDB, or SSH. If there is an on-premises database to be integrated with Redshift, export the data from the database to a file and then import the file to S3. This file can now be integrated with Redshift. Re-indexing is required to get a better query performance.
Azure SQL Data Warehouse is integrated with Azure Blob storage. It uses a similar approach to as Redshift to import the data from SQL server. The SQL server data is exported to a text file and then copied across to Azure Blob storage. Once the file is in Azure blob storage, it can be imported to Data Warehouse using the Polybase create 'CREATE EXTERNAL TABLE' command, followed by the 'CREATE TABLE...AS SELECT' command. Once the data is imported, re-create the indexes; in other words, use 'create statistics' to get the optimal query performance.
Backup and Security
Redshift can be hosted in a virtual private cloud. The data can be encrypted at rest and also when accessed from the client tools. Data is encrypted over the wire using SSL and, at rest, every block is encrypted using hardware-accelerated AES-256. The compute nodes cannot be accessed directly. they can be accessed only through the leader node. The data is backed up as snapshots in S3 storage automatically and by default they are retained for a day. The retention can be extended up to 35 days.
Azure SQL Data Warehouse currently doesn't support hosting on a virtual network. It also supports encryption at rest and over the wire. Over the wire, it can be encrypted by using SSL and at rest it can be encrypted by using the 'ALTER DATABASE <> SET ENCRYPTION ON' command. The data is backed up automatically to Azure blob storage as snapshots every eight hours and the backup is retained for seven days.
Client BI Tools
Redshift integrates with many popular BI tools, like Tableau. In addition, it also allows connecting using JDBC and ODBC drivers.
Azure SQL Data Warehouse also supports integration with popular BI tools such as Tableau and Power BI.
Both Redshift and Azure Data Warehouse look promising. Azure SQL Data Warehouse leads in some areas, such as the scalability and decoupling the store from compute. On the other hand, Redshift leads in security by enabling it to be hosted in a VPC.