Amazon RedShift is a data warehouse offering in the cloud by Amazon AWS. It’s offered under the umbrella of AWS database services. Amazon RedShift is a relational, fast, and petabyte scale data warehouse. It’s a low cost, high performant data warehouse that integrates with most of the available BI tools. In this article, we will discuss the RedShift architecture and its benefits.
What is Amazon RedShift? As defined earlier, it’s a data warehouse offering in the cloud which is cost effective, scalable, fully managed and highly performant. It is cost effective because there are no upfront costs and is charged based on the resources provisioned.
It is highly scalable and easy to configure with the help of an UI interface or scripts (APIs). When the new servers are being provisioned, the existing cluster is marked as read only. Only read operations can be performed on the existing cluster. After the provisioning of new cluster is complete, the data is copied over to the new servers and the DNS is switched to the newly provisioned servers. The new servers are now available for read/write operations.
It’s a fully managed data warehouse in that system provisioning, OS upgrade, and software patches are all done by Amazon RedShift. It constantly monitors the health of the system and replaces failed nodes when detected. All the data is backed up to the Amazon S3 storage.
The data is secured over the wire by enabling SSL for communication and the data stored on disk can be secured by enabling encryption. When data is enabled for encryption, the data in the nodes and the backed up data both are encrypted.
Architecture & Advantages
The main component of Amazon RedShift is the cluster. One or more compute nodes form a cluster. If a cluster has more than one compute nodes, a ‘leader node’ also is provisioned.
Figure 1: The Amazon RedShift cluster
If a leader node exists, it’s the leader node that communicates with the external systems. The client BI tools interact only with the Leader node. The leader node serves as an endpoint to which the client communicates. The leader node generates the query plan when it receives a query request. It then distributes the request among the compute nodes and co-ordinates the query execution.
Each compute node has its own CPU, memory, and disk. They execute the query assigned to them by the Leader node and send the result back to the Leader node. The Leader node consolidates the results from all the compute nodes. The query is distributed and executed in parallel in the compute nodes. Each compute node is further divided into slices that have some portion of the memory and the CPU allocated. The number of slices in a node is generally determined by the number of CPUs allotted to the node. For example, if a dense compute node dc1.large has 2 CPUs, the slice per node is 2. Similarly, if dc1.8xlarge has 32 CPUs, the slices per node is 32.
RedShift stores columnar data instead of row data. The column data is stored in blocks with min and max values. In a table, we define sort keys to enable faster retrieval of data. There are multiple types to sort keys. For example, say there is a table named employee without any order. The employee number is not sorted; when a query operation is performed, it has to scan the entire table. Let’s assume the Employee table is now sorted on Employee number. Now, RedShift arranges the Employee number in blocks. One block in RedShift is 1MB in size. The employee numbers are stored in a block, like 1..100 in block 1, 101..200 in block 2. Every block has a min and max value defined. There is no overlapping of data between blocks. When there is a query request to search an employee with employee number 150, the query engine can skip all other blocks and search only in block 2. This enhances the query performance.
The sort keys are of two types: compound or an interleaved sort key. A compound sort key is a combination of multiple columns, one primary column and also secondary columns. A compound sort key helps with joins and where conditions; however, the performance drops when the query is only on secondary columns without the primary column. A compound sort key is the default sort type.
An interleaved sort key is helpful when multiple queries use different columns as filters (in a where condition) against the same data. An interleaved sort key is beneficial for large tables. In an interleaved sort, each column is given equal weight. Both compound and interleaved require a re-index to keep the query performance level high.
Import Data to Amazon RedShift
To easiest way to import data from SQL Server to RedShift is to first create the required tables in RedShift. Extract the data from SQL in a text file with some delimiter and upload to the AWS S3 storage. Once the data is stored in S3, use the copy command to import the data in Redshift. The syntax is:
copy <table name> from <S3 file path> credentials 'aws_access_key_id=<key id>; aws_secret_access_key=<access key>' delimiter '<Delimiter Character>'
Redshift with Client BI Tools
Tableau has an option to connect to RedShift directly from its left hand pane. After keying in the connection information—in other words, the server name, database name, and the credentials—it connects to the RedShift live data. The dashboards can be generated by using the RedShift table schema and published to the Tableau server. The published report can be viewed from different browsers.
In this article, we saw the advantages of using RedShift; these include scalability, performance, cost, and its integration with the client BI tools. It’s a completely managed solution where the application patches and OS upgrade are all managed by Amazon. We also saw how the architecture of Leader nodes and compute nodes help in faster computing. The advantages of interleaved key and possible scenarios is to configure them.