srv405 deep dive on amazon redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Tony Gibbs Data Warehousing Solutions Architect

April 19, 2017

Deep Dive on Amazon Redshift Query Lifecycle and Storage Subsystem

Deep Dive Overview

• Amazon Redshift history and development • Cluster architecture • Concepts and terminology • Storage deep dive • Query lifecycle • New & upcoming features • Open Q&A

Amazon Redshift History & Development

Columnar

MPP

OLAP

AWS Identity and Access

Management (IAM)

Amazon VPC

Amazon SWF

Amazon S3 AWS KMS Amazon Route 53

Amazon CloudWatch

Amazon EC2

PostgreSQL

Amazon Redshift

February 2013

April 2017

> 100 Significant Patches

> 150 Significant Features

Amazon Redshift Cluster Architecture

Amazon Redshift Cluster Architecture

Massively parallel, shared nothing Leader node

• SQL endpoint • Stores metadata • Coordinates parallel SQL processing

Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore

10 GigE (HPC)

Ingestion Backup Restore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

S3 / EMR / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk


128GB RAM

16TB disk


Leader Node

Compute & Leader Node Components

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Compute Node

128GB RAM

16TB disk

16 cores

Compute Node

128GB RAM

16TB disk

16 cores

Compute Node

Leader Node

• Parser & rewriter • Planner & optimizer • Code generator

• Input: optimized plan • Output: >=1 C++

functions • Compiler

• Task scheduler • WLM

• Admission • Scheduling

• PostgreSQL catalog tables

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Compute Node

128GB RAM

16TB disk

16 cores

Compute Node

128GB RAM

16TB disk

16 cores

Compute Node

Leader Node

• Query execution processes • Backup & restore processes • Replication processes • Local Storage

• Disks • Slices • Tables

• Columns • Blocks

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Leader Node

Compute Node Compute Node Compute Node

Concepts & Terminology

Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt

CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );

aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14

• Accessing dt with row storage: o Need to read everything o Unnecessary I/O

Designed for I/O Reduction

aid loc dt

Designed for I/O Reduction Columnar storage Data compression Zone maps



• Accessing dt with columnar storage o Only scan blocks for relevant column

Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt

CREATE TABLE deep_dive ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH );


• Columns grow and shrink independently • Reduces storage requirements • Reduces I/O

Designed for I/O Reduction Columnar storage Data compression Zone maps


aid loc dt


• In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which cannot

contain data for a given query • Eliminates unnecessary I/O

SELECT COUNT(*) FROM deep_dive WHERE dt = '09-JUNE-2013'

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted Table MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted By Date

Zone Maps

Terminology and Concepts: Data Sorting • Goals:

• Physically order rows of table data based on certain column(s) • Optimize effectiveness of zone maps • Enable MERGE JOIN operations

• Impact:

• Enables rrscans to prune blocks by leveraging zone maps • Overall reduction in block I/O

• Achieved with the table property SORTKEY defined over one or more columns

• Optimal SORTKEY is dependent on:

• Query patterns • Data profile • Business requirements

Terminology and Concepts: Slices

A slice can be thought of like a “virtual compute node” • Unit of data partitioning • Parallel query processing

Facts about slices:

• Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data

Data Distribution

• Distribution style is a table property which dictates how that table’s data is distributed throughout the cluster: • KEY: Value is hashed, same value goes to same location (slice) • ALL: Full table data goes to first slice of every node • EVEN: Round robin

• Goals: • Distribute data evenly for parallel processing • Minimize data movement during query processing

KEY

ALL Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

EVEN

Data Distribution: Example CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE (EVEN|KEY|ALL);

CN1

Slice 0 Slice 1

CN2

Slice 2 Slice 3

Table: deep_dive User

Columns System

Columns aid loc dt ins del row

Data Distribution: EVEN Example CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE EVEN;

CN1

Slice 0 Slice 1

CN2

Slice 2 Slice 3

INSERT INTO deep_dive VALUES (1, 'SFO', '2016-09-01'), (2, 'JFK', '2016-09-14'), (3, 'SFO', '2017-04-01'), (4, 'JFK', '2017-05-14');

Table: loft_deep_dive User Columns System Columns

aid loc dt ins del row







Rows: 0 Rows: 0 Rows: 0 Rows: 0

(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24 MB)


Data Distribution: KEY Example #1 CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE KEY DISTKEY (loc);

CN1

Slice 0 Slice 1

CN2

Slice 2 Slice 3




Rows: 2 Rows: 0 Rows: 0


Rows: 0 Rows: 1



Rows: 2 Rows: 0 Rows: 1

Data Distribution: KEY Example #2 CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE KEY DISTKEY (aid);

CN1

Slice 0 Slice 1

CN2

Slice 2 Slice 3













Data Distribution: ALL Example CREATE TABLE loft_deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE ALL;

CN1

Slice 0 Slice 1

CN2

Slice 2 Slice 3


Rows: 0 Rows: 0

(3 User Columns + 3 System Columns) x (2 slice) = 12 Blocks (12 MB)



Rows: 0 Rows: 1 Rows: 2 Rows: 4 Rows: 3



Rows: 0 Rows: 1 Rows: 2 Rows: 4 Rows: 3

Terminology and Concepts: Data Distribution

KEY • The key creates an even distribution of data • Joins are performed between large fact/dimension tables • Optimizing merge joins and group by

ALL • Small and medium size dimension tables (< 2-3M)

EVEN • When key cannot produce an even distribution

Storage Deep Dive

Storage Deep Dive: Disks

• Amazon Redshift utilizes locally attached storage devices • Compute nodes have 2.5-3x the advertised storage capacity

• 1, 3, 8, or 24 disks depending on node type • Each disk is split into two partitions

• Local data storage, accessed by local CN • Mirrored data, accessed by remote CN

• Partitions are raw devices • Local storage devices are ephemeral in nature • Tolerant to multiple disk failures on a single node

Storage Deep Dive: Blocks

Column data is persisted to 1 MB immutable blocks Each block contains in-memory metadata:

• Zone Maps (MIN/MAX value) • Location of previous/next block • Blocks are individually compressed with 1 of 11 encodings

A full block contains between 16 and 8.4 million values

Storage Deep Dive: Columns

• Column: Logical structure accessible via SQL • Physical structure is a doubly linked list of blocks • These blockchains exist on each slice for each column • All sorted & unsorted blockchains compose a column • Column properties include:

• Distribution Key • Sort Key • Compression Encoding

• Columns shrink and grow independently, 1 block at a time • Three system columns per table-per slice for MVCC

Block Properties: Design Considerations

• Small writes: • Batch processing system, optimized for processing massive data • For 1MB size + immutable blocks, we clone blocks on write to avoid

fragmentation • Small write (~1-10 rows) has similar cost to a larger write (~100 K

rows)

• UPDATE and DELETE: • Immutable blocks means that we only logically delete rows on

UPDATE or DELETE • Must VACUUM or DEEP COPY to remove ghost rows from table

Column Properties: Design Considerations

• Compression: • COPY automatically analyzes and compresses data when loading into empty

tables • ANALYZE COMPRESSION checks existing tables and proposes optimal

compression algorithms for each column • Changing column encoding requires a table rebuild

• DISTKEY and SORTKEY influence performance (orders of magnitude) • Distribution keys:

• A poor DISTKEY can introduce data skew and an unbalanced workload • A query completes only as fast as the slowest slice completes

• Sort keys: • A sort key is only effective as the data profile allows it to be • Selectivity needs to be considered

Query Lifecycle

Terminology and Concepts: Slices

A slice can be thought of like a “virtual compute node” • Unit of data partitioning • Parallel query processing

Facts about slices:

• Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Compute Node

128GB RAM

16TB disk

16 cores

Compute Node

128GB RAM

16TB disk

16 cores

Compute Node

Leader Node

• Parser & Rewriter • Planner & Optimizer • Code Generator

• Input: Optimized plan • Output: >=1 C++

functions • Compiler

• Task Scheduler • WLM

• Admission • Scheduling

• PostgreSQL Catalog Tables

• Amazon Redshift System Tables (STV)

Query Execution Terminology

Step: • An individual operation needed during query execution • Steps are combined to allow compute nodes to perform a join • Examples: scan, sort, hash, aggr Segment: • A combination of several steps that can be done by a single process • Smallest compilation unit executable by a slice • Segments within a stream run in parallel Stream: • A collection of combined segments • Output to the next stream or SQL client

Visualizing Streams, Segments, and Steps Stream 0

Segment 0 Step 0 Step 1 Step 2

Segment 1 Step 0 Step 1 Step 2 Step 3 Step 4

Segment 2 Step 0 Step 1 Step 2 Step 3

Segment 3 Step 0 Step 1 Step 2 Step 3 Step 4 Step 5

Stream 1 Segment 4

Step 0 Step 1 Step 2 Step 3 Segment 5

Step 0 Step 1 Step 2 Segment 6

Step 0 Step 1 Step 2 Step 3 Step 4 Stream 2

Segment 7 Step 0 Step 1


Time

Query Lifecycle client

JDBC ODBC

Leader Node

Parser

Query Planner

Code Generator

Final Computations

Generate code for all segments of one stream

Explain Plans

Compute Node

Receive Compiled Code

Run the Compiled Code

Return results to Leader

Compute Node

Receive Compiled Code

Run the Compiled Code

Return results to Leader

Return results to client

Segments in a stream are executed concurrently. Each step in a segment is executed serially.

Query Execution Deep Dive: Leader Node

• Leader node receives a query and parses the SQL • Parser produces a logical representation of original query • This query tree is input into the query optimizer (volt) • Volt rewrites the query to maximize its efficiency • Sometimes a single query is rewritten as several dependent statements in the

background • Rewritten query is sent to the planner which generates 1+ query plans for the

execution with the best estimated performance • Query plan is sent to execution engine, where it’s translated into steps, segments,

and streams • Translated plan is sent to the code generator, which generates a C++ function for

each segment • Generated C++ is compiled with gcc to a .o file and distributed to the compute nodes.

Query Execution Deep Dive: Compute Nodes

• Slices execute query segments in parallel • Executable segments are created for one stream at a

time in sequence • When the compute nodes are done, they return the

query results to leader node for final processing • Leader node merges data into a single result set and

addresses any needed sorting or aggregation • Leader node then returns results to the client

Visualizing Streams, Segments, and Steps Stream 0





Stream 1 Segment 4






Time

Query Execution Stream 0





Stream 1 Segment 4






Time

Stream 0 Segment 0


Step 0 Step 1 Step 2 Step 3 Step 4 Segment 2


Step 0 Step 1 Step 2 Step 3 Step 4 Step 5 Stream 1




Stream 2 Segment 7

Step 0 Step 1 Segment 8

Step 0 Step 1

Stream 0 Segment 0








Stream 2 Segment 7


Step 0 Step 1

Stream 0 Segment 0








Stream 2 Segment 7


Step 0 Step 1

Slic

es

0

1

2

3

Design considerations for Redshift slices

DS2.8XL Compute Node

Ingestion Throughput: • Each slice’s query processors can load one file at a time:

• Streaming decompression • Parse • Distribute • Write

Realizing only partial node usage as 6.25% of slices are active

0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15

Design considerations for Amazon Redshift slices

Use at least as many input files as there are slices in the cluster With 16 input files, all slices are working so you maximize throughput COPY continues to scale linearly as you add nodes

16 Input Files

DS2.8XL Compute Node

0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15

Data Preparation for Redshift COPY

Export Data from Source System • CSV Recommend (Simple Delimiter ',' or '|')

• Be aware of UTF-8 varchar columns (UTF-8 take 4 bytes per char) • Be aware of your NULL character (\N)

• GZIP Compress Files • Split Files (1MB – 1GB after gzip compression)

Useful COPY Options for PoC Data • MAXERRORS • ACCEPTINVCHARS • NULL AS

Open source tools

https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs Admin scripts

Collection of utilities for running diagnostics on your cluster Admin views

Collection of utilities for managing your cluster, generating schema DDL, etc. ColumnEncodingUtility

Gives you the ability to apply optimal column encoding to an established schema with data already loaded

Amazon Redshift Engineering’s Advanced Table Design Playbook https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/

https://github.com/awslabs/amazon-redshift-utils

https://github.com/awslabs/amazon-redshift-monitoring

https://github.com/awslabs/amazon-redshift-udfs

https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/


























New & Upcoming Features

Recently Released Features

New Data Type – TIMESTAMPTZ Support for Timestamp with Time zone Multi-byte Object Names Support for Multi-byte (UTF-8) characters for tables, columns, and other database object names

User Connection Limits You can now set a limit on the number of database connections a user is permitted to have open concurrently

Automatic Data Compression for CTAS All newly created tables will leverage default encoding

New Column Encoding - ZSTD Approximate Percentile Functions

Recently Released Features

Performance Enhancements • Vacuum (10x faster for deletes) • Snapshot Restore (2x faster) • Queries (Up to 5x faster)

Copy Can Extend Sorted Region on Single Sort Key • No need to vacuum when loading in sorted order

Enhanced VPC Routing • Restrict S3 Bucket Access

Schema Conversion Tool - One-Time Data Exports • Oracle • Teradata

Schema Conversion Tool • Vertica • SQL Server

Monitor and control cluster resources consumed by a query

Get notified, abort and reprioritize long-running / bad queries

Pre-defined templates for common use cases

Coming Soon: Query Monitoring Rules

BI tools SQL clients Analytics tools

Client AWS

Redshift

ADFS

Corporate Active Directory IAM

Amazon Redshift ODBC/JDBC

User groups Individual user

Single Sign-On

Identity providers

New Redshift ODBC/JDBC drivers. Grab the ticket (userid) and get a SAML assertion.

Coming Soon: IAM Authentication

Automatic and Incremental Background VACUUM

• Reclaims space and sorts when Redshift clusters are idle • Vacuum is initiated when performance can be enhanced • Improves ETL and query performance

Coming Soon: Lots More …

Amazon Redshift Spectrum Run SQL queries directly against data in S3

High Concurrency: Multiple clusters access same data

No ETL: Query data in-place using open file formats

Full Redshift SQL Support

S3

SQL

Check out the coming session on Amazon Redshift Spectrum

Thank you! Tony Gibbs

srv405 deep dive on amazon redshift

Technology