redshift overview

33
Sao Paulo

Upload: amazon-web-services-latin-america

Post on 25-Jul-2015

235 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Redshift overview

Sao Paulo

Page 2: Redshift overview

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Building Your Data Warehouse with Amazon Redshift

Eric Ferreira, AWS (ericfe@)

Page 3: Redshift overview

Selected Amazon Redshift Customers

Page 4: Redshift overview

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Page 5: Redshift overview

Amazon Redshift

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

S3 / EMR / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

• Fast

• Simple

• Cheap

Page 6: Redshift overview

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Page 7: Redshift overview

Amazon Redshift has security built-in

• Load encrypted from S3

• SSL to secure data in transit; ECDHE perfect forward security

• Encryption to secure data at rest – All blocks on disks & in Amazon S3 encrypted

– Block key, Cluster key, Master key (AES-256)

– On-premises HSM & CloudHSM support

• Audit logging & AWS CloudTrail integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3 / Amazon DynamoDB

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

Compute Node

Compute Node

Compute Node

Page 8: Redshift overview

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and nodes

• Able to restore snapshots to any Availability Zone within a region

• Easily enable backups to a second region for disaster recovery

Amazon Redshift continuously backs up your data

Page 9: Redshift overview

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With row storage you do unnecessary I/O

• To get total amount, you have to read everything

Page 10: Redshift overview

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With column storage, you only read the data you need

Page 11: Redshift overview

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes• Columnar compression saves

space & reduces I/O

• Amazon Redshift analyzes and compresses your data

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Page 12: Redshift overview

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Track of the minimum and maximum value for each block

• Skip over blocks that don’t contain the data needed for a given query

• Minimize unnecessary I/O

Page 13: Redshift overview

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Use direct-attached storage to maximize throughput

• Hardware optimized for high performance data processing

• Large block sizes to make the most of each read

• Amazon Redshift manages durability for you

Page 14: Redshift overview

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Use Sort and Distribution keys

Analyze and Vacuum

Page 15: Redshift overview

Load in parallel from Amazon S3 or Amazon DynamoDB or any SSH connection

Data automatically distributed and sorted according to DDL

Scales linearly with number of nodes

Amazon S3 / DynamoDB / SSH

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 16: Redshift overview

Automatic, continuous, incremental backups

Configurable backup retention period

Cross region backups for disaster recovery

Stream data to resume querying faster

Amazon S3

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 17: Redshift overview

Simple operation via Console or API

Provision a new cluster in the background

Copy data in parallel from node to node

Only charged for source cluster

Automatic SQL endpoint switchover via DNS

SQL Clients/BI Tools

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 18: Redshift overview

Clickstream Analysis for Amazon.com

• Redshift runs web log analysis for Amazon.com

– 100 node Redshift Cluster (DW1.8XL)– Over one petabyte workload– Largest table: 400TB– 2TB of data per day

• Understand customer behavior– Who is browsing but not buying– Which products / features are winners– What sequence led to higher customer conversion

Page 19: Redshift overview

Redshift Performance Realized

• Scan 15 months of data: 14 minutes– 2.25 trillion rows

• Load one day worth of data: 10 minutes– 5 billion rows

• Backfill one month of data: 9.75 hours– 150 billion rows

• Pig Amazon Redshift: 2 days to 1 hr– 10B row join with 700M rows

• Oracle Amazon Redshift: 90 hours to 8 hrs– Reduced number of SQLs by a factor of 3

Page 20: Redshift overview

Redshift Performance Realized

• Managed Service– 20% time of one DBA

• Backup• Restore• Resizing

• 2PB cluster– 100 node dw1.8xl (3yr RI)– $180/hr

Page 21: Redshift overview

Expanding Amazon Redshift’s Functionality

Page 22: Redshift overview

Custom ODBC and JDBC Drivers

• Up to 35% higher performance than open

source drivers

• Supported by Informatica, Microstrategy,

Pentaho, Qlik, SAS, Tableau

• Will continue to support PostgreSQL

open source drivers

• Download drivers from console

Page 23: Redshift overview

Explain Plan Visualization

Page 24: Redshift overview

User Defined Functions

• We’re enabling User Defined Functions (UDFs) so

you can add your own– Scalar and Aggregate Functions supported

• You’ll be able to write UDFs using Python 2.7– Syntax is largely identical to PostgreSQL UDF Syntax

– System and network calls within UDFs are prohibited

• Comes with Pandas, NumPy, and SciPy pre-installed– You’ll also be able import your own libraries for even more flexibility

Page 25: Redshift overview

Scalar UDF example – URL parsing

CREATE FUNCTION f_hostname (VARCHAR url)  RETURNS varcharIMMUTABLE AS $$  import urlparse  return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;

Rather than using complex REGEX expressions, you can import standard Python URL parsing libraries and use them in your SQL

Page 26: Redshift overview

Sorting by Multiple Columns

• Currently support Compound Sort Keys– Optimized for applications that filter data by one leading

column

• Adding support for Interleaved Sort Keys– Optimized for filtering data by up to eight columns

– No storage overhead unlike an index

– Lower maintenance penalty compared to indexes

Page 27: Redshift overview

Compound Sort Keys Illustrated

• Records in Redshift are stored in blocks.

• For this illustration, let’s assume that four records fill a block

• Records with a given cust_id are all in one block

• However, records with a given prod_id are spread across four blocks

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

cust_id prod_id other columns blocks

Page 28: Redshift overview

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

Interleaved Sort Keys Illustrated

• Records with a given cust_id are spread across two blocks

• Records with a given prod_id are also spread across two blocks

• Data is sorted in equal measures for both keys

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

Page 29: Redshift overview

How to use the feature

• New keyword ‘INTERLEAVED’ when defining sort keys– Existing syntax will still work and behavior is unchanged

– You can choose up to 8 columns to include and can query with any or all

of them

• No change needed to queries

• Benefits are significant

[[ COMPOUND | INTERLEAVED ] SORTKEY ( column_name [, ...] ) ]

Page 30: Redshift overview

New Dense Storage Instance

DS2, based on EC2’s D2, has twice the memory and CPU as DS1 (formerly DW1)

Instance vCPU ECUMemory

(GiB)Network Storage I/O

Price/TBOn Demand

Price/TB3 YR RI

ds1.xlarge 2 4.4 15 Moderate 2TB HDD 0.30GB/s $3,330 $999ds1.8xlarge 16 35 120 10 Gbps 16TB HDD 2.40GB/s $3,330 $999

ds2.xlarge 4 31 Enhanced 2TB HDD 0.50GB/s TBD TBDds2.8xlarge 36 244 Enhanced - 10Gbps 16TB HDD 4.00GB/s TBD TBD

dc1.large 2 7 15 Enhanced 0.16TB SSD 0.20GB/s $18,327 $5,498dc1.8xlarge 32 104 244 Enhanced - 10Gbps 2.56TB SSD 3.70GB/s $18,327 $5,498

DS1 - Dense Storage (formerly DW1)

DS2 - Dense Storage Gen 2

DC1 - Dense Compute (formerly DW2)

Migrate from DS1 to DS2 by restoring from snapshot. We will help you migrate your RIs.

Page 31: Redshift overview

Open Source Tools

• https://github.com/awslabs/amazon-redshift-utils

• Admin Scripts– Collection of utilities for running diagnostics on your Cluster

• Admin Views– Collection of utilities for managing your Cluster, generating Schema DDL,

etc

• Column Encoding Utility– Gives you the ability to apply optimal Column Encoding to an established

Schema with data already loaded

Page 32: Redshift overview

Q&A

Page 33: Redshift overview