amazon redshift

71
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesper Söderlund, Solutions Architect, AWS May 4th 2016 Getting Started with Amazon Redshift

Upload: amazon-web-services

Post on 16-Apr-2017

1.927 views

Category:

Business


0 download

TRANSCRIPT

Page 1: Amazon Redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jesper Söderlund, Solutions Architect, AWS

May 4th 2016

Getting Started with

Amazon Redshift

Page 2: Amazon Redshift

Agenda

• Introduction

• Benefits

• Use cases

• Getting started

• Q&A

Page 3: Amazon Redshift

AnalyzeStore

Amazon

Glacier

Amazon S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

AWS big data portfolio

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

AWS Database

Migration Service

Amazon

QuickSight

Amazon

Kinesis

Firehose

AWS Import/Export

AWS Direct

Connect

Collect

Amazon Kinesis

Streams

Page 4: Amazon Redshift

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

Page 5: Amazon Redshift

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools,

Hadoop, machine learning,

streaming

Analysis inline with process

flows

Pay as you go, grow as you

need

Managed availability and

disaster recovery

Enterprise Big data SaaS

Page 6: Amazon Redshift

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical

representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any

vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

Forrester Wave™ Enterprise Data Warehouse Q4 ’15

Page 7: Amazon Redshift

Selected Amazon Redshift customers

Page 8: Amazon Redshift

Amazon Redshift architecture

Leader node

Simple SQL endpoint

Stores metadata

Optimizes query plan

Coordinates query execution

Compute nodes

Local columnar storage

Parallel/distributed execution of all queries, loads,

backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)

DC1: SSD; scale from 160 GB to 326 TB

DS2: HDD; scale from 2 TB to 2 PB

Ingestion/Backup

Backup

Restore

JDBC/ODBC

10 GigE

(HPC)

Page 9: Amazon Redshift

Benefit #1: Amazon Redshift is fast

Dramatically less I/O

Column storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Page 10: Amazon Redshift

Benefit #1: Amazon Redshift is fast

Parallel and distributed

Query

Load

Export

Backup

Restore

Resize

Page 11: Amazon Redshift

Benefit #1: Amazon Redshift is fast

Hardware optimized for I/O intensive workloads, 4 GB/sec/node

Enhanced networking, over 1 million packets/sec/node

Choice of storage type, instance size

Regular cadence of autopatched improvements

Page 12: Amazon Redshift

Benefit #1: Amazon Redshift is fast

New Dense Storage (HDD) instance type

Improved memory 2x, compute 2x, disk throughput 1.5x

Cost: Same as our prior generation!

Performance improvement: 50%

Enhanced I/O and commit improvements (Jan ’16)

Reduce amount of time to commit data

Performance improvement: 35%

Page 13: Amazon Redshift

Benefit #2: Amazon Redshift is inexpensive

Ds2 (HDD)Price per hour for

DW1.XL single node

Effective annual

price per TB compressed

On-demand $ 0.850 $ 3,725

1 year reservation $ 0.500 $ 2,190

3 year reservation $ 0.228 $ 999

Dc1 (SSD)Price per hour for

DW2.L single node

Effective annual

price per TB compressed

On-demand $ 0.250 $ 13,690

1 year reservation $ 0.161 $ 8,795

3 year reservation $ 0.100 $ 5,500

Pricing is simple

Number of nodes x price/hour

No charge for leader node

No upfront costs

Pay as you go

Page 14: Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups

to Amazon S3

Continuous and incremental backups

across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Page 15: Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/region level disasters

Page 16: Amazon Redshift

Benefit #4: Security is built-in

• Load encrypted from S3

• SSL to secure data in transit

• ECDHE perfect forward security

• Amazon VPC for network isolation

• Encryption to secure data at rest

• All blocks on disks and in S3 encrypted

• Block key, cluster key, master key (AES-256)

• On-premises HSM & AWS CloudHSM support

• Audit logging and AWS CloudTrail integration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Page 17: Amazon Redshift

Benefit #5: We innovate quickly

Well over 100 new features added since launch

Release every two weeks

Automatic patching

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

DUB (4/25)

SOC1/2/3 (5/8)

Unload Encrypted Files

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

SHA1 Builtin (7/15)

4 byte UTF-8 (7/18)

Sharing snapshots (7/18)

Statement Timeout (7/22)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

Resource Level IAM (8/9)

PCI (8/22)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS

Alerts, Cross Region Backup (11/13)

Distributed Tables, Single Node Cursor Support, Maximum Connections to 500

(12/13)

EIP Support for VPC Clusters (12/28)

New query monitoring system tables and diststyle all (1/13)

Redshift on DW2 (SSD) Nodes (1/23)

Compression for COPY from SSH, Fetch size support for single node clusters, new

system tables with commit stats, row_number(), strotol() and query

termination (2/13)

Resize progress indicator & Cluster Version (3/21)

Regex_Substr, COPY from JSON (3/25)

50 slots, COPY from EMR, ECDHE ciphers (4/22)

3 new regex features, Unload to single file, FedRAMP(5/6)

Rename Cluster (6/2)

Copy from multiple regions, percentile_cont, percentile_disc (6/30)

Free Trial (7/1)

pg_last_unload_count (9/15)

AES-128 S3 encryption (9/29)

UTF-16 support (9/29)

Page 18: Amazon Redshift

Benefit #6: Amazon Redshift is powerful

• Approximate functions

• User defined functions

• Machine learning

• Data science

Page 19: Amazon Redshift

Benefit #7: Amazon Redshift has a large ecosystem

Data integration Systems integratorsBusiness intelligence

Page 20: Amazon Redshift

Benefit #8: Service oriented architecture

DynamoDB

EMR

S3

EC2/SSH

RDS/Aurora

Amazon

Redshift

Amazon Kinesis

Amazon ML

Data Pipeline

CloudSearch

Amazon

Mobile

Analytics

Page 21: Amazon Redshift

Use cases

Page 22: Amazon Redshift

NTT Docomo: Japan’s largest mobile service

provider

68 million customers

Tens of TBs per day of data across a

mobile network

6 PB of total data (uncompressed)

Data science for marketing

operations, logistics, and so on

Greenplum on-premises

Scaling challenges

Performance issues

Need same level of security

Need for a hybrid environment

Page 23: Amazon Redshift

Nasdaq: powering 100 marketplaces in 50

countries

Orders, quotes, trade executions,

market “tick” data from 7 exchanges

7 billion rows/day

Analyze market share, client activity,

surveillance, billing, and so on

Microsoft SQL Server on-premises

Expensive legacy DW

($1.16 M/yr.)

Limited capacity (1 yr. of data

online)

Needed lower TCO

Must satisfy multiple security

and regulatory requirements

Similar performance

Page 24: Amazon Redshift

Getting started

Page 25: Amazon Redshift

Provisioning

Page 26: Amazon Redshift

Enter cluster details

Page 27: Amazon Redshift

Select node configuration

Page 28: Amazon Redshift

Select security settings and provision

Page 29: Amazon Redshift

Point-and-click resize

Page 30: Amazon Redshift

Resize

• Resize while remaining online

• Provision a new cluster in the

background

• Copy data in parallel from node to

node

• Only charged for source cluster

Page 31: Amazon Redshift

Data modeling

Page 32: Amazon Redshift

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted table

MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted by date

Zone maps

Page 33: Amazon Redshift

• Single column

• Compound

• Interleaved

Page 34: Amazon Redshift

Single column• Table is sorted by 1 column

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

[ SORTKEY ( date ) ]

• Best for:

• Queries that use first column (that is, date) as primary filter

• Can speed up joins and group bys

• Quickest to VACUUM

Page 35: Amazon Redshift

Compound• Table is sorted by first column, then second column, and so on

Date Region Country

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Korea

2-JUN-2015 Asia Singapore

2-JUN-2015 Europe Germany

3-JUN-2015 Asia Hong Kong

3-JUN-2015 Asia Korea

[ SORTKEY COMPOUND ( date, region, country) ]

• Best for:

• Queries that use first column as primary filter, then other columns

• Can speed up joins and group bys

• Slower to VACUUM

Page 36: Amazon Redshift

Interleaved

• Equal weight is given to each column

Date Region Country

2-JUN-2015 Africa Zaire

3-JUN-2015 Asia Singapore

2-JUN-2015 Asia Korea

2-JUN-2015 Europe Germany

3-JUN-2015 Asia Hong Kong

2-JUN-2015 Asia Korea

[ SORTKEY INTERLEAVED ( date, region, country) ]

• Best for:

• Queries that use different columns in filter

• Queries get faster the more columns used in the filter

• Slowest to VACUUM

Page 37: Amazon Redshift

• KEY

• ALL

• EVEN

Page 38: Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

2

3

4

ID Gender Name

101 M John Smith

306 F Lisa Green

ID Gender Name

292 F Jane Jones

209 M James White

ID Gender Name

139 M Peter Black

164 M Brian Snail

ID Gender Name

446 M Pat Partridge

658 F Sarah Cyan

RoundRobin

DISTSTYLE EVEN

Page 39: Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

HashFunction

ID Gender Name

101 M John Smith

306 F Lisa Green

ID Gender Name

292 F Jane Jones

209 M James White

ID Gender Name

139 M Peter Black

164 M Brian Snail

ID Gender Name

446 M Pat Partridge

658 F Sarah Cyan

DISTSTYLE KEY

Page 40: Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

HashFunction

ID Gender Name

101 M John Smith

139 M Peter Black

446 M Pat Partridge

164 M Brian Snail

209 M James White

ID Gender Name

292 F Jane Jones

658 F Sarah Cyan

306 F Lisa Green

DISTSTYLE KEY

Page 41: Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

All

DISTSTYLE ALL

Page 42: Amazon Redshift

• KEY

• Large fact tables

• Large dimension tables

• ALL

• Medium dimension tables (1K–2M)

• EVEN

• Tables with no joins or group bys

• Small dimension tables (<1000)

Page 43: Amazon Redshift

Loading data

Page 44: Amazon Redshift

AWS cloudCorporate data center

Amazon S3Amazon

Redshift

Flat files

Data loading options

Page 45: Amazon Redshift

AWS cloudCorporate data center

ETL

Source DBs

Amazon

Redshift

Amazon

Redshift

Data loading options

Page 46: Amazon Redshift

AWS cloud

Amazon

Redshift

Amazon

Kinesis

Firehose

Data loading options

Page 47: Amazon Redshift

Querying

Page 48: Amazon Redshift

JDBC/ODBC

Amazon Redshift

Amazon Redshift works with your existing

analysis tools

Page 49: Amazon Redshift

ODBC/JDBC

BI clientsAmazon

Redshift

Page 50: Amazon Redshift

ODBC/JDBC

BI serverAmazon

RedshiftClients

Page 51: Amazon Redshift

Monitor query performance

Page 52: Amazon Redshift
Page 53: Amazon Redshift

https://www.flickr.com/photos/wandering_angel/834854901

Getting started with

RedshiftExperiences from Yle

Page 54: Amazon Redshift
Page 55: Amazon Redshift
Page 56: Amazon Redshift

Vision 2018

The best personal

user experience

Page 57: Amazon Redshift

https://www.flickr.com/photos/rh2ox/9990024683/in/photolist-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-9fLRVv-mWf6or-nMui2q-uqFgjs-fY9CgE-92kp1F-5YiALt-bf2wDV-h4RV3y-7R6usE-

5hrCbv-6yx8zu-da8jMn-ddn815-4nWCrn-98rFiH-2Wa9no-fS7qGg-gbA7WS-4C1MAu-kwvVNz-kwxBw1-amkoeY-p1mpRv-8eKo29-7mfH7s-mWXwd5-in4qSv-7n3E7Y-aywJ1i-kwvpGe-ddn5G7-5BK5qi-ddmZbF-egsmCb-75dpDG-

4mo6sG-4bkfHF-dL8cys-84Xbr2-iuSuaX

Big Data =

Page 58: Amazon Redshift

Big Data =

Big Dada

Page 59: Amazon Redshift

Big Data =

Big Dada

Yle News

Government

One journalist

Page 60: Amazon Redshift

https://www.flickr.com/photos/rh2ox/9990024683/in/photolist-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-9fLRVv-mWf6or-nMui2q-uqFgjs-fY9CgE-92kp1F-5YiALt-bf2wDV-h4RV3y-7R6usE-

5hrCbv-6yx8zu-da8jMn-ddn815-4nWCrn-98rFiH-2Wa9no-fS7qGg-gbA7WS-4C1MAu-kwvVNz-kwxBw1-amkoeY-p1mpRv-8eKo29-7mfH7s-mWXwd5-in4qSv-7n3E7Y-aywJ1i-kwvpGe-ddn5G7-5BK5qi-ddmZbF-egsmCb-75dpDG-

4mo6sG-4bkfHF-dL8cys-84Xbr2-iuSuaX

Big Data =

Small data

Page 61: Amazon Redshift

https://www.flickr.com/photos/rh2ox/9990016123/in/photolist-gdMrKi-6GiYkV-ax8z4B-9rnTdL-bf2wtK-22CjS2-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-9fLRVv-mWf6or-nMui2q-uqFgjs-

fY9CgE-92kp1F-5YiALt-bf2wDV-h4RV3y-7R6usE-5hrCbv-6yx8zu-da8jMn-ddn815-4nWCrn-98rFiH-2Wa9no-fS7qGg-gbA7WS-4C1MAu-kwvVNz-kwxBw1-amkoeY-p1mpRv-8eKo29-7mfH7s-mWXwd5-in4qSv-7n3E7Y-aywJ1i-

kwvpGe-ddn5G7-5BK5qi-ddmZbF-egsmCb

Page 62: Amazon Redshift
Page 63: Amazon Redshift

Painpoint #1

Losing data

Page 64: Amazon Redshift

In production

Critical

Yle Analytics Pipeline – First idea

Elastic

Beanstalk

Analytics

Collector

Web events

~50 mill./day

RedShift

Recommendation

service

Page 65: Amazon Redshift

Painpoint #2Cloud doesn’t

scale?

Page 66: Amazon Redshift

Painpoint #3

It’s down

again?

Page 67: Amazon Redshift

Painpoint #4

Where are the

skills?

Page 68: Amazon Redshift

Pilot

In production

Critical

Responsible: JO

Testing

Yle Analytics Pipeline - Current situation

Elastic

BeanstalkKinesis Streams

Analytics

Collector

Web events

~50 mill./day

Kinesis FirehoseRedShift

Lambda S3 EMR

Daily

archiving

Areena

Recommendation

Delay: 10-15 minutes

Note: Altogether tens of source application types with in total 120 different labels (most of them common). Labels may change.

Small batch sizes

Compression and

larger file sizes.

~60 Gb per day

CloudWatch

Dashboards

Small sized files

Delay: minute?

Page 69: Amazon Redshift

Yle Analytics Pipeline - Future scenario

In production

Critical

Responsible: JO

Elastic

Beanstalk Kinesis Streams

Analytics

Collector

RedShift

Lambda S3

Delay:

15 minutes to hours

Kinesis Firehose

Aggregated

analysis

(hours)

Ad hoc

analysis

Validation

Parsing

Aggregating?CloudWatch

Web events

~50 mill./day

Delay: minutes?

Delay: seconds DynamoDBEMR

Specific

real time

analysis

(seconds)

Selected analysis

EMR

Ad hoc

analysis w/

random

tools

Batch

analysis

Raw data

queries

AP

Is, b

uffe

ring

Additional data

Daily/weekly

division into

paths?

ElasticSearch?

Lambda

EC2 Container

Service?

What kind of architecture?

Data vault?

Schema versioning

Remove old information

Page 70: Amazon Redshift

Redshift is still

being explored and

looks promising!

Page 71: Amazon Redshift

[email protected]

@AlekRossi

Thanks!