leveraging amazon redshift for your data warehouse

47
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Leveraging Amazon Redshift for Your Data Warehouse John Loughlin, Solutions Architect @ AWS Kyle Hubert, Principal Data Architect @ Simulmedia

Upload: amazon-web-services

Post on 14-Aug-2015

385 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Leveraging Amazon Redshift for your Data Warehouse

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Leveraging Amazon Redshift for Your

Data Warehouse

John Loughlin, Solutions Architect @ AWS

Kyle Hubert, Principal Data Architect @ Simulmedia

Page 2: Leveraging Amazon Redshift for your Data Warehouse

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Amazon

Redshift

a lot faster

a lot cheaper

a whole lot simpler

Page 3: Leveraging Amazon Redshift for your Data Warehouse

Amazon

Redshift

Amazon

EMR

Amazon

EC2

Analyze

AWS Data

Pipeline

Amazon

Glacier

Amazon

DynamoDB

Store

AWS Direct

Connect

Collect

Amazon Kinesis

Amazon

S3

Page 4: Leveraging Amazon Redshift for your Data Warehouse

Common customer use cases

• Reduce costs by extending

DW rather than adding HW

• Migrate completely from

existing DW systems

• Respond faster to business

• Improve performance by an

order of magnitude

• Make more data available

for analysis

• Access business data via

standard reporting tools

• Add analytic functionality to

applications

• Scale DW capacity as

demand grows

• Reduce HW and SW costs

by an order of magnitude

Traditional enterprise DW Companies with big data SaaS companies

Page 5: Leveraging Amazon Redshift for your Data Warehouse

Amazon.com enterprise data warehouse

• Generates weblogs @ 2 terabytes/day, growing 67% YoY

• Oracle RAC legacy system

• Scan rate: 1 week of data/hour

• Hit RAC node limit of 32 nodes

• More data => Slower queries

• Migrated to Amazon Redshift

• Scan rate: 15 months of data (2.25 trillion rows) in 14 minutes

• More than 10 x performance with 100 node cluster

• 21 billion rows joined with 10 billion rows in under 2 hours, from

days

Page 6: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift architecture

• Leader node

– SQL endpoint, JDBC/ODBC

– Stores metadata

– Coordinates query execution

• Compute nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Load from Amazon DynamoDB or SS

• Two hardware platforms

– Optimized for data processing

– DS2: HDD; scale from 2TB to 2PB

– DC1: SSD; scale from 160 GB to 326 TB

10 GigE

(HPC)

IngestionBackupRestore

JDBC/ODBC

Page 7: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift node types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/year

• Scale from 2 TB to 2 PB

DS2.XL: 31 GB RAM, 2 cores

2 TB compressed storage, 0.5 GB/sec

scan

DS2.8XL: 244 GB RAM, 16 cores

16 TB compressed, 4 GB/sec scan

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/year

• Scale from 160 GB to 326 TB

DC1.L: 16 GB RAM, 2 cores

160 GB of compressed SSD storage

DC1.8XL: 256 GB RAM, 32 cores

2.56 TB of compressed SSD storage

Page 8: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift lets you analyze all your data

Price is nodes times

hourly cost

No charge for leader

node

3 x data compression

on average

Price includes 3 copies

of data

DS2 (HDD)Price per hour for

smallest single node

Effective annual

price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price per hour for

smallest single node

Effective annual

price per TB compressed

On-Demand $ 0.250 $ 13,690

1 Year Reservation $ 0.161 $ 8,795

3 Year Reservation $ 0.100 $ 5,500

Page 9: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift works with your analysis tools

JDBC/ODBC

Amazon Redshift

Page 10: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query

performance

• Point and click

resize

• Automatic backup

• Built-in security

Page 11: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift continuously backs up your

data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to

maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and

incremental

– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of

drives and nodes

• Able to restore snapshots to any Availability Zone within a region

• Easily enable backups to a second region for disaster recovery

Page 12: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift has security built-in

• Load encrypted from S3

• SSL to secure data in transit; ECDHE

perfect forward security

• Encryption to secure data at rest

– All blocks on disks and in S3 encrypted

– Block key, cluster key, master key (AES-

256)

– On-premises HSM and AWS CloudHSM

support

• Audit logging and AWS CloudTrail

integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

InternalVPC

JDBC/ODBC

Page 13: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • With row storage you do

unnecessary I/O

• To get the total amount, you

have to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 14: Leveraging Amazon Redshift for your Data Warehouse

• With column storage, you

only read the data you

need

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 15: Leveraging Amazon Redshift for your Data Warehouse

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • COPY compresses

automatically

• You can analyze and override

• More performance, less cost

Page 16: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and

maximum value for each

block

• Skip over blocks that don’t

contain relevant data

Page 17: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Use local storage for

performance

• Maximize scan rates

• Automatic replication and

continuous backup

• HDD and SSD platforms

Page 18: Leveraging Amazon Redshift for your Data Warehouse

Amazon Redshift @ Simulmedia

Page 19: Leveraging Amazon Redshift for your Data Warehouse

—John Wanamaker

“Half the money I spend on advertising is wasted; the

trouble is I don't know which half.”

Page 20: Leveraging Amazon Redshift for your Data Warehouse

A data-centric approach to TV advertising

Page 21: Leveraging Amazon Redshift for your Data Warehouse

Targeted TV advertising that reaches

110 million households

Page 22: Leveraging Amazon Redshift for your Data Warehouse

Anonymous viewing data from millions of set-top

boxes and smart TVs overlaid with 3rd party

viewing data

Page 23: Leveraging Amazon Redshift for your Data Warehouse

Reinvested in our platform with Amazon Redshift

Page 24: Leveraging Amazon Redshift for your Data Warehouse

10–100 x improvement in performance

Decreased time to release

Proliferation of experiments on the data

Page 25: Leveraging Amazon Redshift for your Data Warehouse

Business opportunity/capacity has

increased exponentially;

headcount for the team has remained

stable

Page 26: Leveraging Amazon Redshift for your Data Warehouse

On-premises Hadoop/Hive cluster with >80

nodes storing 150 TBs of data

Page 27: Leveraging Amazon Redshift for your Data Warehouse

HDFS -> S3

Freedom from replication factor

Separate archives and active data set

Scalable performance

Page 28: Leveraging Amazon Redshift for your Data Warehouse

Production data was optimal for MPP

Page 29: Leveraging Amazon Redshift for your Data Warehouse

$0

$35,000

$70,000

$105,000

$140,000

$175,000

HDD SSDAmazon Redshift solution A solution B solution C solution D

MPP cost—per TB per year

Page 30: Leveraging Amazon Redshift for your Data Warehouse

Managed service

Continual upgrades

Automatic snapshotting

Page 31: Leveraging Amazon Redshift for your Data Warehouse

<1 sec to query 2 years of historical viewing data

N.B.: skinny fact table

Page 32: Leveraging Amazon Redshift for your Data Warehouse

Flexible data discovery period

Better understanding of data

Tuned facts and distributed dimensions

Page 33: Leveraging Amazon Redshift for your Data Warehouse

Production Amazon Redshift cluster with 3

nodes storing ~1.4 TB

Non-production Amazon Redshift cluster

with 2 nodes storing ~8 TB

Page 34: Leveraging Amazon Redshift for your Data Warehouse

S3 data lake

Minor transformations during ingestion

Idempotent audit tables in Amazon Redshift

Star schema design

Page 35: Leveraging Amazon Redshift for your Data Warehouse

Decreased our infrastructure costs

Cleaned up our architecture

Operationally complexity removed

Capacity planning eased

Page 36: Leveraging Amazon Redshift for your Data Warehouse

Demographics/Targeting/Forecasting

From ~1 hour to ~10 seconds

Page 37: Leveraging Amazon Redshift for your Data Warehouse

Measurement

from ~7–10 hours to ~5 minutes

Page 38: Leveraging Amazon Redshift for your Data Warehouse

SQL everywhere

Page 39: Leveraging Amazon Redshift for your Data Warehouse

Data science:

Improve forecasting

Improve optimizations

Improve measurement

Page 40: Leveraging Amazon Redshift for your Data Warehouse

Analytics:

Build new reports

Discover more about effective spots

Page 41: Leveraging Amazon Redshift for your Data Warehouse

Best practices

Page 42: Leveraging Amazon Redshift for your Data Warehouse

Learn the Amazon Redshift Management Console:

Set up queueing

Set up alerts

Track CPU utilization when debugging

Page 43: Leveraging Amazon Redshift for your Data Warehouse

Low concurrency (1–3 queries)

Alerts on disk usage

Query execution details

Page 44: Leveraging Amazon Redshift for your Data Warehouse

COPY/UNLOAD

Remember to analyze tables for planner

Take advantage of compression analysis

Page 45: Leveraging Amazon Redshift for your Data Warehouse

Use timestamp/date data types

(Add timezone to column name)

Use varchar

Page 46: Leveraging Amazon Redshift for your Data Warehouse

Your Feedback is Important to AWSPlease complete the session evaluation. Tell us what you think!

Page 47: Leveraging Amazon Redshift for your Data Warehouse

NEW YORK