leveraging amazon redshift for your data warehouse
TRANSCRIPT
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Leveraging Amazon Redshift for Your
Data Warehouse
John Loughlin, Solutions Architect @ AWS
Kyle Hubert, Principal Data Architect @ Simulmedia
Petabyte scale
Massively parallel
Relational data warehouse
Fully managed; zero admin
Amazon
Redshift
a lot faster
a lot cheaper
a whole lot simpler
Amazon
Redshift
Amazon
EMR
Amazon
EC2
Analyze
AWS Data
Pipeline
Amazon
Glacier
Amazon
DynamoDB
Store
AWS Direct
Connect
Collect
Amazon Kinesis
Amazon
S3
Common customer use cases
• Reduce costs by extending
DW rather than adding HW
• Migrate completely from
existing DW systems
• Respond faster to business
• Improve performance by an
order of magnitude
• Make more data available
for analysis
• Access business data via
standard reporting tools
• Add analytic functionality to
applications
• Scale DW capacity as
demand grows
• Reduce HW and SW costs
by an order of magnitude
Traditional enterprise DW Companies with big data SaaS companies
Amazon.com enterprise data warehouse
• Generates weblogs @ 2 terabytes/day, growing 67% YoY
• Oracle RAC legacy system
• Scan rate: 1 week of data/hour
• Hit RAC node limit of 32 nodes
• More data => Slower queries
• Migrated to Amazon Redshift
• Scan rate: 15 months of data (2.25 trillion rows) in 14 minutes
• More than 10 x performance with 100 node cluster
• 21 billion rows joined with 10 billion rows in under 2 hours, from
days
Amazon Redshift architecture
• Leader node
– SQL endpoint, JDBC/ODBC
– Stores metadata
– Coordinates query execution
• Compute nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Load from Amazon DynamoDB or SS
• Two hardware platforms
– Optimized for data processing
– DS2: HDD; scale from 2TB to 2PB
– DC1: SSD; scale from 160 GB to 326 TB
10 GigE
(HPC)
IngestionBackupRestore
JDBC/ODBC
Amazon Redshift node types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/year
• Scale from 2 TB to 2 PB
DS2.XL: 31 GB RAM, 2 cores
2 TB compressed storage, 0.5 GB/sec
scan
DS2.8XL: 244 GB RAM, 16 cores
16 TB compressed, 4 GB/sec scan
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/year
• Scale from 160 GB to 326 TB
DC1.L: 16 GB RAM, 2 cores
160 GB of compressed SSD storage
DC1.8XL: 256 GB RAM, 32 cores
2.56 TB of compressed SSD storage
Amazon Redshift lets you analyze all your data
Price is nodes times
hourly cost
No charge for leader
node
3 x data compression
on average
Price includes 3 copies
of data
DS2 (HDD)Price per hour for
smallest single node
Effective annual
price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)Price per hour for
smallest single node
Effective annual
price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Amazon Redshift works with your analysis tools
JDBC/ODBC
Amazon Redshift
Amazon Redshift is easy to use
• Provision in minutes
• Monitor query
performance
• Point and click
resize
• Automatic backup
• Built-in security
Amazon Redshift continuously backs up your
data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to
maintain multiple copies of data at all times
• Backups to Amazon S3 are continuous, automatic, and
incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of
drives and nodes
• Able to restore snapshots to any Availability Zone within a region
• Easily enable backups to a second region for disaster recovery
Amazon Redshift has security built-in
• Load encrypted from S3
• SSL to secure data in transit; ECDHE
perfect forward security
• Encryption to secure data at rest
– All blocks on disks and in S3 encrypted
– Block key, cluster key, master key (AES-
256)
– On-premises HSM and AWS CloudHSM
support
• Audit logging and AWS CloudTrail
integration
• Amazon VPC support
• SOC 1/2/3, PCI-DSS Level 1, FedRAMP
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
InternalVPC
JDBC/ODBC
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • With row storage you do
unnecessary I/O
• To get the total amount, you
have to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With column storage, you
only read the data you
need
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • COPY compresses
automatically
• You can analyze and override
• More performance, less cost
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and
maximum value for each
block
• Skip over blocks that don’t
contain relevant data
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Use local storage for
performance
• Maximize scan rates
• Automatic replication and
continuous backup
• HDD and SSD platforms
Amazon Redshift @ Simulmedia
—John Wanamaker
“Half the money I spend on advertising is wasted; the
trouble is I don't know which half.”
A data-centric approach to TV advertising
Targeted TV advertising that reaches
110 million households
Anonymous viewing data from millions of set-top
boxes and smart TVs overlaid with 3rd party
viewing data
Reinvested in our platform with Amazon Redshift
10–100 x improvement in performance
Decreased time to release
Proliferation of experiments on the data
Business opportunity/capacity has
increased exponentially;
headcount for the team has remained
stable
On-premises Hadoop/Hive cluster with >80
nodes storing 150 TBs of data
HDFS -> S3
Freedom from replication factor
Separate archives and active data set
Scalable performance
Production data was optimal for MPP
$0
$35,000
$70,000
$105,000
$140,000
$175,000
HDD SSDAmazon Redshift solution A solution B solution C solution D
MPP cost—per TB per year
Managed service
Continual upgrades
Automatic snapshotting
<1 sec to query 2 years of historical viewing data
N.B.: skinny fact table
Flexible data discovery period
Better understanding of data
Tuned facts and distributed dimensions
Production Amazon Redshift cluster with 3
nodes storing ~1.4 TB
Non-production Amazon Redshift cluster
with 2 nodes storing ~8 TB
S3 data lake
Minor transformations during ingestion
Idempotent audit tables in Amazon Redshift
Star schema design
Decreased our infrastructure costs
Cleaned up our architecture
Operationally complexity removed
Capacity planning eased
Demographics/Targeting/Forecasting
From ~1 hour to ~10 seconds
Measurement
from ~7–10 hours to ~5 minutes
SQL everywhere
Data science:
Improve forecasting
Improve optimizations
Improve measurement
Analytics:
Build new reports
Discover more about effective spots
Best practices
Learn the Amazon Redshift Management Console:
Set up queueing
Set up alerts
Track CPU utilization when debugging
Low concurrency (1–3 queries)
Alerts on disk usage
Query execution details
COPY/UNLOAD
Remember to analyze tables for planner
Take advantage of compression analysis
Use timestamp/date data types
(Add timezone to column name)
Use varchar
Your Feedback is Important to AWSPlease complete the session evaluation. Tell us what you think!
NEW YORK