amazon redshift

Amazon RedshiftJeff Patti

What is Redshift?“Redshift is a fast, fully managed, petabyte-

scale data warehouse service”-Amazon

With Redshift Monetate is able to generate all of our analytics data for a day in ~ 2 hours

A process that consumes billions of rows and yields millions

What isn’t Redshift?warehouse=# insert into fact_page_view values

warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4);

INSERT 0 1

Time: 4600.094 ms

warehouse=# select fact_time from fact_page_view

warehouse-# where fact_date = '2014-10-02';

fact_time

---------------------

2014-10-02 18:30:00

(1 row)

Time: 618.303 ms

Who am I?Jeff [email protected] Engineer at Monetate

Monetate was in Redshifts Beta in late 2012 and has been actively developing on it since.We’re hiring - monetate.com/jobs/

mailto:[email protected]

mailto:[email protected]

http://monetate.com/jobs/

Leaving Hive For Redshift● Unusual failure modes● Slower and pricier than

Redshift, at least in our configuration

● Custom query language○ Didn’t play nicely with

our sql libraries

● Fully Managed● Performant & Scalable● Excellent integration with

other AWS offerings● PostgreSQL interface

○ command line interface○ libraries for PostgreSQL

work against Redshift

Fully Managed● Easy to deploy● Easy to scale out● Software updates - handled● Hardware failures - taken care of● Automatic backups - baked in

Automatic Backups● Periodically taken as delta from prior backup● Easy to create new cluster from backup, or

overwrite existing cluster● Queryable during recovery, after short delay

○ Preferentially recovers needed blocks to perform commands

● This is how Monetate keeps our development cluster in sync with production

Maintenance Window● Required half hour window once a week for

routine maintenance, such as software updates

● During this time the cluster is unresponsive● You pick when it happens

Scaling OutYou: Change cluster size through AWS consoleAWS:1. Existing cluster put into read only state2. New cluster caught up with existing cluster3. Swapped during maintenance window,

unless specified as immediatea. Immediate swap causes temporary unavailability

during canonical name record swap ( a few minutes)

Monetate● Core products are merchandising, web &

email personalization, testing● A/B & Multivariate testing to determine

impact of experiments● Involved with >20% of US ecommerce spend

each holiday season for the past 3 years running

Monetate Data CollectionTo compute analytics and reports on our clients experiments, for that we collect a lot of data● Billions of page views a week● Billions of experiment views a week● Millions of purchases a week● etc.This is where Redshift comes in handy

Redshift In Monetate

AppApp

AppApp

App

Monetate is Multi-region & Multi-AZ

in AWS

Amazon S3

Amazon Redshift

Our Clients

Analytics & ReportingData Warehousing

Under The Covers● Fork of PostgreSQL 8.0.2, get nice things like

○ Common Table Expressions○ Window Functions

● Column oriented database● Clusters can have many machines

○ Each machine has many slices○ Queries run in parallel on all slices

● Concurrent query support & memory limiting

Instance Types

Query Concurrency

Example Redshift TableCREATE TABLE fact_url (

fact_date DATE NOT NULL ENCODE lzo,

account_id INT NOT NULL ENCODE lzo,

fact_time TIMESTAMP NOT NULL ENCODE lzo,

mid BIGINT NOT NULL ENCODE lzo,

uri VARCHAR(2048) ENCODE lzo,

referer_uri VARCHAR(2048) ENCODE lzo,

PRIMARY KEY (account_id, fact_time, mid)

)

DISTKEY (mid)

SORTKEY (fact_date, account_id, fact_time, mid);

Per Column Compression● Used to fit more rows in each 1MB block● Trade off between CPU and IO● Allows Redshift to read rows from disk faster● Has to use more CPU to decompress data● Our Redshift queries are IO bound

○ We use compression extensively

Constraints“Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift.”

However, “If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.” [emphasis added]

Distribution StyleControls how Redshift distributes rows● Styles

○ Even - round robin rows (default)○ Key - data with the same key goes to same slice

■ Based on a single column from the table○ All - data is copied to all slices

■ Good for small tables

DISTKEY impacts JoinsDS_DIST_NONENo redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table.

DS_DIST_ALL_NONENo redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node.

These two are very performant

DS_DIST_INNERThe inner table is redistributed.

DS_BCAST_INNERA copy of the entire inner table is broadcast to all the compute nodes.

DS_DIST_ALL_INNERThe entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.

DS_DIST_BOTHBoth tables are redistributed.

Query Plan From Explain

-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84)

Hash Cond: ("outer".venueid = "inner".venueid)

-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47)

Hash Cond: ("outer".eventid = "inner".eventid)

-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30)

Merge Cond: ("outer".listid = "inner".listid)

-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14)

-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)

Sort Key● Data is stored on disk in sorted order

○ After being inserted into an empty table, or vacuumed● Sort Key impacts vacuum performance● Columnar data stored in 1MB blocks

○ min/max data stored as metadata● Metadata used to improve query performance

○ Allows Redshift to skip unnecessary blocks

Sort Key Take 1SORTKEY (account_id, fact_time, mid)● As we added new facts, bad things started happening

● Resorting rows for vacuuming had to reorder almost all the rows :(

● This made vacuuming unreasonably slow, affecting how often we could vacuum and therefore query performance

account 1time ordered


... account ntime ordered

new facts for all accounts



... account ntime ordered

Sort Key Take 2SORTKEY (fact_time, account_id, mid)● Now our table is like an append only log, but had poor query performance

● For many of our queries, we only look at one account at a time● Redshift blocks are 1MB each, each spanned many accounts● When querying a single account, had to read from disk and ignore many

rows from other accounts

00:00account ordered

00:01account ordered

... Nowaccount ordered

Sort Key Take 3SORTKEY (fact_date, account_id, fact_time, mid)

● Append only log ✓○ Cheap vacuuming ✓

● Single or few accounts per block ✓○ Significantly improved query performance ✓

Jan 1staccount ordered

Jan 2ndaccount ordered

... Todayaccount ordered

Redshift ⇔ S3Redshift & S3 have excellent integration● Unload from Redshift to S3 via UNLOAD

○ Each slice unloads separately to S3○ We unload into a CSV format

● Load into Redshift from S3 via COPY○ Applies all as inserts○ Primary keys aren’t enforced by Redshift

■ Use staging table to detect duplicate keys

Redshift UNLOADunload ('select * from venue order by venueid')

to 's3://mybucket/tickit/venue/reload_' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>'

manifest

delimiter ',';

Redshift UNLOAD Tipunload ('select * from venue order by venueid')

● Query in unload is quoted which wreaks havoc with quotes around dates, fact_time <= '2014-10-02'

● Instead of escaping the quotes around the date times○ unload ($$ select * from venue order by

venueid $$)

Redshift COPYcopy venue

from 's3://mybucket/tickit/venue/reload_manifest' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>'

manifest

delimiter ',';

Try it Yourself! For Free!!!Amazon Redshift documentation is well written

It contains great tutorials with pricing estimates

Amazon offers a 750 hour free trial of redshift DW2.Large nodes

http://docs.aws.amazon.com/redshift/latest/dg/welcome.html

http://docs.aws.amazon.com/redshift/latest/dg/welcome.html

http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html

http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html

http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html

http://aws.amazon.com/redshift/free-trial/

Questions?

amazon redshift

Technology

redshift queries

isnt redshift

url fact

outer table

billions of rows

innerthe inner table

innerjoin table

new cluster