best practices for supercharging cloud analytics on amazon redshift
DESCRIPTION
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations. To learn more, visit: http://www.snaplogic.com/redshift-trialTRANSCRIPT
![Page 1: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/1.jpg)
1
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftTina Adams, Amazon RedshiftBrandon Davis, CervelloManeesh Joshi, SnapLogic
May 2014
![Page 2: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/2.jpg)
2
Featured Speakers
![Page 3: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/3.jpg)
3
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
![Page 4: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/4.jpg)
4
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
![Page 5: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/5.jpg)
5
Amazon Redshift Architecture
• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution
• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via
Amazon S3; load from Amazon DynamoDB or SSH
• Two hardware platforms– Optimized for data processing– DW1: HDD; scale from 2TB to 1.6PB– DW2: SSD; scale from 160GB to
256TB
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
![Page 6: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/6.jpg)
6
Amazon Redshift is priced to let you analyze all your data
• Number of nodes x cost per hr
• No charge for leader node
• No upfront costs
• Pay as you go
DW1 (HDD)Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation
$ 0.500 $ 2,190
3 Year Reservation
$ 0.228 $ 999
DW2 (SSD)Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation
$ 0.161 $ 8,794
3 Year Reservation
$ 0.100 $ 5,498
![Page 7: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/7.jpg)
7
Amazon Redshift Feature Delivery
-60
40
-30
![Page 8: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/8.jpg)
8
Improved Concurrency
Before15
After50
![Page 9: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/9.jpg)
9
COPY from JSON
{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] }
COPY venue FROM 's3://mybucket/venue.json' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' JSON AS 's3://mybucket/venue_jsonpaths.json';
![Page 10: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/10.jpg)
10
COPY from Amazon Elastic MapReduce
COPY sales From ‘emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials ‘aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';
Amazon EMR Amazon Redshift
![Page 11: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/11.jpg)
11
REGEX_SUBSTR()
select email, regexp_substr(email,'@[^.]*') from users limit 5;
email | regexp_substr --------------------------------------------+---------------- [email protected] | @nonnisiAenean [email protected] | @lacusUtnec [email protected] | @semperpretiumneque [email protected] | @tristiquealiquet [email protected] | @sodalesat
![Page 12: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/12.jpg)
12
Resize Progress
• Progress indicator in console
• New API call
![Page 13: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/13.jpg)
13
ECDHE cipher suites for perfect forward security over SSL
ECDHE-RSA & ECDHE-ECDCSA cipher suites supported
![Page 14: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/14.jpg)
14
Amazon Redshift integrates with multiple data sources
Amazon S3 Amazon EMR
Amazon Redshift
DynamoDB
Amazon RDS
Corporate Datacenter
![Page 15: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/15.jpg)
15
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
![Page 16: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/16.jpg)
16
The SnapLogic Platform for Elastic Integration Powering Analytics, Apps and APIs
Data Applications APIs
![Page 17: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/17.jpg)
17
Why SnapLogic?
Multi-Point Orchestration
• SnapStore: 160+ Prebuilt Snaps
• Orchestration & Workflow
Modern Platform• Elastic, Scale-out
Architecture• Hybrid: Cloud to Cloud and
Cloud to Ground Use Cases
Faster Integration• Easily Design, Monitor,
Manage • Deploy in Days not Months
![Page 18: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/18.jpg)
18
Multi-Point: Comprehensive ConnectivitySnap your Apps: 160+ pre-built integrations
![Page 19: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/19.jpg)
19
Software-defined Integration
Metadata
Data
• Streams: No data is stored/cached
• Secure: 100% standards-based
• Elastic: Scales out & handles data, app, API integration use cases
Hybrid Scale-out Architecture Respects Data Gravity
![Page 20: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/20.jpg)
20
International Hotel Chain Reservation Data Mgmt.
• 126 TB of hotel reservation data
• Prohibitive cost-per-query for analytics
• Unacceptable performance
PAST PRESENT
• FedEx’ed 126 TB of data to load into AWS Redshift
• Now run daily sync between on-premise and cloud with SnapLogic of data changes (100-150GB)
• Enrich analytics with Twitter and Travelocity data
• Improved cost-per-query and performance
![Page 21: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/21.jpg)
21
Mid-sized Pharma Creates Cloud Data Mart
Cloud to On-prem Snaplex
REST
Cloud to Cloud Snaplex
Metadata
Data
• Consolidate DBs (Customer, Address, and Order) and SFDC (Contact and Account) into Redshift
• MicroStrategy is the visualization layer
![Page 22: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/22.jpg)
22
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
![Page 23: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/23.jpg)
23
DEMO
![Page 24: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/24.jpg)
24
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
![Page 25: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/25.jpg)
25
Enterprise Performance Management
(Finance)
Customer Relationship
Management (Sales &
Marketing)
Data Management
Custom Development
Business Intelligence &
Analytics (IT)
• We have offices in Boston, New York, Dallas and the UK• Offshore development and support teams in Russia and India• We partner with the leading on premise and cloud technology
companies
Advise, Implement, Support
Cervello Helps Clients Win With Data
![Page 26: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/26.jpg)
26
Implementation Case Study
• Hospitality industry analytics– Detailed transactional data
– Weekly / monthly / yearly trend analysis
– Began with single-node cluster, adding nodes as data volumes grow
Source Data Redshift Analytics
ETL
![Page 27: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/27.jpg)
27
• Collect external data loads before merging with existing data
• Maintain history of cleansed and standardized source data
• Use data structures optimized for analytics– Dimension and fact tables
for analytics
– Aggregate tables
Best Practice #1: Choose The Right Pattern
• Staging tables
• History tables
• Star schema data warehouse
Requirements Design
![Page 28: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/28.jpg)
28
Best Practice #2: Select the Right Node Type
• Performance was good with initial volumes and small data sets on single node
• Evaluated dense storage (dw1) and dense compute (dw2) nodes
• More opportunity to optimize design as volumes grew
• Increased nodes to handle larger volumes– Solution leverages dense
storage (dw1) nodes
– Expected to stabilize between 10-20TB
• Have also seen smaller volumes that work really well in dense compute (dw2) nodes
Early Stages Mature Stage
![Page 29: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/29.jpg)
29
Best Practice #3: Leverage MPP
• Spread data evenly across nodes while also optimizing join performance
• Distribution key and sort keys are primary considerations
Leader Node
Compute Node 1
Compute Node 2
Compute Node n
Compute Node 3
• Initial fact table distribution key caused skewed data
• Changed to dimension foreign key with better distribution for 40%+ improvement in query times
• Surrogate keys on dimension tables– Primary key
– Sort key and distribution key OR distribute to all nodes
– Sort on foreign keys in fact tables
Goals Approach
![Page 30: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/30.jpg)
30
Best Practice #4: Use Columnar Compression
• Started with compression settings based on general data types– VARCHAR to TEXT255,
INTEGER to MOSTLY16, etc.
– Iterate using ANALYZE COMPRESSION
• Redshift applies automatic compression during COPY– Staging tables
• Reduce I/O workload by minimizing size of data stored on disk
Goals Approach
![Page 31: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/31.jpg)
31
Best Practice #5: Load and Manage Data
• ETL and ELT– ETL: First set of processes prepares data for analytics –
business logic, standardization, validation
– ELT: Second set of processes load data into Redshift and transform into analytical structures
• Data management– Enforce constraints within ETL processes
– Analyze after loads to update statistics
– Vacuum after large loads to existing tables, updates and deletes
![Page 32: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/32.jpg)
32
Bringing it All Together
• Analytic queries– Minimize number of query columns to improve
performance
– Most queries use SUM or COUNT
– Leveraging aggregate tables for monthly dashboards
• Explain long running queries to help optimize design– Sorting / merging within nodes and merging at leader
node
![Page 33: Best Practices for Supercharging Cloud Analytics on Amazon Redshift](https://reader034.vdocuments.site/reader034/viewer/2022042515/54b6d6004a795983428b45f7/html5/thumbnails/33.jpg)
33
Learn more…
1. Try out the SnapLogic Free Trial for Amazon Redshift: http://snaplogic.com/redshift-trial
2. Learn more about Amazon Redshift at:
http://aws.amazon.com/redshift
3. Learn more about Cervello at:
http://mycervello.com/