benefícios e melhores práticas no uso do amazon redshift
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daniel Bento, Arquiteto de Soluções, AWS
Novembro 2016 | São Paulo, Brasil
WorkshopAmazon Redshift
Agenda
• Introduction
• Architecture and Use Cases
• Data Loading and Migration
• Table and Schema Design
• Operations
We start with the customer… and innovate
Managing databases is painful & difficult
SQL DBs do not perform well at scale
Hadoop is difficult to deploy and manage
DWs are complex, costly, and slow
Commercial DBs are punitive & expensive
Streaming data Is difficult to capture & analyze
BI Tools are expensive and hard to manage
ü Amazon RDS
ü Amazon DynamoDB
ü Amazon EMR
ü Amazon Redshift
ü Amazon Aurora
ü Amazon Kinesis
ü Amazon QuickSight
Customers told us… We created…
AnalyzeStore
Glacier
S3
DynamoDB
RDS, Aurora
AWS Big Data Portfolio
Data Pipeline
CloudSearch
EMR EC2
RedshiftMachineLearning
ElasticSearch
Launched
Database Migration
NewQuickSight
New
SQL over Streams
NewKinesis
Firehose
New
Import Export
Direct Connect
Collect
Kinesis
Global Footprint
14 Regions; 33 Availability Zones; 54 Edge Locations
Redshift
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Free Tier – 2 months
Amazon Redshift
a lot fastera lot simplera lot cheaper
The legacy view of data warehousing ...
Multi-year commitment
Multi-year deployments
Multi-million dollar deals
… Leads to dark data
This is a narrow view
Small companies also have big data
(mobile, social, gaming, adtech, IoT)
Long cycles, high costs, administrative complexity all stifle innovation
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools, Hadoop, Machine Learning, Streaming
Analysis in-line with process flows
Pay as you go, grow as you need
Managed availability & DR
Enterprise Big Data SaaS
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailedspreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data Warehouse, Q4 2015
Selected Amazon Redshift Customers
NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs
The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education
Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech
Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology
Amazon Redshift Security
Petabyte-Scale Data Warehousing Service
Amazon Redshift Architecture
Amazon Redshift Architecture
Leader NodeSimple SQL end pointStores metadataOptimizes query planCoordinates query execution
Compute NodesLocal columnar storageParallel/distributed execution of all queries, loads, backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)DC1: SSD; scale from 160 GB to 326 TBDS1/DS2: HDD; scale from 2 TB to 2 PB
Ingestion/BackupBackupRestore
JDBC/ODBC
10 GigE(HPC)
Benefit #1: Amazon Redshift is fast
Dramatically less I/OColumn storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding ---------+----------------+----------listing | listid | deltalisting | sellerid | delta32klisting | eventid | delta32klisting | dateid | bytedictlisting | numtickets | bytedictlisting | priceperticket | delta32klisting | totalprice | mostly32listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324375
623637
959
SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’
MIN: 01-JUNE-2016MAX: 20-JUNE-2016
MIN: 08-JUNE-2016MAX: 30-JUNE-2016
MIN: 12-JUNE-2016MAX: 20-JUNE-2016
MIN: 02-JUNE-2016MAX: 25-JUNE-2016
Unsorted TableMIN: 01-JUNE-2016MAX: 06-JUNE-2016
MIN: 07-JUNE-2016MAX: 12-JUNE-2016
MIN: 13-JUNE-2016MAX: 18-JUNE-2016
MIN: 19-JUNE-2016MAX: 24-JUNE-2016
Sorted By Date
Benefit #1: Amazon Redshift is fastSort Keys and Zone Maps
Benefit #1: Amazon Redshift is fast
Parallel and DistributedQuery
Load
Export
Backup
Restore
Resize
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Benefit #1: Amazon Redshift is fastDistribution Keys
Benefit #1: Amazon Redshift is fast
H/W optimized for I/O intensive workloads
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Example: Our new Dense Storage (HDD) instance typeImproved memory 2x, compute 2x, disk throughput 1.5xCost: same as our prior generation !
Benefit #2: Amazon Redshift is inexpensive
DS2 (HDD) Price Per Hour for DS2.XL Single Node
Effective AnnualPrice per TB compressed
On-Demand $ 0.850 $ 3,7251 Year Reservation $ 0.500 $ 2,1903 Year Reservation $ 0.228 $ 999
DC1 (SSD) Price Per Hour for DC1.L Single Node
Effective AnnualPrice per TB compressed
On-Demand $ 0.250 $ 13,6901 Year Reservation $ 0.161 $ 8,7953 Year Reservation $ 0.100 $ 5,500
Pricing is simpleNumber of nodes x price/hourNo charge for leader node No up front costsPay as you go
Amazon Redshift lets you start small and grow big
Dense Storage (DS2.XL) 2 TB HDD, 31 GB RAM, 2 slices/4 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Dense Storage (DS2.8XL) 16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigECluster 2-128 Nodes (32 TB – 2 PB)
Note: Nodes not to scale
Benefit #2: Amazon Redshift is inexpensive
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups to S3
Continuous and incremental backups across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/Region level disasters
Benefit #4: Security is built-in• Load encrypted from S3
• SSL to secure data in transit
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)• On-premises HSM, AWS CloudHSM & KMS
support
• Audit logging and AWS CloudTrailintegration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE(HPC)
IngestionBackupRestore
Customer VPC
InternalVPC
JDBC/ODBC
Benefit #5: We innovate quickly
2013 20152014
•Initial Release in US East (N. Virginia; US West (Oregon), EU (Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions•MANIFEST option for the COPY & UNLOAD commands•SQL Functions: Most recent queries•Resource-level IAM, CRC32•Data Pipeline•Event notifications, encryption, key rotation, audit logging, on-premises or AWS CloudHSM; PCI, SOC 1/2/3•Cross-Region Snapshot Copy•Audit features, cursor support, 500 concurrent client connections •EIP Address for VPC Cluster
•New system views to tune table design and track WLM query queues•Custom ODBC/JDBC drivers; Query Visualization•Mobile Analytics auto export•KMS for GovCloud Region; HIPAA BAA•Interleaved sort keys•New Dense Storage Nodes (DS2) with better RAM and CPU. •New Reserved Storage Nodes: No, Partial & All Upfront Options•Cross-region backups for KMS encrypted clusters•Scaler UDFs in Python•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)•Modify Cluster Dynamically•Tag-based permissions and BZIP2
•System Tables for query Tuning•Dense Compute Nodes•Gzip & Lzop; JSON , RegEx, Cursors•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit to 50; support for the ECDH cipher suites for SSL connections; FedRAMP•Cross-region ingestion•Free trials & price reductions in Asia Pacific•CloudWatch Alarm for Disk Usage•AES 128-bit encryption; UTF-16; KMS Integration•EU (Frankfurt); GovCloud Regions•S3 Servier-side encryption support for UNLOAD•Tagging Support for Cost-allocation
•WLM Queue-Hopping for timed-out queries•Append rows & Export to BZIP-2•Lambda for Clusters in VPC; Data Schema Conversion Support from ML Console•US West (N. California) Region.
2016
100+ new features added since launchRelease every two weeksAutomatic patching
Benefit #6: Amazon Redshift is powerful• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML
Benefit #7: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon Redshift
Amazon Kinesis
MachineLearning
Data Pipeline
CloudSearch
Mobile Analytics
Use cases
Analyzing Twitter Firehose
Amazon RedshiftStarts at $0.25/hour
EC2Starts at $0.02/hour
S3$0.030/GB-Mo
Amazon Glacier$0.010/GB-Mo
Amazon Kinesis$0.015/shard 1MB/s in; 2MB/out$0.028/million puts
Analyzing Twitter Firehose
500MM tweets/day = ~ 5,800 tweets/sec
2k/tweet is ~12MB/sec (~1TB/day)
$0.015/hour per shard, $0.028/million PUTS
Amazon Kinesis cost is $0.765/hour
Amazon Redshift cost is $0.850/hour (for a 2TB node)
S3 cost is $1.28/hour (no compression)
Total: $2.895/hour
Data warehouses can be
inexpensive and
powerful
Use only the services you need
Scale only the services you need
Pay for what you use
~40% discount with 1 year commitment
~70% discounts with 3 year commitment
Data warehouses can be
inexpensive and
powerful
Amazon.com – Weblog analysis
Web log analysis for Amazon.com1PB+ workload, 2TB/day, growing 67% YoYLargest table: 400 TB
Want to understand customer behavior
Query 15 months of data (1PB) in 14 minutes
Load 5B rows in 10 minutes
21B rows joined with 10B rows – 3 days (Hive) to 2 hours
Load pipeline: 90 hours (Oracle) to 8 hours
64 clusters
800 total nodes
13PB provisioned storage
2 DBAs
Data warehouses can be
fastand
simple
Sushiro – Real-time streaming from IoT & analysis
Sushiro – Real-time streaming & analysisReal-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift
380 stores stream live data from Sushi plates
Inventory information combined with consumption information near real-time
Forecast demand by store, minimize food waste, and improve efficiencies
Amazon
Big data does not mean batch
Can be streamed in
Can be processed in near real time
Can be used to respond quickly to requests
You can mix and match
On premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Data warehouses can support
real-time data
Amazon Redshift Data Loading and MigrationPetabyte-Scale Data Warehousing Service
Data Loading Process
Data Source Extraction Transformation Loading
Amazon Redshift
Target
Focus
Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
AmazonDynamoDB
Amazon S3
Data Volume
Amazon Elastic MapReduce
Amazon RDS
Amazon Redshift
Amazon Glacier
logs / files
Source DBs
VPN Connection
AWS Direct Connect
S3 Multipart Upload
AWS Import/ Export
EC2 or On-Prem (using
SSH)
Database Migration Service
Kinesis
AWS Lambda
AWS Datapipeline
Uploading Files to Amazon S3
Amazon Redshiftmydata
Client.txt
Corporate Data center
Region
Ensure that your data resides in the same Region as your Redshift
clusters
Split the data into multiple files to facilitate parallel
processing
Optionally, you can encrypt your
data using Amazon S3
Server-Side or Client-Side Encryption
Client.txt.1
Client.txt.2
Client.txt.3
Client.txt.4
Files should be individually
compressed using GZIP or LZOP
Loading Data From Amazon S3
Preparing Input Data FilesUploading files to Amazon S3Using COPY to load data from Amazon S3
Splitting Data Files
Slice 0
Slice 1
Slice 0
Slice 1
Client.txt.1
Client.txt.2
Client.txt.3
Client.txt.4
Node 0
Node 1
2 XL Compute Nodes
Copy customer from ‘s3://mydata/client.txt’Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_keyDelimiter ‘|’;
mydata
Use the COPY command
Each slice can load one file at a time
A single input file means only one slice is ingesting data
Instead of 100MB/s, you’re only getting 6.25MB/s
Loading – Use multiple input files to maximize throughput
Use the COPY command
You need at least as many input files as you have slices
With 16 input files, all slices are working so you maximize throughput
Get 100MB/s per node; scale linearly as you add nodes
Loading – Use multiple input files to maximize throughput
Amazon Confidential – Slides not intended for redistribution.
• Comecesuaprimeiramigraçãoem10minutosoumenos
• Mantenhasuasaplicaçõesemexecuçãoduranteamigração
• Movadadosparaamesmaengine ouparaumadiferente
AWS Database Migration Service
CustomerPremises
Application Users
AWS
Internet
VPN
Start a replication instance
Connect to source and target databases
Select tables, schemas, or databases
Let AWS Database Migration Service create tables, load data, and keep them in sync
Switch applications over to the target at your convenience
AWS DMS - Keep your apps running during the migration
AWSDatabase Migration
Service
Loading Data from an Amazon DynamoDB Table
Differences Amazon DynamoDB Amazon Redshift
Table Names • Up to 255 characters• May contain ‘.’ (dot) and ‘-’ (dash)
characters• Case-sensitive
• Limited to 127 characters• Can’t contain dots or dashes• Are NOT case-sensitive• Can’t conflict with any Amazon
Redshift reserved words
NULL Does not support the SQL concept of NULL
Must specify how Amazon Redshift interprets empty or blank attribute values in Amazon DynamoDB
Data Types STRING and NUMBER Data Types Supported
BINARY and SET Data Types Not Supported
Loading Data from Amazon Elastic MapReduce
Load data from Amazon EMR in parallel using COPYSpecify Amazon EMR cluster ID and HDFS file path/nameAmazon EMR must be running until COPY completes.
copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials 'aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';
Loading Data using LambdaAWS Lambda-based Amazon Redshift loader to offer you the ability to drop files into S3 and load them into any number of database tables in multiple Amazon Redshift clusters automatically, with no servers to maintain.Blog post https://blogs.aws.amazon.com/bigdata/post/Tx24VJ6XF1JVJAA/A-Zero-Administration-Amazon-Redshift-Database-LoaderGitHub http://github.com/awslabs/aws-lambda-redshift-loader
Remote Loading using SSHRedshift COPY command can reach out to remote locations (EC2 and on premise) to load data using a secure shell script (SSH)
To Remote Load follow this process:• Add cluster's public key to the
remote host's authorized keys file• Configure remote host to accept
connections from all cluster IP addresses
• Create manifest file in JSON format and upload to S3 bucket
• Issue a COPY command, including a reference to the manifest file
Vacuuming Tables1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0Column 1 Column 2
Amazon Redshift serializes all of the values of a column together.
Vacuuming Tables1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0 Column 1 Column 2
Delete customer where column_0 = 3;
x xxx xxxxxxxxxxxxxxx
Vacuuming Tables
1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas
VACUUM Customer;
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx
Redshift does not automatically reclaim and reuse space that is freed when you delete rows from tables or update rows in tables. The VACUUM command reclaims space following deletes, which improves performance as well as increasing available storage.
Analyzing Tables
ANALYZE command
The entire current
database
A single Table
One or more
specific columns in
a single table
The ANALYZE command obtains a sample of rows,
does some calculations, and saves resulting
column statistics.
You do not need to analyze all columns in all
tables regularly
Analyze the columns that are frequently used in the following:
• Sorting and grouping operations• Joins• Query Predicates
To maintain current statistics for tables:
• Run the ANALYZE command before running queries
• Run the ANALYZE command against the database routinely at the end of every regular load or update cycle
• Run the ANALYZE command against an new tables
Statistics
Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATEOperations into the same table
Transaction 1
Transaction 2 CUSTOMER
Copy customer from ‘s3://mydata/client.txt’…;
Copy customer from ‘s3://mydata/client.txt’…;
Session A
Session B
Transaction 1 puts on the write lock on the CUSTOMER table
Transaction 2 waits until transaction 1 releases the write
lock
Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATEOperations into the same table
Transaction 1
Transaction 2
CUSTOMER
Begin;Delete one row from CUSTOMERS;Copy…;Select count(*) from CUSTOMERS;End;
Begin;Delete one row from CUSTOMERS;Copy…;Select count(*) from CUSTOMERS;End;
Session A
Session B
Transaction 1 puts on the write lock on the CUSTOMER table
Transaction 2 waits until transaction 1 releases the write
lock
Managing Current Write Operations
Potential deadlock situation for concurrent write transactions
Transaction 1
Transaction 2
CUSTOMER
Begin;Delete 3000 rows from CUSTOMERS;Copy…;Delete 5000 rows from PARTS;End;
Begin;Delete 3000 rows from PARTS;Copy…;Delete 5000 rows from CUSTOMERS;End;
Session A
Session B
Transaction 1 puts on the write lock on the CUSTOMER table
PARTS
Transaction 2 puts on the write lock on the PARTS table
Wait
Wait
Amazon Redshift Security
Petabyte-Scale Data Warehousing Service
Amazon Redshift Table and Schema Design
Goals of distribution
• Distribute data evenly for parallel processing• Minimize data movement
• Co-located joins• Localized aggregations
Distribution key All
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
Full table data on first slice of every nodeSame key to same location
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
EvenRound-robin distribution
Choosing a distribution style
Key• Large FACT tables• Rapidly changing tables used
in joins• Localize columns used within
aggregations
All• Have slowly changing data• Reasonable size (i.e., few
millions but not 100s of millions of rows)
• No common distribution key for frequent joins
• Typical use case: joined dimension table without a common distribution key
Even• Tables not frequently joined or
aggregated• Large tables without
acceptable candidate keys
Data Distribution and Distribution Keys
Node 1Slice 1 Slice 2
Node 2Slice 3 Slice 4
cloudfronturi = /games/g1.exeuser_id=1234…
cloudfronturi = /imgs/ad1.pnguser_id=2345…
cloudfronturi=/games/g10.exeuser_id=4312…
cloudfronturi = /img/ad_5.imguser_id=1234…
2M records
5M records
1M records4M records
Poor key choices lead to uneven distribution of records…
Data Distribution and Distribution Keys
Node 1Slice 1 Slice 2
Node 2Slice 3 Slice 4
cloudfronturi = /games/g1.exeuser_id=1234…
cloudfronturi = /imgs/ad1.pnguser_id=2345…
cloudfronturi=/games/g10.exeuser_id=4312…
cloudfronturi = /img/ad_5.imguser_id=1234…
2M records
5M records
1M records4M records
Unevenly distributed data cause processing imbalances!
Data Distribution and Distribution Keys
Node 1Slice 1 Slice 2
Node 2Slice 3 Slice 4
cloudfronturi = /games/g1.exeuser_id=1234…
cloudfronturi = /imgs/ad1.pnguser_id=2345…
cloudfronturi=/games/g10.exeuser_id=4312…
cloudfronturi = /img/ad_5.imguser_id=1234…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance
Single Column
Compound
Interleaved
Sort Keys
Goals of sorting
• Physically sort data within blocks and throughout a table• Optimal SORTKEY is dependent on:
• Query patterns• Data profile• Business requirements
COMPOUND• Most common• Well-defined filter criteria• Time-series data
Choosing a SORTKEY
INTERLEAVED• Edge cases• Large tables (>billion rows)• No common filter criteria• Non time-series data
• Primarily as a query predicate (date, identifier, …)• Optionally, choose a column frequently used for aggregates• Optionally, choose same as distribution key column for most
efficient joins (merge join)
Table is sorted by 1 column[ SORTKEY ( date ) ]
Best for: • Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group bys• Quickest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Single Column
• Table is sorted by 1st column , then 2nd column etc.[ SORTKEY COMPOUND ( date, region, country) ]
• Best for: • Queries that use 1st column as primary filter, then other cols• Can speed up joins and group bys• Slower to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Compound
• Equal weight is given to each column.[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for: • Queries that use different columns in filter• Queries get faster the more columns used in the filter (up to 8)• Slowest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Interleaved
Compressing data
• COPY automatically analyzes and compresses data when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column
• Changing column encoding requires a table rebuild
Automatic compression is a good thing (mostly)
• From the zone maps we know:• Which blocks contain the
range• Which row offsets to scan
• Highly compressed SORTKEYs: • Many rows per block • Large row offset
Skip compression on just the leading column of the compound SORTKEY
During queries and ingestion, the system allocates buffers based on column width
Wider than needed columns mean memory is wasted
Fewer rows fit into memory; increased likelihood of queries spilling to disk
DDL – Make Columns as narrow as possible
Amazon Redshift Security
Petabyte-Scale Data Warehousing Service
Amazon Redshift Operations
Backup and Restore
Backups to Amazon S3 are automatic, continuous & incremental
Configurable system snapshot retention period
User-defined snapshots are on-demand
Streaming restores enable you to resume querying faster
128GB RAM
16TB disk
16 coresRedshiftCluster Node
128GB RAM
16TB disk
16 coresRedshiftCluster Node
128GB RAM
16TB disk
16 coresRedshiftCluster Node
Amazon S3
Snapshots
Automated SnapshotsRedshift enables automated snapshots by default
Snapshots are incremental
Redshift retains incremental backup data required to restore cluster using manual snapshot or automatic snapshot
Only for clusters that haven’t reached the snapshot retention period
Amazon Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster.
Manual SnapshotsThe system will never delete a manual snapshot
You can authorize access to an existing manual snapshot for as many as 20 AWS customer accounts
Shared access between accounts means you can move data between dev/test/prod without reloading
Cross-Region SnapshotsCopy cluster snapshots to another region using the management console
Can’t modify destination region without disabling cross-region snapshots and re-enabling with a new destination region and retention period
Streaming Restore
During restore your node is provisioned within less than two minutes
Query provisioned node immediately as data is automatically streamed from the S3 snapshot
― Cluster is put into read-only mode― New cluster is provisioned according to resizing needs― Node-to-node parallel data copy― Only charged for source cluster
Resize – Phase 1
Resize - Phase 2
― Automatic SQL endpoint switchover via DNS― Decommission the source cluster
Resizing a Redshift Data Warehouse via the AWS Console
Resizing a Redshift Data Warehouse via the AWS Console
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro
Resizing without Production Impact
When you resize your cluster is read-only• Query cluster throughout resizing• There will be a short outage at the end of cluster resizing
Parameter GroupsAllow you to set a variety of configuration options:
• date style, search path and query timeout • user activity logging • whether the leader node should accept only
SSL protected connections
Apply to every database in a cluster
View events using the console, the SDK, or the CLI/API
Use SET to change some properties temporarily, for example:
SET wlm-query-slot-count = 3
Some changes to the parameter group may require a restart to take effect
Default Parameter Values
http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties
EventsEvent information is available for a period of several weeks
Source types:• Cluster• Parameter Group• Security Group• Snapshot
Categories:• Configuration• Management• Monitoring• Security
Subscribe to events to receive SNS notifications
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages
Resource TaggingDefine custom tags for your cluster
• provide metadata about a resource • categorize billing reports for cost
allocation• Activate tags in the billing and cost
management service
Cluster resize or a restore from a snapshot in the original region preserves tags but not in cross region snapshots
http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html
Concurrent Query ExecutionLong-running queries may cause short-running queries to wait in a queue
• As a result users may experience unpredictable performance
By default, a Redshift cluster is configured with one queue with 5 execution slots by default (50 max)
• 5 slots means 5 queries can execute concurrently while additional queries will queue until additional resources free up
RunningDefault queue
Workload ManagementWorkload management is about creating queues for different workloads
User Group A
Short-running queueLong-running queue
Short Query Group
Long Query Group
Workload Management
Workload Management
Don’t set concurrency to more that you need
set query_group to allqueries; select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;reset query_group;
Redshift Utils
Script Purposetop_queries.sql Get the top 50 most time-consuming statements in the last 7 days.
perf_alerts.sql Get the top occurrences of alerts; join with table scans.
filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey.
commit_stats.sql Get information on consumption of cluster resources through COMMIT statements.
current_session_info.sql Get information about sessions with currently running queries.
missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables.
queuing_queries.sql Get queries that are waiting on a WLM query slot.
table_info.sql Get table storage information (size, skew, etc.).
• Redshift Admin Scripts provide diagnostic information for tuning and troubleshooting
• https://github.com/awslabs/amazon-redshift-utils
Redshift Utils
View Purposev_check_data_distribution.sql Get data distribution across slices.v_constraint_dependency.sql Get the the foreign key constraints between tables.v_generate_group_ddl.sql Get the DDL for a group.v_generate_schema_ddl.sql Get the DDL for schemas.v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints.v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object.v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views.v_generate_view_ddl.sql Get the DDL for a view.v_get_obj_priv_by_user.sql Get the table/views that a user has access to.v_get_schema_priv_by_user.sql Get the schema that a user has access to.v_get_tbl_priv_by_user.sql Get the tables that a user has access to.v_get_users_in_group.sql Get all users in a group.v_get_view_priv_by_user.sql Get the views that a user has access to.v_object_dependency.sql Merge the different dependency views.v_space_used_per_tbl.sql Get pull space used per table.v_view_dependency.sql Get the names of the views that are dependent on other tables/views.
• Redshift Admin Views provide information about user and group access, various table constraints, object and view dependencies, data distribution across slices, and pull space used per table
Open source tools
https://github.com/awslabs/amazon-redshift-utilshttps://github.com/awslabs/amazon-redshift-monitoringhttps://github.com/awslabs/amazon-redshift-udfs
Admin scriptsCollection of utilities for running diagnostics on your cluster
Admin viewsCollection of utilities for managing your cluster, generating schema DDL, etc.
ColumnEncodingUtilityGives you the ability to apply optimal column encoding to an established schema with data already loaded
Q&A
PlayKids iFood MapLink Apontador Rapiddo Superplayer Cinepapaya ChefTime
“Com os serviços da AWS pudemos dosar os investimentos iniciais e prospectar os custospara expansões futuras”
“O Redshift nos permitiu
transformar dados em informações
self-service” - Wanderley Paiva
Database Specialist
O Desafio
• Escalabilidade
• Disponibilidade
• Centralização dos dados
• Custos reduzidos e preferencialmente diluído
Solução
Solução v2