replicating in real-time from mysql to amazon redshift
DESCRIPTION
Continuent is delighted to announce an exciting new Continuent Tungsten feature for MySQL users: replication in real-time from MySQL into Amazon Redshift. In this webinar we'll showcase Continuent Tungsten capabilities for continuous and real-time data warehouse loading, then zero-in on practical details of setting up replication from MySQL into Redshift. We cover the following topics: - Introduction to real-time data loading from a relational DBMS into data warehouses - Continuent Tungsten data warehouse loading to Redshift, Hadoop, Vertica, and Oracle - What's new with Redshift data loading - Setting up replication from MySQL into Amazon Redshift - Initial provisioning of the data, followed by on-going and real-time replication - Adding Redshift data loading to existing Continuent Tungsten clusters. This webinar includes practical tips and a live demo of how to get your data warehouse loading projects off the ground quickly and efficiently. Please join us to hear about this great new feature of Continuent Tungsten!TRANSCRIPT
©Continuent 2014
Replicating in Real-Time from MySQL to Amazon Redshift
Featuring Continuent Tungsten
Robert Hodges, CEO Jeff Mace, Director of Services
©Continuent 2014
Presentation Topics
• Introductions
• Tungsten Replicator Overview
• Real-Time Data Warehouse Loading
• Adding Redshift to the Mix
• Replication from Tungsten Clusters to Redshift
• Wrap-up
2
©Continuent 2014
Introducing Continuent
3
• The leading provider of clustering and replication for open source DBMS
• Our Product: Continuent Tungsten
• Clustering - Commercial-grade HA, performance scaling and data management for MySQL
• Replication - Flexible, high-performance data movement
©Continuent 2014©Continuent 2014
Continuent Tungsten Customers
4
1
©Continuent 2014
Quick Continuent Facts
• Largest Tungsten installation by data volume processes over 800 million transactions per day on 225 terabytes of relational data
• Largest installation by transaction volume handles up to 8 billion transactions daily
• Wide variety of topologies including MySQL, Oracle, Vertica, and Hadoop are in production now
• Amazon Redshift is the newest data warehouse target
5
©Continuent 2014 6
Tungsten Replicator Overview
©Continuent 2014
What is Tungsten Replicator?
7
A real-time, high-performance,
open source database replication engine
!GPL V2 license - 100% open source
Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent
“Golden Gate without the Price Tag”®
©Continuent 2014
Tungsten Replicator Overview
8
Master
(Transactions + Metadata)
Slave
THL
DBMS Logs
Replicator
(Transactions + Metadata)
THLReplicator
Extract transactions
from log
Apply
©Continuent 2014
Replication Services and Parallel Apply
9
Extract Filter Apply
StageExtract Filter Apply
StageStage
Pipeline
Master DBMS
Transaction History Log
Parallel Queue
Slave DBMS
Extract Filter ApplyExtract Filter ApplyExtract Filter Apply
©Continuent 2014
HeterogeneousOracleMySQL Oracle MySQL MySQL
star
master-slave
fan-in slave all-masters
MySQLOracle Oracle
©Continuent 2014
Real-Time Data Warehouse Loading
11
©Continuent 2014
Why MySQL Needs A Data Warehouse
12
id cust_id prod_id ...
1 335301 532 ...
2 2378 6235 ...
3 ... ... ...
Sales Table
id sku type
532 C00135 consumer
533 S09957 specialty
... ...
Product Tableprod_id id
532 1
6235 2
... ...
Prod_ID Index
Row format makes table scans very
slow
Indexes slow OLTP
Low/no data compression
Limited index types
Limited join
types
Single-threaded query — no parallelization
©Continuent 2014
Current Data Warehouse Options
13
OLTP Data
MySQL Master
Scalable row store with star schema and materialized view support
CSV files stored on HDFS with access via Hive/MapReduce
Column storage with compression and built-in time series support
©Continuent 2014
Traditional ETL Approach
14
MySQL
Sales Table
LoadTransferExtract
Date columns = intrusive
Batch-oriented = not timely
Scan for changes = performance hit
Sales Table
Data Warehouse
©Continuent 2014
Real-Time Data Replication
15
MySQL
Sales Table
Fast propagation = timely
No SQL changes = transparent
Automatic change capture = low impact
DBMS Logs
Data Replication
Sales Table
Data Warehouse
©Continuent 2014
Loading to an Oracle Data Warehouse
16
MySQL Tungsten Master Replicator
oracle
MySQLExtractor Special Filters * Transform enum to string
binlog_format=row
Tungsten Slave Replicator
oracle
Special Filters * Ignore extra tables * Map names to upper case * Optimize updates to remove unchanged columns MySQL
Binlog
Data stored in rows like
MySQL
©Continuent 2014
How about Loading to Hadoop?
17
Replication
CSV FilesCSV FilesBuffered
TransactionsBinlog
Provisioning
Data stored as append-only files
©Continuent 2014
Typical Raw Hadoop Data Layout
!!!
!
556,MALTESE HOPE,4.99,127\n 557,MANCHURIAN CURTAIN,3.99,177\n 558,MANNEQUIN WORST,2.99,71\n 559,MARRIED GO,2.99,114\n
18
field separator
file partitioning
record separator
compression type conversions
(CSV file)
©Continuent 2014
Loading to a Hadoop Data Warehouse
19
Replicator
hadoopTransactions from master
CSV FilesCSV FilesCSV Files
Staging TablesStaging TablesStaging “Tables”
Base TablesBase TablesMaterialized Views
Javascript load script
e.g. hadoop.js
Write data to CSV
Merge using external
MapReduce job
(Generate Hive table definitions)
(Generate Hive table definitions)
Load using hadoop
command
©Continuent 2014
Hadoop Materialized View Generation
20
Transaction logs Snapshot
UNION ALL
Emit last row per key if not a delete
MAP
REDUCE
Materialized view including all updates
Sort by key(s), transaction orderSHUFFLE
©Continuent 2014
How about Loading to Vertica?
21
Replication
CSV FilesCSV FilesBuffered
TransactionsBinlog
Provisioning?
Data stored in column format
©Continuent 2014
Data Layout in a Column Store
22
Sales Table
cust_id
335301
2378
...
prod_id
532
6235
...
Fast scans on columns
Updates to single rows are hideously
slow
quantity
1
3
...
id
1
2
3
Every column is an index
Good compression
id
532
533
...
sku
C00135
S09957
...
type
consumer
specialty
...
Product Table
Fast joins with parallel
query
©Continuent 2014
Loading to a Vertica Data Warhouse
23
MySQL Tungsten Master Replicator
vertica
MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables
binlog_format=row
Tungsten Slave Replicator
vertica
MySQL Binlog
CSV FilesCSV FilesCSV FilesCSV FilesCSV Files
Large transaction batches to leverage load parallelization
©Continuent 2014
Vertica Batch Loading--The Details
24
Replicator
verticaTransactions from master
CSV FilesCSV FilesCSV Files
Staging TablesStaging TablesStaging Tables
Base Tables
Base Tables
Base Tables
Merge Script
(or) COPY
directly to base tables
COPY to stage tables SELECT to
base tables
No external merge required!
©Continuent 2014
Provisioning Using a Sandbox Server
25
OLTP Server
Temporary Sandbox Server
Vertica Cluster
1. Restore logical backup
2. Replicate restored transactions
3. Replicate normally after restore loads
©Continuent 2014 26
Adding Amazon Redshift to the Mix
©Continuent 2014
Tungsten Support for Amazon Redshift
• Tungsten Replicator 3.0 supports real-time replication to Amazon Redshift
• Loading builds on data warehouse support for Vertica and Hadoop
• Beta program in progress now
• Production ready in September 2014
27
©Continuent 2014
Redshift = Cloud Data Warehouse
28
©Continuent 2014
Redshift Capabilities
• On-demand column store in Amazon Web Services
• Looks like a PostgreSQL 8.x DBMS
• $1,000 per terabyte per year or $0.25/hour
• Automatic backup and cluster management
• Integrated into Amazon VPC
• Data loading from Amazon S3 using SQL COPY command
29
©Continuent 2014
Connecting to Amazon Redshift
$ psql -h redshift-webinar.cub79gczb9kq.us-east-1.redshift.amazonaws.com -U tungsten -d dev -p 5439 psql (9.2.6, server 8.0.2) WARNING: psql version 9.2, server version 8.0. Some psql features might not work. SSL connection (cipher: ECDHE-RSA-AES256-SHA, bits: 256) Type "help" for help. !dev=# select * from test.foo; id | data ----+-------------- 2 | salutations! 3 | bye! 1 | hello! (3 rows) !dev=# \q
30
©Continuent 2014
Tungsten Loading to Redshift
31
Tungsten Slave
redshiftTransactions
from Tungsten Master
CSV FilesCSV FilesCSV Files
Staging TablesStaging Tables
RedShift Staging Tables
Base TablesBase TablesRedShift Base Tables
Javascript load script
e.g. hadoop.js
Write data to CSV
3. Merge using SQL to base
2. COPY to Redshift1. Load to S3
using s3cmd
S3 Storage
©Continuent 2014
Prerequisites for Redshift Loading
32
MySQL Tungsten Master Replicator
redshift
binlog_format=row
Tungsten Slave Replicator
redshift
MySQL Binlog
Allow access from Tungsten host
Java 6+, Ruby 1.8.5+
Amazon RedshiftJava 6+,
Ruby 1.8.5+ s3cmd 1.0+ S3 credentials
S3 Storage
Bucket for CSV uploads
©Continuent 2014
Application Prerequisites
33
• Primary keys on all tables
• UTF-8 character set--or at least be consistent
• Use GMT timezone--or be very consistent about dates
Avoid “Cowboy” schema changes to MySQL master(s)
©Continuent 2014
Generate Schema Using ddlscan
34
•Data types? •Column lengths? •Naming conventions? •Staging tables?
MySQL Tables
ddlscanAmazon Redshift
©Continuent 2014
Staging Table Schema
35
$ bin/ddlscan -db test -template ddl-mysql-redshift-staging.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.stage_xxx_foo;!CREATE TABLE test.stage_xxx_foo!(! tungsten_opcode CHAR(2),! tungsten_seqno INT,! tungsten_row_id INT,! tungsten_commit_timestamp TIMESTAMP,! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)!);
©Continuent 2014
Base Table Schema
$ bin/ddlscan -db test -template ddl-mysql-redshift.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.foo;!CREATE TABLE test.foo!(! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (id)!);
36
©Continuent 2014
Sandbox Provisioning for Redshift
37
OLTP Server
Temporary Sandbox Server
Redshift Cluster
1. Restore logical backup
2. Replicate restored transactions
3. Replicate normally after restore loads
Amazon Redshift
©Continuent 2014
Demo
38
Setting up MySQL to Redshift Data Loading
©Continuent 2014 39
Replicating from Tungsten Clusters to Redshift
©Continuent 2014
60 Second Intro to Tungsten Clusters
40
Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !
• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration
Amazon US West
apache /php
GonzoPortal.com
Connector Connector
©Continuent 2014
Replicating from a Cluster to Redshift
41
Tungsten Cluster1
master
slave
Tungsten 3.0 Replicator
cluster1 Amazon Redshift
Filters and Special Configuration * pkey - Fill in pkey info * colnames - Fill in names * fixmysqlstrings - Convert to UT8 * Replicate from both nodes for higher availability
©Continuent 2014
Fan-in from Multiple Tungsten Clusters
42
Redshift Cluster
cluster2
Tungsten Replicator
cluster1 Amazon Redshift
©Continuent 2014 43
Getting Started with MySQL to Redshift Loading
©Continuent 2014
Where Is Everything?
44
• Tungsten Replicator 3.0 builds are available on code.google.com http://code.google.com/p/tungsten-replicator/
(Visit the nightly builds page for latest features)
• Replicator documentation is available on Continuent website https://docs.continuent.com/tungsten-replicator-3.0/deployment-redshift.html
Contact Continuent for support
©Continuent 2014
In Conclusion…
• Tungsten Replicator 3.0 now supports realtime loading from MySQL to Amazon Redshift
• You can start using it yourself or join our beta program
• Software will be production-ready in September 2014
• Continuent has a wealth of data loading features on tap so stay tuned!
45
©Continuent 2014
www.continuent.com Follow us on Twitter @continuent
!
Tungsten Replicator: http://code.google.com/p/tungsten-replicator
Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs http://flyingclusters.blogspot.com/
560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]