replicating in real-time from mysql to amazon redshift

©Continuent 2014

Replicating in Real-Time from MySQL to Amazon Redshift

Featuring Continuent Tungsten

Robert Hodges, CEO Jeff Mace, Director of Services

©Continuent 2014

Presentation Topics

• Introductions

• Tungsten Replicator Overview

• Real-Time Data Warehouse Loading

• Adding Redshift to the Mix

• Replication from Tungsten Clusters to Redshift

• Wrap-up

2

©Continuent 2014

Introducing Continuent

3

• The leading provider of clustering and replication for open source DBMS

• Our Product: Continuent Tungsten

• Clustering - Commercial-grade HA, performance scaling and data management for MySQL

• Replication - Flexible, high-performance data movement

©Continuent 2014©Continuent 2014

Continuent Tungsten Customers

4

1

©Continuent 2014

Quick Continuent Facts

• Largest Tungsten installation by data volume processes over 800 million transactions per day on 225 terabytes of relational data

• Largest installation by transaction volume handles up to 8 billion transactions daily

• Wide variety of topologies including MySQL, Oracle, Vertica, and Hadoop are in production now

• Amazon Redshift is the newest data warehouse target

5

©Continuent 2014 6

Tungsten Replicator Overview

©Continuent 2014

What is Tungsten Replicator?

7

A real-time, high-performance,

open source database replication engine

!GPL V2 license - 100% open source

Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent

“Golden Gate without the Price Tag”®

©Continuent 2014

Tungsten Replicator Overview

8

Master

(Transactions + Metadata)

Slave

THL

DBMS Logs

Replicator

(Transactions + Metadata)

THLReplicator

Extract transactions

from log

Apply

©Continuent 2014

Replication Services and Parallel Apply

9

Extract Filter Apply

StageExtract Filter Apply

StageStage

Pipeline

Master DBMS

Transaction History Log

Parallel Queue

Slave DBMS

Extract Filter ApplyExtract Filter ApplyExtract Filter Apply

©Continuent 2014

HeterogeneousOracleMySQL Oracle MySQL MySQL

star

master-slave

fan-in slave all-masters

MySQLOracle Oracle

©Continuent 2014

Real-Time Data Warehouse Loading

11

©Continuent 2014

Why MySQL Needs A Data Warehouse

12

id cust_id prod_id ...

1 335301 532 ...

2 2378 6235 ...

3 ... ... ...

Sales Table

id sku type

532 C00135 consumer

533 S09957 specialty

... ...

Product Tableprod_id id

532 1

6235 2

... ...

Prod_ID Index

Row format makes table scans very

slow

Indexes slow OLTP

Low/no data compression

Limited index types

Limited join

types

Single-threaded query — no parallelization

©Continuent 2014

Current Data Warehouse Options

13

OLTP Data

MySQL Master

Scalable row store with star schema and materialized view support

CSV files stored on HDFS with access via Hive/MapReduce

Column storage with compression and built-in time series support

©Continuent 2014

Traditional ETL Approach

14

MySQL

Sales Table

LoadTransferExtract

Date columns = intrusive

Batch-oriented = not timely

Scan for changes = performance hit

Sales Table

Data Warehouse

©Continuent 2014

Real-Time Data Replication

15

MySQL

Sales Table

Fast propagation = timely

No SQL changes = transparent

Automatic change capture = low impact

DBMS Logs

Data Replication

Sales Table

Data Warehouse

©Continuent 2014

Loading to an Oracle Data Warehouse

16

MySQL Tungsten Master Replicator

oracle

MySQLExtractor Special Filters * Transform enum to string

binlog_format=row

Tungsten Slave Replicator

oracle

Special Filters * Ignore extra tables * Map names to upper case * Optimize updates to remove unchanged columns MySQL

Binlog

Data stored in rows like

MySQL

©Continuent 2014

How about Loading to Hadoop?

17

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Provisioning

Data stored as append-only files

©Continuent 2014

Typical Raw Hadoop Data Layout

!!!

!

556,MALTESE HOPE,4.99,127\n 557,MANCHURIAN CURTAIN,3.99,177\n 558,MANNEQUIN WORST,2.99,71\n 559,MARRIED GO,2.99,114\n

18

field separator

file partitioning

record separator

compression type conversions

(CSV file)

©Continuent 2014

Loading to a Hadoop Data Warehouse

19

Replicator

hadoopTransactions from master

CSV FilesCSV FilesCSV Files

Staging TablesStaging TablesStaging “Tables”

Base TablesBase TablesMaterialized Views

Javascript load script

e.g. hadoop.js

Write data to CSV

Merge using external

MapReduce job

(Generate Hive table definitions)

(Generate Hive table definitions)

Load using hadoop

command

©Continuent 2014

Hadoop Materialized View Generation

20

Transaction logs Snapshot

UNION ALL

Emit last row per key if not a delete

MAP

REDUCE

Materialized view including all updates

Sort by key(s), transaction orderSHUFFLE

©Continuent 2014

How about Loading to Vertica?

21

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Provisioning?

Data stored in column format

©Continuent 2014

Data Layout in a Column Store

22

Sales Table

cust_id

335301

2378

...

prod_id

532

6235

...

Fast scans on columns

Updates to single rows are hideously

slow

quantity

1

3

...

id

1

2

3

Every column is an index

Good compression

id

532

533

...

sku

C00135

S09957

...

type

consumer

specialty

...

Product Table

Fast joins with parallel

query

©Continuent 2014

Loading to a Vertica Data Warhouse

23


vertica

MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables

binlog_format=row


vertica

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Large transaction batches to leverage load parallelization

©Continuent 2014

Vertica Batch Loading--The Details

24

Replicator

verticaTransactions from master


Staging TablesStaging TablesStaging Tables

Base Tables

Base Tables

Base Tables

Merge Script

(or) COPY

directly to base tables

COPY to stage tables SELECT to

base tables

No external merge required!

©Continuent 2014

Provisioning Using a Sandbox Server

25

OLTP Server

Temporary Sandbox Server

Vertica Cluster

1. Restore logical backup

2. Replicate restored transactions

3. Replicate normally after restore loads

©Continuent 2014 26

Adding Amazon Redshift to the Mix

©Continuent 2014

Tungsten Support for Amazon Redshift

• Tungsten Replicator 3.0 supports real-time replication to Amazon Redshift

• Loading builds on data warehouse support for Vertica and Hadoop

• Beta program in progress now

• Production ready in September 2014

27

©Continuent 2014

Redshift = Cloud Data Warehouse

28

©Continuent 2014

Redshift Capabilities

• On-demand column store in Amazon Web Services

• Looks like a PostgreSQL 8.x DBMS

• $1,000 per terabyte per year or $0.25/hour

• Automatic backup and cluster management

• Integrated into Amazon VPC

• Data loading from Amazon S3 using SQL COPY command

29

©Continuent 2014

Connecting to Amazon Redshift

$ psql -h redshift-webinar.cub79gczb9kq.us-east-1.redshift.amazonaws.com -U tungsten -d dev -p 5439 psql (9.2.6, server 8.0.2) WARNING: psql version 9.2, server version 8.0. Some psql features might not work. SSL connection (cipher: ECDHE-RSA-AES256-SHA, bits: 256) Type "help" for help. !dev=# select * from test.foo; id | data ----+-------------- 2 | salutations! 3 | bye! 1 | hello! (3 rows) !dev=# \q

30

©Continuent 2014

Tungsten Loading to Redshift

31

Tungsten Slave

redshiftTransactions

from Tungsten Master


Staging TablesStaging Tables

RedShift Staging Tables

Base TablesBase TablesRedShift Base Tables

Javascript load script

e.g. hadoop.js

Write data to CSV

3. Merge using SQL to base

2. COPY to Redshift1. Load to S3

using s3cmd

S3 Storage

©Continuent 2014

Prerequisites for Redshift Loading

32


redshift

binlog_format=row


redshift

MySQL Binlog

Allow access from Tungsten host

Java 6+, Ruby 1.8.5+

Amazon RedshiftJava 6+,

Ruby 1.8.5+ s3cmd 1.0+ S3 credentials

S3 Storage

Bucket for CSV uploads

©Continuent 2014

Application Prerequisites

33

• Primary keys on all tables

• UTF-8 character set--or at least be consistent

• Use GMT timezone--or be very consistent about dates

Avoid “Cowboy” schema changes to MySQL master(s)

©Continuent 2014

Generate Schema Using ddlscan

34

•Data types? •Column lengths? •Naming conventions? •Staging tables?

MySQL Tables

ddlscanAmazon Redshift

©Continuent 2014

Staging Table Schema

35

$ bin/ddlscan -db test -template ddl-mysql-redshift-staging.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.stage_xxx_foo;!CREATE TABLE test.stage_xxx_foo!(! tungsten_opcode CHAR(2),! tungsten_seqno INT,! tungsten_row_id INT,! tungsten_commit_timestamp TIMESTAMP,! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)!);

©Continuent 2014

Base Table Schema

$ bin/ddlscan -db test -template ddl-mysql-redshift.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.foo;!CREATE TABLE test.foo!(! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (id)!);

36

©Continuent 2014

Sandbox Provisioning for Redshift

37

OLTP Server

Temporary Sandbox Server

Redshift Cluster

1. Restore logical backup

2. Replicate restored transactions

3. Replicate normally after restore loads

Amazon Redshift


Replicating from Tungsten Clusters to Redshift

©Continuent 2014

60 Second Intro to Tungsten Clusters

40

Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !

• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration

Amazon US West

apache /php

GonzoPortal.com

Connector Connector

©Continuent 2014

Replicating from a Cluster to Redshift

41

Tungsten Cluster1

master

slave

Tungsten 3.0 Replicator

cluster1 Amazon Redshift

Filters and Special Configuration * pkey - Fill in pkey info * colnames - Fill in names * fixmysqlstrings - Convert to UT8 * Replicate from both nodes for higher availability

©Continuent 2014

Fan-in from Multiple Tungsten Clusters

42

Redshift Cluster

cluster2

Tungsten Replicator

cluster1 Amazon Redshift


Getting Started with MySQL to Redshift Loading

©Continuent 2014

Where Is Everything?

44

• Tungsten Replicator 3.0 builds are available on code.google.com http://code.google.com/p/tungsten-replicator/

(Visit the nightly builds page for latest features)

• Replicator documentation is available on Continuent website https://docs.continuent.com/tungsten-replicator-3.0/deployment-redshift.html

Contact Continuent for support

©Continuent 2014

In Conclusion…

• Tungsten Replicator 3.0 now supports realtime loading from MySQL to Amazon Redshift

• You can start using it yourself or join our beta program

• Software will be production-ready in September 2014

• Continuent has a wealth of data loading features on tap so stay tuned!

45

©Continuent 2014

www.continuent.com Follow us on Twitter @continuent

!

Tungsten Replicator: http://code.google.com/p/tungsten-replicator

Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs http://flyingclusters.blogspot.com/

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]