replicating in real-time from mysql to amazon redshift

46
©Continuent 2014 Replicating in Real-Time from MySQL to Amazon Redshift Featuring Continuent Tungsten Robert Hodges, CEO JeMace, Director of Services

Upload: continuent

Post on 30-Nov-2014

317 views

Category:

Software


5 download

DESCRIPTION

Continuent is delighted to announce an exciting new Continuent Tungsten feature for MySQL users: replication in real-time from MySQL into Amazon Redshift. In this webinar we'll showcase Continuent Tungsten capabilities for continuous and real-time data warehouse loading, then zero-in on practical details of setting up replication from MySQL into Redshift. We cover the following topics: - Introduction to real-time data loading from a relational DBMS into data warehouses - Continuent Tungsten data warehouse loading to Redshift, Hadoop, Vertica, and Oracle - What's new with Redshift data loading - Setting up replication from MySQL into Amazon Redshift - Initial provisioning of the data, followed by on-going and real-time replication - Adding Redshift data loading to existing Continuent Tungsten clusters. This webinar includes practical tips and a live demo of how to get your data warehouse loading projects off the ground quickly and efficiently. Please join us to hear about this great new feature of Continuent Tungsten!

TRANSCRIPT

Page 1: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Replicating in Real-Time from MySQL to Amazon Redshift

Featuring Continuent Tungsten

Robert Hodges, CEO Jeff Mace, Director of Services

Page 2: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Presentation Topics

• Introductions

• Tungsten Replicator Overview

• Real-Time Data Warehouse Loading

• Adding Redshift to the Mix

• Replication from Tungsten Clusters to Redshift

• Wrap-up

2

Page 3: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Introducing Continuent

3

• The leading provider of clustering and replication for open source DBMS

• Our Product: Continuent Tungsten

• Clustering - Commercial-grade HA, performance scaling and data management for MySQL

• Replication - Flexible, high-performance data movement

Page 4: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014©Continuent 2014

Continuent Tungsten Customers

4

1

Page 5: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Quick Continuent Facts

• Largest Tungsten installation by data volume processes over 800 million transactions per day on 225 terabytes of relational data

• Largest installation by transaction volume handles up to 8 billion transactions daily

• Wide variety of topologies including MySQL, Oracle, Vertica, and Hadoop are in production now

• Amazon Redshift is the newest data warehouse target

5

Page 6: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014 6

Tungsten Replicator Overview

Page 7: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

What is Tungsten Replicator?

7

A real-time, high-performance,

open source database replication engine

!GPL V2 license - 100% open source

Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent

“Golden Gate without the Price Tag”®

Page 8: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Tungsten Replicator Overview

8

Master

(Transactions + Metadata)

Slave

THL

DBMS Logs

Replicator

(Transactions + Metadata)

THLReplicator

Extract transactions

from log

Apply

Page 9: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Replication Services and Parallel Apply

9

Extract Filter Apply

StageExtract Filter Apply

StageStage

Pipeline

Master DBMS

Transaction History Log

Parallel Queue

Slave DBMS

Extract Filter ApplyExtract Filter ApplyExtract Filter Apply

Page 10: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

HeterogeneousOracleMySQL Oracle MySQL MySQL

star

master-slave

fan-in slave all-masters

MySQLOracle Oracle

Page 11: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Real-Time Data Warehouse Loading

11

Page 12: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Why MySQL Needs A Data Warehouse

12

id cust_id prod_id ...

1 335301 532 ...

2 2378 6235 ...

3 ... ... ...

Sales Table

id sku type

532 C00135 consumer

533 S09957 specialty

... ...

Product Tableprod_id id

532 1

6235 2

... ...

Prod_ID Index

Row format makes table scans very

slow

Indexes slow OLTP

Low/no data compression

Limited index types

Limited join

types

Single-threaded query — no parallelization

Page 13: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Current Data Warehouse Options

13

OLTP Data

MySQL Master

Scalable row store with star schema and materialized view support

CSV files stored on HDFS with access via Hive/MapReduce

Column storage with compression and built-in time series support

Page 14: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Traditional ETL Approach

14

MySQL

Sales Table

LoadTransferExtract

Date columns = intrusive

Batch-oriented = not timely

Scan for changes = performance hit

Sales Table

Data Warehouse

Page 15: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Real-Time Data Replication

15

MySQL

Sales Table

Fast propagation = timely

No SQL changes = transparent

Automatic change capture = low impact

DBMS Logs

Data Replication

Sales Table

Data Warehouse

Page 16: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Loading to an Oracle Data Warehouse

16

MySQL Tungsten Master Replicator

oracle

MySQLExtractor Special Filters * Transform enum to string

binlog_format=row

Tungsten Slave Replicator

oracle

Special Filters * Ignore extra tables * Map names to upper case * Optimize updates to remove unchanged columns MySQL

Binlog

Data stored in rows like

MySQL

Page 17: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

How about Loading to Hadoop?

17

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Provisioning

Data stored as append-only files

Page 18: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Typical Raw Hadoop Data Layout

!!!

!

556,MALTESE HOPE,4.99,127\n 557,MANCHURIAN CURTAIN,3.99,177\n 558,MANNEQUIN WORST,2.99,71\n 559,MARRIED GO,2.99,114\n

18

field separator

file partitioning

record separator

compression type conversions

(CSV file)

Page 19: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Loading to a Hadoop Data Warehouse

19

Replicator

hadoopTransactions from master

CSV FilesCSV FilesCSV Files

Staging TablesStaging TablesStaging “Tables”

Base TablesBase TablesMaterialized Views

Javascript load script

e.g. hadoop.js

Write data to CSV

Merge using external

MapReduce job

(Generate Hive table definitions)

(Generate Hive table definitions)

Load using hadoop

command

Page 20: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Hadoop Materialized View Generation

20

Transaction logs Snapshot

UNION ALL

Emit last row per key if not a delete

MAP

REDUCE

Materialized view including all updates

Sort by key(s), transaction orderSHUFFLE

Page 21: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

How about Loading to Vertica?

21

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Provisioning?

Data stored in column format

Page 22: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Data Layout in a Column Store

22

Sales Table

cust_id

335301

2378

...

prod_id

532

6235

...

Fast scans on columns

Updates to single rows are hideously

slow

quantity

1

3

...

id

1

2

3

Every column is an index

Good compression

id

532

533

...

sku

C00135

S09957

...

type

consumer

specialty

...

Product Table

Fast joins with parallel

query

Page 23: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Loading to a Vertica Data Warhouse

23

MySQL Tungsten Master Replicator

vertica

MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables

binlog_format=row

Tungsten Slave Replicator

vertica

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Large transaction batches to leverage load parallelization

Page 24: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Vertica Batch Loading--The Details

24

Replicator

verticaTransactions from master

CSV FilesCSV FilesCSV Files

Staging TablesStaging TablesStaging Tables

Base Tables

Base Tables

Base Tables

Merge Script

(or) COPY

directly to base tables

COPY to stage tables SELECT to

base tables

No external merge required!

Page 25: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Provisioning Using a Sandbox Server

25

OLTP Server

Temporary Sandbox Server

Vertica Cluster

1. Restore logical backup

2. Replicate restored transactions

3. Replicate normally after restore loads

Page 26: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014 26

Adding Amazon Redshift to the Mix

Page 27: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Tungsten Support for Amazon Redshift

• Tungsten Replicator 3.0 supports real-time replication to Amazon Redshift

• Loading builds on data warehouse support for Vertica and Hadoop

• Beta program in progress now

• Production ready in September 2014

27

Page 28: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Redshift = Cloud Data Warehouse

28

Page 29: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Redshift Capabilities

• On-demand column store in Amazon Web Services

• Looks like a PostgreSQL 8.x DBMS

• $1,000 per terabyte per year or $0.25/hour

• Automatic backup and cluster management

• Integrated into Amazon VPC

• Data loading from Amazon S3 using SQL COPY command

29

Page 30: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Connecting to Amazon Redshift

$ psql -h redshift-webinar.cub79gczb9kq.us-east-1.redshift.amazonaws.com -U tungsten -d dev -p 5439 psql (9.2.6, server 8.0.2) WARNING: psql version 9.2, server version 8.0. Some psql features might not work. SSL connection (cipher: ECDHE-RSA-AES256-SHA, bits: 256) Type "help" for help. !dev=# select * from test.foo; id | data ----+-------------- 2 | salutations! 3 | bye! 1 | hello! (3 rows) !dev=# \q

30

Page 31: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Tungsten Loading to Redshift

31

Tungsten Slave

redshiftTransactions

from Tungsten Master

CSV FilesCSV FilesCSV Files

Staging TablesStaging Tables

RedShift Staging Tables

Base TablesBase TablesRedShift Base Tables

Javascript load script

e.g. hadoop.js

Write data to CSV

3. Merge using SQL to base

2. COPY to Redshift1. Load to S3

using s3cmd

S3 Storage

Page 32: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Prerequisites for Redshift Loading

32

MySQL Tungsten Master Replicator

redshift

binlog_format=row

Tungsten Slave Replicator

redshift

MySQL Binlog

Allow access from Tungsten host

Java 6+, Ruby 1.8.5+

Amazon RedshiftJava 6+,

Ruby 1.8.5+ s3cmd 1.0+ S3 credentials

S3 Storage

Bucket for CSV uploads

Page 33: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Application Prerequisites

33

• Primary keys on all tables

• UTF-8 character set--or at least be consistent

• Use GMT timezone--or be very consistent about dates

Avoid “Cowboy” schema changes to MySQL master(s)

Page 34: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Generate Schema Using ddlscan

34

•Data types? •Column lengths? •Naming conventions? •Staging tables?

MySQL Tables

ddlscanAmazon Redshift

Page 35: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Staging Table Schema

35

$ bin/ddlscan -db test -template ddl-mysql-redshift-staging.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.stage_xxx_foo;!CREATE TABLE test.stage_xxx_foo!(! tungsten_opcode CHAR(2),! tungsten_seqno INT,! tungsten_row_id INT,! tungsten_commit_timestamp TIMESTAMP,! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)!);

Page 36: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Base Table Schema

$ bin/ddlscan -db test -template ddl-mysql-redshift.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.foo;!CREATE TABLE test.foo!(! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (id)!);

36

Page 37: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Sandbox Provisioning for Redshift

37

OLTP Server

Temporary Sandbox Server

Redshift Cluster

1. Restore logical backup

2. Replicate restored transactions

3. Replicate normally after restore loads

Amazon Redshift

Page 38: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Demo

38

Setting up MySQL to Redshift Data Loading

Page 39: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014 39

Replicating from Tungsten Clusters to Redshift

Page 40: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

60 Second Intro to Tungsten Clusters

40

Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !

• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration

Amazon US West

apache /php

GonzoPortal.com

Connector Connector

Page 41: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Replicating from a Cluster to Redshift

41

Tungsten Cluster1

master

slave

Tungsten 3.0 Replicator

cluster1 Amazon Redshift

Filters and Special Configuration * pkey - Fill in pkey info * colnames - Fill in names * fixmysqlstrings - Convert to UT8 * Replicate from both nodes for higher availability

Page 42: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Fan-in from Multiple Tungsten Clusters

42

Redshift Cluster

cluster2

Tungsten Replicator

cluster1 Amazon Redshift

Page 43: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014 43

Getting Started with MySQL to Redshift Loading

Page 44: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

Where Is Everything?

44

• Tungsten Replicator 3.0 builds are available on code.google.com http://code.google.com/p/tungsten-replicator/

(Visit the nightly builds page for latest features)

• Replicator documentation is available on Continuent website https://docs.continuent.com/tungsten-replicator-3.0/deployment-redshift.html

Contact Continuent for support

Page 45: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

In Conclusion…

• Tungsten Replicator 3.0 now supports realtime loading from MySQL to Amazon Redshift

• You can start using it yourself or join our beta program

• Software will be production-ready in September 2014

• Continuent has a wealth of data loading features on tap so stay tuned!

45

Page 46: Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014

www.continuent.com Follow us on Twitter @continuent

!

Tungsten Replicator: http://code.google.com/p/tungsten-replicator

Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs http://flyingclusters.blogspot.com/

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]