big migrations: moving elephant herds by carlos izquierdo

Post on 16-Apr-2017

124 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Migrations:Moving elephant herds

www.datatons.com

Motivation

● Everybody wants to jump into Big Data● Everybody wants their new setup to be cheap

– Cloud is an excellent option for this

● These environments generally start as a PoC– They should be re-implemented

– Sometimes they are not

www.datatons.com

Motivation

● You may need to move your Hadoop cluster– You want to reduce costs

– You need more performance

– Because of corporate policy

– For legal reasons

● But moving big data volumes is a problem!– Example: 20 TB at 10 MB/s ~ 2 ½ days

www.datatons.com

Initial idea

● Set up a second cluster in the new environment● The new cluster is initially empty● We need to populate it

www.datatons.com

Classic UNIX methods

● Well-known file transfer technologies:– (s)FTP

– Rsync

– NFS + cp

● You need to set up a staging area● This acts as an intermediate space between

Hadoop and the classic UNIX world

www.datatons.com

Classic UNIX methods

www.datatons.com

Classic UNIX methods

● Disadvantages:– Needs a big staging area

– Transfer times are slow

– Single nodes act as bottlenecks

– Metadata needs to be copied separately

– Everything must be stopped during the copy to avoid data loss

– Total downtime: several hours or days (don't even try if your data is bigger)

www.datatons.com

Using Amazon S3

● AWS S3 storage is also an option for staging● Cheaper than VM disks● Available almost everywhere● An access key is needed

– Create a user only with S3 permissions

● Transfer is done using distcp– (We'll see more about this later)

www.datatons.com

Using Amazon S3

www.datatons.com

Distcp

● Distcp copies data between two Hadoop clusters● No staging area needed (Hadoop native)● High throughput● Metadata needs to be copied separately● Clusters need to be connected

– Via VPN for hdfs protocol

– NAT can be used when using webhdfs

● Kerberos complicates matters

www.datatons.com

Distcp

www.datatons.com

Remote cluster access

● As a side note, remote filesystems can also be used outside distcp

● For example, as LOCATION for Hive tables● While we're at it...● We can transform data

– For example, convert files to Parquet

● Is this the right time?

www.datatons.com

Extending Hadoop

● Do like the caterpillar!● We want to step on the new platform while the

old one continues working

www.datatons.com

Requirements

● Install servers in the new platform– Enough to hold ALL data

– Same OS + config as original platform

– Config management tools are helpful for this

● Set up connectivity– VPN (private networking) is needed

● Rack-aware configuration: new nodes need to be on a new rack

● System times and time zones should be consistent

www.datatons.com

Requirements

www.datatons.com

Starting the copy

● New nodes will have a DataNode role● No computing yet (YARN, Impala, etc.)● DataNode roles will be stopped at first● When started:

– If there is only one rack in the original platform, the copy process will begin immediately

– If there is more than one rack in the original, manual intervention will be required

www.datatons.com

Starting the copy

www.datatons.com

Starting the copy

www.datatons.com

Starting the copy

www.datatons.com

Starting the copy

www.datatons.com

Transfer speed

● Two parameters affect the data transfer speed:– dfs.datanode.balance.bandwidthPerSec

– dfs.namenode.replication.work.multiplier.per.iteration

● No jobs are launched in the new nodes– Data flow is almost exclusive to the copy

www.datatons.com

Transfer speed

www.datatons.com

Moving master roles

● When possible, take advantage of HA:– Zookeeper (just add two)

– NameNode

– ResourceManager

● Others need to be migrated manually:– Hive metastore DB needs to be copied

– Having a DNS name for the DB helps

www.datatons.com

Moving master roles

www.datatons.com

Moving data I/O

● Once data is copied (fully or most of it), new computation roles will be deployed:– NodeManager– Impalad

● Roles will be stopped at first● Auxiliary nodes (front-end, app nodes, etc) need to

be deployed in the new platform● A planned intervention (at a low usage time) needs to

take place

www.datatons.com

Moving data I/O

www.datatons.com

During the intervention

● The cluster is stopped● If necessary, client configuration is redeployed● Services are started and tested in this order:

– Zookeeper

– HDFS

– YARN (only for the new platform)

– Impala (only for the new platform)

● Auxiliary services in the new platform are tested● Green light? Change the DNS for the entry points

www.datatons.com

Final picture

www.datatons.com

Conclusions and afterthoughts

● Minimal downtime, similar to non-Hadoop planned works

● Data and service are never at risk● Hadoop tools are used to solve a Hadoop

problem● No user impact: no change in data or access● Kerberos is not an issue (same REALM + kdc)

Thank you!

Carlos Izquierdocizquierdo@datatons.com

www.datatons.com

top related