big migrations: moving elephant herds by carlos izquierdo

Big Migrations:Moving elephant herds

www.datatons.com

Motivation

● Everybody wants to jump into Big Data● Everybody wants their new setup to be cheap

– Cloud is an excellent option for this

● These environments generally start as a PoC– They should be re-implemented

– Sometimes they are not

www.datatons.com

Motivation

● You may need to move your Hadoop cluster– You want to reduce costs

– You need more performance

– Because of corporate policy

– For legal reasons

● But moving big data volumes is a problem!– Example: 20 TB at 10 MB/s ~ 2 ½ days

www.datatons.com

Initial idea

● Set up a second cluster in the new environment● The new cluster is initially empty● We need to populate it

www.datatons.com

Classic UNIX methods

● Well-known file transfer technologies:– (s)FTP

– Rsync

– NFS + cp

● You need to set up a staging area● This acts as an intermediate space between

Hadoop and the classic UNIX world

www.datatons.com


www.datatons.com


● Disadvantages:– Needs a big staging area

– Transfer times are slow

– Single nodes act as bottlenecks

– Metadata needs to be copied separately

– Everything must be stopped during the copy to avoid data loss

– Total downtime: several hours or days (don't even try if your data is bigger)

www.datatons.com

Using Amazon S3

● AWS S3 storage is also an option for staging● Cheaper than VM disks● Available almost everywhere● An access key is needed

– Create a user only with S3 permissions

● Transfer is done using distcp– (We'll see more about this later)

www.datatons.com

Using Amazon S3

www.datatons.com

Distcp

● Distcp copies data between two Hadoop clusters● No staging area needed (Hadoop native)● High throughput● Metadata needs to be copied separately● Clusters need to be connected

– Via VPN for hdfs protocol

– NAT can be used when using webhdfs

● Kerberos complicates matters

www.datatons.com

Distcp

www.datatons.com

Remote cluster access

● As a side note, remote filesystems can also be used outside distcp

● For example, as LOCATION for Hive tables● While we're at it...● We can transform data

– For example, convert files to Parquet

● Is this the right time?

www.datatons.com

Extending Hadoop

● Do like the caterpillar!● We want to step on the new platform while the

old one continues working

www.datatons.com

Requirements

● Install servers in the new platform– Enough to hold ALL data

– Same OS + config as original platform

– Config management tools are helpful for this

● Set up connectivity– VPN (private networking) is needed

● Rack-aware configuration: new nodes need to be on a new rack

● System times and time zones should be consistent

www.datatons.com

Requirements

www.datatons.com

Starting the copy

● New nodes will have a DataNode role● No computing yet (YARN, Impala, etc.)● DataNode roles will be stopped at first● When started:

– If there is only one rack in the original platform, the copy process will begin immediately

– If there is more than one rack in the original, manual intervention will be required

www.datatons.com

Starting the copy

www.datatons.com

Transfer speed

● Two parameters affect the data transfer speed:– dfs.datanode.balance.bandwidthPerSec

– dfs.namenode.replication.work.multiplier.per.iteration

● No jobs are launched in the new nodes– Data flow is almost exclusive to the copy

www.datatons.com

Transfer speed

www.datatons.com

Moving master roles

● When possible, take advantage of HA:– Zookeeper (just add two)

– NameNode

– ResourceManager

● Others need to be migrated manually:– Hive metastore DB needs to be copied

– Having a DNS name for the DB helps

www.datatons.com

Moving master roles

www.datatons.com

Moving data I/O

● Once data is copied (fully or most of it), new computation roles will be deployed:– NodeManager– Impalad

● Roles will be stopped at first● Auxiliary nodes (front-end, app nodes, etc) need to

be deployed in the new platform● A planned intervention (at a low usage time) needs to

take place

www.datatons.com

Moving data I/O

www.datatons.com

During the intervention

● The cluster is stopped● If necessary, client configuration is redeployed● Services are started and tested in this order:

– Zookeeper

– HDFS

– YARN (only for the new platform)

– Impala (only for the new platform)

● Auxiliary services in the new platform are tested● Green light? Change the DNS for the entry points

www.datatons.com

Final picture

www.datatons.com

Conclusions and afterthoughts

● Minimal downtime, similar to non-Hadoop planned works

● Data and service are never at risk● Hadoop tools are used to solve a Hadoop

problem● No user impact: no change in data or access● Kerberos is not an issue (same REALM + kdc)

Thank you!

Carlos [email protected]

www.datatons.com

big migrations: moving elephant herds by carlos izquierdo

Technology