create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

45
CREATE A HADOOP CLUSTER AND MIGRATE 39PB DATA PLUS 150000 JOBS/DAY Stuart Pook ([email protected] @StuartPook)

Upload: hadat

Post on 14-Feb-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

CREATE A HADOOP CLUSTER AND MIGRATE 39PBDATA PLUS 150000 JOBS/DAY

Stuart Pook ([email protected] @StuartPook)

Page 2: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

Anna Savarin, Anthony Rabier, Meriam Lachkar, Nicolas Fraison,Rémy Gayet, Rémy Saissy, Stuart Pook, Thierry Lefort &

Yohan Bismuth

BROUGHT TO YOU BY LAKE 

 

 

 

 

 

Page 3: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

2014, PROBLEMA CLUSTER CRITICAL FOR BOTH STORAGE AND

COMPUTE

still running CDH4 and CentOS 6

data centre is full

39 petabytes raw storage13404 CPUs (26808 threads)105 terabytes RAM40 terabytes imported per day> 100 000 jobs per day

Page 4: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

NO DISASTER PLAN :-(Defining a disaster: what do we have to lose?

the data “data is our blood”write access (needed for compute)compute

72 hours of buffers1st of month billing is sacredprediction models updated every 6 hoursmargin feedback loop

Page 5: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

BACKUPS?Each data block on 3 machines but no protection against

the loss of a data centrethe loss of two racks (the same a�ernoon?)the loss of three machines (in the same hour?)operator error (a “fat-finger”)

Page 6: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

backup too long

restore too long

BACKUP IN THE CLOUD?

Page 7: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

COMPUTE IN THE CLOUD?where to find 27000 threads?reservation too expensiveno need for elasticity as 100% chargegrowth too fast: 200 → 2600 (×12 in 2½ years)Requires reserved instances → same priceHadoop needs network for east-west trafficCriteo already has datacentresCriteo likes bare metalIn-house is cheaper and better

Page 8: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

A NEW CLUSTER TO:protect the dataduplicate the computemigrate to CDH5 and CentOS 7

new functionality → Sparkimplement our capacity planning

increase the computeincrease storage capacity

Page 9: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

Migrating from CDH4 to CDH5 seems easy, but:

A migration can go wrong

Do OS migration to CentOS 7 later

DIGRESSION: WHEN TO MIGRATE?

Huge fsimage (9 GB)Downtime must be limited

We have a new clusterUse it for an offline upgrade

Do not do everything at onceImpact on users as python, mono etc change

Page 10: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

HOW TO BACKUP: A SPARE CLUSTER?No, two clusters master/slave whose roles can swapEssential jobs run on the masterOther jobs split between the two clustersCopy output from essential jobs to the slave

Page 11: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

We have a L

We could have used a Y but

→ two clusters master/slave

IMPORT NEW DATA

kafka feeds a clustercopy to the other

overload intercontinental linksdifferent data imported

Page 12: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

THE MASTER FAILS …Turn the LMove essential jobsStop non-essential jobsStop development & ad-hoc jobsTo be implemented

Page 13: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

AT CRITEO HADOOP GREW FROM POCTO PROD

AD-HOC FIRST CLUSTERcreated as a POCgrew without a master planmore usersmore essentialnow the datacentre is full

Page 14: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

The cluster will start big and

it will grow …

THIS TIME MULTI-YEAR PLAN

Page 15: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

THE PLAN

Page 16: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

... ...4 groupsof2 to 8 Spines

4x 40GuplinksperToR

22 x ServerRacksup to 96 serversperRack

... ...

4x 40GuplinksperToR

22 x ServerRacksup to 96 serversperRack

... ...

... ...

A NEW DATA-CENTRE WITH A NEWNETWORK

Page 17: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

BUILD A NEW DATA-CENTRE IN 9MONTHS

it's impossible but we didn't knowso we did itnew constructorchoose one server for everythingchoose a bad RAID cardsaturated 10 Gb/s links

Page 18: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

call for tenders

three bidders

we bought three 10 node clusters

CHANGE THE HARDWARE ...

16 (or 12) 6 TB SAS disks2 Xeon E5-2650L v3, 24 cores, 48 threads256 GB RAMMellanox 10 Gb/s network card

Page 19: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

we tested:

by executing

TEST THE HARDWARE ...

disk bandwidthnetwork bandwidth

teragen, with :zero replication to saturate the diskshigh replication (×9) to saturate network

terasort

Page 20: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

load the hardware

similar performance

a manufacturer eliminated because

We choose the most dense and the least expensive → Huawei

STRESS THE HARDWARE ...

4 disks delivered brokenother disks failed20% electrical overconsumption

Page 21: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

HARDWARE MIX ...

integrating several manufacturers is an investment

HP in existing data-centresHuawei and HP in the new data-centresBoth in the old preprod clusterHuawei in the new preprod cluster

→ guarantees our liberty at each order

Page 22: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

THE NEW MACHINES ARRIVE …We rackWe configure using our Chef cookbooksInfrastructure is code in gitAutomate everything to scaleSystem state in RAM (diskless)

Page 23: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

We are going on-disk for master nodes

A DIGRESSION: WHY DISKLESS?++++−−

−−

Disks for the data not for the systemChef convergence assured at each rebootManual operations are lostMaximises storage densityLonger rebootsMore infrastructure required for boot (Chef server, RPMs,etc.)3 GB memory used for rootfsNobody else at Criteo does it

Page 24: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

we did and it didn’t work

DON’T MIX O/S AND DATA ON THE SAMEDISK

HDFS is I/O boundO/S traffic slows it down

HDFS traffic is sequentialO/S traffic is randommixing leads to extra seeks

disk failures

Page 25: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

WE TEST HADOOP WITH10 HOUR PETASORTS

one job that uses all the resourcesincrease all per job limits

one user that uses all the resourcesincrease all per user limits

250 GB application mastertrade-off between size of containers & number of containers

Page 26: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

IT CRASHESLinux bug → namenode crashesNeed to update kernel

On all 650 nodesBut we already have some users

Page 27: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

ROLLING REBOOT WITHOUTDOWNTIME

Good trainingConfirms reboot of a 650 node cluster is possibleFound configuration errors at first reboot a�er buildRack by rack → OK

Jobs killed onceNode by node → fail

Some jobs killed 600 timesApplies to datanode and nodemanager restartsNow we can restart a 1100 node cluster without no impact

Page 28: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

TUNE HADOOP CONFIGURATIONLots of petasorts and petagens

Run as many jobs as possible

Adapt the configuration to the scale of our cluster

namenode has 152 GB RAM for 238 266 000 blocksincrease bandwidth and time limit for checkpoint

Page 29: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

DISKS FAILA�er 9 months, 100 (1%) disks have an error90% work again a�er removing and reinserting themAn acceptable failure rateBut too many unknownsCollect statisticsWork with the constructorReplace the disk controller on all machines!

Page 30: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

DON’T PANICLots of disks and machines → problems expected11000 disks → even rare errors will happenmachines running at full charge 24/24 7/7 → even rare errorswill happen

Page 31: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

The manufacturers are motivated

We need to:

ADD A MANUFACTURER TO OUR PARK

Understand the problems for their hardwareFind solutions with their helpUpdate the firmware (rack by rack reboot)Train the DevOps for interventionsEstimate size of stock of spare parts

Page 32: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

 

 

 

 

 

 

HADOOP IS ONLINE

We need to help Criteo use the clusterThe real problems start when the clients arriveThe clients test and then we need to go live

Page 33: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

We copy 10 PB with a 20 Gb/s connection

MAKE THE NEW CLUSTER PROD READY

> 48 days if all goes wellBut we have problems on the network sideThe copy is blocking prod trafficWe have a router problem (for a month)Workaround with Linux traffic control (tc) on each node

Page 34: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

MONITOR EVERYTHINGHDFS

space, blocks, files, quotas, probesnamenode: missing blocks, GC, checkpoints, safemode,QPS, datanode listsdatanodes: disks, read/write throughput, space

YARNjob duration, queue length, memory & CPU usageResourceManager: apps pending, QPS

zookeeper: availability, probesnetworkhardware

Page 35: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

MOVE JOBS TO THE NEW CLUSTERUsers are apprehensive

The first users are very happy

transfer everything → halt → upgrade → run in parallel

CDH4 → CDH5 OKreliability ~ OK

Page 36: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

MASSIVE MIGRATION FAILUREas they moved to the new cluster users found

and thus used more resourcesno more compute resources on new cluster before all jobsmovedold cluster still running jobssome jobs moved backunable to buy more machines fast enough

Page 37: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day
Page 38: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

ADDING DATANODES KILLS THENAMENODE

At ~ 170 million blockseach block needs memory

GC (garbage collection) too longdatanode reports are delayeddatanodes are lostblock replication startsdatanodes returnextra blocks are freedmore GCs

→ vicious circle

Page 39: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

CHOOSE GC FOR NAMENODEserial parallel 

Concurrent Mark Sweep 

G1 Azul 

not efficient on multi-threadlong pauses + high

throughputshort

pauses + lower throughputdivided GC time by ~5

under test

Page 40: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day
Page 41: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

CURRENT SITUATIONTwo different clusters in parallel

Where to run each job?

Which data to copy?

Open questions

Ad-hoc procedures

Was not meant to happen

Page 42: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

HOW REDUNDANT ARE WE?A single incident must not impact both clusters

But the same Chef code is used on both clusters

Test >24 hours in pre-prodRollout carefully

Page 43: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

A “fat-finger”parallel -a prod_machines ssh {} sudo rm -r /var/chef/cache/lsi

parallel -a prod_machines ssh {} sudo rm -r / var/chef/cache/lsi

WHAT ABOUT OPERATOR ERROR?

Page 44: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

WE HAVE5 prod clusters3 pre-prod clusters2 running CDH46 running CDH52395 datanodes42 120 cores135 PB disk space692 TB RAM200 000+ jobs/day

Page 45: create a hadoop cluster and migrate 39pb data plus 150000 jobs/day

[email protected] @StuartPook www.criteolabs.com