bdm37: hadoop in production – the war stories by nikolaï grigoriev, principal software engineer,...

Hadoop – The War Stories

Running Hadoop in large enterprise environment

Nikolai Grigoriev (ngrigoriev@gmail.com, @nikgrig)Principal Software Engineer, http://sociablelabs.com

Agenda

● Why Hadoop?

● Planning Hadoop deployment

● Hadoop and read hardware

● Understanding the software stack

● Tuning HDFS, MapReduce and HBase

● Troubleshooting examples

● Testing your applications

Disclaimer: this presentation is based on the combined work experience from more thanone company and represents the author's personal point of view on the problems discussed in it.

Why Hadoop (and why have we decided to

use it)?● Need to store hundreds of Tb of info

● Need to process it in parallel

● Desire to have both storage and processing horizontally scalable

● Having and open-source platform with commercial support

Our application

Application servers(many :) )

Log processors

“ETL process”

Our application in numbers

● Thousands of user sessions per second

● Average session log size: ~30Kb, 3-7 events per log

● Target retention period – at least ~90 days

● Redundancy and HA everywhere

● Pluggable “ETL” modules for additional data processing

Main problem

Team had no practical knowledge of Hadoop, HDFS and HBase…

...and there was nobody at the company to help

But we did not realize...

It was not THE ONLY problem wewere about to face!

First fight – capacity planning

● Tons of articles are written about Hadoop capacity planning

● Architects may be spending months making educated guesses

● Capacity planning is really about finding the amount of $$$ to be spent on your cluster for target workload– If we had infinite amount of $$$ why would we

bother at all? ;)

Hadoop performance limiting factors

It is all about the balance

● Your Hadoop cluster and your apps use all these resources at different time

● Over-provisioning of one of the resources usually lead to the shortage of another one - wasted $$$

What can we say about an app?

● It is going to store X Tb of data– Amount of storage (not to forget the RF!)

– Accommodate for growth and failures

● It is going to ingest the data at Y Mb/s– Your network speed and number of nodes

● Latency– More HDDs and faster HDDs

– More RAM

– More nodes

We are big enterprise...Geeky Hadoop developer

Old School Senior IT Guy

- many “commodity+” hosts- good but inexpensive networking- more regular HDDs- lots of RAM- I also love cloud…- my recent OS- my software configuration- simple network

SANs, RAIDs, SCSI, racks,Blades, redundancy, Cisco, HP, fiber optics,4-year-old rock-solid RHEL, SNMPmonitoring…

what? I am the Boss...

Hadoop cluster vs. old school application servers

● Mostly identical “commodity+” machines– Probably with the exception of NN, JT

● Better to have more simpler machines than fewer monster ones

● No RAID, just JBOD!

● Ethernet depending on the storage density, bonded 1Gbit may be enough

● Hadoop achieves with software what used to be achievable with [expensive!] hardware

But still, your application is the driver, not the IT guy!

From Cloudera website – Hadoop machine configuration according to workload

Your job is:

● Educate your IT, get them on your side or at least earn their trust

● Try to build a capacity planning spreadsheet based on what you do know

● Apply common sense to guess what you do not know

● ...and plan a decent buffer

● Set reasonable performance targets for your application

Fight #2 – OMG, our application is slow!!!

● Main part of our application was the MR job merging the logs

● We have committed to deliver X logs/sec on a target test cluster with sample workload

● We were delivering like ~30% of that● ...weeks before release :)● ...and we have ran out of other excuses :(● It was clearly our software and/or

configuration

Wait a second – we have support contract from Hadoop vendor!

● I mean no disrespect to the vendors!

● But they do not know your application

● And they do not know your hardware

● And they do not know exactly your OS

● And they do not know your network equipment

● They can help you with some tuning, they can help you with bugs and crashes – but they won't be able (or sometimes simply qualified) to do your job!

We are on our own :(

● We have realized that our testing methods were not adequate to Hadoop-based ETL process

● Testing the product end-to-end was too difficult, tracking changes was impossible

● Turn-around was too long, we could not try something quickly and revert back

● Observing and monitoring the live system with dummy incoming data was not productive enough

Key to successful testing

● Representative data set

● Ability to repeat the same operation as many times as needed with quick turnaround

● Each engineer had to be able to run the tests and try something

● Establishing the key metrics you monitor and try to improve

● Methodological approach – analyze, change, test, be ready to roll back

Our “reference runner”

Large sampledataset

“Reset” tool Runner tool Statistics

Recreates HBase tables(predefined regions),cleans HDFS etc

Injects the test data, prepares the environment,launches the MR job like realapplication, allows to quicklyrebuild and redeploy the partof the application

Any improvements sincelast run?

Manager

Tuning results

● In two weeks we had the job that worked about 3 times faster

● Tuning was done everywhere – from OS to Hadoop/HBase and our code

● We were confident that the software was ready to go to production

● During following 2 years later we realized how bad was our design and how it should have been done ;)

Hadoop MapReduce DOs● Think processes, not threads

● Reusable objects, lower GC overhead● Snappy data compression is generally good

● Reasonable use of counters provides important information

● For frequently running jobs, distributed cache helps a lot

● Minimize disk I/O (spills etc), RAM is cheap

● Avoid unnecessary serialization/deserialization

Hadoop MapReduce DONTs

● Small files in HDFS

● Multithreaded programming inside mapper/reducer

● Fat tasks using too much heap

● Any I/O in M-R other than HDFS, ZK or HBase

● Over-complicated code (simple things work better)

Fight #3 – Going Production!

● Remember the slide about engineer vs. IT God preferences ;)

● Production hardware was slightly different from the test cluster

● Cluster has been deployed by the people who did not know Hadoop

● First attempt to run the software resulted in major failure and the cluster was finally handed over to the developers for fixing ;)

Production hardware

● HP blade servers, 32 core, 128GB of RAM

● Emulex dual-port 10G Ethernet NICs

● 14 HDDs per machine

● OEL 6.3

● 10G switch modules

● Company hosting center with dedicated networking and operations staff

Hardware

BIOS/Firmware(s)

BIOS/Firmware settings

OS (Linux)

Java (JVM)

Hadoop services

Your application(s)

Step back – 10,000 ft look at Hadoop stack

Hardware

BIOS/Firmware(s)

BIOS/Firmware settings

OS (Linux)

Java (JVM)

Hadoop services

Your application(s)

Networ

- Hadoop is not just a bunch of Java apps- It is a data and application platform- It can run well, just run, barely run and cause constant headache – depending on how much love does it receive :)

Hadoop stack (continued)● In Hadoop a small problem, even sometimes on

a single node can be a major pain

● Isolating and finding that small problem may be difficult

● Symptoms are often obvious only at high level (e.g. application)

● Complex hardware (like HP) adds more potential problems

Example of one of the problems we had initially

● Jobs were failing because of timeouts

● Numerous I/O errors observed in job and HDFS logs

● This simple test was failing:$ dd if=/dev/zero of=test8Gb.bin bs=1M count=8192$ time hdfs dfs -copyFromLocal test8Gb.bin /Zzz..zzz...zzz...5min...zzz…real 4m10.002suser 0m15.130ssys 0m4.094s

● IT was clueless but did not really bother● In fact, 8192Mb / (4 * 60 + 10) = 32Mb/s (!?!?!)● 10Gb network transfers to HDFS at ~160Mb/s

Role of HDFS in Hadoop

● In Hadoop HDFS is the key layer that provides the distributed filesystem services for other components

● Health of HDFS directly (and drastically) affects the health of other components

Map-Reduce Data

So, clearly HDFS was the problem

● But what was the problem with HDFS??

● How exactly HDFS writing works?

Chasing it down● Due to node-to-node streaming it was difficult to

understand who was responsible

● Theory of “one bad node in pipeline” was ruled out as results were consistently bad with the cluster of 14 nodes

● Idea (isolating the problem is good):

$ time hdfs -Ddfs.replication=1 dfs -copyFromLocal test8Gb.bin /real 0m42.002s$ time hdfs -Ddfs.replication=2 dfs -copyFromLocal test8Gb.bin /real 2m53.184s$ time hdfs -Ddfs.replication=3 dfs -copyFromLocal test8Gb.bin /real 3m41.072s

● 8192/42=195 Mb/s – hmmm….

Discoveries

● To make even longer story short...– Bug in “cubic” TCP congestion protocol in Linux kernel

– NIC firmware was too old

– Kernel driver for Emulex 10G NICs was too old

– Only one out of 8 NIC RX queues was enabled on some hosts

– A number of network settings were not appropriate for 10G network

– “irqbalance” process (due to kernel bug) was locking NIC RX queues by “losing” NIC IRQ handlers

– ...

More discoveries

– Nodes were set up multi-homed, even HDFS at that time did not support that

– Misconfigured DNS and reverse DNS

● On disk I/O side– Bad filesystem parameters

– Read-ahead settings were wrong

– Disk controller firmware was old

HDFS “litmus” test - TestDFSIO13/03/13 16:30:02 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write

13/03/13 16:30:02 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:30:02 UTC 2013

13/03/13 16:30:02 INFO fs.TestDFSIO: Number of files: 16

13/03/13 16:30:02 INFO fs.TestDFSIO: Total MBytes processed: 160000.0

13/03/13 16:30:02 INFO fs.TestDFSIO: Throughput mb/sec: 103.42190773343779

13/03/13 16:30:02 INFO fs.TestDFSIO: Average IO rate mb/sec: 103.61066436767578

13/03/13 16:30:02 INFO fs.TestDFSIO: IO rate std deviation: 4.513343367320971

13/03/13 16:30:02 INFO fs.TestDFSIO: Test exec time sec: 114.876

13/03/13 16:31:31 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read

13/03/13 16:31:31 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:31:31 UTC 2013

13/03/13 16:31:31 INFO fs.TestDFSIO: Number of files: 16

13/03/13 16:31:31 INFO fs.TestDFSIO: Total MBytes processed: 160000.0

13/03/13 16:31:31 INFO fs.TestDFSIO: Throughput mb/sec: 586.8243268024676

13/03/13 16:31:31 INFO fs.TestDFSIO: Average IO rate mb/sec: 648.8555908203125

13/03/13 16:31:31 INFO fs.TestDFSIO: IO rate std deviation: 267.0954600161208

13/03/13 16:31:31 INFO fs.TestDFSIO: Test exec time sec: 33.683

13/03/13 16:31:31 INFO fs.TestDFSIO:

Fight #4 – tuning Hadoop

● Why do people tune things (IT was not interested ;) )?

● With your own expensive hardware you want the maximum IOPS and CPU power for $$$ you have paid

● Not to mention that you simply want your apps to run faster

● Tuning is an endless process but 80/20 rule works perfectly

Even before you have something to tune….

● Pick reasonably good hardware but do not go high-end

● Same for network equipment

● Hadoop scales well and the redundancy is achieved by software

● More nodes is almost always better than going for extra node power and/or storage space

● Simpler systems are easier to tune, maintain and troubleshoot

● Different machines for master nodes

Tuning the hardware and BIOS

● Updating BIOS and firmwares to recent versions

● Disabling dynamic CPU frequency scaling

● Tuning memory speed, power profile

● Disk controller, tune disk cache

OS Tuning● Pick the filesystem (ext3, ext4, XFS...), parameters (reserve

blocks 0%) and mount options (noatime,nodiratime, barriers etc)

● I/O scheduler depending on your disks and tasks

● Read-ahead settings

● Disable swap!

● irqbalance for big machines

● Tune other parameters (number of FDs, sockets)

● Install major troubleshooting tools (iostat, iotop, tcpdump, strace…) on every one

Network tuning● Test your TCP performance with iperf, ttcp or any other

tools you like

● Know your NICs well, install right firmware and kernel modules

● Tune your TCP and IP parameters (work harder if you have expensive 10G network)

● If your NIC supports TCP offload and it works – use it

● txqueuelen, MTU 9000 (if appropriate), HDFS is chatty

● Learn ethtool and see what it can do for you

● Basic IP networking set-up (DNS etc) has to be 100% perfect

JVM tuning● Hadoop allows you to set JVM options for all

processes

● Your Data Node, Name Node and HBase Region Servers are going to work hard and you need to help them to deal with your workload

● If your MR code is well designed you will most likely NOT need to tune JVM for MR tasks

● Your main enemy will be GC – until you become at lease allies, if not friends :)

Tuning Hadoop services

● NameNode deals with many connections and needs ~150 bytes per HDFS block

● NameNode and DataNode are highly concurrent, latter needs many threads

● Use HDFS short-circuit reads if appropriate

● ZooKeeper needs to handle enough connections

● HBase uses LOTS of heap

● Reuse JVMs for MR jobs if appropriate

Tuning MapReduce tasks (that means tuning for your code and data)

● If you run different MR jobs, consider tuning parameters for each of them, not once and for all of them

● Configure job scheduler to enforce the SLAs

● Estimate the resource needed for each job

● Plan how are you going to run your jobs

Tuning your own code

● Test and profile your complex MR code outside of Hadoop (your savings will scale too!)

● Check for GC overhead

● Use reusable objects

● Avoid using expensive formats like JSON and XML

● Anything you waste is multiplied by the number of rows and the number of tasks!

● Evaluate the need for intermediate data compression

Tuning HBase

● That requires separate presentation

● You will need to fight hard for reducing GC pauses and overhead

● Pre-splitting regions may be a good idea to better balance the load

● Understand HBase compactions and deal with major compactions your way

Set up your monitoring (and alarming)

● You cannot improve what you cannot see!

● Monitor OS, Hadoop and your app metrics

● Ganglia, Graphite, LogStash, even Cloudera Manager are your friends

● Set the baseline, track your changes, observe the outcome

Fight #5 - Operations

● Real hand-over to the Operations people actually never happened

● In case of any problems either it was ignored or escalation to engineers was taking about 1 minute

● Neither NOC nor Operations staff wanted to acquire enough knowledge of Hadoop and the apps

● Monitoring was nearly non-existing

● Same for appropriate alarms

If you are serious...

● Send your Ops for Hadoop training (or buy them books and have them read those!)

● Have them automate everything

● Ops have to understand your applications, not just the platform they are running on

● Your Ops need to be decent Linux admins

● ...and it would be great if they are also OK programmers (scripting, Java…)

● Of course, the motivation is the key

Plan and train for disaster

● Train your Ops how to help your system to survive till Monday morning

● Decide what sort of loss you will tolerate (BigData is not always so precious)

● Design your system for resilience, async processing, queuing etc

Fight #6 - evolution

● Sooner or later you will need to increase your capacity– Unless your business is stagnating

● Technically, you will either– Run out of storage space

– Start hitting the wall on IOPS or CPU and fail to respect your SLAs (even if only internal ones)

– Won't be able to deploy new applications

Understand your application - again

● Even if your apps runs fine you need to monitor the performance factors

● Build spreadsheets reflecting your current numbers● Plan for the business growth

● Translate this into the number of additional nodes and networking equipment

● Especially important if your hardware purchase cycle takes months

Conclusions● Not all companies are ready for BigData – often

because of conservative people in key positions

● Traditional IT/Ops/NOC organizations are often unable to support these platforms

● Engineers have to be given more power to control how the things they build are ran (DevOps)

● Hadoop is a complex platform and has to be taken seriously for serious applications

● If you really depend on Hadoop you do need to build in-house expertise

Questions?

Thanks for listening!

Nikolai Grigorievngrigoriev@gmail.com

bdm37: hadoop in production – the war stories by nikolaï grigoriev, principal software engineer,...

hadoop cluster

hadoop performance

practical knowledge

capacity planning spreadsheet

old school application

faster hdds

software stack tuning

log processors etl process

Software

memoriu grigoriev

hadoop ecosystem - hadoop 生態系

hadoop deployment manual -...

hadoop 3 (2017 hadoop taiwan workshop)

hue: the hadoop ui - hadoop singapore

fuerzas en la naturaleza grigoriev editorial mir

bigdata hadoop course content · industries using hadoop....

r. argurio, g. barnich g. bonelli, m. grigoriev-higher spin...

hadoop installation guide | hadoop configuration

hadoop trends & hadoop on ec2

24 studies for trombone by b grigoriev

configuración para hadoop configuración de wps para...

nikolaï rimski-korsakov/maurice ravel | dimanche 15 juin...

why use hadoop?, challenges / learning hadoop & average...

hadoop crash course hadoop summit sj

snapshotting in hadoop distributed file system for hadoop...

· (page views ? hourly? monthly hadoop node hadoop node...

continuous delivery for linux/windows/hadoop...beta cluster...

Борис Григорьев(boris grigoriev)

hadoop operations powered by ... hadoop (hadoop summit 2014...