hadoop at datasift

Hadoop At

Datasift

About me

Jairam ChandarBig Data Engineer

Datasift

@jairamc

http://about.me/jairam

http://blog.jairam.me

Outline

What is Datasift?

Where do we use Hadoop?

– The Numbers– The Use-cases– The Lessons

!! Sales Pitch Alert !!

What is Datasift?

The Numbers

Machines

– 60 machines ● Datanode● Tasktracker● RegionServer

– 2 machines● Namenode

– 2 machines● HBase Master

– In the processing of doubling our capacity

The Numbers

Machines

– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)

– 24GB RAM

– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)

– 1 Gigabit network links

The Numbers

– Avg load of 3500 interactions/second

– Peak load of 6000 interactions/second

– Highest during the Superbowl – 12000 interactions/second

– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)

– And that's not it!

The Use Cases

– Recordings– Archive/Ultrahose

Map/Reduce

– Exports– Historics

The Use Cases

Recordings– User defned streams

– Stored in HBase for later retrieval

– Export to multiple output formats and stores

– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key

without generating hot-spots.

The Use Cases

Recordings continued ...

The Use Cases

Exporter– Export data from HBase for customer

– Export fles 5 – 10 GB or 3-6 million records

– MR over HBase using TableInputFormat

– But the data needs to be sorted● TotalOrderPartioner

The Use Cases

Exporter Continued

!! Sales Pitch Alert !!

Historics

The Use Cases

Archive/Ultrahose– Not just the Firehose but the Ultrahose

– Stored in HBase as well

– HBase architecture (BigTable) creates Hotspots with Time Series data

● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes

The Use Cases

Archive continued …

2 years of Tweets

– 11 TB compressed

– <Number of tweets we got>

The Use Cases

Historics– Export archive data

– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required

The Use Cases

Historics continued ...

The Lessons - HBase

Tune Tune Tune (Default == BAD)

Based on use case tune -

– Heap– Block Size– Memstore size

Keep number of column families low

Be aware of hot-spotting issue when writing time-series data

Use compression (eg. Snappy)

The Lessons - HBase

Ops need intimate understanding of system

Monitor metrics (GC, CPU, Compaction, I/O)

Don't be afraid to fddle with HBase code

Using a distribution is advisable

Questions?

hadoop at datasift

use cases archiveultrahose

use cases exporter export

hbase code

machines hbase master

time series data

lessons hbase ops

data needs

numbers machines

Technology

the datasift platform

more than websites: php and the firehose @datasift (2013)

thinking at scale with hadoop

hadoop at ebay

hadoop performance at linkedin

realtime apache hadoop at facebook

railroad modeling at hadoop scale

datasift september '12 release overview

hadoop at datasift

hadoop and hive at orbitz, hadoop world 2010

presto at hadoop summit 2016

hadoop at a glance

hadoop networking at datasift

hw09 hadoop applications at yahoo!

hadoop adventures at spotify (strata conference + hadoop...

social data day datasift slides

splunk/socialize at hadoop summit

hadoop playlist (ignite talks at strata + hadoop world 2013)

datasift vedo focus introduction

hadoop platform at yahoo