hadoop at datasift

Hadoop At

Datasift

About me

Jairam ChandarBig Data Engineer

Datasift

@jairamc

http://about.me/jairam

http://blog.jairam.me

http://twitter.com/jairamc

http://about.me/jairam

http://blog.jairam.me/

Outline

What is Datasift?

Where do we use Hadoop?

– The Numbers– The Use-cases– The Lessons

!! Sales Pitch Alert !!

What is Datasift?

The Numbers

Machines

– 60 machines ● Datanode● Tasktracker● RegionServer

– 2 machines● Namenode

– 2 machines● HBase Master

– In the processing of doubling our capacity

The Numbers

Machines

– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)

– 24GB RAM

– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)

– 1 Gigabit network links

The Numbers

Data

– Avg load of 3500 interactions/second

– Peak load of 6000 interactions/second

– Highest during the Superbowl – 12000 interactions/second

– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)

– And that's not it!

The Use Cases

HBase

– Recordings– Archive/Ultrahose

Map/Reduce

– Exports– Historics

The Use Cases

Recordings– User defned streams

– Stored in HBase for later retrieval

– Export to multiple output formats and stores

– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key

without generating hot-spots.

The Use Cases

Recordings continued ...

The Use Cases

Exporter– Export data from HBase for customer

– Export fles 5 – 10 GB or 3-6 million records

– MR over HBase using TableInputFormat

– But the data needs to be sorted● TotalOrderPartioner

The Use Cases

Exporter Continued

!! Sales Pitch Alert !!

Historics

The Use Cases

Archive/Ultrahose– Not just the Firehose but the Ultrahose

– Stored in HBase as well

– HBase architecture (BigTable) creates Hotspots with Time Series data

● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes

The Use Cases

Archive continued …

2 years of Tweets

– 11 TB compressed

– <Number of tweets we got>

The Use Cases

Historics– Export archive data

– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required

The Use Cases

Historics continued ...

The Lessons - HBase

Tune Tune Tune (Default == BAD)

Based on use case tune -

– Heap– Block Size– Memstore size

Keep number of column families low

Be aware of hot-spotting issue when writing time-series data

Use compression (eg. Snappy)

The Lessons - HBase

Ops need intimate understanding of system

Monitor metrics (GC, CPU, Compaction, I/O)

Don't be afraid to fddle with HBase code

Using a distribution is advisable

Questions?

hadoop at datasift

Technology

use cases archiveultrahose

use cases exporter export

hbase code

machines hbase master

time series data

lessons hbase ops

data needs

numbers machines