hadoop at datasift

Post on 13-May-2015

4.173 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.

TRANSCRIPT

Hadoop At

Datasift

About me

Jairam ChandarBig Data Engineer

Datasift

@jairamc

http://about.me/jairam

http://blog.jairam.me

Outline

What is Datasift?

Where do we use Hadoop?

– The Numbers– The Use-cases– The Lessons

!! Sales Pitch Alert !!

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

What is Datasift?

The Numbers

Machines

– 60 machines ● Datanode● Tasktracker● RegionServer

– 2 machines● Namenode

– 2 machines● HBase Master

– In the processing of doubling our capacity

The Numbers

Machines

– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)

– 24GB RAM

– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)

– 1 Gigabit network links

The Numbers

Data

– Avg load of 3500 interactions/second

– Peak load of 6000 interactions/second

– Highest during the Superbowl – 12000 interactions/second

– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)

– And that's not it!

The Use Cases

HBase

– Recordings– Archive/Ultrahose

Map/Reduce

– Exports– Historics

The Use Cases

Recordings– User defned streams

– Stored in HBase for later retrieval

– Export to multiple output formats and stores

– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key

without generating hot-spots.

The Use Cases

Recordings continued ...

The Use Cases

Exporter– Export data from HBase for customer

– Export fles 5 – 10 GB or 3-6 million records

– MR over HBase using TableInputFormat

– But the data needs to be sorted● TotalOrderPartioner

The Use Cases

Exporter Continued

!! Sales Pitch Alert !!

Historics

The Use Cases

Archive/Ultrahose– Not just the Firehose but the Ultrahose

– Stored in HBase as well

– HBase architecture (BigTable) creates Hotspots with Time Series data

● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes

The Use Cases

Archive continued …

2 years of Tweets

– 11 TB compressed

– <Number of tweets we got>

The Use Cases

Historics– Export archive data

– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required

The Use Cases

Historics continued ...

The Lessons - HBase

Tune Tune Tune (Default == BAD)

Based on use case tune -

– Heap– Block Size– Memstore size

Keep number of column families low

Be aware of hot-spotting issue when writing time-series data

Use compression (eg. Snappy)

The Lessons - HBase

Ops need intimate understanding of system

Monitor metrics (GC, CPU, Compaction, I/O)

Don't be afraid to fddle with HBase code

Using a distribution is advisable

Questions?

top related