hadoop at datasift

31
Hadoop At Datasift

Upload: jairam-chandar

Post on 13-May-2015

4.173 views

Category:

Technology


1 download

DESCRIPTION

Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.

TRANSCRIPT

Page 1: Hadoop at datasift

Hadoop At

Datasift

Page 2: Hadoop at datasift

About me

Jairam ChandarBig Data Engineer

Datasift

@jairamc

http://about.me/jairam

http://blog.jairam.me

Page 3: Hadoop at datasift

Outline

What is Datasift?

Where do we use Hadoop?

– The Numbers– The Use-cases– The Lessons

Page 4: Hadoop at datasift

!! Sales Pitch Alert !!

Page 5: Hadoop at datasift

What is Datasift?

Page 6: Hadoop at datasift

What is Datasift?

Page 7: Hadoop at datasift

What is Datasift?

Page 8: Hadoop at datasift

What is Datasift?

Page 9: Hadoop at datasift

What is Datasift?

Page 10: Hadoop at datasift

What is Datasift?

Page 11: Hadoop at datasift

What is Datasift?

Page 12: Hadoop at datasift

What is Datasift?

Page 13: Hadoop at datasift

What is Datasift?

Page 14: Hadoop at datasift

What is Datasift?

Page 15: Hadoop at datasift

The Numbers

Machines

– 60 machines ● Datanode● Tasktracker● RegionServer

– 2 machines● Namenode

– 2 machines● HBase Master

– In the processing of doubling our capacity

Page 16: Hadoop at datasift

The Numbers

Machines

– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)

– 24GB RAM

– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)

– 1 Gigabit network links

Page 17: Hadoop at datasift

The Numbers

Data

– Avg load of 3500 interactions/second

– Peak load of 6000 interactions/second

– Highest during the Superbowl – 12000 interactions/second

– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)

– And that's not it!

Page 18: Hadoop at datasift

The Use Cases

HBase

– Recordings– Archive/Ultrahose

Map/Reduce

– Exports– Historics

Page 19: Hadoop at datasift

The Use Cases

Recordings– User defned streams

– Stored in HBase for later retrieval

– Export to multiple output formats and stores

– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key

without generating hot-spots.

Page 20: Hadoop at datasift

The Use Cases

Recordings continued ...

Page 21: Hadoop at datasift

The Use Cases

Exporter– Export data from HBase for customer

– Export fles 5 – 10 GB or 3-6 million records

– MR over HBase using TableInputFormat

– But the data needs to be sorted● TotalOrderPartioner

Page 22: Hadoop at datasift

The Use Cases

Exporter Continued

Page 23: Hadoop at datasift

!! Sales Pitch Alert !!

Page 24: Hadoop at datasift

Historics

Page 25: Hadoop at datasift

The Use Cases

Archive/Ultrahose– Not just the Firehose but the Ultrahose

– Stored in HBase as well

– HBase architecture (BigTable) creates Hotspots with Time Series data

● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes

Page 26: Hadoop at datasift

The Use Cases

Archive continued …

2 years of Tweets

– 11 TB compressed

– <Number of tweets we got>

Page 27: Hadoop at datasift

The Use Cases

Historics– Export archive data

– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required

Page 28: Hadoop at datasift

The Use Cases

Historics continued ...

Page 29: Hadoop at datasift

The Lessons - HBase

Tune Tune Tune (Default == BAD)

Based on use case tune -

– Heap– Block Size– Memstore size

Keep number of column families low

Be aware of hot-spotting issue when writing time-series data

Use compression (eg. Snappy)

Page 30: Hadoop at datasift

The Lessons - HBase

Ops need intimate understanding of system

Monitor metrics (GC, CPU, Compaction, I/O)

Don't be afraid to fddle with HBase code

Using a distribution is advisable

Page 31: Hadoop at datasift

Questions?