intro to big data using hadoop

Intro to Big Data using

Hadoop

Big Data

Sergejus

Barinovas

sergejus.blogas.lt

fb.com/ITishnikai

@sergejusb

Information is powerful…

but it is how we use it that will

define ushow we use it

Data Explosion

picture from Big Data Integration

relational

textaudiovideo

images

http://bigdataintegration.blogspot.com/



Big Data (globally)

– creates over 30 billion pieces of content

per day

– stores 30 petabytes of data

– produces over 90 million tweets per day

30 billion

30 petabytes

90 million

Big Data (our example)

– logs over 300 gigabytes of transactions per

day

– stores more than 1,5 terabyte of

aggregated data

300 gigabytes

1,5 terabyte

4 Vs of Big Data

volumevelocityvarietyvariability

volumevelocityvarietyvariability

Big Data Challenges

Sort 10TB on 1 node =

100-node cluster =

2,5 days

35 mins

Big Data Challenges

“Fat” servers implies high cost

– use cheap commodity nodes instead

Large # of cheap nodes implies often

failures

– leverage automatic fault-tolerance

commodity

fault-tolerance

Big Data Challenges

We need new data-parallel programming

model for clusters of commodity machines

data-parallel

MapReduce

to the rescue!

MapReduce

Published in 2004 by Google– MapReduce: Simplified Data Processing on Large Clusters

Popularized by Apache Hadoop project

– used by Yahoo!, Facebook, Twitter, Amazon,

…

Hadoop

Google

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/mapreduce-osdi04.pdf

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/mapreduce-osdi04.pdf

MapReduce

Who got it?

Word Count Example

the quickbrown

fox

the fox ate the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

the, 3brown, 2fox, 2how, 1now, 1

quick, 1ate, 1mouse, 1cow, 1

Input Map Shuffle & Sort Reduce Output

Word Count Example

the quickbrown

fox


how now

brown cow

Map

Map

Map

Reduce

Reduce


the, 1quick, 1brown, 1fox, 1

the, 1fox, 1ate, 1the, 1mouse, 1

how, 1now, 1brown, 1cow, 1

the, 1brown, 1fox, 1the, 1fox, 1the, 1how, 1now, 1brown, 1


Word Count Example

the quickbrown

fox


how now

brown cow

Map

Map

Map

Reduce

Reduce

the, 3brown, 2fox, 2how, 1now, 1



the, [1,1,1]brown, [1,1]fox, [1,1]how, [1]now, [1]

quick, [1]ate, [1]mouse, [1]cow, [1]

Ta da!

MapReduce philosophy

– hide complexity

–make it scalable

–make it cheap

philosophy

MapReduce popularized by

Apache Hadoop projectHadoop

Hadoop Overview

Open source implementation of

– Google MapReduce paper

– Google File System (GFS) paper

First release in 2008 by Yahoo!

– wide adoption by Facebook, Twitter,

Amazon, etc.

MapReduce

(GFS)

Yahoo!

http://research.google.com/archive/mapreduce-osdi04.pdf

http://www.cs.brown.edu/courses/cs295-11/2006/gfs.pdf

Hadoop Core

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Hadoop Core (HDFS)



• Name Node stores file metadata

• files split into 64 MB blocks

• blocks replicated across 3 Data Nodes

Name Node

blocks

3 Data Nodes

Hadoop Core (HDFS)



Name Node Data Node

Hadoop Core (MapReduce)

• Job Tracker distributes tasks and handles failures

• tasks are assigned based on data locality

• Task Trackers can execute multiple tasks



Name Node Data Node

Job Tracker

data locality

Task Trackers

Hadoop Core (MapReduce)



Name Node Data Node

Job Tracker Task Tracker

Hadoop Core (Job submission)

Name Node Data Node

Job Tracker Task Tracker

Clie

nt

HBase

Hadoop Ecosystem



Pig (ETL)

Avro

Zooke

ep

er

Hive (BI) Sqoop (RDBMS)

JavaScript MapReducevar map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }};var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};

Pig

words = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;

HiveCREATE EXTERNAL TABLE WordCount (

word string,count int

)ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'

STORED AS TEXTFILELOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Demo

Hadoop in the Cloud

Über Demo

Thanks!

Questions?Questions?

intro to big data using hadoop

Technology

big data

hadoop core

word count

fox ate

cow

mapreduce

hdfs

write