intro to big data using hadoop
DESCRIPTION
Introduction to Big Data and Apache Hadoop project. MapReduce vizualizationTRANSCRIPT
![Page 1: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/1.jpg)
Intro to Big Data using
Hadoop
Big Data
Sergejus
Barinovas
sergejus.blogas.lt
fb.com/ITishnikai
@sergejusb
![Page 2: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/2.jpg)
Information is powerful…
but it is how we use it that will
define ushow we use it
![Page 3: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/3.jpg)
Data Explosion
picture from Big Data Integration
relational
textaudiovideo
images
![Page 4: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/4.jpg)
Big Data (globally)
– creates over 30 billion pieces of content
per day
– stores 30 petabytes of data
– produces over 90 million tweets per day
30 billion
30 petabytes
90 million
![Page 5: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/5.jpg)
Big Data (our example)
– logs over 300 gigabytes of transactions per
day
– stores more than 1,5 terabyte of
aggregated data
300 gigabytes
1,5 terabyte
![Page 6: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/6.jpg)
4 Vs of Big Data
volumevelocityvarietyvariability
volumevelocityvarietyvariability
![Page 7: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/7.jpg)
Big Data Challenges
Sort 10TB on 1 node =
100-node cluster =
2,5 days
35 mins
![Page 8: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/8.jpg)
Big Data Challenges
“Fat” servers implies high cost
– use cheap commodity nodes instead
Large # of cheap nodes implies often
failures
– leverage automatic fault-tolerance
commodity
fault-tolerance
![Page 9: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/9.jpg)
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines
data-parallel
![Page 10: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/10.jpg)
MapReduce
to the rescue!
![Page 11: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/11.jpg)
MapReduce
Published in 2004 by Google– MapReduce: Simplified Data Processing on Large Clusters
Popularized by Apache Hadoop project
– used by Yahoo!, Facebook, Twitter, Amazon,
…
Hadoop
![Page 12: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/12.jpg)
MapReduce
Who got it?
![Page 13: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/13.jpg)
Word Count Example
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
the, 3brown, 2fox, 2how, 1now, 1
quick, 1ate, 1mouse, 1cow, 1
Input Map Shuffle & Sort Reduce Output
![Page 14: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/14.jpg)
Word Count Example
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
Input Map Shuffle & Sort Reduce Output
the, 1quick, 1brown, 1fox, 1
the, 1fox, 1ate, 1the, 1mouse, 1
how, 1now, 1brown, 1cow, 1
the, 1brown, 1fox, 1the, 1fox, 1the, 1how, 1now, 1brown, 1
quick, 1ate, 1mouse, 1cow, 1
![Page 15: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/15.jpg)
Word Count Example
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
the, 3brown, 2fox, 2how, 1now, 1
quick, 1ate, 1mouse, 1cow, 1
Input Map Shuffle & Sort Reduce Output
the, [1,1,1]brown, [1,1]fox, [1,1]how, [1]now, [1]
quick, [1]ate, [1]mouse, [1]cow, [1]
Ta da!
![Page 16: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/16.jpg)
MapReduce philosophy
– hide complexity
–make it scalable
–make it cheap
philosophy
![Page 17: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/17.jpg)
MapReduce popularized by
Apache Hadoop projectHadoop
![Page 18: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/18.jpg)
Hadoop Overview
Open source implementation of
– Google MapReduce paper
– Google File System (GFS) paper
First release in 2008 by Yahoo!
– wide adoption by Facebook, Twitter,
Amazon, etc.
MapReduce
(GFS)
Yahoo!
![Page 19: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/19.jpg)
Hadoop Core
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
![Page 20: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/20.jpg)
Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes
Name Node
blocks
3 Data Nodes
![Page 21: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/21.jpg)
Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
![Page 22: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/22.jpg)
Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Job Tracker
data locality
Task Trackers
![Page 23: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/23.jpg)
Hadoop Core (MapReduce)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Job Tracker Task Tracker
![Page 24: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/24.jpg)
Hadoop Core (Job submission)
Name Node Data Node
Job Tracker Task Tracker
Clie
nt
![Page 25: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/25.jpg)
HBase
Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
MapReduce (Job Scheduling / Execution System)
Pig (ETL)
Avro
Zooke
ep
er
Hive (BI) Sqoop (RDBMS)
![Page 26: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/26.jpg)
JavaScript MapReducevar map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }};var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};
![Page 27: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/27.jpg)
Pig
words = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;
![Page 28: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/28.jpg)
HiveCREATE EXTERNAL TABLE WordCount (
word string,count int
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILELOCATION "/example/count";
SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
![Page 29: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/29.jpg)
Demo
Hadoop in the Cloud
Über Demo
![Page 30: Intro to Big Data using Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022051513/54620eecb4af9f621c8b45bc/html5/thumbnails/30.jpg)
Thanks!
Questions?Questions?