driven by python - jonebirdjonebird.com/hadoop_intro.pdf · introduction to hadoop driven by...

16
Jon Miller Jon Miller [email protected] [email protected] http://jonebird.com/ http://jonebird.com/ Introduction to Hadoop Introduction to Hadoop Driven by Python Driven by Python

Upload: duongdien

Post on 09-Mar-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

Jon MillerJon Miller

[email protected]@gmail.comhttp://jonebird.com/http://jonebird.com/

Introduction to HadoopIntroduction to Hadoop

Driven by PythonDriven by Python

Page 2: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 2

What is Hadoop?

Page 3: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 3

● Doug Cutting's daughter's stuffed toy elephant

● Distributed MapReduce System

● Apache Project with multiple sub-projectsCore, HDFS then HBase, Hive, Pig, ZooKeeper

What is Hadoop?

Page 4: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 4

Where is the Python?

Page 5: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 5

Where is the Python?● Hadoop Streaming● Automatically copies your

python script to nodes● Uses STDIN / STDOUT

to communicate

Page 6: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 6

Hadoop Architecture

Page 7: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 7

Hadoop Architecture● Expect hardware failures● Take the computing to the data,

NOT pull data to compute● Datanodes, Tasktrackers & Jobtracker

Page 8: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 8

Web Analytics Example

Page 9: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 9

Mapper#!/usr/bin/env python import sys

IGNORE_SITES = [ 'http://jonebird.com/', 'http://www.jonebird.com/' ]

for line in sys.stdin: if line.count('"') == 6: # some entries I do not care about: # 1. Discard if referer is myself # 2. Discard if there is _no_ referer. i.e. "-" referer = line.split('"')[3] can_ignore = any( referer.startswith(site) for site in IGNORE_SITES ) if referer != '-' and not can_ignore: print '%s\t%d' % (referer, 1)

Page 10: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 10

Reducer#!/usr/bin/env python import sys referer_count = {}

# parse input from the mapping processfor line in sys.stdin: try: referer, count = line.strip().split('\t', 1) count = int(count) referer_count[referer] = referer_count.get(referer, 0) + count except ValueError: # ignoring odd failures pass # Report our resultsfor referer, count in referer_count.iteritems(): print '%s\t%s' % (referer, count)

Page 11: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 11

Invocation# With $HADOOP_HOMEPATH=$PATH:${HADOOP_HOME}/bin

hadoop dfs -copyFromLocal /var/log/httpd/ apache_logs

export HSTREAM="${HADOOP_HOME}/bin/hadoop jar \ ${HADOOP_HOME}/contrib/streaming/hadoop-${HADOOP_VERSION}-streaming.jar"

# Now run the following command to get a quick# usage statement about using the streamer$HSTREAM -info

$HSTREAM -D mapred.job.name='Apache Referer' \ -input apache_logs/access_log* \ -output apache_referer \ -mapper $(pwd)/mapper.py \ -reducer $(pwd)/reducer.py

Page 12: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 12

Results# With $HADOOP_HOMEPATH=$PATH:${HADOOP_HOME}/bin

# View the resultant data sets in the HDFShadoop dfs -ls apache_referer

hadoop dfs -cat apache_referer/part*

Page 13: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 13

Why Should I Care?

Page 14: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 14

Page 15: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 15

Questions?

Creative Commons License v3.0

Page 16: Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

09/27/09 16

Interwebs http://hadoop.apache.org/ http://cloudera.com/ http://developer.yahoo.com/hadoop/tutorial/

Books Hadoop: The Definitive Guide by Tom White Pro Hadoop by Jason Venner

Videos Google MapReduce Lectures http://www.youtube.com/watch?v=yjPBkvYh-ss

Creative Commons License v3.0