streaming python on hadoop

54
1

Upload: vivian-shangxuan-zhang

Post on 13-Apr-2017

1.052 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Streaming Python on Hadoop

1

Page 2: Streaming Python on Hadoop

Meet-up: Tackling “Big Data” with Hadoop and

PythonSam Kamin, VP Data Engineering

NYC Data Science [email protected]

2

Page 3: Streaming Python on Hadoop

NYC Data Science Academy● We’re a company that does training and

consulting in the Data Science area.● I’m Sam Kamin. I just joined NYCDSA as

VP of Data Engineering (a new area for us). I was formerly a professor at the U. of Illinois (CS) and a Software Engineer at Google.

3

Page 4: Streaming Python on Hadoop

What this meet-up is about● Wikipedia: “Data Science is the extraction of

knowledge from large volumes of data.”● My goal tonight: Show you how you can

handle large volumes of data with simple Python programming, using the Hadoop streaming interface.

4

Page 5: Streaming Python on Hadoop

Outline of talk● Brief overview of Hadoop● Introduction to parallelism via MapReduce● Examples of applying MapReduce● Implementing MapReduce in Python

You can do some programming at the end if you want!

5

Page 6: Streaming Python on Hadoop

Big Data: What’s the problem?Too much data!

o Web contains about 5 billion web pages. According to Wikipedia, its total size in bytes is about 4 zettabytes - that’s 1021, or four thousand billion gigabytes.

o Google’s datacenters store about 15 exabytes (15 x 1018 bytes).

6

Page 7: Streaming Python on Hadoop

Big Data: What’s the solution?● Parallel computing: Use multiple,

cooperating computers.

7

Page 8: Streaming Python on Hadoop

Parallelism● Parallelism = dividing up a problem so that

multiple computers can all work on it:o Break the data into pieceso Send the pieces to different computers for

processing.o Send the results back and process the combination

to get the final result.

8

Page 9: Streaming Python on Hadoop

Cloud computing● Amazon, Google, Microsoft, and many other

companies operate huge clusters: Racks of (basically) off-the-shelf computers with (basically) standard network connections.

● The computers in these clusters run Linux - use them like any other computer...

9

Page 10: Streaming Python on Hadoop

Cloud computing● But: getting them to work together is really

hard:o Management: machine/disk failure; efficient data

placement; debugging, monitoring, logging, auditing.o Algorithms: decomposing your problem so it can be

solved in parallel can be hard.

That’s what Hadoop is here to help with.

10

Page 11: Streaming Python on Hadoop

● A collection of services in a cluster:o Distributed, reliable file system (HDFS)o Scheduler to run jobs in correct order, monitor,

restart on failure, etc.o MapReduce to help you decompose your problem

for parallel executiono A variety of other components (mostly based on

MapReduce), e.g. databases, application-focused libraries

11

Page 12: Streaming Python on Hadoop

How to use Hadoop● Hadoop is open source (free!)● It is hosted on Apache: hadoop.apache.org● Download it and run it standalone (for

debugging)● Buy a cluster or rent time on one, e.g. AWS,

GCE, Azure. (All offer some free time for new users.)

12

Page 13: Streaming Python on Hadoop

MapReduce● The main, and original, parallel-processing

system of Hadoop.● Developed by Google to simplify parallel

processing. Hadoop started as an open-source implementation of Google’s idea.

● With Hadoop’s streaming interface, it’s really easy to use MapReduce in Python.

13

Page 14: Streaming Python on Hadoop

MapReduce - The Big Idea● Calculations on large data sets often have

this form: Start by aggregating the data (possibly in a different order from the “natural order”), then perform a summarizing calculation on the aggregated groups.

● The idea of MapReduce: If your calculation is explicitly structured like this, it can be automatically parallelized.

14

Page 15: Streaming Python on Hadoop

Computing with MapReduceA MapReduce computation has three stages:Map: A function called map is applied to each record in your input. It produces zero or more records as output, each with a key and value. Keys may be repeated.Shuffle: The output from step 1 is sorted and combined: All records with the same key are combined into one.Reduce: A function called reduce is applied to each record (key + values) from step 2 to produce the final output.As the programmer, you only write map and reduce.

15

Page 16: Streaming Python on Hadoop

Computing with MapReduce

16

InputA, 7C, 5B, 23B, 12A, 18

A, [18, 7]B, [23, 12]C, [5]

Outputmap reduceshuffle

Note: map is record-oriented, meaning the output of the map stage is strictly a combination of the outputs from each record. That allows us to calculate in parallel...

Page 17: Streaming Python on Hadoop

Parallelism via MapReduce

17

Input A, [18, 7]B, [23, 12]C, [5]

map reduce

Because map and reduce are record-oriented, MR can divide inputs into arbitrary chunks:

map

map

map

reduce

reduce

reduce

Output

Output

Output

Output

distributedata

distributedata

combine/shuffle

Page 18: Streaming Python on Hadoop

MapReduce example: Stock prices● Input: list of daily opening and closing prices for

thousands of stocks over thousands of days.● Desired output: The biggest-ever one-day

percentage price increase for each stock.● Solution using MR:

o map: (stock, open, close) => (stock, (close - open) / open) (if pos)

o reduce: (stock, [%c0, %c1, …]) => (stock, max [%c0, %c1, …]).

18

Page 19: Streaming Python on Hadoop

MapReduce example - map

Goog, 230, 240Apple, 100, 98MS, 300, 250MS, 250, 260MS, 270, 280Goog, 220, 215Goog, 300, 350IBM, 80, 90IBM, 90, 85

Goog, 4.3%

MS, 4%MS, 3.7%

Goog, 16.6%IBM, 12.5%

map

You supply map: Output stock with % increase, or nothing if decrease.

19

Page 20: Streaming Python on Hadoop

MapReduce example - shuffle/sort

Goog, 4.3%MS, 4%MS, 3.7%Goog, 16.6%IBM, 12.5%

shuffle/sort Goog, [4.3%, 16.6%]

IBM, [12.5%]MS, [3.7%, 4%]

Goog, 4.3%MS, 4%MS, 3.7%Goog, 16.6%IBM, 12.5%

MapReduce supplies shuffle/sort: Combine all records for each stock.

20

Page 21: Streaming Python on Hadoop

MapReduce example - reduce

reduceGoog, [4.3%, 16.6%]IBM, [12.5%]MS, [3.7%, 4%]

Goog, 16.6%IBM, 12.5%MS, 4%

You supply reduce: Output max of percentages for each input record.

21

Page 22: Streaming Python on Hadoop

Wait, why did that help?I could have just written a loop to read every line and put the percentages in a table!● Suppose you have a terabyte of data, and

1000 computers in your cluster.● MapReduce can automatically split the data

into 1000 1GB chunks. You write two simple functions and get a 1000x speed-up!

22

Page 23: Streaming Python on Hadoop

Modelling problems using MR● We’re going to look at a variety of problems

and see how we can fit them into the MR structure.

● The question for each problem is: What are the types of map and reduce, and what do they do?

23

Page 24: Streaming Python on Hadoop

Example: Word countInput: Lines of text.Desired output: # of occurrences of each word (i.e. each sequence of non-space chars)

E.g. Input: Roses are red, violets are blue Output: are, 2

blue, 1red, 1 etc.

24

Page 25: Streaming Python on Hadoop

Example: Word countSolution:● map: “w1 w2 … wk” → w1, 1

w2, 1

...

wk, 1● reduce: (w, [1, 1, …]) → (w, n)

n 1’s

25

Page 26: Streaming Python on Hadoop

Example: Word count frequencyInput: Output of word countDesired output: For any number of occurrences c, the number of different words that occur c times.

E.g. Input: Roses are red, violets are blue Output: 1, 4

2, 1

26

Page 27: Streaming Python on Hadoop

Example: Word count frequencySolution:● map: w, c → c, 1● reduce: (c, [1, 1, …]) → (c, n)

n 1’s

27

Page 28: Streaming Python on Hadoop

Example: Page Rank● Famous algorithm used by Google to rank

pages. (Comes down to matrix-vector multiplication, as we’ll see…)

● Based on two ideas:o Importance of a page depends upon how many

pages link to it.o However, if a page has lots of links going out, the

value of each link is reduced.

28

Page 29: Streaming Python on Hadoop

Example: Page RankWith those two ideas, calculate rank of page:

Note: Because the web has cycles - page p can have a link to page q, which has a link to p - this formula requires an iterative solution.

pagerank(p) = Σq→p

29

pagerank(q)

out-degree(q)

Page 30: Streaming Python on Hadoop

Example: Page RankConsider pages and their links as a graph (page A has links to B, C, and D, etc.):

30

pr(A) = pr(B)/2 + pr(D)/2pr(B) = pr(A)/3 + pr(D)/2pr(C) = pr(A)/3 + pr(B)/2pr(D) = pr(A)/3 + pr(C)

Page 31: Streaming Python on Hadoop

Example: Page Rank● Represent the graph as a weighted

adjacency matrix:

31

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

M =

links to

links from

A

B

C

D

B DA C

Page 32: Streaming Python on Hadoop

Example: Page Rank● Now, if we put the page rank of each page in

a vector v, then multiplying M by v calculates the pagerank formula for all nodes:

32

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

pr(A)

pr(B)

pr(C)

pr(D)

pr(B)/2 + pr(D)/2

pr(A)/3 + pr(D)/2

pr(A)/3 + pr(B)/2

pr(A)/3 + pr(C)

X =

Page 33: Streaming Python on Hadoop

Example: Page Rank● So, to calculate page ranks, start with an

initial guess of all page ranks and multiply.● After one multiplication:

33

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

1/4

1/4

1/4

1/4

1/4

5/24

5/24

1/3

X =

Page 34: Streaming Python on Hadoop

Example: Page Rank● After two multiplications:

34

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

.27

.24

.188

.29

X =

1/4

5/24

5/24

1/3

Page 35: Streaming Python on Hadoop

Example: Page Rank● Thus, page rank = matrix-vector product.● Can we express matrix-vector multiplication

as a MapReduce?o Assume v is copied (magically) to each node.o M, being much bigger, needs to be partitioned, i.e. M

is the main input file.o How shall we represent M and define map and

reduce?

35

Page 36: Streaming Python on Hadoop

Example: Page Rank● A solution:

o Represent M using one record for each link: (p, q, out-degree(p)) for every link p→q.

o map: (p, q, d) (q, v[p]/d)↦

reduce: p, [c1, c2, …] p, c↦ 1+c2+...

36

Page 37: Streaming Python on Hadoop

MapReduce: Summary● Nowadays, MapReduce powers the internet:

o Google, Amazon, Facebook, use it extensively for everything from page ranking to error log analysis.

o NIH use it to analyze gene sequences.o NASA uses it to analyze data from probes.o etc., etc.

● Next question: How can we implement a MapReduce?

37

Page 38: Streaming Python on Hadoop

Writing map and reduce in Python● Easy using the streaming interface:

o map and reduce : stdin → stdout. Each should iterate over stdin and output result for each line.

o Inputs and outputs are text files. In map and reduce output, tab character separates key from value.

o Shuffle just sorts the files on the key. Instead of a line with a key and list of values, we

get consecutive lines with the same key.

38

Page 39: Streaming Python on Hadoop

Example: stock prices● Recall the output of the shuffle stage:

● The only difference is this becomes:

Goog, [4.3%, 16.6%]IBM, [12.5%]MS, [3.7%, 4%]

Goog 4.3%Goog 16.6%IBM 12.5%MS 3.7%MS 4%

39

Page 40: Streaming Python on Hadoop

Example: stock prices● On the next two slides, we show the map

and reduce functions in Python.● Both of them are just stand-alone programs

that read stdin and write stdout.● In fact, we can test our pipeline without using

MapReduce: cat input-file | ./map.py | sort |

./reduce.py40

Page 41: Streaming Python on Hadoop

Example: stock prices - map.py#!/usr/bin/env pythonimport sysimport string

for line in sys.stdin: record = line.split(",") opening = int(record[1]) closing = int(record[2]) if (closing > opening): change = float(closing - opening) / opening print '%s\t%s' % (record[0], change)

41

Page 42: Streaming Python on Hadoop

Example: stock prices - reduce.pystock = Nonemax_increase = 0for line in sys.stdin: next_stock, increase = line.split('\t') increase = float(increase) if next_stock == stock: # another line for the same stock if increase > max_increase: max_increase = increase else: # new stock; output result for previous stock if stock: # only false on the very first line of input print( "%s\t%f" % (stock, max_increase) ) stock = next_stock max_increase = increase# print the lastprint( "%s\t%d" % (stock, max_increase) )

42

Page 43: Streaming Python on Hadoop

Invoking Hadoop● Now we just have to run Hadoop. (Here we

are running locally. To run in a cluster, you need to move the data into HDFS first.)

If you want to run code on our servers, I’ll give instructions at the end of the talk.

43

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ -input input.txt -output output \ -mapper map.py -reducer reduce.py

Page 44: Streaming Python on Hadoop

Brief history of Hadoop● 2004: Two engineers from Google published

a paper on MapReduceo Doug Cutting was working on an open-source web

crawler; saw that MapReduce solved his biggest problem: coordinating lots of computers; decided to implement an open-source version of MR.

o Yahoo hired Cutting and continued and expanded the Hadoop project.

44

Page 45: Streaming Python on Hadoop

Brief history of Hadoop (cont.)● Today: Hadoop includes its own scheduler,

lock mechanism, many database systems, MapReduce, a non-MapReduce parallelism system called Spark, and more.

● Demand for “data engineers” who can manage huge datasets using Hadoop keeps increasing.

45

Page 46: Streaming Python on Hadoop

Summary● We discussed the easiest way (that I know)

to use Hadoop to process large datasets.● Hadoop provides MapReduce, which can

exploit massive parallelism by automatically breaking up inputs and processing the pieces separately, as long as the user supplies map and reduce functions.

46

Page 47: Streaming Python on Hadoop

Summary (cont.)● Your problem as a programmer is to figure

out how to write map and reduce functions that will solve your problem. This is sometimes really easy.

● Using Python streaming, map and reduce are just Python scripts that read from stdin and write to stdout - no need to learn special Hadoop APIs or anything!

47

Page 48: Streaming Python on Hadoop

So is that all there is to MapReduce?● If only! For more complex cases and for

higher efficiency:o Use Java for higher efficiencyo Store data in the cluster, for capacity, reliability, and

efficiencyo Tune your application for higher efficiency, e.g.

placing computations near datao Use some of many Hadoop components that can

make programs easier to write and more efficient48

Page 49: Streaming Python on Hadoop

Next steps● If you want to learn more, there are many books and

online tutorials.o Hadoop: The Definitive Guide, by Tom White, is the

definitive guide. (You’ll need to know Java.)● We’ll be giving a five-Saturday lecture/lab class

expanding on this meet-up starting this Saturday, and a twelve-evening class starting August 3.

● We’ll be giving a six-week, full-time bootcamp on Hadoop+Python starting in late August.

49

Page 50: Streaming Python on Hadoop

Running examples● For those of you who want to run examples:

o Login to server per given instructionso Directory streaming-examples has code for stock

prices, wordcount, and word frequencies.o In each directory, enter: source run-hadoop.sho Output in output/part-00000 should match file

expected-output.o If you want to edit and re-run, you need to delete

output directories: rm -r output (and rm -r output0 in count-freq).

50

Page 51: Streaming Python on Hadoop

Running examples (cont.)● Please let us know if you want to continue

working on this tomorrow; we’ll leave the accounts live until Friday if you request it.

● Some suggestions:o Word count variants

Ignore case Ignore punctuation Find number of words of each length Create sorted list of words of each length

51

Page 52: Streaming Python on Hadoop

Running examples (cont.)● Some suggestions:

o Stock prices Produce both max and min increases

o Matrix-vector multiplication - you’ll be starting from scratch on this one. Implement the method we described. Suppose the input is in the form p, q1, q2, …, qn,

i.e. a page and all of its outgoing links.

52

Page 53: Streaming Python on Hadoop

Combiners● Obvious source of inefficiency in wordcount:

Suppose a word occurs twice on one line; we should output one line of ‘w, 2’ instead of two lines of ‘w, 1’.

● In fact, this applies to the entire file: Instead of ‘w, 1’ for each occurrence of a word, output ‘w, n’ if w occurs n times.

53

Page 54: Streaming Python on Hadoop

Combiners● Or, to put this differently: We should apply

reduce to each file before the shuffle stage.● Can do this by specifying a combiner

function (which in this case is just reduce).

54

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ -input input.txt \ -output output \ -mapper map.py \ -reducer reduce.py -combiner reduce.py