streaming python on hadoop

Meet-up: Tackling “Big Data” with Hadoop and

PythonSam Kamin, VP Data Engineering

NYC Data Science [email protected]

2

NYC Data Science Academy● We’re a company that does training and

consulting in the Data Science area.● I’m Sam Kamin. I just joined NYCDSA as

VP of Data Engineering (a new area for us). I was formerly a professor at the U. of Illinois (CS) and a Software Engineer at Google.

3

What this meet-up is about● Wikipedia: “Data Science is the extraction of

knowledge from large volumes of data.”● My goal tonight: Show you how you can

handle large volumes of data with simple Python programming, using the Hadoop streaming interface.

4

Outline of talk● Brief overview of Hadoop● Introduction to parallelism via MapReduce● Examples of applying MapReduce● Implementing MapReduce in Python

You can do some programming at the end if you want!

5

Big Data: What’s the problem?Too much data!

o Web contains about 5 billion web pages. According to Wikipedia, its total size in bytes is about 4 zettabytes - that’s 1021, or four thousand billion gigabytes.

o Google’s datacenters store about 15 exabytes (15 x 1018 bytes).

6

Big Data: What’s the solution?● Parallel computing: Use multiple,

cooperating computers.

7

Parallelism● Parallelism = dividing up a problem so that

multiple computers can all work on it:o Break the data into pieceso Send the pieces to different computers for

processing.o Send the results back and process the combination

to get the final result.

8

Cloud computing● Amazon, Google, Microsoft, and many other

companies operate huge clusters: Racks of (basically) off-the-shelf computers with (basically) standard network connections.

● The computers in these clusters run Linux - use them like any other computer...

9

Cloud computing● But: getting them to work together is really

hard:o Management: machine/disk failure; efficient data

placement; debugging, monitoring, logging, auditing.o Algorithms: decomposing your problem so it can be

solved in parallel can be hard.

That’s what Hadoop is here to help with.

10

● A collection of services in a cluster:o Distributed, reliable file system (HDFS)o Scheduler to run jobs in correct order, monitor,

restart on failure, etc.o MapReduce to help you decompose your problem

for parallel executiono A variety of other components (mostly based on

MapReduce), e.g. databases, application-focused libraries

11

How to use Hadoop● Hadoop is open source (free!)● It is hosted on Apache: hadoop.apache.org● Download it and run it standalone (for

debugging)● Buy a cluster or rent time on one, e.g. AWS,

GCE, Azure. (All offer some free time for new users.)

12

MapReduce● The main, and original, parallel-processing

system of Hadoop.● Developed by Google to simplify parallel

processing. Hadoop started as an open-source implementation of Google’s idea.

● With Hadoop’s streaming interface, it’s really easy to use MapReduce in Python.

13

MapReduce - The Big Idea● Calculations on large data sets often have

this form: Start by aggregating the data (possibly in a different order from the “natural order”), then perform a summarizing calculation on the aggregated groups.

● The idea of MapReduce: If your calculation is explicitly structured like this, it can be automatically parallelized.

14

Computing with MapReduceA MapReduce computation has three stages:Map: A function called map is applied to each record in your input. It produces zero or more records as output, each with a key and value. Keys may be repeated.Shuffle: The output from step 1 is sorted and combined: All records with the same key are combined into one.Reduce: A function called reduce is applied to each record (key + values) from step 2 to produce the final output.As the programmer, you only write map and reduce.

15

Computing with MapReduce

16

InputA, 7C, 5B, 23B, 12A, 18

A, [18, 7]B, [23, 12]C, [5]

Outputmap reduceshuffle

Note: map is record-oriented, meaning the output of the map stage is strictly a combination of the outputs from each record. That allows us to calculate in parallel...

Parallelism via MapReduce

17

Input A, [18, 7]B, [23, 12]C, [5]

map reduce

Because map and reduce are record-oriented, MR can divide inputs into arbitrary chunks:

map

map

map

reduce

reduce

reduce

Output

Output

Output

Output

distributedata

distributedata

combine/shuffle

MapReduce example: Stock prices● Input: list of daily opening and closing prices for

thousands of stocks over thousands of days.● Desired output: The biggest-ever one-day

percentage price increase for each stock.● Solution using MR:

o map: (stock, open, close) => (stock, (close - open) / open) (if pos)

o reduce: (stock, [%c0, %c1, …]) => (stock, max [%c0, %c1, …]).

18

MapReduce example - map

Goog, 230, 240Apple, 100, 98MS, 300, 250MS, 250, 260MS, 270, 280Goog, 220, 215Goog, 300, 350IBM, 80, 90IBM, 90, 85

Goog, 4.3%

MS, 4%MS, 3.7%

Goog, 16.6%IBM, 12.5%

map

You supply map: Output stock with % increase, or nothing if decrease.

19

MapReduce example - shuffle/sort

Goog, 4.3%MS, 4%MS, 3.7%Goog, 16.6%IBM, 12.5%

shuffle/sort Goog, [4.3%, 16.6%]

IBM, [12.5%]MS, [3.7%, 4%]

Goog, 4.3%MS, 4%MS, 3.7%Goog, 16.6%IBM, 12.5%

MapReduce supplies shuffle/sort: Combine all records for each stock.

20

MapReduce example - reduce

reduceGoog, [4.3%, 16.6%]IBM, [12.5%]MS, [3.7%, 4%]

Goog, 16.6%IBM, 12.5%MS, 4%

You supply reduce: Output max of percentages for each input record.

21

Wait, why did that help?I could have just written a loop to read every line and put the percentages in a table!● Suppose you have a terabyte of data, and

1000 computers in your cluster.● MapReduce can automatically split the data

into 1000 1GB chunks. You write two simple functions and get a 1000x speed-up!

22

Modelling problems using MR● We’re going to look at a variety of problems

and see how we can fit them into the MR structure.

● The question for each problem is: What are the types of map and reduce, and what do they do?

23

Example: Word countInput: Lines of text.Desired output: # of occurrences of each word (i.e. each sequence of non-space chars)

E.g. Input: Roses are red, violets are blue Output: are, 2

blue, 1red, 1 etc.

24

Example: Word countSolution:● map: “w1 w2 … wk” → w1, 1

w2, 1

...

wk, 1● reduce: (w, [1, 1, …]) → (w, n)

n 1’s

25

Example: Word count frequencyInput: Output of word countDesired output: For any number of occurrences c, the number of different words that occur c times.

E.g. Input: Roses are red, violets are blue Output: 1, 4

2, 1

26

Example: Word count frequencySolution:● map: w, c → c, 1● reduce: (c, [1, 1, …]) → (c, n)

n 1’s

27

Example: Page Rank● Famous algorithm used by Google to rank

pages. (Comes down to matrix-vector multiplication, as we’ll see…)

● Based on two ideas:o Importance of a page depends upon how many

pages link to it.o However, if a page has lots of links going out, the

value of each link is reduced.

28

Example: Page RankWith those two ideas, calculate rank of page:

Note: Because the web has cycles - page p can have a link to page q, which has a link to p - this formula requires an iterative solution.

pagerank(p) = Σq→p

29

pagerank(q)

out-degree(q)

Example: Page RankConsider pages and their links as a graph (page A has links to B, C, and D, etc.):

30

pr(A) = pr(B)/2 + pr(D)/2pr(B) = pr(A)/3 + pr(D)/2pr(C) = pr(A)/3 + pr(B)/2pr(D) = pr(A)/3 + pr(C)

Example: Page Rank● Represent the graph as a weighted

adjacency matrix:

31

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

M =

links to

links from

A

B

C

D

B DA C

Example: Page Rank● Now, if we put the page rank of each page in

a vector v, then multiplying M by v calculates the pagerank formula for all nodes:

32

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

pr(A)

pr(B)

pr(C)

pr(D)

pr(B)/2 + pr(D)/2

pr(A)/3 + pr(D)/2

pr(A)/3 + pr(B)/2

pr(A)/3 + pr(C)

X =

Example: Page Rank● So, to calculate page ranks, start with an

initial guess of all page ranks and multiply.● After one multiplication:

33

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

1/4

1/4

1/4

1/4

1/4

5/24

5/24

1/3

X =

Example: Page Rank● After two multiplications:

34

0 1/2 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

1/3 0 1 0

.27

.24

.188

.29

X =

1/4

5/24

5/24

1/3

Example: Page Rank● Thus, page rank = matrix-vector product.● Can we express matrix-vector multiplication

as a MapReduce?o Assume v is copied (magically) to each node.o M, being much bigger, needs to be partitioned, i.e. M

is the main input file.o How shall we represent M and define map and

reduce?

35

Example: Page Rank● A solution:

o Represent M using one record for each link: (p, q, out-degree(p)) for every link p→q.

o map: (p, q, d) (q, v[p]/d)↦

reduce: p, [c1, c2, …] p, c↦ 1+c2+...

36

MapReduce: Summary● Nowadays, MapReduce powers the internet:

o Google, Amazon, Facebook, use it extensively for everything from page ranking to error log analysis.

o NIH use it to analyze gene sequences.o NASA uses it to analyze data from probes.o etc., etc.

● Next question: How can we implement a MapReduce?

37

Writing map and reduce in Python● Easy using the streaming interface:

o map and reduce : stdin → stdout. Each should iterate over stdin and output result for each line.

o Inputs and outputs are text files. In map and reduce output, tab character separates key from value.

o Shuffle just sorts the files on the key. Instead of a line with a key and list of values, we

get consecutive lines with the same key.

38

Example: stock prices● Recall the output of the shuffle stage:

● The only difference is this becomes:

Goog, [4.3%, 16.6%]IBM, [12.5%]MS, [3.7%, 4%]

Goog 4.3%Goog 16.6%IBM 12.5%MS 3.7%MS 4%

39

Example: stock prices● On the next two slides, we show the map

and reduce functions in Python.● Both of them are just stand-alone programs

that read stdin and write stdout.● In fact, we can test our pipeline without using

MapReduce: cat input-file | ./map.py | sort |

./reduce.py40

Example: stock prices - map.py#!/usr/bin/env pythonimport sysimport string

for line in sys.stdin: record = line.split(",") opening = int(record[1]) closing = int(record[2]) if (closing > opening): change = float(closing - opening) / opening print '%s\t%s' % (record[0], change)

41

Example: stock prices - reduce.pystock = Nonemax_increase = 0for line in sys.stdin: next_stock, increase = line.split('\t') increase = float(increase) if next_stock == stock: # another line for the same stock if increase > max_increase: max_increase = increase else: # new stock; output result for previous stock if stock: # only false on the very first line of input print( "%s\t%f" % (stock, max_increase) ) stock = next_stock max_increase = increase# print the lastprint( "%s\t%d" % (stock, max_increase) )

42

Invoking Hadoop● Now we just have to run Hadoop. (Here we

are running locally. To run in a cluster, you need to move the data into HDFS first.)

If you want to run code on our servers, I’ll give instructions at the end of the talk.

43

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ -input input.txt -output output \ -mapper map.py -reducer reduce.py

Brief history of Hadoop● 2004: Two engineers from Google published

a paper on MapReduceo Doug Cutting was working on an open-source web

crawler; saw that MapReduce solved his biggest problem: coordinating lots of computers; decided to implement an open-source version of MR.

o Yahoo hired Cutting and continued and expanded the Hadoop project.

44

Brief history of Hadoop (cont.)● Today: Hadoop includes its own scheduler,

lock mechanism, many database systems, MapReduce, a non-MapReduce parallelism system called Spark, and more.

● Demand for “data engineers” who can manage huge datasets using Hadoop keeps increasing.

45

Summary● We discussed the easiest way (that I know)

to use Hadoop to process large datasets.● Hadoop provides MapReduce, which can

exploit massive parallelism by automatically breaking up inputs and processing the pieces separately, as long as the user supplies map and reduce functions.

46

Summary (cont.)● Your problem as a programmer is to figure

out how to write map and reduce functions that will solve your problem. This is sometimes really easy.

● Using Python streaming, map and reduce are just Python scripts that read from stdin and write to stdout - no need to learn special Hadoop APIs or anything!

47

So is that all there is to MapReduce?● If only! For more complex cases and for

higher efficiency:o Use Java for higher efficiencyo Store data in the cluster, for capacity, reliability, and

efficiencyo Tune your application for higher efficiency, e.g.

placing computations near datao Use some of many Hadoop components that can

make programs easier to write and more efficient48

Next steps● If you want to learn more, there are many books and

online tutorials.o Hadoop: The Definitive Guide, by Tom White, is the

definitive guide. (You’ll need to know Java.)● We’ll be giving a five-Saturday lecture/lab class

expanding on this meet-up starting this Saturday, and a twelve-evening class starting August 3.

● We’ll be giving a six-week, full-time bootcamp on Hadoop+Python starting in late August.

49

Running examples● For those of you who want to run examples:

o Login to server per given instructionso Directory streaming-examples has code for stock

prices, wordcount, and word frequencies.o In each directory, enter: source run-hadoop.sho Output in output/part-00000 should match file

expected-output.o If you want to edit and re-run, you need to delete

output directories: rm -r output (and rm -r output0 in count-freq).

50

Running examples (cont.)● Please let us know if you want to continue

working on this tomorrow; we’ll leave the accounts live until Friday if you request it.

● Some suggestions:o Word count variants

Ignore case Ignore punctuation Find number of words of each length Create sorted list of words of each length

51

Running examples (cont.)● Some suggestions:

o Stock prices Produce both max and min increases

o Matrix-vector multiplication - you’ll be starting from scratch on this one. Implement the method we described. Suppose the input is in the form p, q1, q2, …, qn,

i.e. a page and all of its outgoing links.

52

Combiners● Obvious source of inefficiency in wordcount:

Suppose a word occurs twice on one line; we should output one line of ‘w, 2’ instead of two lines of ‘w, 1’.

● In fact, this applies to the entire file: Instead of ‘w, 1’ for each occurrence of a word, output ‘w, n’ if w occurs n times.

53

Combiners● Or, to put this differently: We should apply

reduce to each file before the shuffle stage.● Can do this by specifying a combiner

function (which in this case is just reduce).

54

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ -input input.txt \ -output output \ -mapper map.py \ -reducer reduce.py -combiner reduce.py

streaming python on hadoop

Education