modeling social data, lecture 4: counting at scale

47
Counting at Scale APAM E4990 Modeling Social Data Jake Hofman Columbia University February 10, 2017 Jake Hofman (Columbia University) Counting at Scale February 10, 2017 1 / 29

Upload: jakehofman

Post on 22-Mar-2017

172 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Modeling Social Data, Lecture 4: Counting at Scale

Counting at ScaleAPAM E4990

Modeling Social Data

Jake Hofman

Columbia University

February 10, 2017

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 1 / 29

Page 2: Modeling Social Data, Lecture 4: Counting at Scale

Previously

Claim:

Solving the counting problem at scale enables you to investigatemany interesting questions in the social sciences

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 2 / 29

Page 3: Modeling Social Data, Lecture 4: Counting at Scale

Learning to count

Last week:

Counting at small/medium scales on a single machine

This week:

Counting at large scales in parallel

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 3 / 29

Page 4: Modeling Social Data, Lecture 4: Counting at Scale

Learning to count

Last week:

Counting at small/medium scales on a single machine

This week:

Counting at large scales in parallel

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 3 / 29

Page 5: Modeling Social Data, Lecture 4: Counting at Scale

What?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 4 / 29

Page 6: Modeling Social Data, Lecture 4: Counting at Scale

What?

“... to create building blocks for programmers who justhappen to have lots of data to store, lots of data toanalyze, or lots of machines to coordinate, and whodon’t have the time, the skill, or the inclination tobecome distributed systems experts to build theinfrastructure to handle it.”

-Tom WhiteHadoop: The Definitive Guide

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 5 / 29

Page 7: Modeling Social Data, Lecture 4: Counting at Scale

What?

Hadoop contains many subprojects:

We’ll focus on distributed computation with MapReduce.

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 6 / 29

Page 8: Modeling Social Data, Lecture 4: Counting at Scale

Who/when?

An overly brief history

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29

Page 9: Modeling Social Data, Lecture 4: Counting at Scale

Who/when?

pre-2004Doug Cutting and Mike Cafarella develop open source projects for

web-scale indexing, crawling, and search

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29

Page 10: Modeling Social Data, Lecture 4: Counting at Scale

Who/when?

2004Dean and Ghemawat publish MapReduce programming model,

used internally at Google

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29

Page 11: Modeling Social Data, Lecture 4: Counting at Scale

Who/when?

2006Hadoop becomes official Apache project, Cutting joins Yahoo!,

Yahoo adopts Hadoop

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29

Page 12: Modeling Social Data, Lecture 4: Counting at Scale

Who/when?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29

Page 13: Modeling Social Data, Lecture 4: Counting at Scale

Where?

http://wiki.apache.org/hadoop/PoweredBy

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 8 / 29

Page 14: Modeling Social Data, Lecture 4: Counting at Scale

Why?

Why yet another solution?

(I already use too many languages/environments)

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 9 / 29

Page 15: Modeling Social Data, Lecture 4: Counting at Scale

Why?

Why a distributed solution?

(My desktop has TBs of storage and GBs of memory)

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 9 / 29

Page 16: Modeling Social Data, Lecture 4: Counting at Scale

Why?

Roughly how long to read 1TB from a commodity hard disk?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29

Page 17: Modeling Social Data, Lecture 4: Counting at Scale

Why?

Roughly how long to read 1TB from a commodity hard disk?

1

2

Gb

sec× 1

8

B

b× 3600

sec

hr≈ 225

GB

hr

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29

Page 18: Modeling Social Data, Lecture 4: Counting at Scale

Why?

Roughly how long to read 1TB from a commodity hard disk?1

≈ 4hrs

1SSDs provide a ∼10x speedupJake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29

Page 19: Modeling Social Data, Lecture 4: Counting at Scale

Why?

http://bit.ly/petabytesort

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 11 / 29

Page 20: Modeling Social Data, Lecture 4: Counting at Scale

Typical scenario

Store, parse, and analyze high-volume server logs,

e.g. how many search queries match “icwsm”?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 12 / 29

Page 21: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: 30k ft

Break large problem into smaller parts, solve in parallel, combineresults

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 13 / 29

Page 22: Modeling Social Data, Lecture 4: Counting at Scale

Typical scenario

“Embarassingly parallel”(or nearly so)

node 1local read filter

node 2local read filter

node 3local read filter

node 4local read filter

}collect results

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 14 / 29

Page 23: Modeling Social Data, Lecture 4: Counting at Scale

Typical scenario++

How many search queries match “icwsm”, grouped by month?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 15 / 29

Page 24: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: example

20091201,4.2.2.1,"icwsm 2010"20100523,2.4.1.2,"hadoop"20100101,9.7.6.5,"tutorial"20091125,2.4.6.1,"data"20090708,4.2.2.1,"open source"20100124,1.2.2.4,"washington dc"

20100522,2.4.1.2,"conference"20091008,4.2.2.1,"2009 icwsm"20090807,4.2.2.1,"apache.org"20100101,9.7.6.5,"mapreduce"20100123,1.2.2.4,"washington dc"20091121,2.4.6.1,"icwsm dates"

20090807,4.2.2.1,"distributed"20091225,4.2.2.1,"icwsm"20100522,2.4.1.2,"media"20100123,1.2.2.4,"social"20091114,2.4.6.1,"d.c."20100101,9.7.6.5,"new year's"

Mapmatching records to(YYYYMM, count=1)

200912, 1

200910, 1200911, 1

200912, 1

200910, 1...

200912, 1200912, 1

...200911, 1

200910, 1...200912, 2

...200911, 1

Shuffleto collect all recordsw/ same key (month)

Reduceresults by adding

count values for each key

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 16 / 29

Page 25: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: paradigm

Programmer specifies map and reduce functions

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29

Page 26: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: paradigm

Map: tranforms input record to intermediate (key, value) pair

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29

Page 27: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: paradigm

Shuffle: collects all intermediate records by key

Record assigned to reducers by hash(key) % num reducers

Reducers perform a merge sort to collect records with same key

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29

Page 28: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: paradigm

Reduce: transforms all records for given key to final output

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29

Page 29: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: paradigm

Distributed read, shuffle, and write are transparent to programmer

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29

Page 30: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: principles

• Move code to data (local computation)

• Allow programs to scale transparently w.r.t size of input

• Abstract away fault tolerance, synchronization, etc.

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 18 / 29

Page 31: Modeling Social Data, Lecture 4: Counting at Scale

MapReduce: strengths

• Batch, offline jobs

• Write-once, read-many across full data set

• Usually, though not always, simple computations

• I/O bound by disk/network bandwidth

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 19 / 29

Page 32: Modeling Social Data, Lecture 4: Counting at Scale

!MapReduce

What it’s not:

• High-performance parallel computing, e.g. MPI

• Low-latency random access relational database

• Always the right solution

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 20 / 29

Page 33: Modeling Social Data, Lecture 4: Counting at Scale

Word count

dog 2-- 1the 3brown 1fox 2jumped 1lazy 2jumps 1over 2quick 1that 1who 1? 1

the quick brown foxjumps over the lazy dogwho jumped over thatlazy dog -- the fox ?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29

Page 34: Modeling Social Data, Lecture 4: Counting at Scale

Word count

Map: for each line, output each word and count (of 1)

the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?

the 1quick 1brown 1fox 1---------jumps 1over 1the 1lazy 1dog 1---------who 1jumped 1over 1---------that 1lazy 1dog 1-- 1the 1fox 1? 1

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29

Page 35: Modeling Social Data, Lecture 4: Counting at Scale

Word count

Shuffle: collect all records for each word

the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?

-- 1---------? 1---------brown 1---------dog 1dog 1---------fox 1fox 1---------jumped 1---------jumps 1---------lazy 1lazy 1---------over 1over 1---------quick 1---------that 1---------the 1the 1the 1---------who 1

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29

Page 36: Modeling Social Data, Lecture 4: Counting at Scale

Word count

Reduce: add counts for each word

-- 1---------? 1---------brown 1---------dog 1dog 1---------fox 1fox 1---------jumped 1---------jumps 1---------lazy 1lazy 1---------over 1over 1---------quick 1---------that 1---------the 1the 1the 1---------who 1

-- 1? 1brown 1dog 2fox 2jumped 1jumps 1lazy 2over 2quick 1that 1the 3who 1

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29

Page 37: Modeling Social Data, Lecture 4: Counting at Scale

Word count

dog 1dog 1----------- 1---------the 1the 1the 1---------brown 1---------fox 1fox 1---------jumped 1---------lazy 1lazy 1---------jumps 1---------over 1over 1---------quick 1---------that 1---------? 1---------who 1

dog 2-- 1the 3brown 1fox 2jumped 1lazy 2jumps 1over 2quick 1that 1who 1? 1

the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29

Page 38: Modeling Social Data, Lecture 4: Counting at Scale

WordCount.java

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 22 / 29

Page 39: Modeling Social Data, Lecture 4: Counting at Scale

Hadoop streaming

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 23 / 29

Page 40: Modeling Social Data, Lecture 4: Counting at Scale

Hadoop streaming

MapReduce for *nix geeks2:

# cat data | map | sort | reduce

Where the sort is a hack to approximate a distributed group-by:

• Mapper reads input data from stdin

• Mapper writes output to stdout

• Reducer receives input, grouped by key, on stdin

• Reducer writes output to stdout

2http://bit.ly/michaelnollJake Hofman (Columbia University) Counting at Scale February 10, 2017 24 / 29

Page 41: Modeling Social Data, Lecture 4: Counting at Scale

wordcount.sh

Locally:

# cat data | tr " " "\n" | sort | uniq -c

Distributed:

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 25 / 29

Page 42: Modeling Social Data, Lecture 4: Counting at Scale

wordcount.sh

Locally:

# cat data | tr " " "\n" | sort | uniq -c

Distributed:

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 25 / 29

Page 43: Modeling Social Data, Lecture 4: Counting at Scale

Transparent scaling

Use the same code on MBs locally or TBs across thousandsof machines.

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 26 / 29

Page 44: Modeling Social Data, Lecture 4: Counting at Scale

wordcount.py

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 27 / 29

Page 45: Modeling Social Data, Lecture 4: Counting at Scale

Higher level abstractions

Higher level languages like Pig or Hive provide robustimplementations for many common MapReduce operations

(e.g., filter, sort, join, group by, etc.)

They also allow for user-defined map and reduce functions

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 28 / 29

Page 46: Modeling Social Data, Lecture 4: Counting at Scale

Higher level abstractions

Pig looks like SQL, but is more procedural

pig.apache.org

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 29 / 29

Page 47: Modeling Social Data, Lecture 4: Counting at Scale

Higher level abstractions

Hive is closer to SQL, with nested declarative statements

hive.apache.org

Jake Hofman (Columbia University) Counting at Scale February 10, 2017 29 / 29