ec2, mapreduce, and distributed processing

173
EC2, MapReduce, and Distributed Processing Jonathan Dahl (and Rail Spikes, Slantwise, Zencoder, etc.)

Upload: jonathan-dahl

Post on 07-Nov-2014

7.799 views

Category:

Technology


3 download

DESCRIPTION

RailsConf Europe talk on MapReduce and EC2.

TRANSCRIPT

Page 1: EC2, MapReduce, and Distributed Processing

EC2, MapReduce, and Distributed Processing

Jonathan Dahl

(and Rail Spikes,Slantwise,Zencoder,

etc.)

Page 2: EC2, MapReduce, and Distributed Processing

distributed processing /dis'trib'ut'ed prŏs'ěs'ĭz/ noun Refers to any of a variety of computer systems that use more than one computer, or processor, to run an application. This includes parallel processing, in which a single computer uses more than one CPU to execute programs. More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various

Page 3: EC2, MapReduce, and Distributed Processing

asynchronous processing/a·syn·chro·nous prŏs'ěs'ĭz/ noun Computations that run independently of each other, without requiring constant synchronization. Each operation

Page 4: EC2, MapReduce, and Distributed Processing

parallel processing/ p a r · a l · l e l p rŏs 'ěs ' ĭz / n o u n Simultaneous computation of a single problem or system running across separate CPU cores. Includes

Page 5: EC2, MapReduce, and Distributed Processing

distributed processing/dis'trib'ut'ed prŏs'ěs'ĭz/ noun Just like parallel processing, but utilizing separate full systems, not just separate CPU cores.

Page 6: EC2, MapReduce, and Distributed Processing

You

Page 7: EC2, MapReduce, and Distributed Processing

Me

Page 8: EC2, MapReduce, and Distributed Processing
Page 9: EC2, MapReduce, and Distributed Processing

Map______...

Page 10: EC2, MapReduce, and Distributed Processing

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

Queue

3. Result

Page 11: EC2, MapReduce, and Distributed Processing

Roadmap:I. Functional ProgrammingII. MapReduceIII. EC2IV. Distributed Processing

Page 12: EC2, MapReduce, and Distributed Processing

1. Functional Programming

Page 13: EC2, MapReduce, and Distributed Processing

ƒ(x) vs. i++;

Page 14: EC2, MapReduce, and Distributed Processing

ƒ(x) = 2x + 1

Page 15: EC2, MapReduce, and Distributed Processing

ƒ(person) = first name + last name

Page 16: EC2, MapReduce, and Distributed Processing

lambda {|x| x*2 + 1 }

Page 17: EC2, MapReduce, and Distributed Processing

lambda do |user| "#{user.firstname} #{user.lastname}"end

Page 18: EC2, MapReduce, and Distributed Processing

ƒ(users) = ∑ of logins for each user

Page 19: EC2, MapReduce, and Distributed Processing

users.sum { |user| user.number_of_logins }

Page 20: EC2, MapReduce, and Distributed Processing

var total_logins = 0;

for (i = 0; i < users.size; i++) { total_logins += number_of_logins(users[i])}

Page 21: EC2, MapReduce, and Distributed Processing

users.sum(&:number)

Page 22: EC2, MapReduce, and Distributed Processing

users.sum(&:number)

Page 23: EC2, MapReduce, and Distributed Processing

users.each {}

Page 24: EC2, MapReduce, and Distributed Processing

result = Array.new

users.each {|user| result << user.email }

result

Page 25: EC2, MapReduce, and Distributed Processing

reduce

Page 26: EC2, MapReduce, and Distributed Processing

reduce == inject == fold

Page 27: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

Page 28: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

Page 29: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

ƒ(x,y) = x + yƒ(x,y) = x << y if y > 0ƒ(x,y) = x << y.upcase

Page 30: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

lambda {|result, i| result + i}

lambda do |result, i| result << i if i > 0end

lambda {|r, i| r << i.upcase }

Page 31: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

0[]Hash.new(“”)

Page 32: EC2, MapReduce, and Distributed Processing

list.reduce(init) {}

Page 33: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 34: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 35: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 36: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 37: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 38: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 39: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 40: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 41: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend# 55

Page 42: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 43: EC2, MapReduce, and Distributed Processing

reduceinjectfold

list -> valuereduceinjectfold

Page 44: EC2, MapReduce, and Distributed Processing

reduceinjectfold

reduceinjectfold

Page 45: EC2, MapReduce, and Distributed Processing

reduceinjectfold

reduceinjectfold

|result, x|

Page 46: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 47: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 48: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 49: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 50: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 51: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 52: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 53: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 54: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 55: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 56: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 57: EC2, MapReduce, and Distributed Processing

map(list, function)

Page 58: EC2, MapReduce, and Distributed Processing

map(list, function)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

Page 59: EC2, MapReduce, and Distributed Processing

map(list, function)

lambda {|x| x + 1 }lambda {|x| x.upcase }lambda {|x| x.nil? }

Page 60: EC2, MapReduce, and Distributed Processing

list.map {}

Page 61: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 62: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 63: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 64: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 65: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 66: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 67: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }# [false, false, false, false, false, true, true, true, true, true]

Page 68: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”]

Page 69: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”] [“A”,”B”,”C”]=>

Page 70: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all

=>

Page 71: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all [“david”, “stanley”, “anna”]=>

=>

Page 72: EC2, MapReduce, and Distributed Processing

(1..5).map {|x| x * x}

1 * 12 * 23 * 34 * 45 * 5

Page 73: EC2, MapReduce, and Distributed Processing

parallelizable!

Page 74: EC2, MapReduce, and Distributed Processing

(1..5).reduce(0) {|i,x| i * x}

Page 75: EC2, MapReduce, and Distributed Processing

map: parallelizable

reduce: not (?)

Page 76: EC2, MapReduce, and Distributed Processing

II. MapReduce

Page 77: EC2, MapReduce, and Distributed Processing

MapReduce != map + reduce

Page 78: EC2, MapReduce, and Distributed Processing
Page 79: EC2, MapReduce, and Distributed Processing

MAP a problem across several

servers

Page 80: EC2, MapReduce, and Distributed Processing

REDUCE the results of each server to a

single result set

Page 81: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 82: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 83: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 84: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 85: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 86: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }(group)

results.reduce {|final, i| final[i.key] = i.function }

Page 87: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 88: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 89: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 90: EC2, MapReduce, and Distributed Processing

key -> value

Page 91: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| }

1. Initial data

Page 92: EC2, MapReduce, and Distributed Processing

(1..10).map_with_index {|i, x| }

1. Initial data

Page 93: EC2, MapReduce, and Distributed Processing

1. Initial data

• GFS chunk identifier• Book page number• Web URL• Arbitrary group ID

Page 94: EC2, MapReduce, and Distributed Processing

Map server I:‘key1’ -> 6.8‘key2’ -> 6.9‘key3’ -> 8.1

2. Intermediate data

Page 95: EC2, MapReduce, and Distributed Processing

2. Intermediate data

Map server 2:‘key1’ -> 6.2‘key4’ -> 5.5

Page 96: EC2, MapReduce, and Distributed Processing

Reduce results:‘key1’ -> 6.5‘key2’ -> 6.9‘key3’ -> 8.1‘key4’ -> 5.5

3. Final data

Page 97: EC2, MapReduce, and Distributed Processing

another view

Page 98: EC2, MapReduce, and Distributed Processing
Page 99: EC2, MapReduce, and Distributed Processing
Page 100: EC2, MapReduce, and Distributed Processing
Page 101: EC2, MapReduce, and Distributed Processing
Page 102: EC2, MapReduce, and Distributed Processing
Page 103: EC2, MapReduce, and Distributed Processing
Page 104: EC2, MapReduce, and Distributed Processing
Page 105: EC2, MapReduce, and Distributed Processing
Page 106: EC2, MapReduce, and Distributed Processing
Page 107: EC2, MapReduce, and Distributed Processing
Page 108: EC2, MapReduce, and Distributed Processing
Page 109: EC2, MapReduce, and Distributed Processing
Page 110: EC2, MapReduce, and Distributed Processing
Page 111: EC2, MapReduce, and Distributed Processing

• Stage in between ‘map’ and ‘reduce’

Page 112: EC2, MapReduce, and Distributed Processing

• All mappers must finish before reduce

Page 113: EC2, MapReduce, and Distributed Processing

• Prepare intermediate results

Page 114: EC2, MapReduce, and Distributed Processing

• (Group results by key)

Page 115: EC2, MapReduce, and Distributed Processing
Page 116: EC2, MapReduce, and Distributed Processing

Parallel reduce?

Page 117: EC2, MapReduce, and Distributed Processing
Page 118: EC2, MapReduce, and Distributed Processing

ƒ(key1), ƒ(key3), ƒ(key4)

ƒ(key2), ƒ(key5)

Page 119: EC2, MapReduce, and Distributed Processing
Page 120: EC2, MapReduce, and Distributed Processing
Page 121: EC2, MapReduce, and Distributed Processing

Example

Page 122: EC2, MapReduce, and Distributed Processing

chunky: 12bacon: 15

Page 123: EC2, MapReduce, and Distributed Processing
Page 124: EC2, MapReduce, and Distributed Processing

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

Page 125: EC2, MapReduce, and Distributed Processing

c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iend

Page 126: EC2, MapReduce, and Distributed Processing

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

Page 127: EC2, MapReduce, and Distributed Processing
Page 128: EC2, MapReduce, and Distributed Processing
Page 129: EC2, MapReduce, and Distributed Processing

+1 second

Page 130: EC2, MapReduce, and Distributed Processing

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

Page 131: EC2, MapReduce, and Distributed Processing

word_chunks = input_words.chunk(200)

Page 132: EC2, MapReduce, and Distributed Processing

mapped_words = word_chunks.map do |words| distributed_count(words)end

Page 133: EC2, MapReduce, and Distributed Processing

def distributed_count(words) c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 i end c.sort{|a,b| b[1]<=>a[1]}end

Page 134: EC2, MapReduce, and Distributed Processing

grouped_words = group(mapped_words)

# :the => [1829, 887, 1523] ..# :cat => [19, 7, 36, 132] ...

Page 135: EC2, MapReduce, and Distributed Processing

final_results = grouped_words.inject({}) do |result, words| result[words.first] = words.last.inject(0) {|r, i| r + i } resultendwords = final_results.sort{|a,b| b[1]<=>a[1]}

Page 136: EC2, MapReduce, and Distributed Processing

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

Page 137: EC2, MapReduce, and Distributed Processing
Page 138: EC2, MapReduce, and Distributed Processing
Page 139: EC2, MapReduce, and Distributed Processing

requirements

Page 140: EC2, MapReduce, and Distributed Processing

1. Fixed problem

Page 141: EC2, MapReduce, and Distributed Processing

2. Mappable problem

Page 142: EC2, MapReduce, and Distributed Processing

3. Distributed reduce

Page 143: EC2, MapReduce, and Distributed Processing

example uses

Page 144: EC2, MapReduce, and Distributed Processing

III. EC2

Page 145: EC2, MapReduce, and Distributed Processing
Page 146: EC2, MapReduce, and Distributed Processing
Page 147: EC2, MapReduce, and Distributed Processing

Why?

Page 148: EC2, MapReduce, and Distributed Processing
Page 149: EC2, MapReduce, and Distributed Processing
Page 150: EC2, MapReduce, and Distributed Processing
Page 151: EC2, MapReduce, and Distributed Processing
Page 152: EC2, MapReduce, and Distributed Processing
Page 153: EC2, MapReduce, and Distributed Processing
Page 154: EC2, MapReduce, and Distributed Processing
Page 155: EC2, MapReduce, and Distributed Processing
Page 156: EC2, MapReduce, and Distributed Processing

Example

Page 157: EC2, MapReduce, and Distributed Processing
Page 158: EC2, MapReduce, and Distributed Processing
Page 159: EC2, MapReduce, and Distributed Processing

1851-1922

Page 160: EC2, MapReduce, and Distributed Processing

4TB

Page 161: EC2, MapReduce, and Distributed Processing

Hadoop + EC2

Hadoop

Page 162: EC2, MapReduce, and Distributed Processing

100 instances

Page 163: EC2, MapReduce, and Distributed Processing

24 hours

Page 164: EC2, MapReduce, and Distributed Processing

$240

Page 165: EC2, MapReduce, and Distributed Processing

(€164)

Page 166: EC2, MapReduce, and Distributed Processing

IV. Three Thoughts

Page 167: EC2, MapReduce, and Distributed Processing
Page 168: EC2, MapReduce, and Distributed Processing
Page 169: EC2, MapReduce, and Distributed Processing

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

Queue

3. Result

Page 170: EC2, MapReduce, and Distributed Processing
Page 171: EC2, MapReduce, and Distributed Processing

Hadoop

Page 172: EC2, MapReduce, and Distributed Processing
Page 173: EC2, MapReduce, and Distributed Processing

Thanks!Jonathan Dahl

Slides at Rail Spikes http://railspikes.com

Photo Credits

•Rofi: http://flickr.com/photos/rofi/

•Digital:Slurp http://flickr.com/photos/digitalslurp/

•Others stolen from Google Image search