megadata with python and hadoop

21
Processing Megadata With Python and Hadoop July 2010 TriHUG Ryan Cox www.asciiarmor .com

Upload: ryancox

Post on 05-Dec-2014

1.808 views

Category:

Documents


5 download

DESCRIPTION

July 2010 Triangle Hadoop Users Group presentation

TRANSCRIPT

Page 1: Megadata With Python and Hadoop

Processing MegadataWith Python and Hadoop

July 2010 TriHUGRyan Cox

www.asciiarmor.com

Page 2: Megadata With Python and Hadoop

0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF1089919999999999999999990029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF1049919999999999999999990029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF1089919999999999999999990029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF1089919999999999999999990029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF1089919999999999999999990029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF1089919999999999999999990029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF1069919999999999999999990029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF1089919999999999999999990029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF1089919999999999999999990029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF1089919999999999999999990029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF1089919999999999999999990029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF1089919999999999999999990029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF1049919999999999999999990029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF1049919999999999999999990029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF1049919999999999999999990029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF1039919999999999999999990029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF1009919999999999999999990029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF1009919999999999999999990029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF1009919999999999999999990029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF1009919999999999999999990029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF1009919999999999999999990029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF1089919999999999999999990029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF1089919999999999999999990035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW17010029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF1089919999999999999999990029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF1089919999999999999999990029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF1089919999999999999999990029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF1009919999999999999999990029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF1009919999999999999999990029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF1009919999999999999999990029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF1009919999999999999999990029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF1009919999999999999999990029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF1009919999999999999999990029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF1049919999999999999999990029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF1039919999999999999999990029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF1009919999999999999999990029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF1009919999999999999999990029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF1009919999999999999999990029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF1009919999999999999999990029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF1009919999999999999999990029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF1009919999999999999999990029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF1009919999999999999999990029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF1089919999999999999999990029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF1089919999999999999999990035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721

~130 GB NCDC CLIMATE DATASET1901

19041907

19101913

19161919

19221925

19281931

19341937

19401943

19461949

19521955

19581961

19641967

19701973

19761979

19821985

19881991

19941997

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Page 3: Megadata With Python and Hadoop

How can we make this scale?( and do more interesting things )

high_temp = 0 for line in open('1901'): line = line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93])

if (temp != "+9999" and quality in "01459"): high_temp = max(high_temp,float(temp))

print high_temp

Page 4: Megadata With Python and Hadoop
Page 5: Megadata With Python and Hadoop

JEFFREY DEAN – GOOGLE - 2004

“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”

Page 6: Megadata With Python and Hadoop
Page 7: Megadata With Python and Hadoop

def mapper(line): line = line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): return float(temp) return None

output = map(mapper,open('1901')) print reduce(max,output)

MAPREDUCE IN PURE PYTHON

Page 8: Megadata With Python and Hadoop

HADOOP STREAMING

for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)

(last_key, max_val) = (None, 0)for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val)))

if last_key: print "%s\t%s" % (last_key, max_val)

map

per.p

yre

duer

.py

cat dataFile | mapper.py | sort | reducer.py

Page 9: Megadata With Python and Hadoop

DUMBO

Page 10: Megadata With Python and Hadoop

DUMBO

def mapper(key,value): line = value.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): yield year, int(temp)

def reducer(key,values): yield key,max(values)

if __name__ == "__main__": import dumbo dumbo.run(mapper,reducer,reducer)

Page 11: Megadata With Python and Hadoop

DUMBO

• Ability to pass around Python objects• Job / Iteration Abstraction• Counter / Status Abstraction• Simplified Joining mechanism• Ability to use non-Java combiners• Built-in library of mappers / reducers• Excellent way to model MR algorithms

Page 12: Megadata With Python and Hadoop

ELASTIC MAP REDUCE

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

CLI – API – Web Console

Page 13: Megadata With Python and Hadoop

ELASTIC MAP REDUCE

Page 14: Megadata With Python and Hadoop

ELASTIC MAP REDUCE

Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’

Page 15: Megadata With Python and Hadoop

ELASTIC MAP REDUCECloudWatch metrics

Page 16: Megadata With Python and Hadoop

ELASTIC MAP REDUCECloudWatch metrics

Page 17: Megadata With Python and Hadoop

QUIZ: HOW WOULD YOU DO THIS?

Page 18: Megadata With Python and Hadoop

MAP REDUCE ALGORITHMS ARE DIFFERENT

BFS(G, s) // G is the graph and s is the starting node for each vertex u V [G] - {s}∈ do color[u] ← WHITE // color of vertex u d[u] ← ∞ // distance from source s to vertex u π[u] ← NIL // predecessor of u color[s] ← GRAY d[s] ← 0 π[s] ← NIL Q ← Ø // Q is a FIFO - queue ENQUEUE(Q, s) while Q ≠ Ø // iterates as long as there are gray vertices. do u ← DEQUEUE(Q) for each v Adj[u]∈ do if color[v] = WHITE // discover the undiscovered adjacent vertices then color[v] ← GRAY // enqueued whenever painted gray d[v] ← d[u] + 1 π[v] ← u ENQUEUE(Q, v) color[u] ← BLACK // painted black whenever dequeued

Page 19: Megadata With Python and Hadoop

MAP REDUCE ELSEWHERE

> m = function() { emit(this.user_id, 1); } > r = function(k,vals) { return 1; } > res = db.events.mapReduce(m, r, { query : {type:'sale'} }); > db[res.result].find().limit(2) { "_id" : 8321073716060 , "value" : 1 } { "_id" : 7921232311289 , "value" : 1 }

> {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},                             {<<"groceries">>, <<"yours">>}],                            [{'map', {'qfun', Count}, 'none', false},                             {'reduce', {'qfun', Merge}, 'none', true}]).

Mon

goD

BRi

ak

Page 20: Megadata With Python and Hadoop

LEARN MORE

Definitive Guide Hadoophttp://www.hadoopbook.com

Dumbohttp://dumbotics.com/http://github.com/klbostee/dumbo/

Elastic Map Reduce http://aws.amazon.com/

Botohttp://github.com/boto

Getting Started Slideshttp://www.slideshare.net/pacoid/getting-started-on-hadoop

Page 21: Megadata With Python and Hadoop

DEMO