mapreduce and hadoop

MapReduce and Hadoop

MapReduce and HadoopFrankie Pike

2010: 1.2 zettabytes1.2 trillion gigabytesDVDs past the moon2-way = 6 newspapers everyday~58% growth per year

Why care?Why care?

Why care?Googles capacity = 1 exabyte24 hours of Youtube > Internet in 20004 years of video / day on Youtube100 trillion words online

Common Architecture

http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpgCommon ArchitectureSingle point of failureSpace-constraintsMulti-tenancy difficultiesRe-writing of programs or changes to network config

MapReduce

The PromiseHigh reliability any node can go downHigh scalability easy to add nodesMulti-tenancyCost ReductionCloud-friendlyJava, C++, C#, Python, RTransparent Parallelization

The KryptoniteData set needs to be big enough

Consistency mid-processing

Two Steps in MapReduceMapReduceMappingInput K/V pairs -> Intermediate K/V PairsInput and Intermediate can be different(Server Key, Blog Data) -> (Blog Key, Post Count)

Sorted and Partitioned for reduction

Number of maps depends on task and cluster10TB data with blocksize 128MB = 82,000 maps10-100 maps per node ideal

ReducingIntermediate K/V -> Intermediate K/V (smaller)

Matching keys consolidated(A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6)

Number of Reductions >= 0Hopefully smaller dataset at each iterationReduce as much as needed

An Example{ "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "...", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..." }] }Want count of comments for bloghttp://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 1: Map to final format

http://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 2: Reduce (Partition)

http://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 3: Reduce (more)

http://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 4: Reduce (most)

http://ayende.com/blog/4435/map-reduce-a-visual-explanationSingle Node

http://bc.tech.coop/blog/070520.htmlDual Node

http://map-reduce.wikispaces.asu.edu/N-Nodes

http://www.inventoland.net/img/blog/mapReduce.pngDealing with FailureWorkersOccasional check-in pings by mastersMastersData structures get periodic auto-saves and consistency checks. Can restart from periodic savesBandwidthTasks attempt to pair with local storage

Has it worked?PatentedRegenerated index

Apache Hadoopopen source software for reliable, scalable, distributed computing

Hadoop Distributed File System (HDFS)Hadoop MapReduceCassandra (multi-master database)HBase (scalable, distributed, structured database)Mahout (data mining and machine learning libs)ZooKeeper (coordination service)

SourcesAvankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.eduAyende Rahien, Map/Reduce A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual-explanationhttp://hadoop.apache.org/http://en.wikipedia.org/wiki/MapReduce/

mapreduce and hadoop

Documents

partition http

htmldual node http

common architecture

final format http

edunnodes http

visual explanation

post count

blog data blog key