mapreduce and hadoop
DESCRIPTION
MapReduce and Hadoop. Frankie Pike. Why care?. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year. Why care?. Why care?. Google’s capacity = 1 exabyte 24 hours of Youtube > Internet in 2000 - PowerPoint PPT PresentationTRANSCRIPT
MapReduce and Hadoop
MapReduce and HadoopFrankie Pike
2010: 1.2 zettabytes1.2 trillion gigabytesDVDs past the moon2-way = 6 newspapers everyday~58% growth per year
Why care?Why care?
Why care?Googles capacity = 1 exabyte24 hours of Youtube > Internet in 20004 years of video / day on Youtube100 trillion words online
Common Architecture
http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpgCommon ArchitectureSingle point of failureSpace-constraintsMulti-tenancy difficultiesRe-writing of programs or changes to network config
MapReduce
The PromiseHigh reliability any node can go downHigh scalability easy to add nodesMulti-tenancyCost ReductionCloud-friendlyJava, C++, C#, Python, RTransparent Parallelization
The KryptoniteData set needs to be big enough
Consistency mid-processing
Two Steps in MapReduceMapReduceMappingInput K/V pairs -> Intermediate K/V PairsInput and Intermediate can be different(Server Key, Blog Data) -> (Blog Key, Post Count)
Sorted and Partitioned for reduction
Number of maps depends on task and cluster10TB data with blocksize 128MB = 82,000 maps10-100 maps per node ideal
ReducingIntermediate K/V -> Intermediate K/V (smaller)
Matching keys consolidated(A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6)
Number of Reductions >= 0Hopefully smaller dataset at each iterationReduce as much as needed
An Example{ "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "...", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..." }] }Want count of comments for bloghttp://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 1: Map to final format
http://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 2: Reduce (Partition)
http://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 3: Reduce (more)
http://ayende.com/blog/4435/map-reduce-a-visual-explanationStep 4: Reduce (most)
http://ayende.com/blog/4435/map-reduce-a-visual-explanationSingle Node
http://bc.tech.coop/blog/070520.htmlDual Node
http://map-reduce.wikispaces.asu.edu/N-Nodes
http://www.inventoland.net/img/blog/mapReduce.pngDealing with FailureWorkersOccasional check-in pings by mastersMastersData structures get periodic auto-saves and consistency checks. Can restart from periodic savesBandwidthTasks attempt to pair with local storage
Has it worked?PatentedRegenerated index
Apache Hadoopopen source software for reliable, scalable, distributed computing
Hadoop Distributed File System (HDFS)Hadoop MapReduceCassandra (multi-master database)HBase (scalable, distributed, structured database)Mahout (data mining and machine learning libs)ZooKeeper (coordination service)
SourcesAvankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.eduAyende Rahien, Map/Reduce A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual-explanationhttp://hadoop.apache.org/http://en.wikipedia.org/wiki/MapReduce/