hadoop mapreduce
TRANSCRIPT
SEMINAR ONSEMINAR ONAndroid App DevelopmentAndroid App Development
Trained by-Trained by-Hewlett-Packard Education Services, Hewlett-Packard Education Services,
MumbaiMumbai
Presented to-Mr. R.K. Banyal By-Mr. Hukum Chand Saini Urvashi Kataria
About HPES:About HPES:• American global IT company headquartered in Palo-
Alto, California, US.• Provider of products, soft wares, technologies,
solutions and services to individual as well as small & medium sized business.
• Major operations include- HP Software, HP Financial Services & Corporate Investments
• Provides practical training in fields like Big Data, Android App Dev, Embedded Systems etc.
An android application that allows you to enjoy your as well as your dear ones birthday.
Save the days, get reminded of them, capture moments on the day itself, get greeted by the app, and celebrate!!
About Birthday Bash:About Birthday Bash:
The home screen:The home screen:
Calculating age and Calculating age and further:further:
Saving name for specified Saving name for specified date:date:
Happy Birthday!Happy Birthday!
Hadoop Map Reduce
(Map + reduce)
Presentation on:Presentation on:
Why MapReduce?Why MapReduce?• Large scale data processing was difficult!
Managing hundreds or thousands of processors
Managing parallelization and distribution
Reliable execution with easy data access
MapReduce provides all of these, easily!
What is Hadoop MapReduce?What is Hadoop MapReduce?
Hadoop ClusterHadoop Cluster HDFS (Physical) HDFS (Physical) StorageStorage
MapReduce ObjectsMapReduce Objects
How Map and Reduce Work How Map and Reduce Work TogetherTogether
Hadoop MapReduce: A Closer Hadoop MapReduce: A Closer LookLook
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
PartitionerIntermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
PartitionerIntermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local HDFS store
Node 1 Node 2
Shuffling Process
Intermediate (K,V) pairs
exchanged by all nodes
AlgorithmAlgorithmmap(key, value):// key: document name; value: text of document
for each word w in value:emit(w, 1)
reduce(key, values):// key: a word; values: an iterator over counts
result = 0for each count v in values:result += vemit(key,result)
map(key=url, val=contents): for each word w in contents:
emit (w, “1”)reduce(key=word, values=uniq_counts)://Sum all “1”s in values list
emit result “(word, sum)”
The very famous:The very famous:Word Count ExampleWord Count Example
Ways to MapReduceWays to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Common Data Sources Common Data Sources for MapReduce Jobsfor MapReduce Jobs
Service ProvidersService Providers• Open Source
o Apache
• Commercialo Clouderao Hortonworkso MapRo AWS MapReduceo Microsoft HDInsight (Beta)
Advancements:Advancements:MRV1 & MRV2MRV1 & MRV2
MRV2 (MAPREDUCE VERSION 2)•Splits the existing JobTracker’s roles
o Resource managemento Job lifecycle management
•MapReduce 2.0 provides many benefits over the existing MapReduce framework:
o Better scalability o Through distributed job lifecycle management o Support for multiple Hadoop MapReduce API versions in a
single cluster
Better MapReduce - Better MapReduce - OptimizationsOptimizations
Advantages of MapReduceAdvantages of MapReduce
• Distributed data and computation.• Tasks are independent. Entire nodes can fail and restart.• Linear scaling in the idle case. It’s used to design cheap
commodity, hardware.• Simple programming model. The end-user programmer
only writes map reduce task.
Disadvantages/ Cases where Disadvantages/ Cases where MR isn’t a suitable choice:MR isn’t a suitable choice:
• Real time processing• It is not always very easy to implement each and every
thing as a map reduce program• When your intermediate processes need to talk to each
other • When your processing requires lot of data to be shuffled
over the network• When you need to handle streaming data. MR is best suited
to batch process huge amount of data which you already have
Limitations of Limitations of MapReduceMapReduce
RDBMS vs. RDBMS vs. HadoopHadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times
Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
ReferencesReferences• J. Dean and S. Ghemawat. “MapReduce: Simplified Data
Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004.
• S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” OSDI 200?
• http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010.
• Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009
• http://developer.yahoo.com/hadoop/tutorial/module4.html• J. Lin and C. Dyer. Data-Intensive Text Processing with
MapReduce, Book Draft. February 7, 2010.
Thank You!!Thank You!!