10c introduction
TRANSCRIPT
Introduction 1© 2012 MapR Technologies
Introduction: MapR and Hadoop
7/6/2012
Introduction 2© 2012 MapR Technologies
Introduction
Agenda
• Hadoop Overview
• MapReduce Overview
• Hadoop Ecosystem
• How is MapR Different?
• Summary
Introduction 3© 2012 MapR Technologies
Introduction
Objectives
At the end of this module you will be able to:
• Explain why Hadoop is an important technology for effectively working with Big Data
• Describe the phases of a MapReduce job
• Identify some of the tools used with Hadoop
• List the similarities and differences between MapR and other Hadoop distributions
Introduction 4© 2012 MapR Technologies
Hadoop Overview
Introduction 5© 2012 MapR Technologies
Data VolumeGrowing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
Data is Growing Faster than Moore’s Law
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC Digital Universe
Study 2011
Introduction 6© 2012 MapR Technologies
Before Hadoop
Web crawling to power search engines
• Must be able to handle gigantic data
• Must be fast!
Problem: databases (B-Tree) not so fast, and do not scale
Solution: Sort and Merge
• Eliminate the pesky seek time!
Introduction 7© 2012 MapR Technologies
How to Scale?
Big Data has Big Problems
• Petabytes of data
• MTBF on 1000s of nodes is < 1 day
• Something is always broken
• There are limits to scaling Big Iron
• Sequential and random access just don’t scale
Introduction 8© 2012 MapR Technologies
Example: Update 1% of 1TB
Data consists of 10 billion records, each 100 bytes
Task: Update 1% of these records
Introduction 9© 2012 MapR Technologies
Approach 1: Just Do It
Each update involves read, modify and write
– t = 1 seek + 2 disk rotations = 20ms
– 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)
Total time dominated by seek and rotation times
Introduction 10© 2012 MapR Technologies
Approach 2: The “Hard” Way
Copy the entire database 1GB at a time
Update records sequentially
– t = 2 x 1GB / 100MB/s + 20ms = 20s
– 103 x 20s = 20,000s = 5.6 hours
100x faster to move 100x more data!
Moral: Read data sequentially even if you only want 1% of it
Introduction 11© 2012 MapR Technologies
Introducing Hadoop!
Now imagine you have thousands of disks on hundreds of machines with near linear scaling
– Commodity hardware – thousands of nodes!
– Handles Big Data – Petabytes and more!
– Sequential file access – all spindles at once!
– Sharding – data distributed evenly across cluster
– Reliability – self-healing, self-balancing
– Redundancy – data replicated across multiple hosts and disks
– MapReduce
• Parallel computing framework
• Moves the computation to the data
Introduction 12© 2012 MapR Technologies
Hadoop Architecture
• MapReduce: Parallel computing– Move the computation to the data
– Minimizes network utilization
• Distributed storage layer: Keeping track of data and metadata– Data is sharded across the cluster
• Cluster management tools
• Applications and tools
Introduction 13© 2012 MapR Technologies
What’s Driving Hadoop Adoption?
“Simple algorithms and lots of data trump complex models ”
Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems
Introduction 14© 2012 MapR Technologies
MapReduce Overview
Introduction 15© 2012 MapR Technologies
MapReduce
• A programming model for processing very large data sets
― A framework for processing parallel problems across huge datasets using a large number of nodes
― Brute force parallel computing paradigm
• Phases
― Map
• Job partitioned into “splits”
― Shuffle and sort
• Map output sent to reducer(s) using a hash
― Reduce
Introduction 16© 2012 MapR Technologies
Inside Map-Reduce
Input Map Shuffleand sort
Reduce Output
"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax
the, 1time, 1has, 1come, 1…
come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…
come, 6has, 8the, 4time, 14…
Introduction 17© 2012 MapR Technologies
JobTracker
• Sends out tasks
• Co-locates tasks with data
• Gets data location
• Manages TaskTrackers
Introduction 18© 2012 MapR Technologies
TaskTracker
• Performs tasks (Map, Reduce)
• Slots determine number of concurrent tasks
• Notifies JobTracker of completed jobs
• Heartbeats to the JobTracker
• Each task is a separate Java process
Introduction 19© 2012 MapR Technologies
Hadoop Ecosystem
Introduction 20© 2012 MapR Technologies
Hadoop Ecosystem
• PIG: It will eat anything
– High level language, set algebra, careful semantics
– Filter, transform, co-group, generate, flatten
– PIG generates and optimizes map-reduce programs
• Hive: Busy as a bee
– High level language, more ad hoc than PIG
– SQL-ish
– Has central meta-data service
– Loves external scripts
• HBase: NoSQL for your cluster
• Mahout: distributed/scalable machine learning algorithms
Introduction 21© 2012 MapR Technologies
How is MapR Different?
Introduction 22© 2012 MapR Technologies
Mostly, It’s Not!
API-compatible
– Move code over without modifications
– Use the familiar Hadoop Shell
Supports popular tools and applications
– Hive, Pig, HBase—Flume, if you want it
Introduction 23© 2012 MapR Technologies
Very Different Where It Counts
No single point of failure
Faster shuffle, faster file creation
Read/write storage layer
NFS-mountable
Management tools—MCS, Rest API, CLI
Data placement, protection, backup
HA at all layers (Naming, NFS, JobTracker, MCS)
Introduction 24© 2012 MapR Technologies
Summary
Introduction 25© 2012 MapR Technologies
Questions