10c introduction

25
Introduction 1 © 2012 MapR Technologies Introduction: MapR and Hadoop 7/6/2012

Upload: inyoung-cho

Post on 07-Jul-2015

366 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: 10c introduction

Introduction 1© 2012 MapR Technologies

Introduction: MapR and Hadoop

7/6/2012

Page 2: 10c introduction

Introduction 2© 2012 MapR Technologies

Introduction

Agenda

• Hadoop Overview

• MapReduce Overview

• Hadoop Ecosystem

• How is MapR Different?

• Summary

Page 3: 10c introduction

Introduction 3© 2012 MapR Technologies

Introduction

Objectives

At the end of this module you will be able to:

• Explain why Hadoop is an important technology for effectively working with Big Data

• Describe the phases of a MapReduce job

• Identify some of the tools used with Hadoop

• List the similarities and differences between MapR and other Hadoop distributions

Page 4: 10c introduction

Introduction 4© 2012 MapR Technologies

Hadoop Overview

Page 5: 10c introduction

Introduction 5© 2012 MapR Technologies

Data VolumeGrowing 44x

2020: 35.2

Zettabytes

2010:

1.2

Zettabytes

Data is Growing Faster than Moore’s Law

Business Analytics Requires a New Approach

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

IDC Digital Universe

Study 2011

Page 6: 10c introduction

Introduction 6© 2012 MapR Technologies

Before Hadoop

Web crawling to power search engines

• Must be able to handle gigantic data

• Must be fast!

Problem: databases (B-Tree) not so fast, and do not scale

Solution: Sort and Merge

• Eliminate the pesky seek time!

Page 7: 10c introduction

Introduction 7© 2012 MapR Technologies

How to Scale?

Big Data has Big Problems

• Petabytes of data

• MTBF on 1000s of nodes is < 1 day

• Something is always broken

• There are limits to scaling Big Iron

• Sequential and random access just don’t scale

Page 8: 10c introduction

Introduction 8© 2012 MapR Technologies

Example: Update 1% of 1TB

Data consists of 10 billion records, each 100 bytes

Task: Update 1% of these records

Page 9: 10c introduction

Introduction 9© 2012 MapR Technologies

Approach 1: Just Do It

Each update involves read, modify and write

– t = 1 seek + 2 disk rotations = 20ms

– 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)

Total time dominated by seek and rotation times

Page 10: 10c introduction

Introduction 10© 2012 MapR Technologies

Approach 2: The “Hard” Way

Copy the entire database 1GB at a time

Update records sequentially

– t = 2 x 1GB / 100MB/s + 20ms = 20s

– 103 x 20s = 20,000s = 5.6 hours

100x faster to move 100x more data!

Moral: Read data sequentially even if you only want 1% of it

Page 11: 10c introduction

Introduction 11© 2012 MapR Technologies

Introducing Hadoop!

Now imagine you have thousands of disks on hundreds of machines with near linear scaling

– Commodity hardware – thousands of nodes!

– Handles Big Data – Petabytes and more!

– Sequential file access – all spindles at once!

– Sharding – data distributed evenly across cluster

– Reliability – self-healing, self-balancing

– Redundancy – data replicated across multiple hosts and disks

– MapReduce

• Parallel computing framework

• Moves the computation to the data

Page 12: 10c introduction

Introduction 12© 2012 MapR Technologies

Hadoop Architecture

• MapReduce: Parallel computing– Move the computation to the data

– Minimizes network utilization

• Distributed storage layer: Keeping track of data and metadata– Data is sharded across the cluster

• Cluster management tools

• Applications and tools

Page 13: 10c introduction

Introduction 13© 2012 MapR Technologies

What’s Driving Hadoop Adoption?

“Simple algorithms and lots of data trump complex models ”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

Page 14: 10c introduction

Introduction 14© 2012 MapR Technologies

MapReduce Overview

Page 15: 10c introduction

Introduction 15© 2012 MapR Technologies

MapReduce

• A programming model for processing very large data sets

― A framework for processing parallel problems across huge datasets using a large number of nodes

― Brute force parallel computing paradigm

• Phases

― Map

• Job partitioned into “splits”

― Shuffle and sort

• Map output sent to reducer(s) using a hash

― Reduce

Page 16: 10c introduction

Introduction 16© 2012 MapR Technologies

Inside Map-Reduce

Input Map Shuffleand sort

Reduce Output

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

Page 17: 10c introduction

Introduction 17© 2012 MapR Technologies

JobTracker

• Sends out tasks

• Co-locates tasks with data

• Gets data location

• Manages TaskTrackers

Page 18: 10c introduction

Introduction 18© 2012 MapR Technologies

TaskTracker

• Performs tasks (Map, Reduce)

• Slots determine number of concurrent tasks

• Notifies JobTracker of completed jobs

• Heartbeats to the JobTracker

• Each task is a separate Java process

Page 19: 10c introduction

Introduction 19© 2012 MapR Technologies

Hadoop Ecosystem

Page 20: 10c introduction

Introduction 20© 2012 MapR Technologies

Hadoop Ecosystem

• PIG: It will eat anything

– High level language, set algebra, careful semantics

– Filter, transform, co-group, generate, flatten

– PIG generates and optimizes map-reduce programs

• Hive: Busy as a bee

– High level language, more ad hoc than PIG

– SQL-ish

– Has central meta-data service

– Loves external scripts

• HBase: NoSQL for your cluster

• Mahout: distributed/scalable machine learning algorithms

Page 21: 10c introduction

Introduction 21© 2012 MapR Technologies

How is MapR Different?

Page 22: 10c introduction

Introduction 22© 2012 MapR Technologies

Mostly, It’s Not!

API-compatible

– Move code over without modifications

– Use the familiar Hadoop Shell

Supports popular tools and applications

– Hive, Pig, HBase—Flume, if you want it

Page 23: 10c introduction

Introduction 23© 2012 MapR Technologies

Very Different Where It Counts

No single point of failure

Faster shuffle, faster file creation

Read/write storage layer

NFS-mountable

Management tools—MCS, Rest API, CLI

Data placement, protection, backup

HA at all layers (Naming, NFS, JobTracker, MCS)

Page 24: 10c introduction

Introduction 24© 2012 MapR Technologies

Summary

Page 25: 10c introduction

Introduction 25© 2012 MapR Technologies

Questions