hadoop mapreduce fundamentals

86
Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 1 of 5

Upload: lynn-langit

Post on 26-Jan-2015

136 views

Category:

Technology


3 download

DESCRIPTION

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

TRANSCRIPT

Page 1: Hadoop MapReduce Fundamentals

Hadoop MapReduce Fundamentals

@LynnLangit

a five part series – Part 1 of 5

Page 2: Hadoop MapReduce Fundamentals

Course Outline

Page 3: Hadoop MapReduce Fundamentals

What is Hadoop?

Open-source data storage and processing API Massively scalable, automatically parallelizable

Based on work from Google GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight

Page 4: Hadoop MapReduce Fundamentals

Why Use Hadoop?

Cheaper Scales to Petabytes or

more

Faster Parallel data

processing

Better Suited for particular

types of BigData problems

Page 5: Hadoop MapReduce Fundamentals

What types of business problems for Hadoop?

Source: Cloudera “Ten Common Hadoopable Problems”

Page 6: Hadoop MapReduce Fundamentals

Companies Using Hadoop

Facebook

Yahoo

Amazon

eBay

American Airlines

The New York Times

Federal Reserve Board

IBM

Orbitz

Page 7: Hadoop MapReduce Fundamentals

Forecast growth of Hadoop Job Market

Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html

Page 8: Hadoop MapReduce Fundamentals

Hadoop is a set of Apache Frameworks and more…

Data storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable

Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant

Other Tools / Frameworks Data Access

HBase, Hive, Pig, Mahout Tools

Hue, Sqoop Monitoring

Greenplum, ClouderaHadoop Core - HDFS

MapReduce API

Data Access

Tools & Libraries

Monitoring & Alerting

Page 9: Hadoop MapReduce Fundamentals

What are the core parts of a Hadoop distribution?

Page 10: Hadoop MapReduce Fundamentals

Hadoop Cluster HDFS (Physical) Storage

Page 11: Hadoop MapReduce Fundamentals

MapReduce Job – Logical View

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Page 12: Hadoop MapReduce Fundamentals

Hadoop Ecosystem

Page 13: Hadoop MapReduce Fundamentals
Page 14: Hadoop MapReduce Fundamentals

Common Hadoop Distributions

Open Source Apache

Commercial Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight

(Beta)

Page 15: Hadoop MapReduce Fundamentals

A View of Hadoop (from Hortonworks)

Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Page 16: Hadoop MapReduce Fundamentals

Setting up Hadoop Development

Page 17: Hadoop MapReduce Fundamentals

Demo – Setting up Cloudera Hadoop

Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

Page 18: Hadoop MapReduce Fundamentals

Hadoop MapReduce Fundamentals

@LynnLangit

a five part series – Part 2 of 5

Page 19: Hadoop MapReduce Fundamentals

So, what’s the problem?

“I can just use some ‘SQL-like’ language to query Hadoop, right?

“Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData

Page 20: Hadoop MapReduce Fundamentals

Ways to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Page 21: Hadoop MapReduce Fundamentals

Demo – Using Hive QL on CDH4

Page 22: Hadoop MapReduce Fundamentals

What is Hive?

a data warehouse system for Hadoop that facilitates easy data summarization supports ad-hoc queries (still batch though…) created by Facebook

a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL

Interactive-console –or- Execute scripts Kicks off one or more MapReduce jobs in the background

an ability to use indexes, built-in user-defined functions

Page 23: Hadoop MapReduce Fundamentals

Is HQL == ANSI SQL? – NO!

--non-equality joins ARE allowed on ANSI SQL

--but are NOT allowed on Hive (HQL)

SELECT a.* FROM a JOIN b ON (a.id <> b.id)

Note: Joins are quite different in MapReduce, more on that coming up…

Page 24: Hadoop MapReduce Fundamentals

Preparing for MapReduce

Page 25: Hadoop MapReduce Fundamentals

Common Hadoop Shell Commands

hadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>

hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir>

<toDir> hadoop fs –ls /user/hadoop/dir1

hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>

Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail

Page 26: Hadoop MapReduce Fundamentals

Demo – Working with Files and HDFS

Page 27: Hadoop MapReduce Fundamentals

Thinking in MapReduce

Hint: “It’s Functional”

Page 28: Hadoop MapReduce Fundamentals

Understanding MapReduce – P1/3

Map>> (K1, V1)

Info in Input Split

list (K2, V2) Key / Value out

(intermediate values)

One list per local node

Can implement local Reducer (or Combiner)

Page 29: Hadoop MapReduce Fundamentals

Understanding MapReduce – P2/3

Map>> (K1, V1)

Info in Input Split

list (K2, V2) Key / Value out

(intermediate values)

One list per local node

Can implement local Reducer (or Combiner)

Shuffle/Sort>>

Page 30: Hadoop MapReduce Fundamentals

Understanding MapReduce – P3/3

Map>> (K1, V1)

Info in Input Split

list (K2, V2) Key / Value out

(intermediate values)

One list per local node

Can implement local Reducer (or Combiner)

Reduce (K2, list(V2)

Shuffle / Sort phase precedes Reduce phase

Combines Map output into a list

list (K3, V3) Usually aggregates

intermediate values

(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)

Shuffle/Sort>>

Page 31: Hadoop MapReduce Fundamentals

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

MapReduce Example - WordCount

Page 32: Hadoop MapReduce Fundamentals

MapReduce Objects

Each daemon spawns a new JVM

Page 33: Hadoop MapReduce Fundamentals

Ways to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Page 34: Hadoop MapReduce Fundamentals

Demo – Running MapReduce WordCount

Page 35: Hadoop MapReduce Fundamentals

Hadoop MapReduce Fundamentals

@LynnLangit

a five part series – Part 3 of 5

Page 36: Hadoop MapReduce Fundamentals

Ways to run MapReduce Jobs

Configure JobConf options From Development Environment (IDE) From a GUI utility

Cloudera – Hue Microsoft Azure – HDInsight console

From the command line hadoop jar <filename.jar> input output

Page 37: Hadoop MapReduce Fundamentals

Ways to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Page 38: Hadoop MapReduce Fundamentals

Setting up Hadoop On Windows Azure

About HDInsight

Page 39: Hadoop MapReduce Fundamentals

Demo – MapReduce in the Cloud

WordCount MapReduce using HDInsight

Page 40: Hadoop MapReduce Fundamentals

MapReduce (WordCount) with Java Script

Note: JavaScript is part of the Azure Hadoop distribution

Page 41: Hadoop MapReduce Fundamentals

Common Data Sources for MapReduce Jobs

Page 42: Hadoop MapReduce Fundamentals

Where is your Data coming from?

On premises Local file system Local HDFS instance

Private Cloud Cloud storage

Public Cloud Input Storage buckets Script / Code buckets Output buckets

Page 43: Hadoop MapReduce Fundamentals

Common Data Jobs for MapReduce

Page 44: Hadoop MapReduce Fundamentals

Demo – Other Types of MapReduce

Tip: Review the Java MapReduce code in these samples as well.

Page 45: Hadoop MapReduce Fundamentals

Methods to write MapReduce Jobs

Typical – usually written in Java MapReduce 2.0 API MapReduce 1.0 API

Streaming Uses stdin and stdout Can use any language to write Map and Reduce Functions

C#, Python, JavaScript, etc…

Pipes Often used with C++

Abstraction libraries Hive, Pig, etc… write in a higher level language, generate one

or more MapReduce jobs

Page 46: Hadoop MapReduce Fundamentals

Ways to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Page 47: Hadoop MapReduce Fundamentals

Demo – MapReduce via C# & PowerShell

Page 48: Hadoop MapReduce Fundamentals

Ways to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Page 49: Hadoop MapReduce Fundamentals

Using AWS MapReduce

Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud

Page 50: Hadoop MapReduce Fundamentals

What is Pig?

ETL Library for HDFS developed at Yahoo Pig Runtime Pig Language Generates MapReduce Jobs

ETL steps LOAD <file> FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT… DUMP {to screen for testing} STORE <newFile>

Page 51: Hadoop MapReduce Fundamentals

MapReduce Python Sample

Remember that white space matters in Python!

Page 52: Hadoop MapReduce Fundamentals

Demo – Using AWS MapReduce with Pig

Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud

Page 53: Hadoop MapReduce Fundamentals

AWS Data Pipeline with HIVE

Page 54: Hadoop MapReduce Fundamentals

Hadoop MapReduce Fundamentals

@LynnLangit

a five part series – Part 4 of 5

Page 55: Hadoop MapReduce Fundamentals

Better MapReduce - Optimizations

Page 56: Hadoop MapReduce Fundamentals

Optimization BEFORE running a MapReduce Job

Page 57: Hadoop MapReduce Fundamentals

More about Input File Compression

From Cloudera… Their version of LZO ‘splittable’

Type File Size GB Compress Decompress

None Log 8.0 - -

Gzip Log.gz 1.3 241 72

LZO Log.lzo 2.0 55 35

Page 58: Hadoop MapReduce Fundamentals

Optimization WITHIN a MapReduce Job

Page 59: Hadoop MapReduce Fundamentals

59

Page 60: Hadoop MapReduce Fundamentals

Mapper Task Optimization

Page 61: Hadoop MapReduce Fundamentals

Data Types Writable

Text (String) IntWritable LongWritable FloatWritable BooleanWritable

WritableComparable for keys Custom Types supported – write RawComparator

Page 62: Hadoop MapReduce Fundamentals

Reducer Task Optimization

Page 63: Hadoop MapReduce Fundamentals

MapReduce Job Optimization

Page 64: Hadoop MapReduce Fundamentals

Demo – Unit Testing MapReduce

Using MRUnit + Asserts Optionally using ApprovalTests

Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png

Page 65: Hadoop MapReduce Fundamentals

A note about MapReduce 2.0

Splits the existing JobTracker’s roles resource management job lifecycle management

MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability

through distributed job lifecycle management support for multiple Hadoop MapReduce API versions in a

single cluster

Page 66: Hadoop MapReduce Fundamentals

What is Mahout? Library with common machine learning algorithms

Over 20 algorithms Recommendation (likelihood – Pandora) Classification (known data and new data – spam id) Clustering (new groups of similar data – Google news)

Can non-statisticians find value using this library?

Page 67: Hadoop MapReduce Fundamentals

Mahout Algorithms

Page 68: Hadoop MapReduce Fundamentals

Setting up Hadoop on Windows

For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other tools

Neudesic Azure Storage Viewer

Page 69: Hadoop MapReduce Fundamentals

Demo – Mahout

Using HDInsight

Page 70: Hadoop MapReduce Fundamentals

What about the output?

Page 71: Hadoop MapReduce Fundamentals

Clients (Visualizations) for HDFS

Many clients use Hive Often included in GUI console tools for Hadoop distributions as

well Microsoft includes clients in Office (Excel 2013)

Direct Hive client Connect using ODBC

PowerPivot – data mashups and presentation Data Explorer – connect, transform, mashup and filter

Hadoop SDK on Codeplex Other popular clients

Qlikview Tableau Karmasphere

Page 72: Hadoop MapReduce Fundamentals

Demo – Executing Hive Queries

Page 73: Hadoop MapReduce Fundamentals

Demo – Using HDFS output in Excel 2013

To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803

Page 74: Hadoop MapReduce Fundamentals

Ab

ou

t V

isu

alizati

on

Page 75: Hadoop MapReduce Fundamentals

Demo – New Visualizations – D3

Page 76: Hadoop MapReduce Fundamentals

Hadoop MapReduce Fundamentals

@LynnLangit

a five part series – Part 5 of 5

Page 77: Hadoop MapReduce Fundamentals

Limitations of MapReduce

Page 78: Hadoop MapReduce Fundamentals

Comparing: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

Query Response Time

Can be near immediate Has latency (due to batch processing)

Page 79: Hadoop MapReduce Fundamentals

Microsoft alternatives to MapReduce

Use existing relational system Scale via cloud or edition (i.e. Enterprise or PDW)

Use in memory OLAP SQL Server Analysis Services Tabular Models

Use “productized” Dremel Microsoft Polybase – status = beta?

Page 80: Hadoop MapReduce Fundamentals

Looking Forward - Dremel or Apache Drill

Based on original research from Google

Page 81: Hadoop MapReduce Fundamentals

Apache Drill Architecture

Page 82: Hadoop MapReduce Fundamentals

In-market MapReduce Alternatives

Cloudera

Impala

Google

Big Query

Page 83: Hadoop MapReduce Fundamentals

Demo – Google’s BigQuery Dremel for the rest of us

Page 84: Hadoop MapReduce Fundamentals

Hadoop MapReduce Call to Action

Page 85: Hadoop MapReduce Fundamentals

More MapReduce Developer Resources

Based on the distribution – on premises Apache

MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera

Cloudera Cloudera University - http://university.cloudera.com/ Cloudera Developer Course (4 day) - *RECOMMENDED* -

http://university.cloudera.com/training/apache_hadoop/developer.html Hortonworks MapR

Based on the distribution – cloud AWS MapReduce

Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs Windows Azure HDInsight

Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/

More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/

Page 86: Hadoop MapReduce Fundamentals

The Changing Data Landscape