intro to big data and hadoop ubc cs lecture series - g fawkes

49
© 2013 Geoff Fawkes. All Rights Reserved. 1 Introduction to Analytics and Big Data - Hadoop The University of British Columbia Computer Science Alumni/Industry Lecture Series Geoff Fawkes November, 2013 / 450

Upload: gfawkesnew2

Post on 25-Jan-2015

414 views

Category:

Technology


6 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 1

Introduction to Analytics and Big Data - Hadoop

The University of British ColumbiaComputer Science Alumni/Industry Lecture Series

Geoff FawkesNovember, 2013

/ 450

Page 2: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 2

Who am I?

Director Engineering, Teradata HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various

engineering roles Technology executive, mentor, software engineer

B.Sc. Comp Sci (UBC), MBA Executive (SFU)

Interruptive (disruptive?) personality Please ask questions to me / each other as we go along I don’t have all the answers – you do!

Credits: Rob Pegler, SNIA Education Storage Networking Industry Association, 2012

Who’s paying attention - 450 slides page count? Not that “big” - - about 50

Page 3: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 3

Big Data and Hadoop

History Data Challenges Why Hadoop?

Page 4: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 4

Customer Challenges: The Data Deluge

Page 5: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 5

Big Data is Different than Business Intelligence

Page 6: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 6

Questions From Business Will Vary

Page 7: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 7

Web 2.0 is “Data Driven”

Page 8: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 8

The World of Data-Driven Applications

Page 9: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 9

Attributes of Big Data

Page 10: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 10

Top Ten Common Big Data Problems

Page 11: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 11

Industries Are Embracing Big Data

Page 12: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 12

Why Hadoop?

Page 13: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 13

Why Hadoop?

Page 14: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 14

Storage and Memory B/W Lagging CPU

Page 15: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 15

Commodity Hardware Economics

Page 16: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 17

What is Hadoop?

Hadoop Adoption HDFS MapReduce Examples Ecosystem Projects

Page 17: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 18

Hadoop Adoption in the Industry

Page 18: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 19

What is Hadoop?

Page 19: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 20

What is Hadoop?

Page 20: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 21

HDFS 101 – The Data Set System

Page 21: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 22

HDFS Organization and Replication

Page 22: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 23

Hadoop Server Roles - Multiple

Page 23: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 24

Hadoop Cluster

Page 24: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 25

HDFS File Write Operation - Instance

Page 25: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 26

HDFS File Read Operation - Instance

Page 26: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 27

HDFS File Operation R/W Replication

Page 27: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 28

MapReduce 101 – Functional Programming Meets Distributed Processing

Page 28: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 29

What is MapReduce?

Page 29: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 30

Key MapReduce Terminology

Page 30: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 31

MapReduce Basic Concepts

Page 31: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 32

Example 1: MapReduce Operation

Page 32: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 33

Example 2: Sample Dataset

Page 33: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 34

MapReduce Paradigm – UNIX Cmd

Page 34: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 35

Example 3: Count Words

Page 35: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 36

Map function

Reduce function

Run this program as aMapReduce job

Ex. 3: Lifecycle of a MapReduce Job

Page 36: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 37

Map function

Reduce function

Run this program as aMapReduce job

Ex. 3: Lifecycle of a MapReduce Job

Page 37: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 38

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Time

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Ex. 3: Lifecycle of a MapReduce Job

Page 38: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 39

190+ parameters in Hadoop

Set manually or defaults are used

MapReduce Job Configuration Parms

Page 39: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 40

Putting it all Together: MapReduce + HDFS

Page 40: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 41

Hadoop Ecosystem Projects

- Interactive SQL Query & Modeling

- Data flow for tedious MapReduce Jobs

- Columnar NoSQL Store

Page 41: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 42

Compare: Hadoop, SQL, Massively Parallel Processing (MPP)

Page 42: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 43

Compare: RDBMS and MapReduce

Page 43: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 44

Hadoop Use Cases

Set Top Cable TV Boxes Pay Per View Advertising Bank Risk Modelling Product Sentiment Analysis

Page 44: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 45

Example 1: Set Top Cable TV Boxes

Page 45: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 46

Example 2: Pay Per View Advertising

Page 46: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 47

Example 3: Bank Risk Modelling

Page 47: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 48

Example 4: Product Sentiment Analysis

Page 48: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 49

More Reading?

World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011

McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity

Big Data: Harnessing a game-changing asset

IDC: 2011 Digital Universe Study: Extracting Value from Chaos

The Economist: Data, Data Everywhere

Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field

O’Reilly – What is Data Science?

O’Reilly – Building Data Science Teams?

O’Reilly – Data for the public good

Obama Administration “Big Data Research and Development Initiative.”

Page 49: Intro to big data and hadoop   ubc cs lecture series - g fawkes

© 2013 Geoff Fawkes. All Rights Reserved. 50

Introduction to Analytics and Big Data – Hadoop

Q&A

Geoff Fawkes http://www.linkedin.com/pub/geoff-fawkes/1/269/202 @gfawkes

November, 2013