hadoop in the wild cmsc 491 hadoop-based distributed computing spring 2015 adam shook

33
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Upload: sheena-richard

Post on 17-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop in the Wild

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Page 2: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda

• Check out some use cases• Discuss some architectures

Page 3: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

USE CASES

Page 4: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Common Use Cases

• Log Processing• Image Identification• Extract Transform Load• Recommendation Engines• Time-Series Storage and Processing• Building Search Indexes• Long-Term Archive• Audit Logging

Page 5: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Non-Use Cases

• Data processing handled by one large server• ACID Transactions

Page 6: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Bank

• Problem– Need to analyze customer activity across multiple

products to predict credit risk– Acquired a number of banks

• Solution– Setup a single Hadoop cluster with data from multiple

EDWs– Bank added new sources of customer service data to

get a clear picture of a customer’s financial situation

Page 7: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Mobile Carrier

• Problem– Why are our customers terminating their service

contracts?

• Solution– Combined transactional and event data with social

network data– Combined coverage maps with account data

Page 8: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

An Online Dating Service

• Problem– Surveys, demographic, and web activity to build a

picture– Customers wanted better recommendations– Algorithms improved and number of users grew

• Solution– Moved data and analysis to Hadoop– Able to size system to meet needs of customers

Page 9: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Ad Targeting

• Problem– Advertising is a special kind of recommendation– Need to select best ad for a particular visitor, but

each advertiser is paying to have its ad seen

• Solution– Collect stream of user activity with continuous

analysis– Build sophisticated models of user behavior

Page 10: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

POS Transaction Analysis

• Problem– Retailers able to collect much more data in stores and

online– EDW do not generally support sophisticated analysis to

provide better forecasting

• Solution– Loaded 20 years of sales transactions and used Hive to

do same analysis as before– Now able to use new algorithms with new data sets

Page 11: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Sensor Data

• Problem– Volume of sensor data from every generator across

multiple grids is enormous– Clear picture depends on real-time and forensic

analysis

• Solution– Capture and store all streaming sensor data– Built continuous analysis system to watch

performance of generators

Page 12: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Threat Analysis

• Problem– How do we detect threats and fraudulent activity

in an online world?

• Solution– Use of HBase to store virus signatures– Use of MapReduce to compare spam or malware• Lambda Architecture

Page 13: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Trade Surveillance

• Problem– Difficult to monitor trades for compliance, and

impossible to catch rogue traders

• Solution– Store trade data and trading party data– Continuously monitor activity and build

connections– Provides cheap storage for law-required auditing

Page 14: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Search

• Problem– Indexing stuff is pretty easy, until we went and had

to index the Internet– User preferences make it harder

• Solution– MapReduce was designed for indexing– Online retailers depend on search for users finding

and buying products

Page 15: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data Sandbox

• Problem– ???

• Solution– Simple storage mechanism with diverse tools for

data analysis and exploration

Page 16: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ARCHITECTURES

Page 17: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building your Data Lake

Page 18: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building your Data Lake

Page 19: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building your Data Lake

Page 20: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building your Data Lake

Page 21: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

1 2

3 4

Page 22: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Lambda Architecture

All Data Precompute Views

QFD 1 QFD 2 QFD N

QFD 1 QFD 2 QFD N

Process Stream Increment Views

New Data Stream Query

Real-TimeIncrement

Batchrecompute

Storm

Real-time views

Batch views

BATCH LAYER

SERVING LAYER

SPEED LAYER

Hadoop

(Apache HBase)

(HDFS/SQL)

Page 23: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Facebook

• EDW (Oracle) was unable to scale and perform• Investigated small Hadoop system• Engineers loved it• Began developing Hive

Page 24: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Facebook

• Time-series summaries• Ad hoc jobs over historical data• Long-term archival store for logs• Look up log events by specific attributes

Page 25: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Facebook Architecture

Page 26: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Facebook Messaging

• Needed a short set of temporal data• A growing set of data that is rarely accessed• HBase fit their needs more than other open-

source technologies

Page 27: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Twitter Architecture

Page 28: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

LinkedIn Architecture

Page 29: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

LinkedIn Applications

Page 30: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

LinkedIn Applications

Page 31: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

LinkedIn Applications

Page 32: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

LinkedIn Future

• MapReduce is not suited for large graph processing

• Batch-oriented nature is not suited for “breaking news”

Page 33: Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

References

• Hadoop: The Definitive Guide, Chapter 16.2• http://www.slideshare.net/s_shah/the-big-data-ecosystem-

at-linkedin-23512853• http://www.slideshare.net/Hadoop_Summit/hadoop-

hardware-twitter-size-does-matter• http://www.forbes.com/sites/edddumbill/2014/01/14/the-

data-lake-dream/• http://www.slideshare.net/brocknoland/common-and-

unique-use-cases-for-apache-hadoop• http://blog.cloudera.com/wp-content/uploads/2011/03/

ten_common_hadoopable_problems_final.pdf