how salesforce.com uses hadoop webinar

54
Follow us @forcedotcom How Salesforce.com uses Hadoop Narayan Bharadwaj Data Science @nadubharadwaj Jed Crosby Data Science @JedCrosby #forcewebinar

Upload: salesforce-developers

Post on 06-May-2015

1.652 views

Category:

Technology


0 download

DESCRIPTION

Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms.In this webinar, you will learn about an internal use case and a product use case::: Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).:: Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.

TRANSCRIPT

Page 1: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

How Salesforce.com uses Hadoop

Narayan Bharadwaj

Data Science

@nadubharadwaj

Jed Crosby

Data Science

@JedCrosby

#forcewebinar

Page 2: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Safe harbor statement under the Private Securities Litigation Reform Act of 1995:

This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such

uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ

materially from the results expressed or implied by the forward-looking statements we make. All statements other than

statements of historical fact could be deemed forward-looking, including any projections of product or service availability,

subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of

management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or

technology developments and customer contracts or use of our services.

The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and

delivering new functionality for our service, new products and services, our new business model, our past operating losses,

possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our

security measures, the outcome of any litigation, risks associated with completed and any possible mergers and

acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain,

and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our

limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further

information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report

on Form 10-K for the most recent fiscal year ended January 31, 2011 and in our quarterly report on Form 10-Q for the most

recent fiscal quarter ended October 31, 2011. These documents and others containing important disclosures are available

on the SEC Filings section of the Investor Information section of our Web site.

Any unreleased services or features referenced in this or other presentations, press releases or public statements are not

currently available and may not be delivered on time or at all. Customers who purchase our services should make the

purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does

not intend to update these forward-looking statements.

Safe Harbor

Page 3: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Agenda

Hadoop use cases

Use case 1 - Product Metrics*

Technology

Use case 2- Collaborative Filtering*

Q&A

*Every time you see the elephant, we will attempt to

explain a Hadoop related concept.

Page 4: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Got “Cloud Data”?

780 million transactions/day

Terabytes/day

130k customers

Millions of users

Page 5: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Hadoop Overview

Started by Doug Cutting at Yahoo!

Based on two Google papers

– Google File System (GFS): http://research.google.com/archive/gfs.html

– Google MapReduce: http://research.google.com/archive/mapreduce.html

Hadoop is an open source Apache project

– Hadoop Distributed File System (HDFS)

– Distributed Processing Framework (MapReduce)

Several related projects

– HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog

Page 6: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Hadoop use cases

Product Metrics User behavior

analysis Capacity planning

Monitoring intelligence

Performance analysis

Security

Ad-hoc log searches

Collaborative Filtering

Search Relevancy

Page 7: How salesforce.com Uses Hadoop Webinar

Product Metrics

Page 8: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Track feature usage/adoption across 130k+ customers

– Eg: Accounts, Contacts, Visualforce, Apex,…

Track standard metrics across all features

– Eg: #Requests, #UniqueOrgs, #UniqueUsers,

AvgResponseTime,…

Track features and metrics across all channels

– API, UI, Mobile

Primary audience: Executives, Product Managers

Product Metrics – Problem Statement

Page 9: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Feature Metadata

(Instrumentation)

Daily Summary

(Output)

Crunch it

(How?)

Storage & Processing

Feature (What?) Fancy UI

(Visualize)

Collaborate &

Iterate

Data Pipeline

Page 10: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Feature Metrics

(Custom

Object)

Trend Metrics

(Custom Object)

Client Machine

Pig script generator

Hadoop

Log Files

Lo

g P

ull

User Input

(Page Layout)

Reports,

Dashboards

AP

I

AP

I

Wo

rkfl

ow

Fo

rmu

la

Fie

lds

Java Program

Collaboration

(Chatter)

Wo

rkfl

ow

Product Metrics Pipeline

Page 11: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status

F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev

F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review

F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom

F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed

Feature Metrics (Custom Object)

Page 12: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Feature Metrics (Custom Object)

Page 13: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

User Input (Page Layout)

Formula

Field

Workflow

Rule

Page 14: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

User Input (Child Custom Object)

Child

Objects

Page 15: How salesforce.com Uses Hadoop Webinar

Apache Pig

Page 16: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

-- Define UDFs

DEFINE GFV GetFieldValue(‘/path/to/udf/file’);

-- Load data

A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();

-- Filter data

B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;

-- Extract Fields

C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..

-- Group

G = GROUP C BY ……

-- Compute output metrics

O = FOREACH G {

orgs = C.orgId; uniqueOrgs = DISTINCT orgs;

}

-- Store or Dump results

STORE O INTO ‘/path/to/user/output’;

Basic Pig script construct

Page 17: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Java Pig Script Generator (Client)

Page 18: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Id Date #Requests #Unique

Orgs

#Unique

Users

Avg

ResponseTime

F0001 06/01/2012 <big> <big> <big> <little>

F0002 06/01/2012 <big> <big> <big> <little>

F0003 06/01/2012 <big> <big> <big> <little>

F0001 06/02/2012 <big> <big> <big> <little>

F0002 06/02/2012 <big> <big> <big> <little>

F0003 06/03/2012 <big> <big> <big> <little>

Trend Metrics (Custom Object)

Page 19: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Upload to Trend Metrics (Custom Object)

Page 20: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Visualization (Reports & Dashboards)

Page 21: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Visualization (Reports & Dashboards)

Page 22: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Collaborate, Iterate (Chatter)

Page 23: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Feature Metrics

(Custom

Object)

Trend Metrics

(Custom Object)

Client Machine

Pig script generator

Hadoop

Log Files

Lo

g P

ull

User Input

(Page Layout)

Reports,

Dashboards

AP

I

AP

I

Wo

rkfl

ow

Fo

rmu

la

Fie

lds

Java Program

Collaboration

(Chatter)

Wo

rkfl

ow

Recap

Page 24: How salesforce.com Uses Hadoop Webinar

Technology

Page 25: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Apache Hadoop

Version=0.20.2

Hadoop ecosystem

Page 26: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Contributions

@pRaShAnT1784 : Prashant Kommireddi

Lars Hofhansl @thefutureian : Ian Varley

Page 27: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Apache Pig

Version=0.9.1

Data Science tools ecosystem

Page 28: How salesforce.com Uses Hadoop Webinar

Collaborative Filtering

Page 29: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Show similar files within an organization

– Content-based approach

– Community-base approach

Collaborative Filtering – Problem Statement

Page 30: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Popular File

Page 31: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Related File

Page 32: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Amazon published this algorithm in 2003.

– Amazon.com Recommendations: Item-to-Item Collaborative Filtering,

by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet

Computing, January-February 2003.

At Salesforce, we adapted this algorithm for Hadoop,

and we use it to recommend files to view and users to

follow.

We found this relationship using item-to-item collaborative

filtering

Page 33: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual Report Vision Statement

Dilbert Comic

Darth Vader Cartoon

Disk Usage Report

Example: CF on 5 files

Page 34: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual

Report

Vision

Statement

Dilbert

Cartoon

Darth

Vader

Cartoon

Disk

Usage

Report

Miranda

(CEO)

1 1 1 0 0

Bob (CFO) 1 1 1 0 0

Susan

(Sales)

0 1 1 1 0

Chun

(Sales)

0 0 1 1 0

Alice (IT) 0 0 1 1 1

View History Table

Page 35: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual Report

Disk Usage

Report

Darth Vader

Cartoon

Dilbert

Cartoon

Vision Statement

Relationships between the files

Page 36: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual

Report

Disk Usage

Report

Darth Vader

Cartoon

Dilbert

Cartoon

Vision Statement 2

2

0

0

3 1

0

3

1 1

Relationships between the files

Page 37: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual

Report

Vision

Statement

Dilbert

Cartoon

Darth Vader

Cartoon

Disk Usage

Report

Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)

Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1)

Darth Vader (1) Annual Rpt. (2) Disk Usage (1)

Disk Usage (1)

The popularity problem: notice that Dilbert appears first in every list.

This is probably not what we want.

The solution: divide the relationship tallies by file popularities.

Sorted relationships for each file

Page 38: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual Report

Disk Usage

Report

Darth Vader

Cartoon Dilbert

Cartoon

Vision Statement .82

.63 0

0

.77 .33

0

.77

.45 .58

Normalized relationships between the files

Page 39: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Annual Report Vision

Statement

Dilbert

Cartoon

Darth Vader

Cartoon

Disk Usage

Report

Vision Stmt.

(.82)

Annual Report

(.82)

Darth Vader

(.77)

Dilbert (.77) Darth Vader

(.58)

Dilbert (.63) Dilbert (.77) Vision Stmt.

(.77)

Disk Usage

(.58)

Dilbert

(.45)

Darth Vader

(.33)

Annual Report

(.63)

Vision Stmt.

(.33)

Disk Usage

(.45)

High relationship tallies AND similar popularity values now drive closeness.

Sorted relationships for each file, normalized by file popularities

Page 40: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

1) Compute file popularities

2) Compute relationship tallies and divide by file

popularities

3) Sort and store the results

The item-to-item CF algorithm

Page 41: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

MapReduce Overview Map Shuffle Reduce

(adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)

Page 42: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

<user, file>

Inverse identity map

<file, List<user>>

Reduce

<file, (user count)>

Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.

1. Compute File Popularities

Page 43: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

(Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert)

Inverse identity map

<Dilbert, {Miranda, Bob, Susan, Chun, Alice}>

Reduce

(Dilbert, 5)

Example: File popularity for Dilbert

Page 44: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

<user, file>

Identity map

<user, List<file>>

Reduce

<(file1, file2), Integer(1)>,

<(file1, file3), Integer(1)>,

<(file(n-1), file(n)), Integer(1)>

Relationships have their file IDs in alphabetical order

to avoid double counting.

2a. Compute relationship tallies - find all relationships in view history

table

Page 45: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

(Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert)

Identity map

<Miranda, {Annual Report, Vision Statement, Dilbert}>

Reduce

<(Annual Report, Dilbert), Integer(1)>,

<(Annual Report, Vision Statement), Integer(1)>,

<(Dilbert, Vision Statement), Integer(1)>

Example 2a: Miranda’s (CEO) file relationship votes

Page 46: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

<(file1, file2), Integer(1)>

<(file1, file2), List<Integer(1)>

Identity map

Reduce: count and

divide by popularities

<file1, (file2, similarity score)>, <file2, (file1, similarity score)>

Note that we emit each result twice, one for each file that belongs to a

relationship.

2b. Tally the relationship votes - just a word count, where each

relationship occurrence is a word

Page 47: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

<(Dilbert, Vader), Integer(1)>,

<(Dilbert, Vader), Integer(1)>,

<(Dilbert, Vader), Integer(1)>

<(Dilbert, Vader), {1, 1, 1}>

Identity map

Reduce: count and

divide by popularities

<Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>

Example 2b: the Dilbert/Darth Vader relationship

Page 48: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

<file1, (file2, similarity score)>

Identity map

<file1, List<(file2, similarity score)>>

Reduce

<file1, {top n similar files}>

Store the results in your location of choice

3. Sort and store results

Page 49: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

<Dilbert, (Annual Report, .63)>,

<Dilbert, (Vision Statement, .77)>,

<Dilbert, (Disk Usage, .45)>,

<Dilbert, (Darth Vader, .77)>

Identity map

<Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}>

Reduce

<Dilbert, {Darth Vader, Vision Statement}> (Top 2 files)

Store results

Example 3: Sorting the results for Dilbert

Page 50: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Cosine formula and normalization trick to avoid the

distributed cache

Mahout has CF

Asymptotic order of the algorithm is O(M*N2) in worst

case, but is helped by sparsity.

cosAB A B

A BA

AB

B

Appendix

Page 51: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Summary

Hadoop Cloud Data

Hadoop + Force.com = Recommendation algorithms

Page 52: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

@forcedotcom / #forcewebinar

Developer Force Group

facebook.com/forcedotcom

Developer Force – Force.com

Community

Page 53: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Upcoming Events

June 26 – Mobile CodeTalk

– http://bit.ly/mct-wr

June 27 – Painless Mobile App

Development

– http://bit.ly/mobileapp-hp

http://bit.ly/mdc-hp

Page 54: How salesforce.com Uses Hadoop Webinar

Follow us @forcedotcom

Q&A http://bit.ly/

hadoopsurvey

Narayan Bharadwaj Jed Crosby Prashant Kommireddi Santosh Rau

@nadubharadwaj @JedCrosby @pRaShAnT1784 @santoshrau

@SalesforceEng