thomas jensen. machine learning

23
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia

Upload: volha-banadyseva

Post on 27-Aug-2014

2.135 views

Category:

Data & Analytics


3 download

DESCRIPTION

#BigDataBY

TRANSCRIPT

Page 1: Thomas Jensen. Machine Learning

The Impact of Big Data on Classic Machine Learning

Algorithms

Thomas Jensen, Senior Business Analyst @ Expedia

Page 2: Thomas Jensen. Machine Learning

Who am I?

• Senior Business Analyst @ Expedia• Working within the competitive

intelligence unit• Responsible for :

• Algorithm that score new hotels• Algorithm that predicts room nights

sold on existing Expedia hotels• Scraping competitor sites• Other stuff….

Page 3: Thomas Jensen. Machine Learning

The Promise of Big Data

Real time dataData driven decision

More accurate and robust models

Granularity

Page 4: Thomas Jensen. Machine Learning

Big Data Challenges

Data Processing – not going to talk about this.

Speed at which to use data – how fast should we update algorithms?

How do we train algorithms on data sets that do not fit into memory?

Page 5: Thomas Jensen. Machine Learning

Big Data Challenges

Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 6: Thomas Jensen. Machine Learning

Classification - Logistic Regression

• One classic task in machine learning / statistics is to classify some objects/events/decisions correctly

• Examples are:• Customer churn• Click behavior• Purchase behavior• ….

• One of the most popular algorithms to carry out these tasks is logistic regression

Page 7: Thomas Jensen. Machine Learning

What is logistic regression?

• Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other

• Pr 𝑦 𝑥 =1

1+𝑒−𝑥𝛽

• The challenge is to choose the optimal beta(s)

• To do that we minimize a cost function

Page 8: Thomas Jensen. Machine Learning

Why Use Logistic Regression?

• It is simple and well understood algorithm

• Outputs probabilities

• There are tried and tested models to estimate the parameters

• It is flexible – can handle a number of different inputs, and feature transformations

Page 9: Thomas Jensen. Machine Learning

Usual Approaches

• Batch training (offline approach)• Get all the data and train the algorithm in one go

• Disadvantages when data is big• Requires all data to be loaded into memory

• Periodic retraining is necessary

• Very time consuming with big data!

Page 10: Thomas Jensen. Machine Learning

Batch Training

Page 11: Thomas Jensen. Machine Learning

Examples of Logistic Regression in Industry Settings – Real Time Bidding

• RTB• RTB algorithms are usually

based on logistic regression• Whether or not to bid on a

user is determined by the probability that the user will click on an add

• Each day billions of bids are processed

• Each bid has to be processed within 80 milliseconds

Page 12: Thomas Jensen. Machine Learning

Examples of Logistic Regression in Industry Settings – Fraud Detection

Detecting Fraudulent Credit Card Transactions

• The probability that a transaction is using a stolen credit card is typically estimated with logistic regression

• Billions of transactions are analyzed each day

Page 13: Thomas Jensen. Machine Learning

How Slow is the Batch Version of Logistic Regression?

One target variable and two feature vectors.All randomly generated.

Page 14: Thomas Jensen. Machine Learning

A Real World Problem

Page 15: Thomas Jensen. Machine Learning

A Real World Problem

• Some stats on the training job in the pipeline:• Runs training jobs on a per country basis

• Longest running job lasts ~9 hours

• Shortest running job lasts ~3 hours

• There are often convergence failures

• What we need an algorithm that:• Can reduce training time

• Is robust towards convergence failures

Page 16: Thomas Jensen. Machine Learning

A Big Data Friendly Approach

Online Training

• Pass each data point sequentially through the algorithm

• Only requires one data point at a time in memory

• Allows for on-the-fly training of the algorithm

Page 17: Thomas Jensen. Machine Learning

Online Learning

• We want to learn a vector of weights

• Initialize all weights. Begin loop:1. Get training example

2. Make a prediction for the target variable

3. Learn the true value of the target

4. Update the weights and go to 1

Page 18: Thomas Jensen. Machine Learning

Online Learning

• Initialise all weights. Begin loop:

Repeat {For i = 1 to m {

𝜃𝑗 = 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖 , 𝑦𝑖))

}

}

the partial derivative of the cost functions

the cost function – giventheta and row i, i.e. how wrongAre we?

the step size – how fastwe should climb the gradient

Page 19: Thomas Jensen. Machine Learning

Online Learning

• Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.

Page 20: Thomas Jensen. Machine Learning

Batch vs. Online Learning

DataSize: 4.8GBRows: 500,000Columns: 5000

0

20

40

60

80

100

120

Batch SGDClassifier Sofia-ml

Training

*Times include reading data and training algorithm

Page 21: Thomas Jensen. Machine Learning

Online Learning Vs. Batch

Online Learning

• When we have a continuous stream of data

• When It is important to update the algorithm in real time – can hit a moving target

• When training speed is important

• Parameters are “jumpy” around the optimal values

Batch

• When it is very important to get the exact optimal values

• When data can fit in memory

• When training time is not of the essence

Page 22: Thomas Jensen. Machine Learning

Popular Online Learning Libraries

• Sofia-ml (c/c++)• Requires data in svmLight format• Have implementations of SVM, Neural networks and logistic regression• Supports classification and ranking

• Wovbal wabbit (c/c++)• Requires data in own wv format• Have implementations of the most popular loss functions• Supports classification, ranking and regression

• Pandas + scikit-learn (python)• Pandas has a nice function for reading files in batches• Can handle sparse and non-sparse matrices• Scikit–learn has an SGD classifier that can fit the model in batches• Supports classification, ranking and regression

Page 23: Thomas Jensen. Machine Learning