sd forum 11 04-2010

Post on 15-Jan-2015

192 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

SD Forum November 4, 2010

TRANSCRIPT

Apache Mahout

Thursday, November 4, 2010

Apache MahoutNow with extra whitening and classification powers!

Thursday, November 4, 2010

• Mahout intro

• Scalability in general

• Supervised learning recap

• The new SGD classifiers

Thursday, November 4, 2010

Mahout?

• Hebrew for “essence”

• Hindi for a guy who drives an elephant

Thursday, November 4, 2010

Mahout?

• Hebrew for “essence”

• Hindi for a guy who drives an elephant

Thursday, November 4, 2010

Mahout?

• Hebrew for “essence”

• Hindi for a guy who drives an elephant

Thursday, November 4, 2010

Mahout!

• Scalable data-mining and recommendations

• Not all data-mining

• Not the fanciest data-mining

• Just some of the scalable stuff

• Not a competitor for R or Weka

Thursday, November 4, 2010

General Areas

• Recommendations

• lots of support, lots of flexibility, production ready

• Unsupervised learning (clustering)

• lots of options, lots of flexibility, production ready (ish)

Thursday, November 4, 2010

General Areas

• Supervised learning (classification)

• multiple architectures, fair number of options, somewhat inter-operable

• production ready (for the right definition of production and ready)

• Large scale SVD

• larger scale coming, beware sharp edges

Thursday, November 4, 2010

Scalable?

• Scalable means

• Time is proportional to problem size by resource size

• Does not imply Hadoop or parallel

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

1

Thursday, November 4, 2010

WallClockTime

# of Training Examples

Scalable Algorithm(Mahout wins!)

Traditional Datamining Works here

Scalable Solutions Required

Non-scalable Algorithm

Thursday, November 4, 2010

Scalable means ...

• One unit of work requires about a unit of time

• Not like the company store (bit.ly/22XVa4)

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

|P | = O(1) =⇒ t = O(1)

1

Thursday, November 4, 2010

WallClockTime

# of Training Examples

Parallel Algorithm

Sequential Algorithm Preferred

Parallel Algorithm Preferred

Sequential Algorithm

Thursday, November 4, 2010

Toy Example

Thursday, November 4, 2010

Training Data Sample

yes

no 0.92 0.01 circle

0.30 0.41 square

Filled?

x coordinate y coordinate shape

predictor variables

target variable

Thursday, November 4, 2010

What matters most?

!

!

!

!!

!

!

!!

!

Thursday, November 4, 2010

SGD Classification

• Supervised learning of logistic regression

• Sequential gradient descent, not parallel

• Highly optimized for high dimensional sparse data, possibly with interactions

• Scalable, real dang fast to train

Thursday, November 4, 2010

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xnT x1 ... xn

T x1 ... xn

Model

Model

TT

TT

T

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn? x1 ... xn

? x1 ... xn

Thursday, November 4, 2010

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xnT x1 ... xn

T x1 ... xn

Model

Model

TT

TT

T

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Thursday, November 4, 2010

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xnT x1 ... xn

T x1 ... xn

Model

Model

TT

TT

T

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Stateless, parallel

Thursday, November 4, 2010

Small example

• On 20 newsgroups

• converges in < 10,000 training examples (less than one pass through the data)

• accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes

• learning rate, regularization set automagically on held-out data

Thursday, November 4, 2010

System Structure

EvolutionaryProcess epvoid train(target, features)

AdaptiveLogisticRegression

20

1

OnlineLogisticRegression foldsvoid train(target, tracking, features)double auc()

CrossFoldLearner

51

Matrix betavoid train(target, features)double classifyScalar(features)

OnlineLogisticRegression

Thursday, November 4, 2010

Training API

public interface OnlineLearner {

void train(int actual, Vector instance);

void train(long trackingKey, int actual, Vector instance);

void train(long trackingKey, String groupKey, int actual, Vector instance);

void close();}

Thursday, November 4, 2010

Classification APIpublic class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close();

public double auc(); public State<Wrapper> getBest();}

CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood();

double p = model.classifyScalar(features);

Thursday, November 4, 2010

Speed?

• Encoding API for hashed feature vectors

• String, byte[] or double interfaces

• String allows simple parsing

• byte[] and double allows speed

• Abstract interactions supported

Thursday, November 4, 2010

Speed!

• Parsing and encoding dominate single learner

• Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core

• 20 million mixed text, categorical features with many interactions learned in ~ 1 hour

Thursday, November 4, 2010

More Speed!

• Evolutionary optimization of learning parameters allows simple operation

• 20x threading allows high machine use

• 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes

Thursday, November 4, 2010

Summary

• Mahout provides early production quality scalable data-mining

• New classification systems allow industrial scale classification

Thursday, November 4, 2010

Contact Info

Ted Dunningtdunning@maprtech.com

Thursday, November 4, 2010

Contact Info

Ted Dunningtdunning@maprtech.com

or tdunning@apache.com

Thursday, November 4, 2010

top related