sd forum 11 04-2010

Apache Mahout

Thursday, November 4, 2010

Apache MahoutNow with extra whitening and classification powers!


• Mahout intro

• Scalability in general

• Supervised learning recap

• The new SGD classifiers


Mahout?

• Hebrew for “essence”

• Hindi for a guy who drives an elephant


Mahout!

• Scalable data-mining and recommendations

• Not all data-mining

• Not the fanciest data-mining

• Just some of the scalable stuff

• Not a competitor for R or Weka


General Areas

• Recommendations

• lots of support, lots of flexibility, production ready

• Unsupervised learning (clustering)

• lots of options, lots of flexibility, production ready (ish)


General Areas

• Supervised learning (classification)

• multiple architectures, fair number of options, somewhat inter-operable

• production ready (for the right definition of production and ready)

• Large scale SVD

• larger scale coming, beware sharp edges


Scalable?

• Scalable means

• Time is proportional to problem size by resource size

• Does not imply Hadoop or parallel

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

1


WallClockTime

# of Training Examples

Scalable Algorithm(Mahout wins!)

Traditional Datamining Works here

Scalable Solutions Required

Non-scalable Algorithm


Scalable means ...

• One unit of work requires about a unit of time

• Not like the company store (bit.ly/22XVa4)

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

|P | = O(1) =⇒ t = O(1)

1


http://bit.ly/22XVa4

http://bit.ly/22XVa4

WallClockTime

# of Training Examples

Parallel Algorithm

Sequential Algorithm Preferred

Parallel Algorithm Preferred

Sequential Algorithm


Toy Example


Training Data Sample

yes

no 0.92 0.01 circle

0.30 0.41 square

Filled?

x coordinate y coordinate shape

predictor variables

target variable


What matters most?

!

!

!

!!

!

!

!!

!


SGD Classification

• Supervised learning of logistic regression

• Sequential gradient descent, not parallel

• Highly optimized for high dimensional sparse data, possibly with interactions

• Scalable, real dang fast to train


Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xnT x1 ... xn

T x1 ... xn

Model

Model

TT

TT

T

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn? x1 ... xn

? x1 ... xn


Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xnT x1 ... xn

T x1 ... xn

Model

Model

TT

TT

T

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast


Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xnT x1 ... xn

T x1 ... xn

Model

Model

TT

TT

T

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Stateless, parallel


Small example

• On 20 newsgroups

• converges in < 10,000 training examples (less than one pass through the data)

• accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes

• learning rate, regularization set automagically on held-out data


System Structure

EvolutionaryProcess epvoid train(target, features)

AdaptiveLogisticRegression

20

1

OnlineLogisticRegression foldsvoid train(target, tracking, features)double auc()

CrossFoldLearner

51

Matrix betavoid train(target, features)double classifyScalar(features)

OnlineLogisticRegression


Training API

public interface OnlineLearner {

void train(int actual, Vector instance);

void train(long trackingKey, int actual, Vector instance);

void train(long trackingKey, String groupKey, int actual, Vector instance);

void close();}


Classification APIpublic class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close();

public double auc(); public State<Wrapper> getBest();}

CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood();

double p = model.classifyScalar(features);


Speed?

• Encoding API for hashed feature vectors

• String, byte[] or double interfaces

• String allows simple parsing

• byte[] and double allows speed

• Abstract interactions supported


Speed!

• Parsing and encoding dominate single learner

• Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core

• 20 million mixed text, categorical features with many interactions learned in ~ 1 hour


More Speed!

• Evolutionary optimization of learning parameters allows simple operation

• 20x threading allows high machine use

• 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes


Summary

• Mahout provides early production quality scalable data-mining

• New classification systems allow industrial scale classification


Contact Info

Ted [email protected]


mailto:[email protected]


Contact Info

Ted [email protected]

or [email protected]






sd forum 11 04-2010

Technology

xn t x1

supervised learning

xn sequential

author t p r

ly22xva4 t p r p

general supervised learning

nonscalable algorithm

scalable datamining