sd forum 11 04-2010

Apache Mahout

Thursday, November 4, 2010

Apache MahoutNow with extra whitening and classification powers!

• Mahout intro

• Scalability in general

• Supervised learning recap

• The new SGD classifiers

Mahout?

• Hebrew for “essence”

• Hindi for a guy who drives an elephant

Mahout?

Mahout!

• Scalable data-mining and recommendations

• Not all data-mining

• Not the fanciest data-mining

• Just some of the scalable stuff

• Not a competitor for R or Weka

General Areas

• Recommendations

• lots of support, lots of flexibility, production ready

• Unsupervised learning (clustering)

• lots of options, lots of flexibility, production ready (ish)

General Areas

• Supervised learning (classification)

• multiple architectures, fair number of options, somewhat inter-operable

• production ready (for the right definition of production and ready)

• Large scale SVD

• larger scale coming, beware sharp edges

Scalable?

• Scalable means

• Time is proportional to problem size by resource size

• Does not imply Hadoop or parallel

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

WallClockTime

# of Training Examples

Scalable Algorithm(Mahout wins!)

Traditional Datamining Works here

Scalable Solutions Required

Non-scalable Algorithm

Scalable means ...

• One unit of work requires about a unit of time

• Not like the company store (bit.ly/22XVa4)

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

|P | = O(1) =⇒ t = O(1)

WallClockTime

# of Training Examples

Parallel Algorithm

Sequential Algorithm Preferred

Parallel Algorithm Preferred

Sequential Algorithm

Toy Example

Training Data Sample

no 0.92 0.01 circle

0.30 0.41 square

Filled?

x coordinate y coordinate shape

predictor variables

target variable

What matters most?

SGD Classification

• Supervised learning of logistic regression

• Sequential gradient descent, not parallel

• Highly optimized for high dimensional sparse data, possibly with interactions

• Scalable, real dang fast to train

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xn

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xn

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xn

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Stateless, parallel

Small example

• On 20 newsgroups

• converges in < 10,000 training examples (less than one pass through the data)

• accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes

• learning rate, regularization set automagically on held-out data

System Structure

EvolutionaryProcess epvoid train(target, features)

AdaptiveLogisticRegression

OnlineLogisticRegression foldsvoid train(target, tracking, features)double auc()

CrossFoldLearner

Matrix betavoid train(target, features)double classifyScalar(features)

OnlineLogisticRegression

Training API

public interface OnlineLearner {

void train(int actual, Vector instance);

void train(long trackingKey, int actual, Vector instance);

void train(long trackingKey, String groupKey, int actual, Vector instance);

void close();}

Classification APIpublic class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close();

public double auc(); public State<Wrapper> getBest();}

CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood();

double p = model.classifyScalar(features);

Speed?

• Encoding API for hashed feature vectors

• String, byte[] or double interfaces

• String allows simple parsing

• byte[] and double allows speed

• Abstract interactions supported

Speed!

• Parsing and encoding dominate single learner

• Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core

• 20 million mixed text, categorical features with many interactions learned in ~ 1 hour

More Speed!

• Evolutionary optimization of learning parameters allows simple operation

• 20x threading allows high machine use

• 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes

Summary

• Mahout provides early production quality scalable data-mining

• New classification systems allow industrial scale classification

Contact Info

Ted Dunningtdunning@maprtech.com

Contact Info

Ted Dunningtdunning@maprtech.com

or tdunning@apache.com

sd forum 11 04-2010

xn t x1

supervised learning

xn sequential

author t p r

ly22xva4 t p r p

general supervised learning

nonscalable algorithm

scalable datamining

Technology

agile estimation sd forum

04. sd curriculum

sd-wan - tm forum live an sd-wan strategy and selecting the...

sd forum java sig - running java applications on amazon ec2

forum paris-04-2011

espon seminar, 2-3 june 2008, portoroz (forum@sd-med.org)

17 04 04 european gold forum - zurich

sd forum presentation 2011

echinger-forum 04/2012

forum for business - 04 april

forks forum, february 04, 2016

(2012-04-12) sd febril (ppt)

sd forum java sig - service oriented ui architecture

forks forum, september 04, 2014

2016 04-21 forum-kierowników_wytyczne_mc

sd - 04 - procedimentos remotos

2016 04-21 forum-kierowników_wytyczne_mc_ver_wyklad

opinieblad forum 04

kbk sd 04. matematika

forks forum, december 04, 2014