oscon data 2011 ted dunning

24
Hands-on Classification

Upload: mapr-technologies

Post on 31-May-2015

131 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Oscon Data 2011 Ted Dunning

Hands-on Classification

Page 2: Oscon Data 2011 Ted Dunning

Preliminaries

• Code is available from github:– [email protected]:tdunning/Chapter-16.git

• EC2 instances available• Thumb drives also available• Email to [email protected]• Twitter @ted_dunning

Page 3: Oscon Data 2011 Ted Dunning

A Quick Review

• What is classification?– goes-ins: predictors– goes-outs: target variable

• What is classifiable data?– continuous, categorical, word-like, text-like– uniform schema

• How do we convert from classifiable data to feature vector?

Page 4: Oscon Data 2011 Ted Dunning

Data Flow

Not quite so simple

Page 5: Oscon Data 2011 Ted Dunning

Classifiable Data

• Continuous– A number that represents a quantity, not an id– Blood pressure, stock price, latitude, mass

• Categorical– One of a known, small set (color, shape)

• Word-like– One of a possibly unknown, possibly large set

• Text-like– Many word-like things, usually unordered

Page 6: Oscon Data 2011 Ted Dunning

But that isn’t quite there

• Learning algorithms need feature vectors– Have to convert from data to vector

• Can assign one location per feature – or category – or word

• Can assign one or more locations with hashing– scary– but safe on average

Page 7: Oscon Data 2011 Ted Dunning

Data Flow

Page 8: Oscon Data 2011 Ted Dunning
Page 9: Oscon Data 2011 Ted Dunning

Classifiable Data Vectors

Page 10: Oscon Data 2011 Ted Dunning
Page 11: Oscon Data 2011 Ted Dunning
Page 12: Oscon Data 2011 Ted Dunning

Hashed Encoding

Page 13: Oscon Data 2011 Ted Dunning

What about collisions?

Page 14: Oscon Data 2011 Ted Dunning

Let’s write some code

(cue relaxing background music)

Page 15: Oscon Data 2011 Ted Dunning

Generating new features

• Sometimes the existing features are difficult to use

• Restating the geometry using new reference points may help

• Automatic reference points using k-means can be better than manual references

Page 16: Oscon Data 2011 Ted Dunning

K-means using target

Page 17: Oscon Data 2011 Ted Dunning

K-means features

Page 18: Oscon Data 2011 Ted Dunning

More code!

(cue relaxing background music)

Page 19: Oscon Data 2011 Ted Dunning

Integration Issues

• Feature extraction is ideal for map-reduce– Side data adds some complexity

• Clustering works great with map-reduce– Cluster centroids to HDFS

• Model training works better sequentially– Need centroids in normal files

• Model deployment shouldn’t depend on HDFS

Page 20: Oscon Data 2011 Ted Dunning

Averagemodels

Parallel Stochastic Gradient Descent

Trainsub

model

Model

Input

Page 21: Oscon Data 2011 Ted Dunning

Updatemodel

Variational Dirichlet Assignment

Gathersufficientstatistics

Model

Input

Page 22: Oscon Data 2011 Ted Dunning

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFS

Read fromHDFS to local disk by distributed cache

Written by map-reduce

Read from local disk from distributed cache

Page 23: Oscon Data 2011 Ted Dunning

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, 1, point

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, n, sum/n

• Output to HDFSMapR FS

Read fromNFS

Written by map-reduce

Page 24: Oscon Data 2011 Ted Dunning

Modeling architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS