composing mahout clustering jobs

42
Composing Mahout clustering jobs Frank Scholten [email protected]

Upload: doannhan

Post on 13-Feb-2017

239 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Composing Mahout clustering jobs

Composing Mahout clustering jobs

Frank Scholten

[email protected]

Page 2: Composing Mahout clustering jobs

Bio

● Frank Scholten

● Developer at Amsterdam, NL

● Mahout user / contributor

● http://blog.jteam.nl/author/frank

Page 3: Composing Mahout clustering jobs

Agenda

What is clustering?

Introducing

Clustering

Page 4: Composing Mahout clustering jobs

What is clustering?

Page 5: Composing Mahout clustering jobs

Clustering - Google News

Page 6: Composing Mahout clustering jobs

Why clustering?

● Summarizing data

● Applications

Market analysis – identify customer groups

Biology – identify species

Image compression

many more applications!

Page 7: Composing Mahout clustering jobs

Definition

“Cluster analysis or clustering is the assignment of a set of observations into

subsets (called clusters) so that observations in the same cluster are similar in some

sense.”

Source: Wikipedia

Page 8: Composing Mahout clustering jobs

2-D Clustering Example

Inter-cluster distance

ClusterCenter

Intra-cluster distance

PointLegend Cluster

Page 9: Composing Mahout clustering jobs

Distance Measures

Euclidian distance measure

P

Q

d

Page 10: Composing Mahout clustering jobs

Vectorization

Vectorize data to measure distances

'The fox chased the dog'● [the => 2, fox => 1, chased => 1, dog => 1]

#0000CD → [wavelength => 475]

“Amsterdam” → [ lat => 52, long => 4]

Page 11: Composing Mahout clustering jobs

K-Means Algorithm

Select K random vectors

Specify distance measure + threshold

Every iteration● Add vector closest to cluster● Recompute center● Converged if no vectors within threshold

Page 12: Composing Mahout clustering jobs

Introducing

Page 13: Composing Mahout clustering jobs

Scalable machine learning

On top of Hadoop, for the most part

Started in 2008

Version 0.5 released last week!

Page 14: Composing Mahout clustering jobs

CollaborativeFiltering

Clustering

Classifcation

Is this SPAM?

And much more!

Page 15: Composing Mahout clustering jobs

Composing several jobs

$ mahout seqdirectory <options>

$ mahout seq2sparse <options>

$ mahout kmeans <options>

$ mahout clusterdump <options>

Page 16: Composing Mahout clustering jobs

bin/mahout

$ mahout seqdirectory --help

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop

HADOOP_CONF_DIR=/usr/local/hadoop/conf

Usage:

[--keyPrefix <keyPrefix> --chunkSize <chunkSize> --charset <charset> --output

<output> --fileFilterClass <fileFilterClass> --help --input <input>]

Page 17: Composing Mahout clustering jobs

bin/mahout

● bin/mahout calls MahoutDriver

● MahoutDriver

● Parses options

● Configures other Drivers

Page 18: Composing Mahout clustering jobs

KMeansDriver

String[] args = new String[] {

"--input", input,

"--output", output,

"--clusters", clusters,

"--clustering",

"--numClusters", “10”

};

ToolRunner.run(conf,new KmeansDriver(),args);

Page 19: Composing Mahout clustering jobs

KMeansDriver

V1V2V3

V4V5V6

KMeansMapper KMeansCombiner KMeansReducer

(C1,V1)(C2,V2)(C2,V3)

(C3,V4)(C3,V5)(C4,V6)

(C1,V1)(C2,[V2,V3])

Cluster closest to vector

Combine clusterobservations

(C3,[V4,V5)(C4,V6)

(C1, Centroid1)(C2, Centroid2)(C3, Centroid3)(C4, Centroid4)

Page 20: Composing Mahout clustering jobs

Clustering

Page 21: Composing Mahout clustering jobs

Clustering

● Publicly available monthly dumps

● Posts ~ 5.5 GB ~ 1.4 M questions

● Inspired by Mahout in Action book

Page 22: Composing Mahout clustering jobs

Goal - Tag cloud

(250 tags)

Deploying Mahout on a Hadoop cluster

(unknown #) Questions Datasets for Apache Mahout

Classifying data using Apache Mahout

...

Ruby

Unit TestingMahout

Hadoop IPhone

Java

Page 23: Composing Mahout clustering jobs

How to cluster?

● Stopwords influence clustering big time!

● Option?● Cluster on collocations, e.g “Unit Testing”

● MAHOUT-415

Lucene filter for collocations

Page 24: Composing Mahout clustering jobs

Preprocess

Cluster

Index

Join clustered points

Vectorize[ unit testing => 1, … ][ version control => 1, … ]

Clustering on collocations

Continuous Integration

Unit Testing

Version control

Posts

Findcollocations

Text

ClusterLabels

Page 25: Composing Mahout clustering jobs

Pre process

StackOverfow posts

Extractpost body

Interpret &Strip HTMLPlain text

<row Id="4234" PostTypeId="1" content=”...”><row Id="136" PostTypeId="2" content=”...”> <row Id="985" PostTypeId="1" content=”...”>

“&#xD;&#xA;”&lt;pHow to use Mahout&lt;/p&gt;

How to use Mahout

Use Mahout'sXMLInputFormat

Page 26: Composing Mahout clustering jobs

Find collocations

Unigrams

● “Java”, “Ruby”

Bigrams

● “Continuous Integration”, “Unit Testing”

Page 27: Composing Mahout clustering jobs

Find collocations

● Use Mahout's CollocDriver

● Compute LogLikelihood Ratio (LLR) (Dunning)

● Select bigrams with high LLR

Page 28: Composing Mahout clustering jobs

Find collocations

Save collocations in Bloom Filter

[1, 0, 1, 0, 1, 1, 0, 0, ...]

Generate khash values for“Unit testing”

Add collocation “Unit Testing”

0, 2, 5, 4

Set bits in bitset

BloomFilter

Page 29: Composing Mahout clustering jobs

Vectorize

Lucene analyzer emits collocations

[ Unit testing => 1, … ][ Version control => 1, … ]

[1, 0, 1, 0, 1, 1, 0, 0, ...]

Generate khash values for“Unit testing”

0, 2, 5, 4

Check bits in bitset

Is “Unit Testing”a siginifcant colloc? Bloom

FilterTrue Can be

false positive!

Page 30: Composing Mahout clustering jobs

Cluster

KMeans

Page 31: Composing Mahout clustering jobs

Join clustered points

PostsSequence fle

[ (id), (title, content) ]

Clustered points

( id = 23 , )

( id = 51 , )

( id = 78 , )

( id = 23 , )

( id = 34 , )

Map-side join

[ (id), (title, content, clusterId) ]

Page 32: Composing Mahout clustering jobs

Cluster labels

[0.56, 0.32, 0.98, ...]

0 => Unit testing1 => Version control2 => Continuous integration...

Dictionary fle

Term with highest weight

0.98

Label “Continuous integration”

Cluster centroid

Page 33: Composing Mahout clustering jobs

Index

Index

[id,title,content,clusterId,clusterLabel]

View with web app & Solr

Page 34: Composing Mahout clustering jobs

Running the job on

Mahout job jar Amazon instances

Launch via Whirr

Submit via Java or CLI

Page 35: Composing Mahout clustering jobs

Apache Whirr

● Tool for launching clusters

● Whirr property fle

whirr.provider=aws-ec2

whirr.instance-templates= 1 hadoop-jobtracker+hadoop-namenode, 10 hadoop-datanode+hadoop-tasktracker

whirr.identity=topsecret

Page 36: Composing Mahout clustering jobs

Apache Whirr

Launch!

$ whirr launch-cluster \

--config so-cluster.properties

$ export HADOOP_CONF_DIR=.whirr/so-cluster

Run!

$ hadoop fs -put posts.xml input

$ mahout seq2sparse ...

Page 37: Composing Mahout clustering jobs

Whirr

Launch!

Configuration prop = new PropertiesConfiguration(whirrConfigFile);

ClusterSpec spec = new ClusterSpec(prop);

Service service = new Service();

Cluster cluster = service.launchCluster(clusterSpec);

Page 38: Composing Mahout clustering jobs

Whirr

Submit!

Configuration configuration = new Configuration();

configuration.addResource(new Path( “/home/frank/.whirr/so/hadoop-site.xml"));

Job job = new Job(conf);

job.submit();

Page 39: Composing Mahout clustering jobs

Demo Time!

Page 40: Composing Mahout clustering jobs

Conclusions

Page 41: Composing Mahout clustering jobs

References

http://blog.jteam.nl/author/frank

Mahout mailinglist

Page 42: Composing Mahout clustering jobs

Q&A