mahout classification presentation

30
Classification on Mahout Naoki Nakatani San Jose State University CS185C Spring 2014

Upload: nackjoe828

Post on 27-Jan-2015

111 views

Category:

Technology


2 download

DESCRIPTION

These slides were presented in class on April 7th, 2014.

TRANSCRIPT

Page 1: Mahout classification presentation

Classification on MahoutNaoki NakataniSan Jose State University

CS185C Spring 2014

Page 2: Mahout classification presentation

Agenda

● Classification Overview● Mahout Overview

○ Classification on Mahout● Case Study with Demo

○ Problem Description○ Working Environment○ Data Preparation○ ML Model Generation

Page 3: Mahout classification presentation

Classification?● Classifying examples into given set of categories● Supervised learning

○ Prepare data○ Build classifier (train & test)○ Apply classifier to new data

http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg

Page 4: Mahout classification presentation

Mahout?● Scalable machine learning

library = Can handle Big Data

● Runs on HDFS● Classification, Clustering,

Collaborative Filtering , etc

http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png

Page 5: Mahout classification presentation

Classification on Mahout?Classifying examples into given set of categories

Scalable machine learning library that can handle big data

Classifying big data into given set of categories

Page 6: Mahout classification presentation

Case Study & Demo

Given question with title and body, can we automatically generate tags for it?

Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.

Does anyone have a link?

Documentation

latex3

expl3

Page 7: Mahout classification presentation

DatasetFile :● TrainSmall.tsv

Fields :● id, title, body, tags

Characteristics :● Each question contains

only one tag

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

\0

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”“----” , ”-----------” , “------------------------” , “--- --- --- ---”

Page 8: Mahout classification presentation

Working Environment

● Mac OS 10.9.1● Eclipse 4.3.2● Hadoop 1.2.1● Mahout 0.9● Source code available here.

Page 9: Mahout classification presentation

Prerequisite (Where are you?)● You have input tsv file at result > output-topfivetags.● You are at “result” directory in Terminal.● Command “hadoop” and “mahout” is working.

Page 10: Mahout classification presentation

Prepare Data1. Convert TSV file to Hadoop sequence file format.

Specify tag as a category. (Run TSVToSeq.java)

output-tsvtoseq folder and chunk-0 file is created.

Page 11: Mahout classification presentation

Prepare Data1. Make directory in HDFS and upload chunk-0 (sequence

file) to the folder.

Page 12: Mahout classification presentation

hadoop fs -mkdir <directory>

Page 13: Mahout classification presentation

hadoop fs -put <source> <destination>

Page 14: Mahout classification presentation

Prepare Data2. Transform questions into vectors. (mahout seq2sparse)

Page 15: Mahout classification presentation

mahout seq2sparse -i <input directory> -o <output directory>

Page 16: Mahout classification presentation
Page 17: Mahout classification presentation

Prepare Data3. Split data into

a. Train set : to train modelb. Test set : to test model

Page 18: Mahout classification presentation

mahout split \-i <input directory> \

--trainingOutput <output dir to train> \--testOutput <output dir to test> \--randomSelectionPct <integer> \

--overwrite \--sequenceFiles \

-xm sequential

Page 19: Mahout classification presentation
Page 20: Mahout classification presentation

Build Classifier1. Choose algorithm to use for classificationAvailable algorithms:

○ Naive Bayes■ trainnb, testnb■ org.apache.mahout.

classifier.naivebayes

○ Hidden Markov Model■ baumwelch, hmmpredict■ org.apache.mahout.

classifier.sequencelearning.hmm

○ Logistic Regression■ trainlogistic, testlogistic■ org.apache.mahout.

classifier.sgd

○ Random Forest■ ?■ ?

Page 21: Mahout classification presentation

2. Train & test model using train set

Should yield high accuracy

Build Classifier (Naive Bayes)

Page 22: Mahout classification presentation

mahout trainnb \-i <dir to train vectors> \

-el \-li <dir to put label index> \

-o <dir to put model> \-ow \

-c

Page 23: Mahout classification presentation
Page 24: Mahout classification presentation

mahout testnb \-i <dir to train vectors> \

-m <dir to model> \-l <dir to label index> \

-ow \-o <output dir> \

-c

Page 25: Mahout classification presentation
Page 26: Mahout classification presentation

Build Classifier (Naive Bayes)3. Test model using test set

Check if the accuracy is satisfactory

Page 27: Mahout classification presentation
Page 28: Mahout classification presentation

Apply ClassifierWhat do you have at this point?● model● label index

You can start classifying new data! (Check this example)

Model

Label Index

Page 30: Mahout classification presentation

Happy Machine Learning!