shawn tice shawn.cameron.bird@gmail - san jose state ... · classi cation is a supervised learning...

Reading Review

Shawn [email protected]

22 October 2012

DefinitionsClassification is the task of assigning anobservation to one of two or more categoriesin a manner consistent with a training set ofobservations whose categories are known.

An observation is characterized by a collection offeatures, which may be categorical,integer-valued, or real-valued.

For Example

Cap Shape Cap Color ClassConvex Brown EdibleConvex Yellow PoisonousBell White Edible

Classification is a supervised learning problem.The unsupervised equivalent is clustering.

ApplicationsI Optical Character Recognition

I Speech Recognition

I Diagnosis

I Document Classification

I . . .

Features of Classification AlgorithmsExpressive

What kinds of clusters can theclassifier separate? For example, linearversus polynomial.

Robust to NoiseDoes noise in the training datasignificantly degrade performance?

Robust to Feature ChoiceDoes choosing irrelevant featuresdegrade performance?

High DimensionalityDoes a large number of dimensionsdegrade performance?

Features of Classification AlgorithmsGlobally Optimal

Does the training method result in anoptimal classifier for the training set?

Efficient TrainingIs training the classifier on a largetraining set efficient?

Efficient ClassificationCan the classifier efficiently classify anobservation?

Decision TreeInduce a tree of predicates to iteratively dividethe feature space until a single class is reached.

Expressive Robust toNoise

Robust toFeature Choice

ManyFeatures

Rectilinear Yes Yes No

GloballyOptimal

EfficientTraining

EfficientClassification

No Yes Yes

Nearest-NeighborAssign the majority class of the k nearestneighbors in the training set according to somedistance measure.



ManyFeatures

Very No No No

GloballyOptimal

EfficientTraining


No Yes No

Bayesian (Naıve)Assign the class y to an observation x thatmaximizes the posterior probability P (y|x).Assume the features are conditionallyindependent given y.



ManyFeatures

Yes Yes Partially No

GloballyOptimal

EfficientTraining


Yes Yes Yes

Artificial Neural NetworkConstruct a network of input, hidden, and outputnodes, where hidden nodes weight, aggregate,and filter features from input nodes, and outputnodes assign a class.



ManyFeatures

Very No Yes No

GloballyOptimal

EfficientTraining


No No Yes

Support Vector MachineInduce a maximal margin hyperplane that dividesobservations in high-dimensional space.Observations on one side belong to one class, andobservations on the other side belong to anotherclass.



ManyFeatures

Yes Yes Yes Yes

GloballyOptimal

EfficientTraining


Yes Yes Yes

Multiple ClassesSome classifiers (e.g. SVMs) only handle binaryclassification.

One-vs-RestTrain one classifier per potential classthat decides between that class and allother classes. Run all classifiers andpick the one that gets the most votes.

One-vs-OneTrain one classifier per combination oftwo classes.

Multiple ClassesError Correcting Codes allow for binaryclassification errors. Assign a unique bit string oflength n to each possible class, and train nclassifiers to predict each bit.

The class is the codeword whose HammingDistance is closest to the constructed bit string.

If the Hamming distance between any pair ofcodewords is d then any b(d− 1)/2c errors in theouput code can be corrected.

Ensemble MethodsImprove classification accuracy by trainingmultiple classifiers and aggregating their outputs.

I Use different parameterizations of the baseclassifier

I Manipulate the training set (bagging andboosting)

I Manipulate the class labels(error-correcting output coding)

Ensemble MethodsThe most intuitive (to me) approach is to traindifferent parameterizations of a base classifier onthe entire training set, then run each classifier onan observation and take the weighted vote of allclassifiers.

A classifier’s weight is determined by itspredicted accuracy.

Ensemble MethodsBagging divides the training set into smallersamples, and trains a new classifier on eachsample.

Boosting is similar, but observations areassigned a sampling weight which increases ordecreases in successive rounds, depending on howa classifier trained on that sample classifies eachobservation.

Fusion and MetalearningSummary : Buttcher et al. show that whencombining the results of several different learnersit’s usually possible to marginally improve theperformance of the best single learner.

Two approaches: Fusion and Stacking.

FusionFusion combines multiple lists of results into asingle list according to a scoring formula.

Reported improvements (p381) to P@10 andMAP are around ±0.03 (but more often +0.03).

Fusion is designed for improving search results,but could perhaps be applied to soft classification.

StackingStacking combines the results of severalclassifiers by transforming their scores for trainingdocuments into log-odds, then applying learning(e.g. regression) to those transformed scores.

Reported improvements (p383) aresubstantial—as much as a +0.13 improvementover the best individual classifier on averageprecision when using cross validation.

Relevance FeedbackRelevance feedback attempts to improve aninitial query by using documents marked by theuser as relevant to expand the query.

We can use relevance feedback term selectiontechniques to build a query from an initialdocument in order to find other similardocuments in an inverted index.

The two basic models are vector-based andprobability-based.

Vector-Based Relevance FeedbackThe method proposed by Ide (1971), called theIde-dec-hi formula for modifying a query basedon relevance information:

Q1 = Q0 +

nr∑i

ri − si

where si is the vector for “the first non-relevantdocument”.

Probability-Based Relevance FeedbackButtcher et al. develop a term weight related toIDF and the F4 measure:

wt = log(nt,r + 0.5)(N −Nt − nr + nt,r + 0.5)

(nr − nt,r + 0.5)(Nt − nt,r + 0.5)

and use that to derive a term selection value:

nt,r · wt

Relevance FeedbackRuthven and Lalmas report that vector-based RFmethods perform better, but don’t actuallyprovide numbers; Butcher et al. only present theprobabilistic approach.

Both parties suggest selecting around 10 terms,but don’t provide much explanation.

Robertson and Walker (1999) present a methodfor estimating a cutoff threshold based on theprobability of accepting a noise term.

PlansMany machine learning algorithms have beensuccessfully applied to text categorization.

In the literature SVMs have consistentlyoutperformed other techniques, but logisticregression works nearly as well and has muchmore efficient implementations.

Stacking and metalearning techniques can beused to combine several classifiers and have beenshown to improve on the best performance of anysingle constituent classifier.

PlansSince we’ll be implementing our algorithms inPHP without new C extensions, performance willbe more important than achieving maximumaccuracy.

So I’ll experiment with algorithms that trade offsome potential accuracy for more efficientimplementations: Naive Bayes, LogisticRegression, and perhaps one other algorithm iftime permits (e.g. a neural network with one ormore hidden layers or decision trees).

I’ll also experiment with boosting.

PlansBecause we’ll be bootstrapping the training set, Isuspect our learning algorithms may initiallyperform very poorly.

Thus I may need to investigate feature selectiontechniques for reducing the size of the feature set(terms).

PlansFor relevance feedback, both vector-based andprobability-based techniques appear relativelyeasy to implement.

So I’ll implement both and see which returnsmore relevant documents on average.

References

Buttcher, S., Clarke, C., and Cormack, G.Information Retrieval: Implementing andEvaluating Search EnginesMIT Press, 2010.

Joachims, T.Text categorization with support vectormachinesTechnical report, LS VIII Number 23University of Dortmund, 1997.

References

Ruthven, I., and Lalmas, M.A survey on the use of relevance feedback forinformation access systemsKnowledge Engineering Review, 18(2):95-1452003.

Tan, Pang-Ning, Steinbach, M., Kumar, V.Introduction to Data MiningAddison Wesley, 2006.

shawn tice shawn.cameron.bird@gmail - san jose state ... · classi cation is a supervised learning...

Documents