pmuthoju_presentation.ppt

Automatic Document Categorization using Support

Vector Machines

Prashanth Kumar Muthojupmuthoju@cs.odu.edu

Advisor: Dr. Zubair

Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References

Introduction What is Categorization

Sorting a set of documents into categories from a predefined set. [link]

Assigning a document to a category based on it’s contents.

Introduction .. Cont.d Types of Categorization :

Manual Automatic (Machine Learning)

Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)

Introduction .. Cont.d

Why ‘Automation’ ? Manual categorization

needs large number of human resources is expensive is time consuming

Introduction .. Cont.d

Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution

Problem

The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here

http://www.dtic.mil/trail/fieldgrp.html

Towards the solution ..

Strategy: Exploit an existing collection with categorized

documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum

effectiveness

What is Support Vector Machine ?

Binary Classifier Finds the ith largest margin

to separate two classes Subsequently classifies items

Based on which side of the lineThey fall.

Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to

coincidental relations encountered during training).

SVM Library (LibSVM 2.85)

Solution

Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group.

Each file is represented by<label> <feature1>:<value1> < feature 2>:<value2> ...

(Sparse vector representation)

<label> is 1 if positive file; 0 if negative file< feature>:<value> are represented by <word>:<tfidf>

(Common words are eliminated before preparing data set).

Solution

For each of the Field/Group,the following procedure isRepeated (Training phase):

Collection Model

by Dr. Zeil

Download Documents

Convert PDF to Text

Model Documents

Using TF and IDF

Positive Training Set for

Negative Training Set for Field/Group K

SVMFor

Field/Group K

Solution

(Testing Phase)

Trained SVMFor

Input Test Document(PDF)

Convert PDF to Text

Model Documents

Using TF and IDF

Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document.

Field/Group 1

Field/Group K

Field/Group N

Improving the results

Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs

between 0 and 1

Experiment

Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400.

For each field/group, 70 pdf files were downloaded.

50 files were used as positive files for training 20 files were used for testing

An additional 50 files were taken randomly from all other field/groups as negative files for training.

Experiment

Metric: Recall = #Correct Answers /

#Total Possible Answers Precision = #Correct Answers /

#Answers Produced

Results

140200 120200 201300 220200 250400

140200 13 2 1 2 2

120200 1 16 0 3 0

201300 0 5 13 2 0

220200 1 0 2 17 0

250400 0 0 1 0 19

Results ..Cont.d

Category Precession Recall

140200 0.87 0.65

120200 0.70 0.80

201300 0.76 0.65

220200 0.71 0.85

250400 0.90 0.95

Future Work

Hierarchical Model

150000

150300 150600

150301 150302 150601 150602

In flat model, we consider each field/group independent.

In Hierarchical model, we consider all files under the branch as positive files for training

Future Work

Multi-Label classification Practically each document may belong to multiple

field/groups.

Conclusion

The classification results of DTIC documents based on Field/Groups were impressive.

Ways to improve the results have been identified. A couple of suggestions were given for future work in this

particular area.

References

Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47.

Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. (http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf)

J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

pmuthoju_presentation.ppt

pdf files

fieldgroup n trained

likelythe fieldgroup

zeil fieldgroup

fieldgroup independent

manual categorization

positive files

set of documents

Documents

the dutch republic in international trade

plato - symposium

european colinization of latin america

star wars prequel trilogy trivia (episodes i-iii)

venture capital

do you admire leonardo da vinci?

how computer keyboards work

star wars original trilogy trivia (episodes iv-vi)

star wars trivia!

explore the levels of creation

daniel zanella and alexander weygers

compressing and decompressing folders

heidegger kritik

acetone peroxide

chapter 24

algorithms

xcdc14a

bhagavad gita

oedipus the king: the ideal tragic play

(tesla) - the tesla magnetic car engine