pmuthoju_presentation.ppt
Post on 11-Jun-2015
298 Views
Preview:
TRANSCRIPT
1
Automatic Document Categorization using Support
Vector Machines
Prashanth Kumar Muthojupmuthoju@cs.odu.edu
Advisor: Dr. Zubair
2
Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References
3
Introduction What is Categorization
Sorting a set of documents into categories from a predefined set. [link]
Assigning a document to a category based on it’s contents.
4
Introduction .. Cont.d Types of Categorization :
Manual Automatic (Machine Learning)
Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)
5
Introduction .. Cont.d
Why ‘Automation’ ? Manual categorization
needs large number of human resources is expensive is time consuming
6
Introduction .. Cont.d
Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution
7
Problem
The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here
http://www.dtic.mil/trail/fieldgrp.html
8
Towards the solution ..
Strategy: Exploit an existing collection with categorized
documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum
effectiveness
9
Towards the solution ..
What is Support Vector Machine ?
Binary Classifier Finds the ith largest margin
to separate two classes Subsequently classifies items
Based on which side of the lineThey fall.
10
Towards the solution ..
Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to
coincidental relations encountered during training).
11
Towards the solution ..
SVM Library (LibSVM 2.85)
Java
12
Solution
Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group.
Each file is represented by<label> <feature1>:<value1> < feature 2>:<value2> ...
(Sparse vector representation)
<label> is 1 if positive file; 0 if negative file< feature>:<value> are represented by <word>:<tfidf>
(Common words are eliminated before preparing data set).
13
Solution
For each of the Field/Group,the following procedure isRepeated (Training phase):
Collection Model
by Dr. Zeil
Download Documents
(PDF)
Convert PDF to Text
Model Documents
Using TF and IDF
Positive Training Set for
Negative Training Set for Field/Group K
SVMFor
Field/Group K
Field/Group K
14
Solution
(Testing Phase)
Trained SVMFor
Trained SVMFor
Trained SVMFor
Input Test Document(PDF)
Convert PDF to Text
Model Documents
Using TF and IDF
Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document.
Field/Group 1
Field/Group K
Field/Group N
15
Improving the results
Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs
between 0 and 1
16
Experiment
Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400.
For each field/group, 70 pdf files were downloaded.
50 files were used as positive files for training 20 files were used for testing
An additional 50 files were taken randomly from all other field/groups as negative files for training.
17
Experiment
Metric: Recall = #Correct Answers /
#Total Possible Answers Precision = #Correct Answers /
#Answers Produced
18
Results
140200 120200 201300 220200 250400
140200 13 2 1 2 2
120200 1 16 0 3 0
201300 0 5 13 2 0
220200 1 0 2 17 0
250400 0 0 1 0 19
19
Results ..Cont.d
Category Precession Recall
140200 0.87 0.65
120200 0.70 0.80
201300 0.76 0.65
220200 0.71 0.85
250400 0.90 0.95
20
Future Work
Hierarchical Model
150000
150300 150600
150301 150302 150601 150602
In flat model, we consider each field/group independent.
In Hierarchical model, we consider all files under the branch as positive files for training
21
Future Work
Multi-Label classification Practically each document may belong to multiple
field/groups.
22
Conclusion
The classification results of DTIC documents based on Field/Groups were impressive.
Ways to improve the results have been identified. A couple of suggestions were given for future work in this
particular area.
References
Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47.
Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. (http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf)
J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.
23
top related