concept ontology for text classification bhumika thakker
TRANSCRIPT
Concept Ontology for Text Classification
Bhumika Thakker
Outline
Introduction Techniques Used Shrinkage in a Hierarchy of Classes Classification using very few words Enhanced Word Clustering for Hierarchical Text
Classification Classification of Web Content References
Introduction An explosion in availability of online information
with millions of documents on every topic easily accessible via the internet
The inability of people to assimilate and profitably utilize large amount of information is apparent due to the increase in available information
It becomes significant to organize this mass of information
Techniques for Classification
Shrinkage in a Hierarchy of Classes Classification using very few words Word Clustering for Hierarchical Text
Classification
Shrinkage in a Hierarchy of Classes Technique leverages commonly available topic
hierarchies in order to significantly improve classification accuracy
Works well even when hierarchy is large and the training data set is sparse
A method for exponentially reducing the amount of computation necessary for classification, sacrificing only a small amount of accuracy
Shrinkage in a Hierarchy of Classes The approach is to apply a technique from
Statistics called shrinkage It provides improved estimates of parameters The technique exploits a hierarchy by “shrinking”
parameter estimates in data sparse children toward the estimates of data-rich ancestors
Employs simple form of shrinkage that creates new parameter estimates in a child by a linear interpolation of all hierarchy nodes from the child to the root
Shrinkage in a Hierarchy of ClassesShrinkage for Text Classification It is used to better estimate the probability of a word
given a class θjt
For each node in tree a maximum likelihood (ML) estimate based on the data associated with that node is constructed
An improved estimate for each leaf node is then derived by “shrinking” its ML estimate towards the ML estimate of all its ancestors
A unigram model for each node in the tree is build and smoothed each leaf model by linearly interpolating it with all the models found along the path to the root
Cont…
The estimates along the path from the leaf to the root represent a tradeoff between specificity and reliability
The estimate at the leaf is the most specific as it is based on the data from the topic alone
Its least reliable as it is based on the least amount of data
The estimator at the root is the most reliable but the least specific
Cont…
Subtract each child’s data from its parent’s before calculating the parent’s ML estimate to ensure that the ML estimates along a given path are independent
The estimate is based on the data that belongs to all the siblings of said child but not to the child itself
Thus for any path from leaf to root every datum in the tree is used exactly one of the ML estimates providing both independence among the estimates and efficient use of the training data
Determining the weights
Let be k estimates where is the estimate at the leaf, and k-1 is the depth of class cj in the tree
The interpolation weights among the ancestors of the class cj are written where
for the new estimate of the class conditioned word probabilities based on shrinkage is:
The likelihood of data according to the mixture model is a convex function of weights and attains a single global maximum. This maximum for each leaf class cj, is calculated using the following
procedure
The algorithm can be viewed as a particularly simple form of EM (Expectation maximization)
Each datum is assumed to have been generated by first choosing one of the tree nodes in the path to the root say , using that estimate to generate the datum
EM then maximizes the total likelihood when the choices of estimates made for the various data are unknown
The first step in the iterative part is thus the E step and the second one is the M step
The method makes inefficient use of the available training data by carving off some of it to be used as a held-out set
To overcome this problem the algorithm is modified as follows:
All the available data is used both to construct the ML estimates and to optimize the weights
As each document is used in the above algorithm, the ML estimates are modified to exclude its data so as to make them independent of it
This method is known as “leave-one-out” or “jacknifing”
Experimental Results
The following figure shows classification accuracy on the industry sector data set with 50-50 train-test splits while varying vocabulary size
Larger vocabulary sizes generally perform better Hierarchical Feature Selection somewhat improves the
performance of the flat naive Bayes in mid range of feature selection at about 5000 words
Hierarchical feature selection accuracy reaches 64% Shrinkage improves classification accuracy of 74%
Shrinkage helps more when training data is sparse Following figure shows accuracy on the Newsgroups
data set with full vocabulary and varying amount of training data
Accuracy in this domain is highest with no feature selection for both classifiers with small amount of training data
Shrinkage provides more improvement when the amount of training data is small and the shrinkage reduces variance in the classifications
Estimates are improved by using shrinkage to smooth a class’s parameters with its ancestors
Following figure shows classification accuracy on Science hierarchy as a function of vocabulary size
Best performance for both the classifier is equal Among 51 with less than 50 training documents
shrinkage provides 6% improvement in accuracy, 39% for flat to 45% for hierarchical
Pruning Tree for Increased computational Efficiency The class hierarchy can be leveraged to
improve computational efficiency The classifier can avoid calculating
for a majority of the classes by pruning the tree dynamically during the classification of each document
Classification using very few words
Categorizing the different documents according to their topic, where topics are organized in a hierarchy of increasing specificity
The bottleneck in this classification tasks is the need for a person to read each document and decide on its appropriate place in the hierarchy
This is avoided by automatically classifying new documents using machine learning techniques
The approach is to divide the classification task into a set of smaller classification tasks
Each corresponds to some split in classification hierarchy
The key insight is that each subtask is significantly simpler than original task
The classifier at each node in the hierarchy needs to be distinguished between small number of categories
This is possible using small set of features
The ability to restrict to a very small feature set avoids many of the difficulties
Such models are more robust and less subject to overfitting
Thus they achieve better accuracy even for a very simple classifier such as Naive Bayes
Key note is not merely the use of feature selection but its integration with the hierarchical structure
Choosing small set of feature would not give good performance if used for classification for flattened class space
In hierarchical approach any document only sees a small fraction of the features throughout the process
The feature which it does see are divided so as to focus the attention of the classifier on the features relevant to the classification subtask at hand
Probabilistic Framework
Constructing a hierarchical set of classifier, each based on its own set of relevant features
It uses two main subroutines: A feature selection algorithm for deciding on the
appropriate feature set at each decision point A supervised learning algorithm for constructing a
classifier for that decision
Bayesian Classifiers
A Bayesian network allows us to provide compact descriptions of complex distributions over a large number of random variables
It uses directed acyclic graph to encode conditional independence assumptions about the domain
Independent assumptions allow the distribution to be described as a product of small local interaction models
Bayesian classifier is simply a Bayesian network applied to a classification domain
Contains a node C for the class variable and a node Xi for each of the features
Specific instance x, the Bayesian network allows to compute the probability for each possible ck
Bayes Optimal classification can be achieved by simply selecting class ck for which this probability is maximized
Feature Selection
Feature selection employs Information Theoretic measures to determine a subset of the original domain features that seem to best capture the class distribution in the data
For each feature Xi the algorithm determines the expected cross entropy:
where is the set of all domain features except Xi It eliminates the feature Xi for which is minimized
Feature Selection
The feature eliminated least disrupts the original conditional class distribution
This process can be iterated to eliminate as many features as desired
In this respect the algorithm is very applicable to text domains with many features
Experimental Results
Both hierarchical and flat classification schemes are ran on the datasets without employing any probabilistic feature selection
Original number of features in each data set and results are given in the following table
Two important phenomena are observed In Hier1 dataset, the very large number of features
used precludes the hierarchical scheme from performing better than simple flat method
In Hier2 dataset, the large number of features and small dataset size allows for more expressive KDB algorithm to overfit the training data
These results provide an empirical motivation for integration of feature selection
The table results shows that hierarchical method clearly outperforms the flat classification method when considering a direct comparison of the 10 and 20 feature runs
Hierarchical method produces 8-41% fewer errors than flat methods for Hier1 and more modest relative gains for Hier2
The results show that the feature selection stage does serve to focus the algorithm on the features relevant to the local classification task
The table shows the set of 10 features found to be most discriminating at each level of hierarchy learned for the Hier1 dataset
At top level of the hierarchy, High level terms are selected from various major topics
More specific words distinguishing the subtopics are seen
Enhanced Word Clustering for Hierarchical TextClassification
Distributional clustering of words/features is one of the ways to reduce dimensionality
Each word cluster can be treated as a single feature and thus dimensionality can be drastically reduced
Feature clustering is more effective than feature selection especially at lower number of features.
Feature clustering appears to preserve classification accuracy as compared to a full-feature classifier
In case of small training sets and noisy features, word clustering can actually increase classification accuracy.
The algorithms used are agglomerative in nature yielding sub-optimal word clusters at a high computational cost
DISTRIBUTIONAL WORD CLUSTERING C be a discrete random variable that takes on values
from the set of classes W be the random variable that ranges over the set of
words The joint distribution p(C,W) can be estimated from the
training set. Now suppose we cluster the words into k clusters W1,
…,Wk. As Interested in reducing the number of features and the model size, we only look at “hard" clustering where each word belongs to exactly one word cluster, i.e
The random variable W range over the word clusters. The information about C captured by W can be
measured by the mutual information I(C;W). Ideally, in formation of word clusters exact preservation of the mutual information is expected
However clustering usually lowers mutual information Thus its essential to find a clustering that minimizes the
decrease in mutual information, I(C;W) - I(C;W )
Classifying using word Clusters
The Naive Bayes method can be simply translated into using word clusters instead of words.
This is done by estimating the new parameters p(Ws/ci) for word clusters similar to the word parameters p(wt/ci) in as
When estimates of p(wt/ci) for individual words are relatively poor, the corresponding word cluster parameters p(Ws/ci) provide more robust estimates resulting in higher classification scores
The Naive Bayes rule (5) for classifying a test document d can be rewritten as
Figures 2 and 3 plot the fraction of mutual information lost against the number of clusters for both the divisive and agglomerative algorithms on the 20Ng and Dmoz data sets.
It can be seen that less mutual information is lost with Divisive Clustering compared to Agglomerative Clustering at all number of clusters, though the difference is more pronounced at lower number of clusters
Figure 5 shows classification accuracies on the 20 Newsgroups data set for the algorithms considered.
The horizontal axis indicates the number of features/clusters used in the classification model while the vertical axis indicates the percentage of test documents that were classified correctly
Divisive Clustering (SVM as well as NB) achieves significantly better results at lower number of features than feature selection using Information Gain and Mutual Information
With 50 clusters, Divisive Clustering (NB) achieves 78.05% accuracy
The largest gain occurs when the number of clusters equals the number of classes
In Figure 6, the classification accuracy is plotted on 20Ng data using Naive Bayes when the training data is sparse.
2% of the available data is taken, that is 20 documents per class, for training and tested on the remaining 98% of the documents
The results are averages of 5-10 trials Divisive Clustering obtains better results than Information Gain at all number
of features. It also achieves a significant 12% increase over the maximum possible
accuracy achieved by Information Gain. This is in contrast to Figure 5 where Information Gain eventually catches up
as the number of features increases When the training data is small the word by class frequency matrix contains
many zero entries. By clustering words, more robust estimates of word class probabilities are
obtained which lead to higher classification accuracies
Classification of Web Content
Utilizes known hierarchical structure, the classification problem is decomposed into a set of smaller problems corresponding to hierarchical splits in the tree
Aims to first learn to distinguish among classes at the top level, then lower level distinctions are learned only within the appropriate top level of the tree.
Each of these sub-problems can be solved much more efficiently, and more accurately as well
Classification of Web Content
The use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation.
Each sub-problem is smaller than the original problem It is sometimes possible to use a much smaller set of
features for each The hierarchical structure can also be used to set the
negative set for discriminative training and at classification time to combine information from different levels
SVM for Web Content
SVMs have been found to be efficient and effective for text classification, but not previously explored in the context of hierarchical classification.
The efficiency of SVMs for both initial learning and real-time classification make them applicable to large dynamic collections like web content
Hierarchical structure used in Web Content Hierarchical structure is used for two purposes
Train second-level category models using different contrast sets (either within the same top-level category in the hierarchical case, or across all categories in the flat non hierarchical case).
Combine scores from the top- and second-level models using different combination rules, some requiring a threshold to be exceeded at the top level before second level comparisons are made.
Classifying web search results
Classification techniques are used to automatically organize search results into existing hierarchical structures
Classification models are learned offline using a training set of human-labeled documents and web categories
Classification offers two advantages compared to clustering – run time classification is very efficient manually generated category names are easily understood
Constraints
To support goal of automatically classifying web search results two constraints Use of just the short summaries returned from web search engines.
since it takes too long to retrieve the full text of pages in a networked environment. These automatically generated summaries are much shorter than the texts used in most classification experiments, and they are much noisier than other document surrogates like abstracts that some have worked with.
Focus on the top levels of the hierarchy since we believe that many search results can be usefully disambiguated at this level.
Develop an interface that tightly couples search results and category structure are found to have large preference and performance advantages for automatically classified search results
TEXT CLASSIFICATION USING SVMs
Text classification involves a training phase and a testing phase.
In the training phase, a large set of web pages with known category labels are used to train a classifier.
An initial model is built using a subset of the labeled data, and a holdout set is used to identify optimal model parameters.
During the testing or operational phase, the learned classifier is used to classify new web pages.
A support vector machine (SVM) algorithm was used as the classifier
A linear SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin.
The margin is the distance from the hyperplane to the nearest of the positive and negative examples.
In the linearly separable case maximizing the margin can be expressed as an optimization problem
In cases where points are not linearly separable, slack variables are introduced that permit, but penalize, points that fall on the wrong side of the decision boundary
Problems that are not linearly separable, kernel methods can be used to transform the input space so that some non-linear problems can be learned.
The simplest linear form of the SVM can be used because it provided good classification accuracy, and is fast to learn and apply
Feature Selection in Web Content
Reduction in the feature space by eliminating words that appear in only a single document then selecting the 1000 words with highest mutual information with each category.
The mutual information MI(F, C) between a feature, F, and a category, C, is defined as:
Compute the mutual information between every pair of features and categories
SVM Parameters
In addition to varying the number of features, SVM performance is governed by two parameters C (the penalty imposed on training examples that fall on the wrong side
of the decision boundary) p (the decision threshold)
The Default C parameter value (0.01) is used The decision threshold, p, can be set to control precision and recall
for different tasks. Increasing p, results in fewer test items meeting the criterion, and
this usually increases precision but decreases recall. Conversely, decreasing p typically decreases precision but
increases recall. P is chosen so as to optimize performance on the F1 measure on a
training validation set.
Results
Decision thresholds were established on a training validation set For each category, if a test item exceeds the decision threshold, it
is judged to be in the category A test item can be in zero, one, or more than one categories From this precision (P) and recall (R) are computed These are micro-averaged to weight the contribution of each
category by the number of test examples in it For each test example, the probability of it being in each of the 13
top-level categories and each of the 150 second-level categories is computed
References
Improving Text Classication by Shrinkage in a Hierarchy of Classes by Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, Andrew Y. Ng http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz
Hierarchically classifying documents using very few words by Daphne Koller and Mehran Sahami
http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1997-75&format=pdf&compression=&name=1997-75.pdf
Hierarchical Classification of Web Content by Susan Dumais and Hao Chenhttp://research.microsoft.com/~sdumais/sigir00.pdf
Enhanced Word Clustering for Hierarchical Text Classification by Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumarhttp://www.cs.utexas.edu/users/inderjit/public_papers/hierdist.pdf
THANK YOU