extracting key-substring-group features for text classification kdd 2006 dell zhang: univ of london...
TRANSCRIPT
Extracting Key-Substring-Group Features for
Text Classification
KDD 2006Dell Zhang: Univ of London
Wee Sun Lee: Nat Univ of Singapore
Presented by:
Payam Refaeilzadeh
Motivation
Treating text documents as a string of characters rather than a bag of words may provide a better feature representation of the document for classification purposes– Sub-word features are captured. e.g. morphological
variants: {work, worker, works, worked}– Super-word features are captured. e.g phrasal
effects, such as noun-phrases: affected cells– Word boundary detection problems can be avoided
(particularly useful for eastern languages)
Motivation continued
String based classification can be achieved using generative classifiers (e.g. Markov-based classifiers)… But
Discriminative classifiers (e.g. SVM) have proven to be superior … But
For discriminative classifiers we need to represent documents as a bag of features where the features are string-based rather than word-based
Challenges
Naïve approach: bag of all possible sub-strings– Very high-dimensional O(n^2) s.t. n = |d|– Redundant features
Better approach: – Group all substrings that have the same
distribution and treat each as a single feature.– Throw out groups that are not statistically
significant
Approach
Use a generalized suffix-tree to capture all substrings of a corpus.
Efficiently compute frequency statistics on the substrings and create substring-groups.
Extract key-substring-groups by eliminating groups that are
– Too frequent or not-frequent enough– Context dependant– Redundant (based on mutual information)
Suffix Tree
– A directed tree with exactly n numbered leaves and at most n internal nodes n = |S|
– The path from the root to each leaf spells out the suffix of the string that starts at position i
– If S contains a substring P, at least one suffix will begin with that substring => can check for the existence of P by doing a search of the tree starting at the root
– The frequency a substring can be calculated by counting the leaves in the sub-tree rooted at the child node of the edge where the substring search ended.
Suffix Tree continued
– Each internal node v has a path string spelled by the path r->v
– If the path string of a node u is the suffix for the path string of another node u, there is a suffix link from u to v
– The suffix tree (including suffix links) for a corpus of documents with a total of n characters can be build in O(n) using Ukkonen’s algorithm
– All substrings whose path ends in the edge above the same node have identical distribution and can be treated as a substring-group
Feature Selection
Compute the leaf frequency for each internal node Mark out the nodes that have too low or too high of a
frequency Mark out the nodes that have too few children
(contextual independence) Mutual Information
– Mark out the nodes for which freq(node)/freq(parent) is too large
– Mark out the nodes for which freq(node)/freq(suffix) is too large
Feature Extraction
Each possible substring starts the suffix that is the path string for a node.
1. Accumulate the key-substring-groups for each node by traversing the suffix tree and collecting anything that wasn’t thrown out
2. For each document start with the node that represents the entire document and follow the suffix links - extracting the feature set for each node
Experiments
Experiments with English, Chinese and Greek Text all outperformed other methods.
Parameters optimized using cross-validation
Comments
The good– A creative use of an existing algorithm / structure (suffix-
tree) to do efficient string-based feature extraction and selection for text data
The bad– Did not run own experiments. Results compared to
published results of other researchers.– Did not compare to word-based + feature selection– Did not experiment with spam classification