extracting key-substring-group features for text classification kdd 2006 dell zhang: univ of london...

Extracting Key-Substring-Group Features for

Text Classification

KDD 2006Dell Zhang: Univ of London

Wee Sun Lee: Nat Univ of Singapore

Presented by:

Payam Refaeilzadeh

Motivation

Treating text documents as a string of characters rather than a bag of words may provide a better feature representation of the document for classification purposes– Sub-word features are captured. e.g. morphological

variants: {work, worker, works, worked}– Super-word features are captured. e.g phrasal

effects, such as noun-phrases: affected cells– Word boundary detection problems can be avoided

(particularly useful for eastern languages)

Motivation continued

String based classification can be achieved using generative classifiers (e.g. Markov-based classifiers)… But

Discriminative classifiers (e.g. SVM) have proven to be superior … But

For discriminative classifiers we need to represent documents as a bag of features where the features are string-based rather than word-based

Challenges

Naïve approach: bag of all possible sub-strings– Very high-dimensional O(n^2) s.t. n = |d|– Redundant features

Better approach: – Group all substrings that have the same

distribution and treat each as a single feature.– Throw out groups that are not statistically

significant

Approach

Use a generalized suffix-tree to capture all substrings of a corpus.

Efficiently compute frequency statistics on the substrings and create substring-groups.

Extract key-substring-groups by eliminating groups that are

– Too frequent or not-frequent enough– Context dependant– Redundant (based on mutual information)

Suffix Tree

– A directed tree with exactly n numbered leaves and at most n internal nodes n = |S|

– The path from the root to each leaf spells out the suffix of the string that starts at position i

– If S contains a substring P, at least one suffix will begin with that substring => can check for the existence of P by doing a search of the tree starting at the root

– The frequency a substring can be calculated by counting the leaves in the sub-tree rooted at the child node of the edge where the substring search ended.

Suffix Tree continued

– Each internal node v has a path string spelled by the path r->v

– If the path string of a node u is the suffix for the path string of another node u, there is a suffix link from u to v

– The suffix tree (including suffix links) for a corpus of documents with a total of n characters can be build in O(n) using Ukkonen’s algorithm

– All substrings whose path ends in the edge above the same node have identical distribution and can be treated as a substring-group

Feature Selection

Compute the leaf frequency for each internal node Mark out the nodes that have too low or too high of a

frequency Mark out the nodes that have too few children

(contextual independence) Mutual Information

– Mark out the nodes for which freq(node)/freq(parent) is too large

– Mark out the nodes for which freq(node)/freq(suffix) is too large

Feature Extraction

Each possible substring starts the suffix that is the path string for a node.

1. Accumulate the key-substring-groups for each node by traversing the suffix tree and collecting anything that wasn’t thrown out

2. For each document start with the node that represents the entire document and follow the suffix links - extracting the feature set for each node

Experiments

Experiments with English, Chinese and Greek Text all outperformed other methods.

Parameters optimized using cross-validation

Comments

The good– A creative use of an existing algorithm / structure (suffix-

tree) to do efficient string-based feature extraction and selection for text data

The bad– Did not run own experiments. Results compared to

published results of other researchers.– Did not compare to word-based + feature selection– Did not experiment with spam classification

extracting key-substring-group features for text classification kdd 2006 dell zhang: univ of london...

Documents