Hierarchical Classification of Real Life Documents
Ke Wang, Senqiang ZhouSimon Fraser University
Yu HeNational University of Singapore
Hierarchical & Multi-classed Documents
• Topics are organized into a hierarchy of increasing specificity
• A document is classified into all relevant classes.
• For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo
New Issues
• Misclassification is non-symmetric– Travel Outdoor Vs. Travel Software
• Documents are multi-classed– Traditional way: only one class attached
• Class space is sparse– 2 - 1 subsets of classes for k classes– Exploring the similarities between classes k
A New Classification Model
• The model of documents:– {t1,t2,….,tn|C1,…,Ck}, where t1,t2,….,tn are
keywords and C1,…,Ck are classes from a given class hierarchy
– { C1,…,Ck } is called a classset (CS)
• Construct a classifier– consisting of rules of the form {ti1,…, tip} {Ci1,
…, Cip}, that assigns a “good” classset to a given new document
Class Similarity
• Two classsets are similar if they “cover” similar documents.
• Anc(CS): the set of classes in a classset CS plus all ancestor classes.
• CS1 is more general than CS2 if Anc(CS1) Anc(CS2)– {Dance} is more general than {Fast-Dance,Music}
because Anc({Dance}) Anc({Fast-dance,Music})
Class Similarity (Cont.)
• A document d is covered by a classset CS if CS is more general than the classset of d
• Cover(CS) denotes the set of documents covered by CS
• Cover(CS1) Cover(CS2)=Cover(CS1 CS2)
Class Similarity (Cont.)
• The dissimilarity of CS1 and CS2 is defined as the normalized difference of their coverage E(CS1,CS2):
(|Cover(CS2)-Cover(CS1)| + |Cover(CS1)-Cover(CS2)|)/|Cover(CS1) Cover(CS2)|
• The similarity is defined as 1 - E(CS1,CS2)
The Confidence
• Match(TCS ): the set of documents that contain all the terms in T.
• The confidence of TCS is defined as:
Match(TCS ) - d E(CSd,CS)Confg(TCS ) = ------------------------------------
Match(TCS )
What’s behind the Confg ?
• Intuitively, Confg(TCS ) measures the average similarity between CS and the classsets of the documents that match TCS .
• If E(CSd,CS) is binary, i.e., 1 or 0, Confg(TCS ) degenerates to the standard confidence.
Construction of Classifier
• Step 1: Find association rules– Generate all association rules of the form
TCS that satisfy some user-specified minimum support and confidence.
Construction of Classifier(Cont.)
• Step 2: rank the rules– A document is classified by the matching
rule that has highest confidence.
– This selection is called most confidence first (MCF)
Construction of Classifier (Cont.)
• Step 3: remove rules of low accuracy– Let D be the set of training documents
classified by rule TCS, the accuracy of TCS is defined as
||
),(||)(
D
CSCSEDCSTAccu Dd
d
Construction of Classifier (Cont.)
– Confg(T CS) is defined with respect to all the document s that match the rule, whereas Accu(TCS ) is defined w.r.t the documents classified by the rule.
– Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.
Construction of Classifier (Cont.)
• Step 4: cut off the ranked list– If we cut off the list of rules r1,…,rm after the first
i rules, r1,…,ri,– Cutoff error = PrefixError(ri)+DefualtError(ri)– PrefixError(ri) is the sum of the rule error
Error(rj) for all rules rj, 1 j I– DefualtError(ri) is the error caused by assigning
the default classset to all the documents not classified by any rule rj
Experiments
Experimental Results
• The result on IBM data set– The error: Coverage beats the others. – The size: Confidence gets smaller. – The time: Coverage takes longer.
Classification Error
Size & Execution Time