1 finding structure in noisy text: topic classification and unsupervised clustering rohit prasad,...
TRANSCRIPT
1
Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering
Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and Rich Schwartz
{rprasad,pnataraj}@bbn.com
Presented by Daniel Lopresti
8th January 2007
2
Outline
Research objectives and challenges
Overview of supervised classification using HMMs
Supervised topic classification of newsgroup messages
Unsupervised topic discovery and clustering
Rejection of off-topic messages
3
Objectives
Develop a system that performs topic based categorization of newsgroup messages in two modes
Mode 1 – Supervised classification: topics of interest to the user are known apriori to the system– Spot messages that are on topics of interest to a user– Requires rejecting “off-topic” messages to ensure low false
alarm rates
Mode 2 – Unsupervised classification: topics of interest to the user are not known– Discover topics in a large corpus without human supervision– Automatically organize/cluster the messages to support
efficient navigation
4
Challenges Posed by Newsgroup Messages
Text in newsgroup messages tends to be “noisy”– Abbreviations, misspellings– Colloquial (non-grammatical) language of messages– Discursive structure with frequent switching between topics– Lack of context in some messages makes it impossible to
understand the message without access to complete thread
Supervised classification requires annotation of newsgroup messages with a set of topic labels– Every non-trivial message contains multiple topics– No completely annotated corpus of newsgroup messages
exists• By complete annotation we mean tagging each message with ALL
relevant topics
5
Outline
Research objectives and challenges
Overview of supervised classification using HMMs
Supervised topic classification of newsgroup messages
Unsupervised topic discovery and clustering
Rejection of off-topic messages
6
Supervised Topic Classification
“President Clinton dumped his embattled Mexican
bailout today. Instead, he announced another plan that
doesn’t need congressional approval.”
Text Audio or Images
ASR or OCR
Topic Classifier
CNN, NBC, CBS, NPR, etc.
• Clinton, Bill • Mexico • Money
• Economic assistance, American
Applications
• News Sorting• Information Retrieval• Detection of Key Events• Improved Speech Recognition
TopicModels
Topic-LabeledBroadcast News
Corpus
SeveralTopics
Training
e.g., Primary Source Media4-5 topics / story
40,000 stories / year
5,000 topics
Text
7
OnTopicTM HMM Topic Model
A probabilistic hidden Markov model (HMM) that attempts to capture the generation of a story
Assumes a story can be on multiple topics, different words are related to different topics
Uses an explicit state for General Language because most words in a story are not related to any topics
Scalable to a large number of topics and requires only topic labels for each story for training the model
Language independent methodology
.
.
P( Tj | Set )
storystart
storyend
T1
T2
TM
T0General Language
Loop
P( Set )
nP ( Wn| Tj )
8
Outline
Research objectives and challenges
Overview of supervised classification using HMMs
Supervised topic classification of newsgroup messages
Unsupervised topic discovery and clustering
Rejection of off-topic messages
9
Experiment Setup
Performed experiments with two newsgroup corpora– Automated Front End (AFE) newsgroup corpus collected by
Washington Univ.– 20 Newsgroup (NG) corpus from
http://people.csail.mit.edu/jrennie/20Newsgroups/
Assumed the name of the newsgroup is the ONLY associated topic for each of the message
Although cost effective this assumption leads to inaccuracies in estimating system performance– Messages typically contain multiple topics, some of which may
be related to the dominant theme of another newsgroup
10
AFE Newsgroups Corpus
Google newsgroups data collected by Washington University from 12 diverse newsgroups
Messages posted to 11 newsgroups are considered to be in-topic and all messages posted to the “talk.origins” newsgroup are considered to be off-topic
Message headers were stripped to exclude newsgroup name from training and test messages
Split the corpus into training, test, and, validation sets according to the distribution specified in the config.xml file provided by the Washington University– But since the filenames were truncated we could not select the
same messages as Washington University
11
AFE Newsgroups Corpus
Newsgroup# of Messages
#Training #Test #Validation
Alt.sports.baseball.stl_cardinals 21 33 10
Comp.ai.neural_nets 15 25 7
Comp.programming.threads 31 47 15
Humanities.musics.composers.wagner 19 31 9
Misc.consumers.frugal_living 10 17 5
Misc.writing.moderated 24 37 12
Rec.Equestrian 27 41 13
Rec.martial_arts.moderated 18 29 9
Sci.archaelogy.moderated 46 69 23
Sci.logic 20 30 10
Soc.libraries.talk 10 17 5
Talk.origins (Chaff) 245 10401 122
Total Number of Messages (w/o chaff) 241 376 118
Total Number of Messages (w/ chaff) 486 10777 240
Total Number of Words (w/o chaff) 103K 118K 32K
Total Number of Words (w/ chaff) 187K 3.4M 63K
12
Closed-set Classification Accuracy on AFE
Trained OnTopic models on 11 newsgroups – Excluded messages from talk.origins newsgroup because they
are “off-topic” w.r.t topics of interest– Used stemming since some newsgroups had only a few
training messages
Classified 376 in-topic messages
Achieved overall top-choice accuracy of 91.2%– Top-choice accuracy: Percentage of times the top-choice
(best) topic returned by OnTopic was the correct answer
Top-choice accuracy was worse on newsgroups with fewer training examples
13
Closed-set Classification Accuracy (Contd.)
Newsgroup #Training Messages %Top-Choice Accuracy
Misc.consumers.frugal_living 10 47.1%
Soc.libraries.talk 10 58.8%
Comp.ai.neural_nets 15 80.0%
Rec.martial_arts.moderated 18 86.2%
Humanities.musics.composers.wagner 19 100.0%
Sci.logic 20 96.7%
Alt.sports.baseball.stl_cardinals 21 100.0%
Misc.writing.moderated 24 91.9%
Rec.Equestrian 27 97.6%
Comp.programming.threads 31 100.0%
Sci.archaelogy.moderated 46 95.7%
Overall 241 91.2%
14
“20 Newsgroups” Corpus
Downloaded 20 Newsgroups Corpus (“20 NG”) from http://people.csail.mit.edu/jrennie/20Newsgroups/
Corpus characteristics: – Messages from 20 newsgroups with an average of 941
messages per newsgroup– Average of 350 threads in each newsgroup– Average message length of 300 words (170 words after
headers and “replied to” text is excluded)– Some newsgroups are similar – the 20 newsgroups span 6
broad subjects
Data pre-processing– Stripped message headers, e-mail IDs, and signatures to
exclude newsgroup related information
Corpus was split into training, development, and validation sets for topic classification experiments
15
Distribution of Messages Across Newsgroups
Newsgroup Total Messages Unique Threads Messages Per Thread
alt.atheism 799 87 9.2
comp.graphics 973 532 1.8
comp.os.ms-windows.misc 985 479 2.1
comp.sys.ibm.pc.hardware 982 536 1.8
comp.sys.mac.hardware 961 467 2.1
comp.windows.x 980 773 1.3
misc.forsale 972 877 1.1
rec.autos 990 260 3.8
rec.motorcycles 994 177 5.6
rec.sport.baseball 994 272 3.7
rec.sport.hockey 999 346 2.9
sci.crypt 991 216 4.6
sci.electronics 981 395 2.5
sci.med 990 314 3.2
sci.space 987 296 3.3
soc.religion.christian 997 295 3.4
talk.politics.guns 910 145 6.3
talk.politics.mideast 940 307 3.1
talk.politics.misc 775 133 5.8
talk.religion.misc 628 103 6.1
Average 941 350 3.7
16
Organization of Newsgroups By Subject Matter
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey
sci.cryptsci.electronicssci.medsci.space
misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast
talk.religion.miscalt.atheismsoc.religion.christian
17
Splits for Training and Testing
80:20 split between training and test/validation sets for three different partitioning schemes
Thread Partitioning: Entire thread is assigned to one of training, development, or validation sets
Chronological Partitioning: Messages in each thread are split between training, test, and validation; first 80% in training, and rest in test and validation
Random Partitioning: 80:20 split between training and test/validation, without regard to thread or chronology– Prior work by other researchers with 20 NG used random
partitioning
18
Closed-set Classification Results
Test Message Type
%Top Choice Accuracy
Thread Chronological Random
w/o “replied-to” text 74.5 77.8 79.7
w/ “replied-to” text 76.0 79.6 83.2
Trained OnTopic model set consisting of 20 topics
Classified 2K test messages– Two test conditions, one where “replied-to” text (from previous
messages) is included and the other where it is stripped from the test message
Classification accuracy is low due to following– Significant subject overlap between newsgroups– Lack of useful a priori probabilities due to almost uniform
distribution of topics, unlike AFE newsgroup data
19
Detailed Results for Thread PartitionedNewsgroup %Top-choice Accuracy Top Confusion
talk.religion.misc 29.3 talk.politics.guns
misc.forsale 51.0 comp.os.ms-windows.misc
talk.politics.misc 57.5 talk.politics.guns
sci.electronics 58.3 rec.autos
comp.os.ms-windows.misc 62.0 comp.sys.mac.hardware
alt.atheism 63.4 soc.religion.christian
comp.graphics 68.6 comp.windows.x
comp.sys.ibm.pc.hardware 72.6 comp.os.ms-windows.misc
comp.sys.mac.hardware 74.5 comp.sys.ibm.pc.hardware
comp.windows.x 77.1 comp.sys.ibm.pc.hardware
rec.motorcycles 81.9 rec.autos
talk.politics.guns 82.9 sci.crypt
talk.politics.mideast 84.6 rec.motorcycles
soc.religion.christian 87.4 sci.med
sci.crypt 89.0 talk.politics.guns
rec.sport.baseball 90.7 rec.sport.hockey
rec.autos 93.4 misc.forsale
sci.med 93.6 misc.forsale
rec.sport.hockey 94.2 rec.sport.baseball
sci.space 94.6 rec.autos
Overall 76.0
20
Detailed Results for ChronologicalNewsgroup %Top-choice Accuracy Top Confusion
talk.religion.misc 35.0 alt.atheism
misc.forsale 53.8 comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc 62.9 comp.windows.x
comp.graphics 63.9 comp.os.ms-windows.misc
sci.electronics 64.3 rec.autos
talk.politics.misc 71.4 talk.politics.guns
alt.atheism 72.2 soc.religion.christian
comp.sys.ibm.pc.hardware 73.2 comp.os.ms-windows.misc
comp.sys.mac.hardware 75.8 comp.sys.ibm.pc.hardware
comp.windows.x 81.4 comp.os.ms-windows.misc
rec.motorcycles 86.7 rec.autos
sci.med 86.9 sci.space
rec.autos 88.7 comp.os.ms-windows.misc
talk.politics.guns 90.1 talk.politics.misc
talk.politics.mideast 90.2 alt.atheism
sci.space 91.9 comp.graphics
rec.sport.baseball 92.9 rec.sport.hockey
soc.religion.christian 94.0 alt.atheism
sci.crypt 96.0 sci.electronics
rec.sport.hockey 98.0 sci.med
Overall 79.6
21
Manual Clustering and Human Review
Manually clustered newsgroups into 12 topics after reviewing content of training messages
Recomputed top-choice classification accuracy using the cluster information
Clustering
%Top Choice Accuracy
Thread Chronological Random
w/o Clustering 76.0 79.6 83.2
w/ Clustering 81.5 84.8 88.2
Effect of presence of multiple topics in a message and incomplete reference topic label set– Manually reviewed messages from 4 categories with lowest
performance for “Chronological” split– Accuracy increases to 88.0% (from 84.8%) following manual rescoring
22
Cluster Table
Topic Cluster Newsgroup(s)
Autos rec.autos, rec.motorcycles
Graphics comp.graphics
Macintosh comp.sys.mac.hardware
Misc.forsale misc.forsale
Politics talk.politics.guns, talk.politics.mideast, talk.politics.misc
Windows comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.windows.x
religion soc.religion.christian, talk.religion.misc, alt.atheism
sports rec.sport.baseball, rec.sport.hockey
sci.crypt sci.crypt
sci.electronics sci.electronics
sci.med sci.med
sci.space sci.space
23
Outline
Research objectives and challenges
Overview of supervised classification using HMMs
Supervised topic classification of newsgroup messages
Unsupervised topic discovery and clustering
Rejection of off-topic messages
24
The Problem
Why unsupervised topic discovery and clustering? – Topics of interest may not be known apriori– May not be feasible to annotate documents with a large
number of topics
Goals– Discover topics and meaningful topic names – Cluster topics instead of messages automatically to organize
messages/documents for navigation at multiple levels
Do
cum
ents
UnsupervisedTopic
Discovery
HierarchicalClustering
NavigationGUI
DiscoveredTopics
HierarchicalTopic Tree
25
Unsupervised Topic Discovery3
Add Phrases
Topic Classification
Topic Training
Initial Topics for each doc
Inputdocuments
Inputdocuments
Frequent phrases, using MDL criterion;Names, using IdentiFinderTM
Select words/phrases with highest tf-idf;Keep topic names that occur in >3 documents
Assign topics to all documents
Key step:Associate many words/phrases with topics;Use EM training in OnTopicTM system
Topic Models
Topic Names
Augmented docs
Topic AnnotatedCorpus 3. S. Sista et al.. An Algorithm for Unsupervised Topic Discovery from
Broadcast News Stories. In Proceedings of ACM HLT, San Diego, CA, 2002.
26
UTD output (English document)
news source: Associated Press – November, 2001
27
UTD output (Arabic document)
News Source: Al_Hayat (Aug-Nov, 2001)
28
Unsupervised Topic Clustering
Organize automatically discovered topics (rather than documents) into a hierarchical topic tree
Leaves of the topic tree are one of the fine topics discovered from the UTD process
Intermediate nodes are logical collection of topics
Each node in the topic tree has a set of messages associated with it– A message can be assigned to multiple topic clusters by virtue
of multiple topic labels assigned to it by UTD process– Overcomes the problem of single cluster assignment of a
document prevalent in most document clustering approaches
Resulting topic tree enables browsing of the large corpus at multiple level of granularity – One can find a message with different set of logical actions
29
Topic Clustering Algorithm
Agglomerative clustering for organizing topics in a hierarchical tree structure
Topic clustering algorithm:Step 1: Each topic assigned to its own individual clusterStep 2: For every pair of clusters, compute the distance between
the two clusters Step 3: Merge the closest pair into a single cluster if the distance
is lower than a threshold and go to Step 2. Else Stop clustering
Modification: merge more than two clusters at each iteration to limit the number of levels in the tree– Also add other constraints in terms of limiting the branching
factor, number of levels etc.
30
Distance Metrics for Topic Clustering
Metrics computed from topic co-occurrences:– Co-occurrence probability– Mutual Information
Metrics computed from support/key word distributions:– Support word overlap between Ti and Tj
– Kullback-Leibler (KL) and J-Divergence between two probability mass functions
])()(
),(log[),(),(
ji
jijijiMI TPTP
TTPTTPTTD
])|(
)|(log[)|(),(
j
i
wijiKL TwP
TwPTwPTTD
31
Clustering Example
Insurance + Premiums +
Coverage + Abortion Coverage + Pay +
Abortion
Abortion Coverage + Pay
+ Abortion
Insurance Premiums Coverage Abortion PayAbortion
Coverage
32
Evaluation of UTC
Initial topic clustering experiments performed on 20 NG corpus – 3,343 topics discovered from 19K message– Allowed a maximum of 4 topics to be clustered at each iteration
Evaluation of UTC has been mostly subjective with a few objective metrics used to evaluate the clustering
Clustering rate: rate of increase of clusters with more than one topic seems to be well correlated with subjective judgments
Combination of J-divergence and topic co-occurrence seems to result in most uniform, logical clusters
33
Key Statistics of the UTC Topic Tree for 20 NG Corpus
Key Feature Value
Average Maximum
Number of Levels - 6
Branching Factor 2.4 4
No. of topics in a cluster 2.7 22
Measured some key features of the topic tree that could have significant impact on user experience
34
Screenshot of the UTC based Message Browser
Form to enter search query
List of documents associated with the selected topic cluster
List of topics associated with the selected document
Hierarchy of topic clusters
Topics associated with all the documents in the selected cluster
Tree view depicting the sub-tree from the root to the selected cluster
History of path taken to arrive at the current view
35
Outline
Research objectives and challenges
Overview of supervised classification using HMMs
Supervised topic classification of newsgroup messages
Unsupervised topic discovery and clustering
Rejection of off-topic messages
36
Off-topic Message Rejection
Significant fraction of messages processed by the topic classification system are likely to be off-topic
Rejection Problem: Design a binary classifier for accepting or rejecting the top-choice topic– Accepting a message means asserting that the message
contains the top-choice topic– Rejecting a message means asserting that the message does
not contain the top-choice topic
37
Rejection Algorithm
Use the General Language (GL) topic model as model for off-topic messages
Compute the ratio of the log-posterior of top-choice topic Tj and GL topic as a relevance score
)|(log
)|(log)(
MessageGLP
MessageTPTLPR j
j
)( jTLPR
Accept the top-choice topic Tj if:
Threshold can be topic-independent or topic-specific
38
Parametric Topic-Specific Threshold Estimation
Compute empirical distribution ( and ) of log likelihood ratio score for a large corpus of off-topic documents– Can assume most messages in corpus are off-topic– More reliable statistics than if computed for on-topic message
Normalize the score for a test message before comparing to a topic-independent threshold
Can be thought of as a transformation of the topic-independent threshold rather than score normalization
)(
)(
T
Tscorescore
off
offnormalized
39
Parametric Topic-Specific Threshold Estimation
Do a Null-hypothesis test using the score distribution of the off-topic messages
Example histogram of normalizedtest scores (y-axis scaled to magnify view for on-topic messages)
Off-topic score distribution
On-topic score distribution
A message that is not-off-topic is
on-topic. A message several
standard-deviations away from off-
topic mean is very likely to be on-
topic.messages) deviation( standard
messages) (mean -)( score score normalized
off-topic
off-topictest
On-topic
40
Non-Parametric Threshold Estimation
Accept the top-choice topic Tj if:
)()( jj TTLPR
Select (Tj)by constrained optimization:
iii
iii
iii
ii
f
ixx
kxxf
FR toFA from Mapping:
,0 and FA:
where,, tosubject min
rejected arethat c with topimessages ofnumber :FR
topica asy incorrectl labelled chaff ofnumber :FA
i
i
i
i
41
Experimentation Configuration
Message TypeDistribution of Messages
Train Dev. Validation
In-topic 5.6K 5.6K 2.8K
Off-topic/Chaff 9.6K 9.6K 76K
In-topic messages from 14 newsgroups of the 20 NG corpus– Messages from six newsgroups were discarded due to
significant subject overlap with off-topic messages
Off-topic/chaff messages are from two sources:– talk.origins newsgroup from the AFE corpus– large collection of messages from 4 Yahoo! groups
Used jack-knifing to estimate rejection thresholds on Train+Dev set and then applied them to validation set
42
Comparison of Threshold Estimation Techniques
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
%False Rejections
%F
als
e A
cc
ep
tan
ces
topic-ind param-topic-depnon-param-topic-dep non-param-topic-dep-nochaff
43
Comparison of Threshold Estimation Techniques
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
%False Rejections
%F
alse
Acc
epta
nce
s
topic-ind param-topic-depnon-param-topic-dep non-param-topic-dep-nochaff
44
Comparison of Threshold Estimation Techniques
Rejection Method%False Rejections @ 1% False Acceptances
Topic-independent thresholds 31.4
Topic-specific thresholds (parametric) 27.4
Topic-specific thresholds (non-parametric) 23.7
45
Conclusions
HMM based topic classification delivers comparable performance on 20 NG and AFE corpora as in [1],[2]
Closed-set classification accuracy on 20 NG data after clustering is slightly worse than AFE data– Key reason is significant subject overlap between the
newsgroups
Clustered categories still exhibited significant subject overlap across clusters– The data set creators assign only six different subjects
(topics) to the 20 NG set1. J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the Poor Assumptions of
Naive Bayes Text Classifiers. In Proceeding of ICML 2003, Washington, D.C., 2003.
2. S. Eick, J. Lockwood, R. Loui, J. Moscola, C. Kastner, A. Levine, and D. Weishar. Transformation Algorithms for Data Streams. In Proceedings of IEEE AAC, March 2005.
46
Conclusions (Contd.)
Novel estimation of topic-specific thresholds outperforms topic-independent threshold for rejection of off-topic messages
Introduced a novel concept of unsupervised topic clustering for organizing messages– Built a demonstration prototype for topic tree based browsing
of large corpus of archived messages
Future work will focus on measuring the utility of UTC on user experience and objective metrics to evaluate UTC performance