1 finding structure in noisy text: topic classification and unsupervised clustering rohit prasad,...

1

Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering

Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and Rich Schwartz

{rprasad,pnataraj}@bbn.com

Presented by Daniel Lopresti

8th January 2007

2

Outline

Research objectives and challenges

Overview of supervised classification using HMMs

Supervised topic classification of newsgroup messages

Unsupervised topic discovery and clustering

Rejection of off-topic messages

3

Objectives

Develop a system that performs topic based categorization of newsgroup messages in two modes

Mode 1 – Supervised classification: topics of interest to the user are known apriori to the system– Spot messages that are on topics of interest to a user– Requires rejecting “off-topic” messages to ensure low false

alarm rates

Mode 2 – Unsupervised classification: topics of interest to the user are not known– Discover topics in a large corpus without human supervision– Automatically organize/cluster the messages to support

efficient navigation

4

Challenges Posed by Newsgroup Messages

Text in newsgroup messages tends to be “noisy”– Abbreviations, misspellings– Colloquial (non-grammatical) language of messages– Discursive structure with frequent switching between topics– Lack of context in some messages makes it impossible to

understand the message without access to complete thread

Supervised classification requires annotation of newsgroup messages with a set of topic labels– Every non-trivial message contains multiple topics– No completely annotated corpus of newsgroup messages

exists• By complete annotation we mean tagging each message with ALL

relevant topics

5

Outline






6

Supervised Topic Classification

“President Clinton dumped his embattled Mexican

bailout today. Instead, he announced another plan that

doesn’t need congressional approval.”

Text Audio or Images

ASR or OCR

Topic Classifier

CNN, NBC, CBS, NPR, etc.

• Clinton, Bill • Mexico • Money

• Economic assistance, American

Applications

• News Sorting• Information Retrieval• Detection of Key Events• Improved Speech Recognition

TopicModels

Topic-LabeledBroadcast News

Corpus

SeveralTopics

Training

e.g., Primary Source Media4-5 topics / story

40,000 stories / year

5,000 topics

Text

7

OnTopicTM HMM Topic Model

A probabilistic hidden Markov model (HMM) that attempts to capture the generation of a story

Assumes a story can be on multiple topics, different words are related to different topics

Uses an explicit state for General Language because most words in a story are not related to any topics

Scalable to a large number of topics and requires only topic labels for each story for training the model

Language independent methodology

.

.

P( Tj | Set )

storystart

storyend

T1

T2

TM

T0General Language

Loop

P( Set )

nP ( Wn| Tj )

8

Outline






9

Experiment Setup

Performed experiments with two newsgroup corpora– Automated Front End (AFE) newsgroup corpus collected by

Washington Univ.– 20 Newsgroup (NG) corpus from

http://people.csail.mit.edu/jrennie/20Newsgroups/

Assumed the name of the newsgroup is the ONLY associated topic for each of the message

Although cost effective this assumption leads to inaccuracies in estimating system performance– Messages typically contain multiple topics, some of which may

be related to the dominant theme of another newsgroup

10

AFE Newsgroups Corpus

Google newsgroups data collected by Washington University from 12 diverse newsgroups

Messages posted to 11 newsgroups are considered to be in-topic and all messages posted to the “talk.origins” newsgroup are considered to be off-topic

Message headers were stripped to exclude newsgroup name from training and test messages

Split the corpus into training, test, and, validation sets according to the distribution specified in the config.xml file provided by the Washington University– But since the filenames were truncated we could not select the

same messages as Washington University

11

AFE Newsgroups Corpus

Newsgroup# of Messages

#Training #Test #Validation

Alt.sports.baseball.stl_cardinals 21 33 10

Comp.ai.neural_nets 15 25 7

Comp.programming.threads 31 47 15

Humanities.musics.composers.wagner 19 31 9

Misc.consumers.frugal_living 10 17 5

Misc.writing.moderated 24 37 12

Rec.Equestrian 27 41 13

Rec.martial_arts.moderated 18 29 9

Sci.archaelogy.moderated 46 69 23

Sci.logic 20 30 10

Soc.libraries.talk 10 17 5

Talk.origins (Chaff) 245 10401 122

Total Number of Messages (w/o chaff) 241 376 118

Total Number of Messages (w/ chaff) 486 10777 240

Total Number of Words (w/o chaff) 103K 118K 32K

Total Number of Words (w/ chaff) 187K 3.4M 63K

12

Closed-set Classification Accuracy on AFE

Trained OnTopic models on 11 newsgroups – Excluded messages from talk.origins newsgroup because they

are “off-topic” w.r.t topics of interest– Used stemming since some newsgroups had only a few

training messages

Classified 376 in-topic messages

Achieved overall top-choice accuracy of 91.2%– Top-choice accuracy: Percentage of times the top-choice

(best) topic returned by OnTopic was the correct answer

Top-choice accuracy was worse on newsgroups with fewer training examples

13

Closed-set Classification Accuracy (Contd.)

Newsgroup #Training Messages %Top-Choice Accuracy

Misc.consumers.frugal_living 10 47.1%

Soc.libraries.talk 10 58.8%

Comp.ai.neural_nets 15 80.0%

Rec.martial_arts.moderated 18 86.2%

Humanities.musics.composers.wagner 19 100.0%

Sci.logic 20 96.7%

Alt.sports.baseball.stl_cardinals 21 100.0%

Misc.writing.moderated 24 91.9%

Rec.Equestrian 27 97.6%

Comp.programming.threads 31 100.0%

Sci.archaelogy.moderated 46 95.7%

Overall 241 91.2%

14

“20 Newsgroups” Corpus

Downloaded 20 Newsgroups Corpus (“20 NG”) from http://people.csail.mit.edu/jrennie/20Newsgroups/

Corpus characteristics: – Messages from 20 newsgroups with an average of 941

messages per newsgroup– Average of 350 threads in each newsgroup– Average message length of 300 words (170 words after

headers and “replied to” text is excluded)– Some newsgroups are similar – the 20 newsgroups span 6

broad subjects

Data pre-processing– Stripped message headers, e-mail IDs, and signatures to

exclude newsgroup related information

Corpus was split into training, development, and validation sets for topic classification experiments

15

Distribution of Messages Across Newsgroups

Newsgroup Total Messages Unique Threads Messages Per Thread

alt.atheism 799 87 9.2

comp.graphics 973 532 1.8

comp.os.ms-windows.misc 985 479 2.1

comp.sys.ibm.pc.hardware 982 536 1.8

comp.sys.mac.hardware 961 467 2.1

comp.windows.x 980 773 1.3

misc.forsale 972 877 1.1

rec.autos 990 260 3.8

rec.motorcycles 994 177 5.6

rec.sport.baseball 994 272 3.7

rec.sport.hockey 999 346 2.9

sci.crypt 991 216 4.6

sci.electronics 981 395 2.5

sci.med 990 314 3.2

sci.space 987 296 3.3

soc.religion.christian 997 295 3.4

talk.politics.guns 910 145 6.3

talk.politics.mideast 940 307 3.1

talk.politics.misc 775 133 5.8

talk.religion.misc 628 103 6.1

Average 941 350 3.7

16

Organization of Newsgroups By Subject Matter

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

sci.cryptsci.electronicssci.medsci.space

misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast

talk.religion.miscalt.atheismsoc.religion.christian

17

Splits for Training and Testing

80:20 split between training and test/validation sets for three different partitioning schemes

Thread Partitioning: Entire thread is assigned to one of training, development, or validation sets

Chronological Partitioning: Messages in each thread are split between training, test, and validation; first 80% in training, and rest in test and validation

Random Partitioning: 80:20 split between training and test/validation, without regard to thread or chronology– Prior work by other researchers with 20 NG used random

partitioning

18

Closed-set Classification Results

Test Message Type

%Top Choice Accuracy

Thread Chronological Random

w/o “replied-to” text 74.5 77.8 79.7

w/ “replied-to” text 76.0 79.6 83.2

Trained OnTopic model set consisting of 20 topics

Classified 2K test messages– Two test conditions, one where “replied-to” text (from previous

messages) is included and the other where it is stripped from the test message

Classification accuracy is low due to following– Significant subject overlap between newsgroups– Lack of useful a priori probabilities due to almost uniform

distribution of topics, unlike AFE newsgroup data

19

Detailed Results for Thread PartitionedNewsgroup %Top-choice Accuracy Top Confusion

talk.religion.misc 29.3 talk.politics.guns

misc.forsale 51.0 comp.os.ms-windows.misc

talk.politics.misc 57.5 talk.politics.guns

sci.electronics 58.3 rec.autos

comp.os.ms-windows.misc 62.0 comp.sys.mac.hardware

alt.atheism 63.4 soc.religion.christian

comp.graphics 68.6 comp.windows.x

comp.sys.ibm.pc.hardware 72.6 comp.os.ms-windows.misc

comp.sys.mac.hardware 74.5 comp.sys.ibm.pc.hardware

comp.windows.x 77.1 comp.sys.ibm.pc.hardware

rec.motorcycles 81.9 rec.autos

talk.politics.guns 82.9 sci.crypt

talk.politics.mideast 84.6 rec.motorcycles

soc.religion.christian 87.4 sci.med

sci.crypt 89.0 talk.politics.guns

rec.sport.baseball 90.7 rec.sport.hockey

rec.autos 93.4 misc.forsale

sci.med 93.6 misc.forsale

rec.sport.hockey 94.2 rec.sport.baseball

sci.space 94.6 rec.autos

Overall 76.0

20

Detailed Results for ChronologicalNewsgroup %Top-choice Accuracy Top Confusion

talk.religion.misc 35.0 alt.atheism

misc.forsale 53.8 comp.sys.ibm.pc.hardware

comp.os.ms-windows.misc 62.9 comp.windows.x

comp.graphics 63.9 comp.os.ms-windows.misc

sci.electronics 64.3 rec.autos

talk.politics.misc 71.4 talk.politics.guns

alt.atheism 72.2 soc.religion.christian

comp.sys.ibm.pc.hardware 73.2 comp.os.ms-windows.misc

comp.sys.mac.hardware 75.8 comp.sys.ibm.pc.hardware

comp.windows.x 81.4 comp.os.ms-windows.misc

rec.motorcycles 86.7 rec.autos

sci.med 86.9 sci.space

rec.autos 88.7 comp.os.ms-windows.misc

talk.politics.guns 90.1 talk.politics.misc

talk.politics.mideast 90.2 alt.atheism

sci.space 91.9 comp.graphics

rec.sport.baseball 92.9 rec.sport.hockey

soc.religion.christian 94.0 alt.atheism

sci.crypt 96.0 sci.electronics

rec.sport.hockey 98.0 sci.med

Overall 79.6

21

Manual Clustering and Human Review

Manually clustered newsgroups into 12 topics after reviewing content of training messages

Recomputed top-choice classification accuracy using the cluster information

Clustering

%Top Choice Accuracy

Thread Chronological Random

w/o Clustering 76.0 79.6 83.2

w/ Clustering 81.5 84.8 88.2

Effect of presence of multiple topics in a message and incomplete reference topic label set– Manually reviewed messages from 4 categories with lowest

performance for “Chronological” split– Accuracy increases to 88.0% (from 84.8%) following manual rescoring

22

Cluster Table

Topic Cluster Newsgroup(s)

Autos rec.autos, rec.motorcycles

Graphics comp.graphics

Macintosh comp.sys.mac.hardware

Misc.forsale misc.forsale

Politics talk.politics.guns, talk.politics.mideast, talk.politics.misc

Windows comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.windows.x

religion soc.religion.christian, talk.religion.misc, alt.atheism

sports rec.sport.baseball, rec.sport.hockey

sci.crypt sci.crypt

sci.electronics sci.electronics

sci.med sci.med

sci.space sci.space

23

Outline






24

The Problem

Why unsupervised topic discovery and clustering? – Topics of interest may not be known apriori– May not be feasible to annotate documents with a large

number of topics

Goals– Discover topics and meaningful topic names – Cluster topics instead of messages automatically to organize

messages/documents for navigation at multiple levels

Do

cum

ents

UnsupervisedTopic

Discovery

HierarchicalClustering

NavigationGUI

DiscoveredTopics

HierarchicalTopic Tree

25

Unsupervised Topic Discovery3

Add Phrases

Topic Classification

Topic Training

Initial Topics for each doc

Inputdocuments

Inputdocuments

Frequent phrases, using MDL criterion;Names, using IdentiFinderTM

Select words/phrases with highest tf-idf;Keep topic names that occur in >3 documents

Assign topics to all documents

Key step:Associate many words/phrases with topics;Use EM training in OnTopicTM system

Topic Models

Topic Names

Augmented docs

Topic AnnotatedCorpus 3. S. Sista et al.. An Algorithm for Unsupervised Topic Discovery from

Broadcast News Stories. In Proceedings of ACM HLT, San Diego, CA, 2002.

26

UTD output (English document)

news source: Associated Press – November, 2001

27

UTD output (Arabic document)

News Source: Al_Hayat (Aug-Nov, 2001)

28

Unsupervised Topic Clustering

Organize automatically discovered topics (rather than documents) into a hierarchical topic tree

Leaves of the topic tree are one of the fine topics discovered from the UTD process

Intermediate nodes are logical collection of topics

Each node in the topic tree has a set of messages associated with it– A message can be assigned to multiple topic clusters by virtue

of multiple topic labels assigned to it by UTD process– Overcomes the problem of single cluster assignment of a

document prevalent in most document clustering approaches

Resulting topic tree enables browsing of the large corpus at multiple level of granularity – One can find a message with different set of logical actions

29

Topic Clustering Algorithm

Agglomerative clustering for organizing topics in a hierarchical tree structure

Topic clustering algorithm:Step 1: Each topic assigned to its own individual clusterStep 2: For every pair of clusters, compute the distance between

the two clusters Step 3: Merge the closest pair into a single cluster if the distance

is lower than a threshold and go to Step 2. Else Stop clustering

Modification: merge more than two clusters at each iteration to limit the number of levels in the tree– Also add other constraints in terms of limiting the branching

factor, number of levels etc.

30

Distance Metrics for Topic Clustering

Metrics computed from topic co-occurrences:– Co-occurrence probability– Mutual Information

Metrics computed from support/key word distributions:– Support word overlap between Ti and Tj

– Kullback-Leibler (KL) and J-Divergence between two probability mass functions

])()(

),(log[),(),(

ji

jijijiMI TPTP

TTPTTPTTD

])|(

)|(log[)|(),(

j

i

wijiKL TwP

TwPTwPTTD

31

Clustering Example

Insurance + Premiums +

Coverage + Abortion Coverage + Pay +

Abortion

Abortion Coverage + Pay

+ Abortion

Insurance Premiums Coverage Abortion PayAbortion

Coverage

32

Evaluation of UTC

Initial topic clustering experiments performed on 20 NG corpus – 3,343 topics discovered from 19K message– Allowed a maximum of 4 topics to be clustered at each iteration

Evaluation of UTC has been mostly subjective with a few objective metrics used to evaluate the clustering

Clustering rate: rate of increase of clusters with more than one topic seems to be well correlated with subjective judgments

Combination of J-divergence and topic co-occurrence seems to result in most uniform, logical clusters

33

Key Statistics of the UTC Topic Tree for 20 NG Corpus

Key Feature Value

Average Maximum

Number of Levels - 6

Branching Factor 2.4 4

No. of topics in a cluster 2.7 22

Measured some key features of the topic tree that could have significant impact on user experience

34

Screenshot of the UTC based Message Browser

Form to enter search query

List of documents associated with the selected topic cluster

List of topics associated with the selected document

Hierarchy of topic clusters

Topics associated with all the documents in the selected cluster

Tree view depicting the sub-tree from the root to the selected cluster

History of path taken to arrive at the current view

35

Outline






36

Off-topic Message Rejection

Significant fraction of messages processed by the topic classification system are likely to be off-topic

Rejection Problem: Design a binary classifier for accepting or rejecting the top-choice topic– Accepting a message means asserting that the message

contains the top-choice topic– Rejecting a message means asserting that the message does

not contain the top-choice topic

37

Rejection Algorithm

Use the General Language (GL) topic model as model for off-topic messages

Compute the ratio of the log-posterior of top-choice topic Tj and GL topic as a relevance score

)|(log

)|(log)(

MessageGLP

MessageTPTLPR j

j

)( jTLPR

Accept the top-choice topic Tj if:

Threshold can be topic-independent or topic-specific

38

Parametric Topic-Specific Threshold Estimation

Compute empirical distribution ( and ) of log likelihood ratio score for a large corpus of off-topic documents– Can assume most messages in corpus are off-topic– More reliable statistics than if computed for on-topic message

Normalize the score for a test message before comparing to a topic-independent threshold

Can be thought of as a transformation of the topic-independent threshold rather than score normalization

)(

)(

T

Tscorescore

off

offnormalized

39

Parametric Topic-Specific Threshold Estimation

Do a Null-hypothesis test using the score distribution of the off-topic messages

Example histogram of normalizedtest scores (y-axis scaled to magnify view for on-topic messages)

Off-topic score distribution

On-topic score distribution

A message that is not-off-topic is

on-topic. A message several

standard-deviations away from off-

topic mean is very likely to be on-

topic.messages) deviation( standard

messages) (mean -)( score score normalized

off-topic

off-topictest

On-topic

40

Non-Parametric Threshold Estimation

Accept the top-choice topic Tj if:

)()( jj TTLPR

Select (Tj)by constrained optimization:

iii

iii

iii

ii

f

ixx

kxxf

FR toFA from Mapping:

,0 and FA:

where,, tosubject min

rejected arethat c with topimessages ofnumber :FR

topica asy incorrectl labelled chaff ofnumber :FA

i

i

i

i

41

Experimentation Configuration

Message TypeDistribution of Messages

Train Dev. Validation

In-topic 5.6K 5.6K 2.8K

Off-topic/Chaff 9.6K 9.6K 76K

In-topic messages from 14 newsgroups of the 20 NG corpus– Messages from six newsgroups were discarded due to

significant subject overlap with off-topic messages

Off-topic/chaff messages are from two sources:– talk.origins newsgroup from the AFE corpus– large collection of messages from 4 Yahoo! groups

Used jack-knifing to estimate rejection thresholds on Train+Dev set and then applied them to validation set

42

Comparison of Threshold Estimation Techniques

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80

%False Rejections

%F

als

e A

cc

ep

tan

ces

topic-ind param-topic-depnon-param-topic-dep non-param-topic-dep-nochaff

43


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

%False Rejections

%F

alse

Acc

epta

nce

s

topic-ind param-topic-depnon-param-topic-dep non-param-topic-dep-nochaff

44


Rejection Method%False Rejections @ 1% False Acceptances

Topic-independent thresholds 31.4

Topic-specific thresholds (parametric) 27.4

Topic-specific thresholds (non-parametric) 23.7

45

Conclusions

HMM based topic classification delivers comparable performance on 20 NG and AFE corpora as in [1],[2]

Closed-set classification accuracy on 20 NG data after clustering is slightly worse than AFE data– Key reason is significant subject overlap between the

newsgroups

Clustered categories still exhibited significant subject overlap across clusters– The data set creators assign only six different subjects

(topics) to the 20 NG set1. J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the Poor Assumptions of

Naive Bayes Text Classifiers. In Proceeding of ICML 2003, Washington, D.C., 2003.

2. S. Eick, J. Lockwood, R. Loui, J. Moscola, C. Kastner, A. Levine, and D. Weishar. Transformation Algorithms for Data Streams. In Proceedings of IEEE AAC, March 2005.

46

Conclusions (Contd.)

Novel estimation of topic-specific thresholds outperforms topic-independent threshold for rejection of off-topic messages

Introduced a novel concept of unsupervised topic clustering for organizing messages– Built a demonstration prototype for topic tree based browsing

of large corpus of archived messages

Future work will focus on measuring the utility of UTC on user experience and objective metrics to evaluate UTC performance

1 finding structure in noisy text: topic classification and unsupervised clustering rohit prasad,...

Documents

topic messages

newsgroup messages text

newsgroup slide

test messages

topic w

associated topic

diverse newsgroups messages

topics story