systems engineering and engineering management the chinese university of hong kong parameter free...

39
Systems Engineering and Engineering Managem The Chinese University of Hong Kong Parameter Free Bursty Parameter Free Bursty Events Detection in Text Events Detection in Text Streams Streams Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu Philip S Yu VLDB 2005 VLDB 2005

Upload: brooke-oconnor

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Parameter Free Bursty Events Parameter Free Bursty Events Detection in Text StreamsDetection in Text Streams

Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S YuGabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu

VLDB 2005VLDB 2005

Page 2: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work

Page 3: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work

Page 4: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Parameter Free Bursty Events Detection in Text StreamsParameter Free Bursty Events Detection in Text Streams

Introduction Introduction (1 or 5)(1 or 5)

Page 5: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Parameter Free Bursty Events Detection inParameter Free Bursty Events Detection in Text Streams Text Streams– A sequence of documents organized temporallyA sequence of documents organized temporally

» E.g. News stories and e-mailsE.g. News stories and e-mails

– Two kinds of stream: Online vs. OfflineTwo kinds of stream: Online vs. Offline» Online Stream: Open-ended. Online Stream: Open-ended.

» Offline Stream: Have boundaries. Offline Stream: Have boundaries.

Introduction Introduction (2 or 5)(2 or 5)

………… ……

Page 6: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Parameter FreeParameter Free Bursty Events Bursty Events Detection in Text StreamsDetection in Text Streams– An event consists a set of features that are useful to identify An event consists a set of features that are useful to identify

(understand) the event.(understand) the event.

– A Bursty Event is an event that is A Bursty Event is an event that is hothot in a specific period of time in a specific period of time

– We call the features that are used to identify the Bursty Event as We call the features that are used to identify the Bursty Event as Bursty Features Bursty Features

– E.g. The event “SARS” consists of the features “Outbreak, E.g. The event “SARS” consists of the features “Outbreak, Atypic, Respire, …” Atypic, Respire, …”

Introduction Introduction (3 or 5)(3 or 5)

TimeTime

No. of News StoriesNo. of News Stories

An event, e.g. SARSAn event, e.g. SARS

Page 7: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Introduction Introduction (4 or 5)(4 or 5)

Parameter Free Parameter Free Bursty Events Detection in Text StreamBursty Events Detection in Text Stream– Given a text stream, try to figure out all of the bursty events Given a text stream, try to figure out all of the bursty events

» In other words, try to figure out all of the bursty features (features that In other words, try to figure out all of the bursty features (features that are “hot” in a specific period) and group the bursty features together are “hot” in a specific period) and group the bursty features together logically, such that the bursty features grouped together are useful for logically, such that the bursty features grouped together are useful for identifying an event.identifying an event.

………… ……

Page 8: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Introduction Introduction (5 or 5)(5 or 5)

Parameter Free Bursty Events Detection in Text StreamsParameter Free Bursty Events Detection in Text Streams– Parameter Free – You do not need to turn the parameters by Parameter Free – You do not need to turn the parameters by

yourselfyourself» The framework is applicable on any corpusThe framework is applicable on any corpus» No fine tuning is necessaryNo fine tuning is necessary» No parameter needs to be estimatedNo parameter needs to be estimated

– Why parameter free is useful?Why parameter free is useful?» Without any prior knowledge about the information in a database, it is Without any prior knowledge about the information in a database, it is

rather difficult to make any initially estimationrather difficult to make any initially estimation» In our problem, we are trying to identify the bursty events in a text In our problem, we are trying to identify the bursty events in a text

stream. In this problem, we do not know have any prior knowledge stream. In this problem, we do not know have any prior knowledge about the information in the database. We do not know what it about the information in the database. We do not know what it contains. We even do not know whether there is any burst. We do not contains. We even do not know whether there is any burst. We do not know…know…

Page 9: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Problem SettingProblem Setting

Data archivedData archived– Source: Local news stories (South China Morning Post)Source: Local news stories (South China Morning Post)

– Period: 2003-01-01 to 2004-12-31Period: 2003-01-01 to 2004-12-31

Some major settingsSome major settings– Offline detectionOffline detection

– New stories that are release on the same day (i.e. new stories that New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together appear in the same piece of the newspaper) are grouped together as a batchas a batch

Page 10: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work

Page 11: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

A possible method (A possible method (NotNot our approach) our approach) – Step 1:Step 1:

» Objective: Group similar events togetherObjective: Group similar events together» Method: Use clustering to group similar documents together (e.g. K-Method: Use clustering to group similar documents together (e.g. K-

Means)Means)– Step 2Step 2

» Objective: Extract the keywords of each eventObjective: Extract the keywords of each event» Method: Use feature selection (e.g. Information gain)Method: Use feature selection (e.g. Information gain)

Document Pivot Clustering Approach Document Pivot Clustering Approach (1 of 3)(1 of 3)

All News StoriesAll News Stories

Via ClusteringVia Clustering

....

..

Group 1Group 1

Group 2Group 2

Step 1Step 1

Step 2Step 2

Extract the Key FeaturesExtract the Key Features

Extract the Key FeaturesExtract the Key Features

featurefeature

....

..

featurefeature

....

..

Page 12: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Document Pivot Clustering Approach Document Pivot Clustering Approach (2 of 3)(2 of 3)

Some difficultiesSome difficulties1.1. Most similar documents may not report the same eventMost similar documents may not report the same event

– From our experiments, we found that two documents that are the From our experiments, we found that two documents that are the most similar in terms of the features, may not necessary report the most similar in terms of the features, may not necessary report the same eventsame event

2.2. Clustering requires feature weightings (e.g. tf-idf)Clustering requires feature weightings (e.g. tf-idf)– Feature weighting is originated from IR. Its idea is: feature appear in Feature weighting is originated from IR. Its idea is: feature appear in

fewer documents in the domain are more useful (obtain higher fewer documents in the domain are more useful (obtain higher weights).weights).

– For clustering: feature appear in many documents in a certain period For clustering: feature appear in many documents in a certain period should obtain a higher weights.should obtain a higher weights.

Page 13: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Some difficulties Some difficulties (cont’d)(cont’d)

3.3. A long running events may be broken down into several small A long running events may be broken down into several small piecespieces– This phenomenon appears in many reported studies (esp. in TDT)This phenomenon appears in many reported studies (esp. in TDT)

4.4. Difficult to figure out the bursty featuresDifficult to figure out the bursty features– Assume clustering can determine bursty events. However, there can Assume clustering can determine bursty events. However, there can

be many clusters that are not “hot” (important). Determine which of be many clusters that are not “hot” (important). Determine which of the cluster is “hot” is difficult (may require a ranking function, but the cluster is “hot” is difficult (may require a ranking function, but difficult to derive.)difficult to derive.)

Document Pivot Clustering Approach Document Pivot Clustering Approach (3 of 3)(3 of 3)

Page 14: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work

Page 15: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Feature Pivot Clustering ApproachFeature Pivot Clustering Approach

Overview of the frameworkOverview of the framework– Step 1Step 1

» Identify the bursty featuresIdentify the bursty features– Step 2Step 2

» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3

» Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events

All News StoriesAll News Stories

ExtractExtractAll featureAll feature

....

.. IdentifyIdentify

Event 1Event 1

....

..Bursty featureBursty feature

....

.. ClusterClusterEvent 2Event 2

....

..

....

..

Determine theDetermine thehot periodhot period

Determine theDetermine thehot periodhot period

Step 1Step 1 Step 2Step 2

Step 3Step 3

Page 16: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

ClusterCluster

Feature Pivot Clustering ApproachFeature Pivot Clustering Approach

Overview of the frameworkOverview of the framework– Step 1Step 1

» Identify the bursty featuresIdentify the bursty featuresStep 2Step 2

Group the bursty features into bursty eventsGroup the bursty features into bursty eventsStep 3Step 3

Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events

All News StoriesAll News Stories

ExtractExtractAll featureAll feature

....

.. IdentifyIdentify

Event 1Event 1

....

..Bursty featureBursty feature

....

.. Event 2Event 2

....

..

....

..

Determine theDetermine thehot periodhot period

Determine theDetermine thehot periodhot period

Step 1Step 1 Step 2Step 2

Step 3Step 3

Page 17: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (1 of 7)(1 of 7)

General IdeaGeneral Idea– Given a single feature, f, try to figure out whether it contains any Given a single feature, f, try to figure out whether it contains any

bursty period. bursty period.

– If so, then it is a bursty feature (in some specific periods)If so, then it is a bursty feature (in some specific periods)

TimeTime

No. of docs contains the feature, fNo. of docs contains the feature, f

Bursty PeriodBursty Period

The distribution of a feature, f, The distribution of a feature, f, among documentsamong documents

Page 18: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (2 of 7)(2 of 7)

Some more examplesSome more examples

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

No burstNo burst Not a burst (stopword)Not a burst (stopword)

Burst without fading awayBurst without fading awayTwo burstTwo burst

Page 19: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (3 of 7)(3 of 7)

An obvious approach to discover whether a feature is a An obvious approach to discover whether a feature is a bursty feature is to use a “threshold cut”bursty feature is to use a “threshold cut”

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

Bursty PeriodBursty Period

The distribution of a feature, f, The distribution of a feature, f, among documentsamong documents

thresholdthreshold

Page 20: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (4 of 7)(4 of 7)

ChallengesChallenges– Setting one single threshold for all features is impossibleSetting one single threshold for all features is impossible

Another attempt – set a “percentage cut”Another attempt – set a “percentage cut”– Figure out the relative differences between the max and min of the “No. Figure out the relative differences between the max and min of the “No.

of docs contains the feature” of docs contains the feature”

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

For a stop-word:For a stop-word:For a normal non-bursty feature:For a normal non-bursty feature:

thresholdthreshold

Page 21: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (5 of 7)(5 of 7)

ChallengesChallenges– Setting a percentage cut is also impossibleSetting a percentage cut is also impossible

» Different features has different distribution:Different features has different distribution:

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

TimeTime

No. of docs contains No. of docs contains the feature, fthe feature, f

500500 300300

Page 22: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (6 of 7)(6 of 7)

Our solutionOur solution– Treating each feature in the text stream as a probabilistic Treating each feature in the text stream as a probabilistic

distributiondistribution– In each day, we compute the probability that the number of In each day, we compute the probability that the number of

documents contains a particular feature, fdocuments contains a particular feature, fjj

» What we got are: What we got are: N’N’ – no. of news stories in the stream – no. of news stories in the stream

n’n’ – no. of news stories in a time window (one day)– no. of news stories in a time window (one day)KK’’ – no. of news stories contains the specific feature – no. of news stories contains the specific feature n’ n’ –– K’ K’ – no. of news stories does not contain the specific feature – no. of news stories does not contain the specific feature

» We can model the distribution of a feature in a time window (i.e. in a We can model the distribution of a feature in a time window (i.e. in a day) by binomial distribution (the above four elements are enough for day) by binomial distribution (the above four elements are enough for computing binomial distribution)computing binomial distribution)

(Continue (Continue next page)next page)

Page 23: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Identify the Bursty Features Identify the Bursty Features (7 of 7)(7 of 7)

– If in any time window (day), the value of the binomial distribution If in any time window (day), the value of the binomial distribution (probability that the number of documents contain the feature) (probability that the number of documents contain the feature) changechange significantly, than it implies that the feature exhibit significantly, than it implies that the feature exhibit “abnormal” behavior“abnormal” behavior

» The reason is that if the features are generated from an unknown The reason is that if the features are generated from an unknown probability distribution, than the value of the binomial distribution at probability distribution, than the value of the binomial distribution at each time window (in each day) should be more or less constanteach time window (in each day) should be more or less constant

– Two reasons that it drop significantly:Two reasons that it drop significantly:» Suddenly very few documents contains the specific featuresSuddenly very few documents contains the specific features

We are not interested in this kind of observation, as it only tells us that We are not interested in this kind of observation, as it only tells us that the specific feature is NOT a bursty feature in the corresponding time the specific feature is NOT a bursty feature in the corresponding time window (day). It gives no insight about whether it is a bursty feature window (day). It gives no insight about whether it is a bursty feature NOW.NOW.

» Suddenly many documents contains the specific features Suddenly many documents contains the specific features We are interested in this kind of featuresWe are interested in this kind of features

Page 24: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

ClusterCluster

Feature Pivot Clustering ApproachFeature Pivot Clustering Approach

Overview of the frameworkOverview of the framework– Step 1Step 1

» Identify the bursty featuresIdentify the bursty featuresStep 2Step 2

Group the bursty features into bursty eventsGroup the bursty features into bursty eventsStep 3Step 3

Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events

All News StoriesAll News Stories

ExtractExtractAll featureAll feature

....

.. IdentifyIdentify

Event 1Event 1

....

..Bursty featureBursty feature

....

.. Event 2Event 2

....

..

....

..

Determine theDetermine thehot periodhot period

Determine theDetermine thehot periodhot period

Step 1Step 1 Step 2Step 2

Step 3Step 3

Page 25: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Feature Pivot Clustering ApproachFeature Pivot Clustering Approach

Overview of the frameworkOverview of the framework– Step 1Step 1

» Identify the bursty featuresIdentify the bursty features– Step 2Step 2

» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3

Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events

All News StoriesAll News Stories

ExtractExtractAll featureAll feature

....

.. IdentifyIdentify

Event 1Event 1

....

..Bursty featureBursty feature

....

.. ClusterClusterEvent 2Event 2

....

..

....

..

Determine theDetermine thehot periodhot period

Determine theDetermine thehot periodhot period

Step 1Step 1 Step 2Step 2

Step 3Step 3

Page 26: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Group the Bursty Features Group the Bursty Features (1 of 2)(1 of 2)

General ideaGeneral idea– Group the features such that they always appear togetherGroup the features such that they always appear together

» If the features always appear together, they should be discussing the If the features always appear together, they should be discussing the same eventsame event

– Cluster the featuresCluster the features

ChallengeChallenge– Should we group these two features together?Should we group these two features together?

» Situation:Situation:If feature A appears, Feature B If feature A appears, Feature B alwaysalways appears also. appears also.Feature A appears in 1,000 stories. Feature B appears in 200 stories.Feature A appears in 1,000 stories. Feature B appears in 200 stories.

» We claim that they should not be grouped together, as Feature B is We claim that they should not be grouped together, as Feature B is only a subset of Feature A. only a subset of Feature A.

We want to group the feature at the “same level”We want to group the feature at the “same level”

Page 27: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Group the Bursty Features Group the Bursty Features (2 of 2)(2 of 2)

Our solutionOur solution– We try to figure out what is the probability of the features grouped We try to figure out what is the probability of the features grouped

together given the observation of the document distribution of the together given the observation of the document distribution of the text streamtext stream

» Find a maximum probability that the features would be grouped Find a maximum probability that the features would be grouped together (Expectation-Maximization, EM)together (Expectation-Maximization, EM)

– Mathematically,Mathematically,

Page 28: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Feature Pivot Clustering ApproachFeature Pivot Clustering Approach

Overview of the frameworkOverview of the framework– Step 1Step 1

» Identify the bursty featuresIdentify the bursty features– Step 2Step 2

» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3

Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events

All News StoriesAll News Stories

ExtractExtractAll featureAll feature

....

.. IdentifyIdentify

Event 1Event 1

....

..Bursty featureBursty feature

....

.. ClusterClusterEvent 2Event 2

....

..

....

..

Determine theDetermine thehot periodhot period

Determine theDetermine thehot periodhot period

Step 1Step 1 Step 2Step 2

Step 3Step 3

Page 29: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Feature Pivot Clustering ApproachFeature Pivot Clustering Approach

Overview of the frameworkOverview of the framework– Step 1Step 1

» Identify the bursty featuresIdentify the bursty features– Step 2Step 2

» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3

» Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events

All News StoriesAll News Stories

ExtractExtractAll featureAll feature

....

.. IdentifyIdentify

Event 1Event 1

....

..Bursty featureBursty feature

....

.. ClusterClusterEvent 2Event 2

....

..

....

..

Determine theDetermine thehot periodhot period

Determine theDetermine thehot periodhot period

Step 1Step 1 Step 2Step 2

Step 3Step 3

Page 30: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Determine the Hot PeriodsDetermine the Hot Periods

General ideaGeneral idea– The highest average probability that the bursty features will be The highest average probability that the bursty features will be

appeared togetherappeared together

GraphicallyGraphically

TimeTime

Document DistributionDocument Distribution

Page 31: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work

Page 32: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Problem SettingProblem Setting

Data archivedData archived– Source: Local news stories (South China Morning Post)Source: Local news stories (South China Morning Post)

– Period: 2003-01-01 to 2004-12-31Period: 2003-01-01 to 2004-12-31

Major SettingsMajor Settings– Offline detectionOffline detection

– New stories that are release on the same day (i.e. new stories that New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together appear in the same piece of the newspaper) are grouped together as a batchas a batch

Page 33: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Results HighlightResults Highlight

Some eventsSome events

Bursty EventsBursty Events Bursty FeaturesBursty Features

SARSSARS Sars, Outbreak, Atypic, Respire, …Sars, Outbreak, Atypic, Respire, …

LegislationLegislation Article, Yip, Law, Rally, …Article, Yip, Law, Rally, …

Bird FuBird Fu Bird, FluBird, Flu

Taiwan IssueTaiwan Issue Taiwan, Chen, Shu, BianTaiwan, Chen, Shu, Bian

Iraq WarIraq War Iraq, War, Saddam, …Iraq, War, Saddam, …

GasGas Victim, Might, Accident, GasVictim, Might, Accident, Gas

Page 34: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works ConclusionConclusion

Page 35: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Related Works Related Works (1 of 2)(1 of 2)

TDT – Automatically techniques for locating topically TDT – Automatically techniques for locating topically related materials in streams data related materials in streams data (Wayne 2000 pp. 1487)(Wayne 2000 pp. 1487)

– Five major tasks: segmentation, tracking, Five major tasks: segmentation, tracking, detectiondetection, first story , first story detection, detection, linkinglinking

– Work well with the “document-pivot clustering” approachWork well with the “document-pivot clustering” approach» Try to group similar documents to form an event (The event is not Try to group similar documents to form an event (The event is not

named, i.e. no need to extract or identify the main features in the named, i.e. no need to extract or identify the main features in the event)event)

No need to figure out the “bursty features”No need to figure out the “bursty features”

– Other interesting issueOther interesting issue» Our approach naturally combine the detection task and linking task Our approach naturally combine the detection task and linking task

togethertogether

Page 36: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Related Works Related Works (2 of 2)(2 of 2)

Many other related worksMany other related works– Vlachos et la SIGMOD’04Vlachos et la SIGMOD’04

» Burst for online queryBurst for online query

– Smith SIGIR’02Smith SIGIR’02» Events DetectionEvents Detection

– Kleinbery KDD’02Kleinbery KDD’02» Burst and hierarchical structureBurst and hierarchical structure

– Swan & Allan SIGIR’00Swan & Allan SIGIR’00» Time varying featuresTime varying features

– ……

Page 37: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

OutlineOutline

IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.

A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering

Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering

Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work

Page 38: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Summary & Future WorkSummary & Future Work

Document Pivot Clustering vs. Feature Pivot ClusteringDocument Pivot Clustering vs. Feature Pivot Clustering– Document Pivot Clustering – Clustering is based on the content of Document Pivot Clustering – Clustering is based on the content of

the documentsthe documents– Feature Pivot Clustering – Clustering is based on distribution of Feature Pivot Clustering – Clustering is based on distribution of

featuresfeatures Future WorksFuture Works

– Try to apply the framework in TDT datasetTry to apply the framework in TDT dataset» However, TDT contain However, TDT contain selectedselected news stories from multiple sources. news stories from multiple sources.

The distribution of features may be affected.The distribution of features may be affected.» Moreover, the time period of TDT is relatively short. We do not know Moreover, the time period of TDT is relatively short. We do not know

whether the change in the distribution of features is significant whether the change in the distribution of features is significant enough for us to do analysisenough for us to do analysis

– Try to assign the same features to multiple events (more realistic)Try to assign the same features to multiple events (more realistic)» However, this may lead to many new issues, such as a “cycle” appear, However, this may lead to many new issues, such as a “cycle” appear,

or the some parameters needed to introduceor the some parameters needed to introduce

Page 39: Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong

Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong

Thank you very much

– The End –