on the spatiotemporal burstiness of terms · boston university slideshow title goes here...

19
ON THE SPATIOTEMPORAL BURSTINESS OF TERMS Theodoros Lappas (Boston University) Marcos Vieira (IBM Research Lab - Brazil) Dimitrios Gunopulos (University of Athens) Vassilis Tsotras (UC Riverside)

Upload: others

Post on 26-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

ON THE SPATIOTEMPORAL

BURSTINESS OF TERMSTheodoros Lappas (Boston University)

Marcos Vieira (IBM Research Lab - Brazil)Dimitrios Gunopulos (University of Athens)

Vassilis Tsotras (UC Riverside)

Page 2: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

MOTIVATION

Thousands of new documents daily, recording real-life events

(online news sites, blogs, etc.)

Burstiness

Temporal

Spatiotemporal

Spatial

The deviation of the observed frequency from the expected one

Applications

Event Detection

Trend Identification

Document Search*

Page 3: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

SPATIOTEMPORAL BURSTINESS

Spatiotemporal Collection: Spatiotemporal Collection:

• Streams from different locations (cities, countries, etc).

• Record real life events in text.

During an event’s time in the spotlight, its characteristic terms exhibit atypically high frequencies in the affected locations.

Large Document Corpora

Streaming Data

Formalization &

Identification

Page 4: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

SPATIOTEMPORAL PATTERNS

Each captures a different type of bursty behavior

Group streams that simultaneously reported bursts for the same term, during the same timeframeGroup streams that simultaneously reported bursts for the same term, during the same timeframe

Combinatorial Patterns

• Ignore geographical proximity among streams

• A combination of a temporal interval and a set of streams from arbitrary locations

• Encodes unusually high frequencies simultaneously observed for a term t in all the streams in some set C, during the same temporal interval I.

Regional Patterns

• Consider the geographical proximity among document streams.

• Defined as a combination of a temporal interval and a geographical region.

• Encode that unusually high frequencies were observed for term t in geographical region R during a temporal interval I.

Page 5: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

COMBINATORIAL PATTERNS

• Process each stream separately

• Get temporal bursty intervals for the given term.

• An interval is defined by its burstiness and its endpoints

Create the interval graph of the entire interval collection

Maximum weight clique (MWC)����interval with the

highest cumulative burstiness

The MWC Problem is Polynomial for interval graphs!

Remove MWC nodes and re-apply to get

the 2nd highest–scoring clique, etc.

Page 6: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

REGIONAL PATTERNS

w1

Page 7: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

REGIONAL PATTERNS

Get a new snapshot on every new timestamp

Identify set of Bursty Rectangles within the snapshot (R-Bursty alg, using bichromatic

discrepancy)

Aggregate consecutive rectangle-sets as new data arrives from the stream (STLocal

alg for finding maximal windows)

Page 8: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

DOCUMENT SEARCH

Pt,d: the set of a patterns of term t that overlap with the timestamp of document d

f(Pt,d): representative score (e.g. min, max, median, avg)

� Standard IR techniques focus on relevance to the given query of terms

� We enhance the search process by considering spatiotemporal burstiness

� Retrieve documents that are relevant to the query and also discuss events with a high spatiotemporal impact

Page 9: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

EXPERIMENTS

Page 10: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

DATASETS

Topix Dataset181 streams (countries) 305,641 articles Sep/08 – Jul/09• 181 streams (countries) ���� 305,641 articles ���� Sep/08 – Jul/09

Major Events ListList of Influential events from Wikipedia

• 1 query for each event (chosen by human annotator)

• List of Influential events from Wikipedia

• 3 types of events: Global impact, Extended impact, Localized impact

• 1 query for each event (chosen by human annotator)

Artificial DataTwo different

Data generators

RandGen

DistGen

Page 11: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

MAJOR EVENTS LIST

11

8/28/2012

Page 12: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

PATTERN SIZE

� Report #streams in top regional and top combinatorial pattern ( and #streams in the MBR)

� StLocal: smaller patterns, focused around the event’s source

� StComb: streams from arbitrary locations, spanning very large regions

Page 13: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

PATTERN RETRIEVAL (ARTIFICIAL)

Jaccard Sim:

Jaccard between predictedand actually affected stream-sets

Start-Error, End-Error:

Absolute difference between

predictedand

actualendpoints of the

pattern

Report Average over 1000 artificial

patterns

Both approaches competitive

for both types of patterns

Each approach better fit for

one type

Page 14: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

COMPUTATIONAL TIME (TOPIX)

� Emulate the streaming scenario, report the running time per timestamp (average over all terms)

� STLocal is unaffected: customized for streaming data

� STComb slower: repeats the MWC computation for every timestamp

� Both approaches competitive, a few ms are enough

Page 15: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

DOCUMENT SEARCH

� Use STLocal, STComb, TB (simple temporal burstiness) to retrieve the top-10 docs for each event in the Major Events List

� Ask Human Annotator to tag each doc as relevant/non relevant, report precision

� Avg % of common docs in the top-k: STComb-TB: 0.61, STComb-STLocal: 0.58, TB-STLocal: 0.67���� Complementarity, each approach captures a different facet of burstiness

Page 16: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

FUTURE WORK

Improve StLocal to handle geographical regions of arbitrary size (now only rectangular)

Improve StComb to handle streaming data, (online maximum weight clique computation)

Compare patterns extracted from the same region, under different granularities (e.g. individuals �neighborhoods � cities � countries)

Visualization

Page 17: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

ACKNOWLEDGEMENTS

CAPES (Brazilian

Federal Agency for Post-Graduate

Education)

NSF (IIS:0910859, IIS:1144158)

MODAP EU Project

DISFER GGET Project

17

8/28/2012

Page 18: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

THANK YOU!

Page 19: On The Spatiotemporal Burstiness of Terms · Boston University Slideshow Title Goes Here tlappas@cs.bu.edu DOCUMENT SEARCH Pt,d: the set of a patterns of term t that overlap with

Boston University Slideshow Title Goes Here

[email protected]

SCALABILITY Vs. #STREAMS (ARTIFICIAL)

Both approaches scale almost linearly

STLocalconsistently faster