on the spatiotemporal burstiness of terms · boston university slideshow title goes here...

Post on 26-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ON THE SPATIOTEMPORAL

BURSTINESS OF TERMSTheodoros Lappas (Boston University)

Marcos Vieira (IBM Research Lab - Brazil)Dimitrios Gunopulos (University of Athens)

Vassilis Tsotras (UC Riverside)

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

MOTIVATION

Thousands of new documents daily, recording real-life events

(online news sites, blogs, etc.)

Burstiness

Temporal

Spatiotemporal

Spatial

The deviation of the observed frequency from the expected one

Applications

Event Detection

Trend Identification

Document Search*

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

SPATIOTEMPORAL BURSTINESS

Spatiotemporal Collection: Spatiotemporal Collection:

• Streams from different locations (cities, countries, etc).

• Record real life events in text.

During an event’s time in the spotlight, its characteristic terms exhibit atypically high frequencies in the affected locations.

Large Document Corpora

Streaming Data

Formalization &

Identification

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

SPATIOTEMPORAL PATTERNS

Each captures a different type of bursty behavior

Group streams that simultaneously reported bursts for the same term, during the same timeframeGroup streams that simultaneously reported bursts for the same term, during the same timeframe

Combinatorial Patterns

• Ignore geographical proximity among streams

• A combination of a temporal interval and a set of streams from arbitrary locations

• Encodes unusually high frequencies simultaneously observed for a term t in all the streams in some set C, during the same temporal interval I.

Regional Patterns

• Consider the geographical proximity among document streams.

• Defined as a combination of a temporal interval and a geographical region.

• Encode that unusually high frequencies were observed for term t in geographical region R during a temporal interval I.

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

COMBINATORIAL PATTERNS

• Process each stream separately

• Get temporal bursty intervals for the given term.

• An interval is defined by its burstiness and its endpoints

Create the interval graph of the entire interval collection

Maximum weight clique (MWC)����interval with the

highest cumulative burstiness

The MWC Problem is Polynomial for interval graphs!

Remove MWC nodes and re-apply to get

the 2nd highest–scoring clique, etc.

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

REGIONAL PATTERNS

w1

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

REGIONAL PATTERNS

Get a new snapshot on every new timestamp

Identify set of Bursty Rectangles within the snapshot (R-Bursty alg, using bichromatic

discrepancy)

Aggregate consecutive rectangle-sets as new data arrives from the stream (STLocal

alg for finding maximal windows)

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

DOCUMENT SEARCH

Pt,d: the set of a patterns of term t that overlap with the timestamp of document d

f(Pt,d): representative score (e.g. min, max, median, avg)

� Standard IR techniques focus on relevance to the given query of terms

� We enhance the search process by considering spatiotemporal burstiness

� Retrieve documents that are relevant to the query and also discuss events with a high spatiotemporal impact

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

EXPERIMENTS

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

DATASETS

Topix Dataset181 streams (countries) 305,641 articles Sep/08 – Jul/09• 181 streams (countries) ���� 305,641 articles ���� Sep/08 – Jul/09

Major Events ListList of Influential events from Wikipedia

• 1 query for each event (chosen by human annotator)

• List of Influential events from Wikipedia

• 3 types of events: Global impact, Extended impact, Localized impact

• 1 query for each event (chosen by human annotator)

Artificial DataTwo different

Data generators

RandGen

DistGen

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

MAJOR EVENTS LIST

11

8/28/2012

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

PATTERN SIZE

� Report #streams in top regional and top combinatorial pattern ( and #streams in the MBR)

� StLocal: smaller patterns, focused around the event’s source

� StComb: streams from arbitrary locations, spanning very large regions

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

PATTERN RETRIEVAL (ARTIFICIAL)

Jaccard Sim:

Jaccard between predictedand actually affected stream-sets

Start-Error, End-Error:

Absolute difference between

predictedand

actualendpoints of the

pattern

Report Average over 1000 artificial

patterns

Both approaches competitive

for both types of patterns

Each approach better fit for

one type

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

COMPUTATIONAL TIME (TOPIX)

� Emulate the streaming scenario, report the running time per timestamp (average over all terms)

� STLocal is unaffected: customized for streaming data

� STComb slower: repeats the MWC computation for every timestamp

� Both approaches competitive, a few ms are enough

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

DOCUMENT SEARCH

� Use STLocal, STComb, TB (simple temporal burstiness) to retrieve the top-10 docs for each event in the Major Events List

� Ask Human Annotator to tag each doc as relevant/non relevant, report precision

� Avg % of common docs in the top-k: STComb-TB: 0.61, STComb-STLocal: 0.58, TB-STLocal: 0.67���� Complementarity, each approach captures a different facet of burstiness

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

FUTURE WORK

Improve StLocal to handle geographical regions of arbitrary size (now only rectangular)

Improve StComb to handle streaming data, (online maximum weight clique computation)

Compare patterns extracted from the same region, under different granularities (e.g. individuals �neighborhoods � cities � countries)

Visualization

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

ACKNOWLEDGEMENTS

CAPES (Brazilian

Federal Agency for Post-Graduate

Education)

NSF (IIS:0910859, IIS:1144158)

MODAP EU Project

DISFER GGET Project

17

8/28/2012

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

THANK YOU!

Boston University Slideshow Title Goes Here

tlappas@cs.bu.edu

SCALABILITY Vs. #STREAMS (ARTIFICIAL)

Both approaches scale almost linearly

STLocalconsistently faster

top related