temporal information retrieval - unibo.itmontesi/cbd/articoli/seminario_tir.pdf · § the temporal...

37
Temporal Information Retrieval Stefano Giovanni Rizzo 1 Danilo Montesi 2 , Matteo Brucato 3 1,2 Department of Computer Science and Engineering - University of Bologna 3 School of Computer Science - University of Massachusetts Amherst 1 [email protected]

Upload: trinhnhan

Post on 28-Sep-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Temporal Information Retrieval

Stefano Giovanni Rizzo1 Danilo Montesi2, Matteo Brucato3

1,2 Department of Computer Science and Engineering - University of Bologna 3 School of Computer Science - University of Massachusetts Amherst

[email protected]

Information Retrieval §  Information Retrieval (IR) is defined as the activity of

§  finding information resources (usually documents)

§  of an unstructured nature (usually text)

§  relevant to an information need (the user query)

§  from within a large collection (e.g. the web).

§  But… is the text truly unstructured? §  As we’ll see structured information, like temporal

expressions, can be extracted from text data resources by means of NLP tools.

§  The temporal information in documents can be used to improve the IR effectiveness.

IR Effectiveness

§  To assess the effectiveness of an IR system (the quality of its search results), there are two parameters about the system’s returned results for a query: §  Precision: What fraction of the returned results are

relevant to the information need?

§  Recall: What fraction of the relevant documents in the collection were returned by the system?

IR Evaluation

§  To measure ad hoc IR effectiveness in the standard way, we need a test collection consisting of three things: 1.  A document collection

2.  A set of information needs, in the form of user queries

3.  A set of relevance judgments, in the form of binary assessment of either relevant or nonrelevant for each query–document pair.

Precision and Recall (1/3)

Document collection

Documents relevant to a

specific query

Retrieved Documents

Relevant Retrieved Documents

Precision and Recall (2/3)

Relevant Documents

Retrieved Documents

Relevant Retrieved Documents = ∩

Precision and Recall (3/3)

Relevant Documents

Retrieved Documents

Relevant Retrieved Documents

= Precision

Relevant Retrieved Documents

= Recall

Precision @ K ▪  Set a rank threshold K ▪  Compute % relevant in top K ▪  Ignores documents ranked lower than K ▪  Ex: ▪  Precision@3 is 2/3 ▪  Precision@4 is 2/4 ▪  Precision@5 is 3/5 ▪  In similar fashion we have Recall@K

Example: Average Precision

Example: MAP

Temporal Relevance

§  The set of temporal properties of a document is one key aspect that determine a document’s relevance for a query. Despite this, temporal information is still treated as a simple textual keyword by traditional IR.

§  Temporal Information Retrieval aims at satisfying these temporal needs and at combining traditional notions of document relevance with the so-called temporal relevance.

Time and Temporal Scope

§  Time is an ubiquitous dimension of nearly every collection of documents §  Digital libraries, news stories, tweets, the Web, …

§  Documents §  Meta-level: creation, publication date, …

§  Content-level: Periods of time mentioned in the text

⇒ The document temporal scope

§  Queries §  Meta-level: issue date (like in Google Trends)

§  Content-level: Periods of time mentioned in the query

⇒ The query temporal scope

Temporal Similarity vs Textual Similarity

§  Textual similarity §  Similarity based on term statistics

§  Not adequate for temporal queries: §  "results elections 2008"

§  ”results elections last year” §  "2008" and "last year" are considered terms and searched

literally in the documents

§  We need to model Temporal Similarity

Temporal Intervals

§  Temporal intervals are semantically rich: §  Synonymy:

§  "2014" = "last year" = "the year after 2013"

§  Polysemy: §  "every friday", "yearly", "super bowl"

§  Algebraic structure (to correlate temporal scopes): §  Overlapping

§  Containment

§  Distance

§  We can exploit these features to improve IR models

Temporal Domain

§  CHRONON §  The smallest discrete unit of time (e.g., a second, a day, a

year)

§  TEMPORAL DOMAIN §  Δ = [tmin, tmin], …, [1990, 1991], [1990, 1992], …, [tmin, tmax]

§  INTERPRETATION FUNCTION (NLP) §  Ψ : TIMEX → ℘(Δ)

§  where TIMEX is the set of all possible time expressions

§  TEMPORAL SCOPE of a document D (or a query Q) §  TD = { [1990, 1999], [1995, 1997], [2001, 2002] }

§  TQ = { [1991, 2001], [2002, 2003] }

The temporal similarity δ*

How can we effectively model δ?

Generalized metrics

§  Given an interval from the query and one from the document §  [aQ,bQ] ∈ TQ

§  [aD,bD] ∈ TD

we define:

§  Manhattan distance

§  δsym([aQ,bQ],[aD,bD]) = |aQ-aD|+|bQ-bD|

§  Query-biased hemimetric

§  δcov(Q)([aQ,bQ],[aD,bD]) = (bQ-aQ) - (min{bQ, bD} - max{aQ, aD})

§  Document-biased hemimetric

§  δcov(D)([aQ,bQ],[aD,bD]) = (bD-aD) - (min{bQ, bD} - max{aQ, aD})

Manhattan Distance

§  δsym([aQ,bQ],[aD,bD]) = |aQ-aD|+|bQ-bD|

δ = 0 This looks intuitively correct

aQ bQ

bD aD

Manhattan Distance: Peculiarity

§  δsym([aQ,bQ],[aD,bD]) = |aQ-aD|+|bQ-bD|

The two documents would have the same distance 3+3=6 from the query…

aQ bQ

bD1

bD2

aD1

aD2

Document-biased Hemimetric

§  δcov(D)([aQ,bQ],[aD,bD]) = (bD-aD) - (min{bQ, bD} - max{aQ, aD})

δ1 = 0 δ2 = 6

•  More appropriate for “broad” time queries: •  Query represents the broadest time interval the user is

willing to accept •  Distance reflects document coverage

aQ

bD1

bD2

aD1

aD2

bQ

Query-biased Hemimetric

§  δcov(Q)([aQ,bQ],[aD,bD]) = (bQ-aQ) - (min{bQ, bD} - max{aQ, aD})

δ1 = 6 δ2 = 0

•  More appropriate for “narrow” time queries: •  Query represents the narrowest time interval the user is

willing to accept •  Distance reflects query coverage

aQ bQ

bD1

bD2 aD2

aD1

Generalized metrics - Summary

§  Metrics (e.g. Manhattan distance): §  Non-negativity: δ(x,y) ≥ 0 §  Coincidence: δ(x,y) = 0 iff x = y §  Symmetry: δ(x,y) = δ(y,x)

§  Triangle inequality: δ(x,z) ≤ δ(x,y) + δ(y,z)

§  The 2 new distances are hemimetrics: §  No symmetry §  Partial coincidence:

§  δ(x,x) = 0 §  but we allow y's, y ≠ x, such that:δ(x,y) = 0

§  Interesting property: §  δsym(x,y) = δcov(D)(x,y) + δcov(Q)(x,y)

Combining text and time scores

§  Temporal similarity: §  simδ*(Q, D) = exp {–δ*(TQ, TD)}

§  Two models of relevance

§  Textual similarity: simtxt

§  Temporal similarity: simtime

§  Combining them: §  sim(Q, Di) = (1 – α) simtxt(Q, Di) + (α) simtime(Q, Di)

§ where α is a combination parameter in [0,1]

Evaluation – TREC Novelty §  Queries: 50 (22% with time) §  Results over all 50 queries, for different size k of

returned results:

Top-kBM25Baseline(TextSimilarity) Text+TemporalSimilarity

Precision Recall Precision Recall

5 0.84 0.17 0.87 0.18

10 0.82 0.34 0.85 0.35

20 0.79 0.65 0.81 0.67 α=0 α = 0.34

Evaluation: MAP over α §  Mean average Precision (MAP) over different values of

the α combination paramater:

0,78

0,79

0,8

0,81

0,82

0,83

0,84

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

MA

P

α

Doc Coverage Query Coverage Euclidean Manhattan Manhattan Cov. Text baseline

Conclusions §  Model for temporal scopes of documents and queries §  Novel metrics for temporal scope similarity

§  Ranking model combining textual and temporal scores

§  Experimental evaluation of the effectiveness improvements over a text-only ranking

Future Works

§  Time quantification in text §  Temporal metrics access methods §  Additional metric distances

§  Score distribution normalization for textual-temporal combination

Time Quantification §  Using a state-of-the-art temporal tagger (Heideltime) we

extracted temporal expressions within large sets of documents: §  English Wikipedia (4.4 million articles)

§  New York Times (1.8 million articles)

§  Twitter (1.2 million tweets from 23/01/2011 to 08/02/2011)

§  AOL Search Queries (20 million search queries)

§  IR Collections (TREC GOV2 has 25 million documents)

§  Patterns and insights are arising from the temporal distribution analysis.

Wikipedia: Temporal Quantification

Wikipedia: Centuries and Decades Bias

Wikipedia: Zipf’s Law

Rank Interval

1 2010-01-01 : 2010-12-31 (1 Year)

2 2008-01-01 : 2008-12-31 (1 Year)

3 2009-01-01 : 2009-12-31 (1 Year)

4 2013-01-01 : 2013-12-31 (1 Year)

5 2007-01-01 : 2007-12-31 (1 Year)

Wikipedia Zipf’s Laws: Words[5] VS Time

Power-law exponent Time : -2.34384 Words : -2

LA: Temporal quantification

FT: Temporal quantification

FBIS: Temporal quantification

Thank you.

Questions?

Stefano Giovanni Rizzo1 Danilo Montesi2, Matteo Brucato3

1,2 Department of Computer Science and Engineering - University of Bologna 3 School of Computer Science - University of Massachusetts Amherst

[email protected]