information retrieval with time series query hyun duk kim (now at twitter), danila nikitin (now at...

22
Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter) , Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana- Champaign Malu Castellanos, Meichun Hsu HP Laboratories

Upload: kerrie-johnson

Post on 05-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Information Retrieval with Time Series Query

Hyun Duk Kim (now at Twitter) , Danila Nikitin (now at Google), ChengXiang Zhai

University of Illinois at Urbana-Champaign

Malu Castellanos, Meichun HsuHP Laboratories

Page 2: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

… Time

Any clues in the companion news stream?Dow Jones Industrial Average [Source: Yahoo Finance]

IR for stock market analysis?

What might have caused the stock market crash?

Sept 11 Attack!

What documents to read to analyze such a “causal” topic?

Page 3: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Analysis of Presidential Prediction Markets

What might have caused the sudden drop of price for this candidate?

What “mattered” in this election?

… Time

Any clues in the companion news stream?

Tax cut?

What documents to read to analyze such a “causal” topic?

Page 4: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

… Time

Any clues in the companion product reviews?

Analysis of Product Sales

What might have caused the decrease of sales?

safety concerns

What reviews to read to analyze such a “causal” topic?

Page 5: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

… Time

Which documents cover such a “trendy” topic?

Finding documents about “trendy” topics

Draw a “time series query”: Find documents about a topic emerging this summer, which has attracted much attention this Oct

Page 6: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Information Retrieval with Time Series Query

• Instead of keyword query, use time series as a query Retrieve documents that contain topics that are correlated with the query time series

• Input: – Time series data with time stamp

– Text stream which is a collection of documents with time stamp within the same time period

• Output– Ranked list of documents

Page 7: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Ideal Results of Information Retrieval with Time Series Query

2000 2001 …

News

7/3/2000

7/29/2000

8/24/2000

9/19/2000

10/15/2000

11/10/2000

12/6/2000

1/1/2001

1/27/2001

2/22/2001

3/20/2001

4/15/2001

5/11/2001

6/6/2001

7/2/2001

7/28/2001

8/23/2001

9/18/2001

10/14/2001

11/9/2001

12/5/2001

12/31/2001

010203040506070

Apple Stock Price

Date

Price

($)

RANK DATE EXCERPT

1 9/29/2000 Expect earning will be far below

2 12/8/2000 $4 billion cash in company

3 10/19/2000 Disappointing earning report

4 4/19/2001 Dow and Nasdaq soar after rate cut by Federal Reserve

5 7/20/2001 Apple's new retail store

… … …

Page 8: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

IR w/ TS - Method Overview

Sep , 2001 Oct , 2001 …

Text Stream

Non-textTime Series

Vocabulary, Word Frequency

Curves

W1

W2

W3

W4

Input 1

Input 2

Rank by Correlation

……………

Ranked Docu-ments

Output

… ……

… …

Input Documents

Page 9: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

IR w/ TS - Method Overview

Sep , 2001 Oct , 2001 …

Text Stream

Non-textTime Series

Vocabulary, Word Frequency

Curves

W1

W2

W3

W4

Rank by Correlation

Input 1

Input 2

……

… ………………

Ranked Docu-ments

OutputInput Documents

1. How to measure correlation between word and time series

2. How to aggregate word correlations to

rank documents

Page 10: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Correlation Function

• Measure correlation between word frequency curve vs. input time series

1. Pearson Correlation– Basic correlation

2. Dynamic Time Warping [Senin`08]

– Capture alignment of shifted or stretched time series

Series before alignment Time series Alignment

Val

ues

Time

Page 11: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Aggregation Function

• Score document correlation by aggregating word correlations

1. Weighted TF-IDF (BM25)– Use top K correlated words as a text query

Use IR formula such as BM25

– Use correlation coefficient as a weight

Page 12: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Aggregation Function

2. Average Correlation

a) Average over all terms:

Not all the words are correlated?

b) Average over top-k terms:

May be dominated by multiple occurrences of the same term

c) Average over top-k unique terms:

Page 13: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Evaluation

• Data Set– New York Times corpus (Jul 2000~Dec 2001)

• Entity annotated

– Daily Stock prices of 24 companies

• Measure– Mean average precision (MAP)

– Normalized discounted cumulative gain (NDCG)

• Research questions

1. Can our method retrieve meaningful documents?

2. Does DTW outperform Pearson Correlation?

3. Which aggregation function works the best?

Page 14: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Top ranked documents by American Airlines stock price

Rank Date Excerpt

1 10/22/2001 Fleeing the war

2 12/11/2001 Us and anti-Taliban forces in Afghanistan

3 11/18/2001 Fate of Taliban Soldiers Under Discussion

4 11/12/2001 Tally and dead and missing in Sep 11 terrorist attacks

5 9/25/2001 Soldiers in Afghanistan …

6 11/19/2001 Recover operation at World Trade Center

7 11/3/2001 4343 died or missing as a result of the attacks on Sep 11

8 11/17/2001 Dead and missing report of Sep 11 attack

… … …

All top ranked documents are related to September 11, terrorist attack

Page 15: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Top Correlated Words to American Airlines stock price

• All top correlated terms to input time series are related to terrorist attack

Highly correlated terms contributed to retrieval of documents about this topic

Word |ρ|

challenged 0.887031

afghanistan 0.861351

security 0.858745

sept 0.858309

terrorism 0.854865

pakistan 0.848829

aghans 0.844596

afghan 0.843481

islamic 0.842499

taliban 0.841455

Page 16: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Top ranked ‘relevant’ documents for Apple stock price

Rank Date Excerpt

1 9/29/2000 Fourth-quarter earning far below estimates

2 12/8/2000 $4 billion reserve, not $11 billion

3 10/19/2000 Announced earnings report

4 4/29/2001 Dow and Nasdaq soar after rate cur by Federal Reserve

5 7/20/2001 Apple’s new retail stores

6 12/6/2000 Apple warns it will record quarterly loss

7 3/24/2001 Stocks perk up, with Nasdaq posing gain

8 8/10/2000 Mixing Mac and Windows

… … …• Retrieved relevant event: Disappointing earning report, store open, etc.

• Useful as a new feature for re-ranking search results?

Page 17: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Quantitative Evaluation

• All our methods > Random precision (0.0013)

• Dynamic time warping >> Pearson correlation

Pearson DTW

MAP NDCG MAP NDCG

0.0019 0.3515 0.0022 0.3609

- Average performance (Average correlation as aggregation method)

Page 18: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Comparison of Aggregation Methods

• AC << TopK, BM25

• Top5-AC << Top20-AC,but not more than K=20

• BM25 is sensitive to parameter setting– Scores of AC methods are

more meaningful

• Incomplete judgments Possibly much better performance in reality

MAP NDCG

AC 0.0019 0.3515

Top5-AC 0.0021 0.361

Top10-AC 0.0023 0.3618

Top20-AC 0.0024 0.3629

Top5-AC-Uniq 0.0022 0.3613

Top10-AC-Uniq 0.0022 0.3616

Top20-AC-Uniq 0.0022 0.3619

Top5-BM25 0.0019 0.3584

Top10-BM25 0.0023 0.361

Top20-BM25 0.0019 0.3582

- Average performance (w/ Pearson correlation)

Page 19: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

“Higher” NDCG vs. Low MAP

Page 20: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Summary

• Introduced a novel retrieval problem– time series as query

• Studied basic solutions: Time series representation of terms– Term retrieval: correlation(query, term)

– Document retrieval: aggregation of term retrieval results

• Dynamic time warping + top-K average correlation seems working well

Page 21: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Limitations & Future Work

• Evaluation is based on simulation– Highly incomplete judgments!

– What’s a good way to evaluate such a new retrieval task?

• Current solutions are heuristic– How can we develop a more principled model?

• Different notions of relevance– “Local” relevance vs. global relevance?

• All other issues relevant to a standard retrieval problem are worth exploring (e.g., feedback?)

Page 22: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign

Thank You! Comments/Questions?