pete bohman adam kunk. chronosearch: a system for extracting a chronological timeline chronochrono

22
ChronoSearch Pete Bohman Adam Kunk

Upload: rachel-webb

Post on 16-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

ChronoSearch

Pete BohmanAdam Kunk

Page 2: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

ChronoSearch

ChronoSearch: A System for Extracting a Chronological Timeline

Chrono

Page 3: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Motivation

Current search engines do not provide a complete picture Latest events dominate top results The user is forced to parse through lots of

pages to find a complete list of information

ChronoSearch aims to summarize search results into a concise list of important events related to an entity

Page 4: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Problem Definition

Input:An entity, E, and a set of web pages, W,

related to E Output:

A sorted list of events, L, which are related to E

Page 5: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Desired output characteristics

Output:L = { li | li occurred before li+1}▪ li is a sentence describing an event

▪ li describes a unique event

▪ li contains a link to the source web page w belonging to W

L is Precise▪ Each li describes an event the user is interested in

L is Comprehensive▪ L contains a description of all the events a user is

interested in

Page 6: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

System Overview

• Extract textual elements from web pages• Beautiful soup to extract <p>

elements and remove html tags

• NLTK sentence tokenization• Sentence sanitization

• Avg. word length [3.2, 7.2] chars/word

• Extract sentences containing entity and date• Regular expressions used for date

extraction• Order events by date

• Remove sentences reporting the same event• Cosine similarity• Verb similarity

Page 7: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Web Redundancy

Focus on strongest signal Absolute entity and date in the same sentence

Page 8: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Duplicate Removal

Guiding Principal – Increase Precision Duplicate event descriptions include

event descriptions using a similar set of verbs and paraphrased sentences.

Methodology Verb similarity▪ Remove sentences containing similar sets of

verbs that occur around the same date Cosine similarity▪ Remove paraphrased sentences

Page 9: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Cosine Similarity

Remove sentences with similarity > .5 Sentence1: “I have to go to school” Sentence2: “I have to go to lecture” [I, have, to, go, school, lecture]▪ V1 = [1, 1, 2, 1, 1, 0]▪ V2 = [1, 1, 2, 1, 0, 1]

Similarity = v1 . V2 / ||V1||| * ||V2|| Similarity = .857

Page 10: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Lessons Learned

Guiding Principal - Increase PrecisionThe web caters to user interest. The more popular an event description is on the web, the more important the event is, and therefore more likely it is to be in a users expected results.

Methodology Increase precision by removing unpopular event

descriptions as determined by search results. Lesson

Insufficient correlation between search results and event importance

Page 11: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Lessons Learned: Event Importance

0 10 20 30 40 50 60 70 801

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

Event, Search Result Correlation

0 Rank1 Rank2 Rank3 Rank

Event

Searc

h R

esult

s

Page 12: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Results

Demo time…

Page 13: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation

Information Retrieval (IR) performance characteristics:

Precision – fraction of documents retrieved that are relevant to queryRecall – fraction of documents that are relevant to query that are successfully retrieved

Page 14: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation Method (cont.)

Evaluated each timeline against truth set Compared ChronoSearch results to others Analyzed results for: Bill Gates, Steve Jobs, Jim Tressel

Merged existing manual timelines to form truth set

Truth Set

Page 15: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation Recall

ChronoSearch CNET Telegraph ChronoSearch NPR Personal ChronoSearch0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Perc

en

tag

e

Steve Jobs Bill Gates Jim Tressel

Page 16: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation Precision

Total Sentences: total number of sentences considered for output

Sentences Removed: total number of sentences removed (3 different mechanisms combined)

Precision Improvement: Percent of non-precise results removed.

Average precision improvement: 29.33%

Entity Total Results Results Removed

Precision Improvement

Bill Gates 187 48 23%

Steve Jobs 206 88 35%

Jim Tressel 83 29 30%

Page 17: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation Precision (cont.)

Bad Sentences: sentences that did not meet the average word lengths

Cosine Similar Events: sentences that had a cosine similarity > 0.5 (by term vectors)

Verb Similar Results: sentences that occurred on the same day and had a verb similarity > 0.5

Entity Bad Sentences Removed

Cosine Similar Results Removed

Verb Similar Results Removed

Bill Gates 7 38 3

Steve Jobs

4 78 6

Jim Tressel

0 28 1

Page 18: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation Precision (cont.)

False Positives for removal techniques:

Average false positive rate: 14.13%

Entity Bad Sentences Removed (False Positives)

Cosine Similar Results Removed (False Positives)

Verb Similar Results Removed (False Positives)

% False Positives For Total Events Removed

Bill Gates 4/7 0/38 1/3 5/48 = 10.42%

Steve Jobs

2/4 13/78 1/6 16/88 = 18.18%

Jim Tressel

0/0 3/28 1/1 4/29 = 13.79%

Page 19: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Evaluation Precision (cont.)

Duplicate Events Not Detected:

Average Effectiveness of Duplicate Detection: 84.65%

Entity Duplicates We Failed To Remove

% Of Total Duplicates We Missed

Bill Gates 10 10/51 = 19.61%

Steve Jobs 21 21/105 = 20%

Jim Tressel 2 2/31 = 6.45%

Page 20: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Future Research (cont.)

Improve recall by extracting weaker signals Attempt to handle relative dates and/or

pronouns▪ Could resolve all relative dates in the

document to be absolute based on the last seen absolute date

▪ Resolve pronouns to nearest entity

▪ Example: Steve Jobs was named the greatest CEO in 2011. One month ago, he passed away.

Page 21: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Future Research (cont.)

Improve precision by associating events to verbs Attempt to find events by looking for

verbs▪ Assumption: An event should contain a verb

and an entity▪ If there is no verb, then there is no event

▪ Example: “Farewell Steve Jobs” 06 Oct 2011.

Page 22: Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

Conclusion

Thank you, we hope you enjoy ChronoSearch!