pete bohman adam kunk. chronosearch: a system for extracting a chronological timeline chronochrono
TRANSCRIPT
ChronoSearch
Pete BohmanAdam Kunk
ChronoSearch
ChronoSearch: A System for Extracting a Chronological Timeline
Chrono
Motivation
Current search engines do not provide a complete picture Latest events dominate top results The user is forced to parse through lots of
pages to find a complete list of information
ChronoSearch aims to summarize search results into a concise list of important events related to an entity
Problem Definition
Input:An entity, E, and a set of web pages, W,
related to E Output:
A sorted list of events, L, which are related to E
Desired output characteristics
Output:L = { li | li occurred before li+1}▪ li is a sentence describing an event
▪ li describes a unique event
▪ li contains a link to the source web page w belonging to W
L is Precise▪ Each li describes an event the user is interested in
L is Comprehensive▪ L contains a description of all the events a user is
interested in
System Overview
• Extract textual elements from web pages• Beautiful soup to extract <p>
elements and remove html tags
• NLTK sentence tokenization• Sentence sanitization
• Avg. word length [3.2, 7.2] chars/word
• Extract sentences containing entity and date• Regular expressions used for date
extraction• Order events by date
• Remove sentences reporting the same event• Cosine similarity• Verb similarity
Web Redundancy
Focus on strongest signal Absolute entity and date in the same sentence
Duplicate Removal
Guiding Principal – Increase Precision Duplicate event descriptions include
event descriptions using a similar set of verbs and paraphrased sentences.
Methodology Verb similarity▪ Remove sentences containing similar sets of
verbs that occur around the same date Cosine similarity▪ Remove paraphrased sentences
Cosine Similarity
Remove sentences with similarity > .5 Sentence1: “I have to go to school” Sentence2: “I have to go to lecture” [I, have, to, go, school, lecture]▪ V1 = [1, 1, 2, 1, 1, 0]▪ V2 = [1, 1, 2, 1, 0, 1]
Similarity = v1 . V2 / ||V1||| * ||V2|| Similarity = .857
Lessons Learned
Guiding Principal - Increase PrecisionThe web caters to user interest. The more popular an event description is on the web, the more important the event is, and therefore more likely it is to be in a users expected results.
Methodology Increase precision by removing unpopular event
descriptions as determined by search results. Lesson
Insufficient correlation between search results and event importance
Lessons Learned: Event Importance
0 10 20 30 40 50 60 70 801
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
Event, Search Result Correlation
0 Rank1 Rank2 Rank3 Rank
Event
Searc
h R
esult
s
Results
Demo time…
Evaluation
Information Retrieval (IR) performance characteristics:
Precision – fraction of documents retrieved that are relevant to queryRecall – fraction of documents that are relevant to query that are successfully retrieved
Evaluation Method (cont.)
Evaluated each timeline against truth set Compared ChronoSearch results to others Analyzed results for: Bill Gates, Steve Jobs, Jim Tressel
Merged existing manual timelines to form truth set
Truth Set
Evaluation Recall
ChronoSearch CNET Telegraph ChronoSearch NPR Personal ChronoSearch0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
Perc
en
tag
e
Steve Jobs Bill Gates Jim Tressel
Evaluation Precision
Total Sentences: total number of sentences considered for output
Sentences Removed: total number of sentences removed (3 different mechanisms combined)
Precision Improvement: Percent of non-precise results removed.
Average precision improvement: 29.33%
Entity Total Results Results Removed
Precision Improvement
Bill Gates 187 48 23%
Steve Jobs 206 88 35%
Jim Tressel 83 29 30%
Evaluation Precision (cont.)
Bad Sentences: sentences that did not meet the average word lengths
Cosine Similar Events: sentences that had a cosine similarity > 0.5 (by term vectors)
Verb Similar Results: sentences that occurred on the same day and had a verb similarity > 0.5
Entity Bad Sentences Removed
Cosine Similar Results Removed
Verb Similar Results Removed
Bill Gates 7 38 3
Steve Jobs
4 78 6
Jim Tressel
0 28 1
Evaluation Precision (cont.)
False Positives for removal techniques:
Average false positive rate: 14.13%
Entity Bad Sentences Removed (False Positives)
Cosine Similar Results Removed (False Positives)
Verb Similar Results Removed (False Positives)
% False Positives For Total Events Removed
Bill Gates 4/7 0/38 1/3 5/48 = 10.42%
Steve Jobs
2/4 13/78 1/6 16/88 = 18.18%
Jim Tressel
0/0 3/28 1/1 4/29 = 13.79%
Evaluation Precision (cont.)
Duplicate Events Not Detected:
Average Effectiveness of Duplicate Detection: 84.65%
Entity Duplicates We Failed To Remove
% Of Total Duplicates We Missed
Bill Gates 10 10/51 = 19.61%
Steve Jobs 21 21/105 = 20%
Jim Tressel 2 2/31 = 6.45%
Future Research (cont.)
Improve recall by extracting weaker signals Attempt to handle relative dates and/or
pronouns▪ Could resolve all relative dates in the
document to be absolute based on the last seen absolute date
▪ Resolve pronouns to nearest entity
▪ Example: Steve Jobs was named the greatest CEO in 2011. One month ago, he passed away.
Future Research (cont.)
Improve precision by associating events to verbs Attempt to find events by looking for
verbs▪ Assumption: An event should contain a verb
and an entity▪ If there is no verb, then there is no event
▪ Example: “Farewell Steve Jobs” 06 Oct 2011.
Conclusion
Thank you, we hope you enjoy ChronoSearch!