Improving Search Results Quality by Customizing Summary Lengths
Michael Kaisser★, Marti Hearst
and John B. Lowe
★University of Edinburgh, UC Berkeley, Powerset, Inc.
ACL-08: HLT
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Talk Outline
How best to display search results? Experiment 1: Is there a correlation between
response type and response length? Experiment 2: Can humans predict the best
response length? Summary and Outlook
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Motivation Web Search result listings today are largely
standardized; display a document’s surrogate (Marchionini et al., 2008)
Typically: One header line, two lines text fragments, one line for URL:
But: Is this the best way to present search results? Especially: Is this the optimal length for every query?
(Source: Yahoo!)
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 1 – Research Question
Do different types of queries require responses of
different lengths?
(And if so, is the preferred response type dependent on the expected semantic response type?)
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 1 – Setup
Data used: 12,790 queries from Powerset’s query database
Contains search engines’ query logs and hand crafted queries
disproportionally large number of natural language queries
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 1 – SetupDisproportionally large number of natural language
queries.
Examples: “date of next US election” Hip Hop A synonym for material highest volcano What problems do federal regulations cause? I want to make my own candles industrial music
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Excursus – Mechanical Turk Amazon web services API for computers to
integrate "artificial artificial intelligence" requesters can upload Human Intelligence Tasks
(HITs) Workers work on these HITs and are paid small
sums of money Examples:
can you see a person in the photo? is the document relevant to a query? is the review of this product positive or negative?
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Excursus – Mechanical Turk Amazon web services API for computers to
integrate "artificial artificial intelligence" requesters can upload Human Intelligence Tasks
(HITs) Workers work on these HITs and are paid small
sums of money
Mechanical Turk is/can also be seen as a platform for online experiments
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 1
Turkers are asked to classify queries by
• Expected response type
• Best response length
Each query is done by three different subjects.
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 1 – Results
Distribution of length categories differs across individual expected response categories.
Some results are intuitive : Queries for numbers want short results Advice queries want longer results
Some results are more surprising: Different length distributions for Person vs.
Organization
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 2 – Research Question
Can human judges correctly predict the preferred result
length?
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 2 – Setup Experiment 1 produced 1099 high-confidence queries
(where all three turkers agreed on semantic category and length)
For 170 of these turkers manually created snippets from Wikipedia of different lengths: Phrase Sentence Paragraph Section Article (in this case a link to the article was displayed)
Note: Categories differ slightly from first experiment
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 2 – Setup
Manually created snippets from Wikipedia of different lengths:
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 2 – SetupDisplayed:
• Instructions
• Query
• One response from one length category
• Rating scale
Each Hit was shown to ten turkers.
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Experiment 2 – SetupInstructions:
Below you see a search engine query and a possible response. We would like you to give us your opinion about the response. We are especially interested in the length of the response. Is it suitable for the query? Is there too much or not enough information? Please rate the response on a scale from 0 (very bad response) to 10 (very good response).
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Significance
Slope Std. Error p-value
Phrase -0.850 0.044 <0.0001
Sentence -0.550 0.050 <0.0001
Paragraph 0.328 0.049 <0.0001
Article 0.856 0.053 <0.0001
Michael Kaisser, Marti Hearst and John B. Lowe
Significance results of unweighted linear regression on the data for the second experiment, which was separated into four groups based on the predicted preferred length.
ACL-08: HLT
Experiment 2 – Details 146 queries 5 length categories per query 10 judgments per query = 7,300 judgments
124 judges 16 judges did more than 146 hits 2 of these 16 were excluded (scammers)
$0.01 per judgment $73 paid at judges, plus $73 Amazon fees $146 for Experiment 2 (excluding snippet generation)
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Results: Human judges can predict the preferred result
lengths (at least for a subset of especially clear queries)
Experiment 2 – Results
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Results: Human judges can predict the preferred result
lengths (at least for a subset of especially clear queries)
Standard results listings are often too short (and sometimes too long)
Experiment 2 – Results
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
OutlookCan queries be automatically classified according
to their predicted result length?
Initial Experiment: Unigram word counts 805 training queries, 286 test queries Three length bins (long, short, other) Weka NaiveBayesMultinomial
Initial Result: 78% of queries correctly classified
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Thank you!
ACL-08: HLT
MT Demographics - Age
Michael Kaisser, Marti Hearst and John B. Lowe
Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
ACL-08: HLT
MT Demographics - Gender
Michael Kaisser, Marti Hearst and John B. Lowe
Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
ACL-08: HLT
MT Demographics - Education
Michael Kaisser, Marti Hearst and John B. Lowe
Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
ACL-08: HLT
MT Demographics - Income
Michael Kaisser, Marti Hearst and John B. Lowe
Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
ACL-08: HLT
MT Demographics - Purpose
Michael Kaisser, Marti Hearst and John B. Lowe
Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe
Excursus – Mechanical TurkExample HIT (not ours):