merging ranks from heterogeneous internet sources hector garcia-molina luis gravano stanford...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Merging Ranks from Merging Ranks from Heterogeneous Internet Heterogeneous Internet
SourcesSources
Hector Garcia-MolinaHector Garcia-Molina
Luis GravanoLuis Gravano
Stanford UniversityStanford University
Luis GravanoLuis Gravano 22Stanford UniversityStanford University
Users Have Many Available Users Have Many Available Information SourcesInformation Sources
Source 1Source 1 hh1111, h, h1212, h, h1313, ..., ...
Source 2Source 2
......
Nothing!Nothing!
User QueryUser Query Query ResultsQuery Results
““Houses Houses near near
Palo AltoPalo Alto for around for around $300K$300K.”.”
Luis GravanoLuis Gravano 33Stanford UniversityStanford University
ChallengesChallenges
• Sources are Sources are too numeroustoo numerous• Sources are Sources are heterogeneousheterogeneous
(query language, model, results)(query language, model, results)
• Users want a Users want a single query resultsingle query result
Luis GravanoLuis Gravano 44Stanford UniversityStanford University
MetasearcherMetasearcher
• Selects the good sources for a Selects the good sources for a queryquery
• Extracts and combines the query Extracts and combines the query results from the sourcesresults from the sources
Luis GravanoLuis Gravano 55Stanford UniversityStanford University
Text Sources Rank Query Text Sources Rank Query ResultsResults
Text SourceText Source
Doc 1: Doc 1: 0.80.8Doc 2: Doc 2: 0.60.6
......
““Distributed Distributed Databases”Databases”
Luis GravanoLuis Gravano 66Stanford UniversityStanford University
StructuredStructured Sources on the Sources on the Internet also Rank ResultsInternet also Rank Results
A real-estate agent receives A real-estate agent receives queries onqueries on LocationLocation and and PricePrice::
Q:Q: “Houses with preferred location “Houses with preferred location in in Palo AltoPalo Alto and preferred price and preferred price
around around $300K$300K.”.”
Luis GravanoLuis Gravano 77Stanford UniversityStanford University
The Agent Ranks its Houses Based The Agent Ranks its Houses Based on its Own Scoring Functionon its Own Scoring Function
Q:Q: “Houses with preferred location in “Houses with preferred location in Palo Palo AltoAlto and preferred price around and preferred price around $300K$300K.”.”
Rank House ID Source Score Location Price1 MV1 0.43 Mountain View $350K2 MV2 0.42 Mountain View $360K3 PA1 0.28 Palo Alto $600K
Luis GravanoLuis Gravano 88Stanford UniversityStanford University
A A Metasearcher Metasearcher then Faces then Faces Two ProblemsTwo Problems
• Extracting the top objectsExtracting the top objects from from the underlying sourcesthe underlying sources
• Merging the resultsMerging the results from the from the various sourcesvarious sources
Luis GravanoLuis Gravano 99Stanford UniversityStanford University
MergingMerging Query Results is Query Results is Easy with Enough InformationEasy with Enough InformationGiven a record like:Given a record like:
the metasearcher ignores thethe metasearcher ignores the Source Source scorescore and computes its and computes its Target scoreTarget score from from the Location and Pricethe Location and Price
Rank House ID Source Score Location Price1 MV1 0.43 Mountain View $350K
Luis GravanoLuis Gravano 1010Stanford UniversityStanford University
ExtractingExtracting the Top Objects the Top Objects from a Source is Hardfrom a Source is Hard
The metasearcher’s scoring function The metasearcher’s scoring function might be different from the source’s!might be different from the source’s!
Rank House ID Target Score Location Price1 PA1 1 Palo Alto $600K2 MV1 0.51 Mountain View $350K3 MV2 0.5 Mountain View $360K
Luis GravanoLuis Gravano 1111Stanford UniversityStanford University
We Want to Avoid Extracting We Want to Avoid Extracting All the Source’s ContentsAll the Source’s Contents
Assume a house Assume a house hh with: with:
•Source(Q, h) = 0Source(Q, h) = 0 (worst for source)(worst for source)
•Target(Q, h) = 1 Target(Q, h) = 1 (best for metasearcher)(best for metasearcher)
Problem!Problem!
Luis GravanoLuis Gravano 1212Stanford UniversityStanford University
The Example Query is The Example Query is Not ManageableNot Manageable at the Agent at the Agent
A query Q is A query Q is manageablemanageable at a source at a source if if < 1 such that:< 1 such that:
SourceSource
TargetTarget(0,0)(0,0)
(1,1)(1,1)
Source(Q, h) Source(Q, h) Target(Q, h)-Target(Q, h)-
Luis GravanoLuis Gravano 1313Stanford UniversityStanford University
Single-Attribute Queries Are Single-Attribute Queries Are More Likely to be ManageableMore Likely to be Manageable
Single-attribute queries for Q:Single-attribute queries for Q:
• QQ11:: Location = Palo AltoLocation = Palo Alto
• QQ22:: Price = $300KPrice = $300K
Luis GravanoLuis Gravano 1414Stanford UniversityStanford University
The Example Becomes The Example Becomes Tractable!Tractable!
… … if the top if the top TargetTarget objects for objects for QQ are among the top are among the top SourceSource
objects for objects for QQ11 andand Q Q22
Luis GravanoLuis Gravano 1515Stanford UniversityStanford University
A A CoverCover Bounds the Target Bounds the Target Scores for QScores for Q
QQ11, …, Q, …, Qmm single-attribute queries form a single-attribute queries form a
cover cover for Q if for Q if g g11, …, g, …, gmm, G such that:, G such that:
Target(QTarget(Qii, h) , h) g gii Target(Q, h) Target(Q, h) G G
Luis GravanoLuis Gravano 1616Stanford UniversityStanford University
Having a Having a Manageable CoverManageable Cover for a for a Query is Query is SufficientSufficient......
Manageable Cover Manageable Cover for query Q at source Sfor query Q at source S
““Efficient” ExecutionsEfficient” ExecutionsPossible at SPossible at S
Luis GravanoLuis Gravano 1717Stanford UniversityStanford University
Having a Having a Manageable CoverManageable Cover for a for a Query is Query is SufficientSufficient......
(1) Pick a manageable cover C = {Q(1) Pick a manageable cover C = {Q11, ..., Q, ..., Qmm} for Q at S} for Q at S
(2) For i = 1 to m: Find (2) For i = 1 to m: Find i i for Q for Qii
(3) Pick 0 (3) Pick 0 gg11, ..., g, ..., gmm, G < 1 for cover C, G < 1 for cover C
(4) For i = 1 to m(4) For i = 1 to m
(5) Retrieve all objects t with Source(Q(5) Retrieve all objects t with Source(Q ii, t) , t) G Gi i = g= gii - - i i
(6) Compute Target(Q, t) for all objects t retrieved(6) Compute Target(Q, t) for all objects t retrieved
(7) If (7) If i such that Gi such that G i i 0 Then Go to Step (11) 0 Then Go to Step (11)
(8) If for all t retrieved, Target(Q, t) (8) If for all t retrieved, Target(Q, t) G Then G Then
(9) Find new, lower 0 (9) Find new, lower 0 g g11, ..., g, ..., gmm, G < 1 for C, G < 1 for C
(10) Go to Step (4) (10) Go to Step (4)
(11) Output those objects retrieved with the highest Target score(11) Output those objects retrieved with the highest Target score
Luis GravanoLuis Gravano 1818Stanford UniversityStanford University
Algorithm to Extract Top Algorithm to Extract Top Target ObjectsTarget Objects
QQ11 QQ22
00
11
gg11
gg22
Target(Q, h) G
Luis GravanoLuis Gravano 1919Stanford UniversityStanford University
Algorithm to Extract Top Algorithm to Extract Top Target ObjectsTarget Objects
QQ11 QQ22
00
11
gg11’’gg22’’
Target(Q, h) G’
Target(Q, h’) G’!h’
Luis GravanoLuis Gravano 2020Stanford UniversityStanford University
Preliminary Performance Preliminary Performance Results for our AlgorithmResults for our Algorithm
• Target=MinTarget=Min: 14% objects retrieved: 14% objects retrieved
• Target=MaxTarget=Max: 4% objects retrieved : 4% objects retrieved
10,000 objects10,000 objects4 query attributes4 query attributes
=0=0
Luis GravanoLuis Gravano 2121Stanford UniversityStanford University
Preliminary Performance Preliminary Performance Results for our AlgorithmResults for our Algorithm
• Target=MinTarget=Min: 25% objects retrieved: 25% objects retrieved
• Target=MaxTarget=Max: 44% objects retrieved : 44% objects retrieved
10,000 objects10,000 objects4 query attributes4 query attributes
=0.10=0.10
Luis GravanoLuis Gravano 2222Stanford UniversityStanford University
Having a Having a Manageable CoverManageable Cover for a for a Query is Also Query is Also NecessaryNecessary......
No Manageable Cover No Manageable Cover for query Q at source Sfor query Q at source S
Efficient ExecutionsEfficient ExecutionsImpossible at SImpossible at S
Luis GravanoLuis Gravano 2323Stanford UniversityStanford University
A Manageable Cover is Necessary: A Manageable Cover is Necessary: ProofProof
Consider QConsider Q11, Q, Q22, Q, Q33 minimal cover for Q with: minimal cover for Q with:
QQ11, Q, Q22 manageable, manageable, QQ33 not manageable not manageable
For For anyany “efficient “execution, build “efficient “execution, build hh such that: such that: • h is not retrieved h is not retrieved • Target(Q, h) > G Target(Q, h) > G = = max{Target(Q, o) | o retrieved}max{Target(Q, o) | o retrieved}
Luis GravanoLuis Gravano 2424Stanford UniversityStanford University
A Manageable Cover is Necessary: A Manageable Cover is Necessary: ProofProof
QQ11 QQ22 QQ33
00
11
gg11
gg22
gg33
h’h’ h’h’
h’h’
Target(Q, h’) > G!Target(Q, h’) > G!
h’ hh h’ hh
h’
Target(Q, h) > G!Target(Q, h) > G!
hh
Target(QTarget(Q33, h) , h) Target(Q, h’)Target(Q, h’)Target(Q, h’) > GTarget(Q, h’) > G
Luis GravanoLuis Gravano 2727Stanford UniversityStanford University
We Studied Two We Studied Two Metasearching ProblemsMetasearching Problems
• Extracting the top objectsExtracting the top objects from from the underlying sourcesthe underlying sources
• Merging the resultsMerging the results from the from the various sourcesvarious sources
Luis GravanoLuis Gravano 2828Stanford UniversityStanford University
Related Work:Related Work:Collection Fusion Collection Fusion
•Voorhees et al.Voorhees et al.
•Callan/Lu/CroftCallan/Lu/Croft
•Gauch/WangGauch/Wang