k-relevance measuring source relevance in data integration query
Post on 21-Dec-2015
240 views
TRANSCRIPT
Queries, relations and sources
K-relevance is defined for queries, which query one or more relations.
Every relation is based on data extracted from one or more external sources.
The data in a relation may be not up-to-date. (the data from some sources may be extracted from previous versions of these sources)
Relations and sources
Every tuple in the relation is based on exactly one source, and has a column which contain reference to the source.
Example:
Numbersource
0Even_nums.html
1Odd_nums.html
Relations and sources
One source may be used by more than one relation.
Example:
positiveNums negativeNums
numbersource
0evenNums.html
1oddNums.html
numbersource
-2evenNums.html
-1oddNums.html
Source information is needed
If an user thinks that there is mistake in the query results, knowledge on which sources the query results are based may help in finding the origin of the mistake.
If an sources can’t ever contribute to the query results, there is no need to extract data from it.
If a source can contribute to the query result regardless of the other sources, there may be need to extract the data from it more frequently.
Query results and sources
Every tuple in the query results is a join of tuples – one tuple for each relation.
The sources of the resulting tuple is an union of the sources of the joining relations.
0-relevance – the actual data sources
The union of the sources for all the tuples in the query results, is called
the 0-relevant sources If the query result is empty, there are no
tuples in the results, so there are no 0-relevant sources.
0-relevance - example
SELECT allNums.n FROM allNums,evenNums WHERE
allNums.n≤evenNums.n
allNums
nsrc
1nums1
2nums2
evenNums
nsrc
2nums2
Result
1
2
allNums.srcevenNums.src=source
{nums1}{nums2}={nums1,nums2}
{nums2} {nums2}={nums2}
The 0-relevant sources:{nums1,nums2}{nums2}= {nums1,nums2}
0-relevance via relation For relation R,if its tuple with source S has
joined to create result tuple, then
S is 0-relevant via R. Example:
Result
1
2
allNums.srcevenNums.src=source
{nums1}{nums2}={nums1,nums2}
{nums2}{nums2}={nums2}{nums1,nums2} are 0-relevant Via allNums.
{nums2} is 0-relevant Via evenNums
Definition: Potential tuple “Potential tuple” for a relation is any tuple
which fit the schema of the relation. (it may actually exist in the relation).
For example, for the relation R(string, int) every tuple of the form (string s,int i) is potential tuple.
For a relation which contain source column, every potential tuple which has S in this column is called
potential tuple from S Note:every “real” tuple in R is also potential
tuple, because it fits the schema of R.
∞-relevance via relation If there are
a potential tuple from the source S for the relation R
and potential tuples for the other relations in the query
which can join to satisfy the query and create a resulting tuple,S is called ∞–relevant source via R
-∞relevance
The union of the ∞-relevant sources via the relations in the query, are the
∞-relevant sources of the query. Note: the ∞-relevant sources are independent of
the data in the relations, and depend only on the query and the sources of the queried relations.
-∞relevance
Every source of the relations is ∞-relevant, unless there are constraints in the query on the source column. Note: the data sources of the relations are
shared: if S is source of R1, it is also source of R2 Therefore, if there are no constraints on the
source column of one of the relations, all of the sources are ∞-relevant.
-∞relevance - example
For example, if the data sources are {src1.html,src2.html} in the query
SELECT A.x FROM A,B WHERE A.source!=‘src1.html’ AND A.x < B.x
There is no possible tuple for A from src1 which will satisfy the query
There are possible tuple for A from src2 (for example, {x=1,src=src2}) and possible tuple for B (for example, {x=2,src=src1}) which satisfy the query and create the resulting tuple (1) src2 is ∞-relevant via A.
-∞relevance - example the data sources are {src1.html,src2.html}
SELECT A.x FROM A,B WHERE A.source!=‘src1.html’ AND A.x < B.x
There are possible tuple for B from src1 (for example, {x=2,src=src1}) and possible tuple for A (for example, {x=1,src=src2}) which satisfy the query and create the resulting tuple (1) src1 is ∞-relevant via B.
There are possible tuple for B from src2 (for example, {x=3,src=src2}) and possible tuple for A (for example, {x=2,src=src2}) which satisfy the query and create the resulting tuple (1) src2 is ∞-relevant via B.
-∞relevance - example
{src2} is ∞-relevant via A {src1,src2} are ∞-relevant via B {src2} {src1,src2}={src1,src2} are the ∞-
relevant sources of the query
k-relevance Assume the query is to m relations. If there are
potential tuple from the source S for the relation R and other (at most) k-1 potential tuples for (at
most) k-1 relations (one tuple for each relation) And real tuples for each of the remaining relations
in the query
which can join to create resulting tuple in the query,
S is called k-relevant source via R.
K-relevance
The union of the k-relevant sources via all relations in the query, is called
the k-relevant sources of the query. Note:If k is greater than or equal to
m (the number of queried relations), k-relevance is equal by definition to ∞-relevance, because all of the joining tuples may be potential tuples, and there is no need to join with real tuples.
K-relevance - notes
If S is k-relevant, it means that k potential tuples (one of them from S) can join with m-k real tuples to satisfy the relation.
k+1 potential tuples can also join with m-k-1 real tuples, because real tuple is also potential tuple by definition.
Therefore, K-relevance is monotone: every k-relevant source is also k+1 relevant source.
K-relevance - example The sources are
{sigcomm.html,sigmetrics.html} The query is:
SELECT Papers.title FROM Authors,Papers WHERE
Papers.author= Authors.name
AND Authors.org=‘MIT’
AND Papers.title like '%Ubiquitous%‘
AND Papers.src=Authors.src
K-relevance - example The relations are:
The query result are empty,Because there is no tuple in Authors with org=‘MIT’.
Therefore, there are no 0-relevant sources. Moreover, even if any source will add a tuple to
Papers, the result will be empty because the tuple won’t be able to join with any tuple in Authors.
Therefore, there are no 1-relevant sources via Papers.
SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘AND Papers.src=Authors.src
K-relevance - example
If sigcomm.html will add the tuple (sigcomm.html, John, MIT, [email protected]) to Authors, it can join with the first tuple from papers. Therefore, sigcomm.html is 1-relevant via Authors.
However, every tuple from sigmetrics.html, even (sigmetrics.html,John,MIT,[email protected]) can’t join with any tuple from Papers, because all the tuples in Papers have ‘sigcomm’ in the source column.
Therefore, the 1-relevant sources for the query are {sigcomm.html}
SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ ANDPapers.src=Authors.src
K-relevance - example
The potential tuples: (sigmetrics.html,Todd, MIT, [email protected]) from
sigmetrics.html in Authors And (sigmetrics.html, Todd, Boost Ubiquitous
Access) in Papers Can join to create the result tuple (Boost
Ubiquitous Access). Therefore, sigmetrics.html is 2-relevant source
via Authors.
SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src
K-relevance - example
sigmetrics.html is also 2-relevant source via Papers: The potential tuples:
(sigmetrics.html, Todd, Boost Ubiquitous Access) from sigmetrics.html in Papers
And (sigmetrics.html,Todd, MIT, [email protected]) in Authors
Can join to create the result tuple (Boost Ubiquitous Access).
Sigmetrics.html is 2-relevant source of the query. Sigcomm.html is also 2-relevant source of the query,
because it’s 1-relevant source and k-relevance is monotone.
SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src
K-relevance – example - conclusion
There are no 0-relevant sources. The only 1-relevant source is {sigcomm.html} The 2-relevant sources are
{sigcomm.html,sigmetrics.html} The query queries only 2 relations, therefore
the ∞-relevant sources are {sigcomm.html,sigmetrics.html}
K-relevance - summary A source is 0-relevant if tuple extracted from it to
one or more of the queried relations has joined to create a tuple in the query results.
A source is ∞-relevant if a potential tuple from it, in one of the relations, can join with potential tuples in the other ralations to satisfy the query and create a tuple in the results.
A source is k-relevant if a potential tuple from it, in one of the relations, can join with potential tuples in at most (k-1) of the other ralations, and with real tuples in the remaining relations to satisfy the query and create a tuple in the results.