parameter-free determination of distance thresholds for metric … · shaoxu song, lei chen, hong...
TRANSCRIPT
Parameter-Free Determination of DistanceThresholds for Metric Distance Constraints
Shaoxu Song† Lei Chen‡ Hong Cheng§
†Tsinghua UniversityBeijing, China
‡The Hong Kong University ofScience and Technology
§The Chinese University ofHong Kong
ICDE 2012
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 1/27
Shaoxu Song, Lei Chen, Hong Cheng
Data Dependencies
Recently used for capturing inconsistencies
fd1 [Address] → [Region]
t5 and t6, with the equal value on Address, but have differentvalues of Region.
ExampleID Name Address Region
01 West Wood Hotel Fifth Avenue, 61st Street Chicago t101 West Wood Fifth Avenue, 61st Street Chicago, IL t201 West Wood (61) 5th Avenue, 61st St. Chicago, IL t322 St. Regis Hotel No.3, West Lake Road. Boston, MA t422 St. Regis Hotel #3, West Lake Rd. Boston t522 St. Regis #3, West Lake Rd. Chicago, MA t6
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 2/27
Shaoxu Song, Lei Chen, Hong Cheng
Tolerance to Variations
Real-world information often has various representationformats.
The strict equality function limits the usage of fds.
fd1 [Address] → [Region]
t1 and t2, detected as a “violation” by mistake.“Chicago” and “Chicago, IL” denote the same region.
t4 and t6, are true violations.Cannot be detected by fd1, as address values are not equal.
ID Name Address Region
01 West Wood Hotel Fifth Avenue, 61st Street Chicago t101 West Wood Fifth Avenue, 61st Street Chicago, IL t201 West Wood (61) 5th Avenue, 61st St. Chicago, IL t322 St. Regis Hotel No.3, West Lake Road. Boston, MA t422 St. Regis Hotel #3, West Lake Rd. Boston t522 St. Regis #3, West Lake Rd. Chicago, MA t6
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 3/27
Shaoxu Song, Lei Chen, Hong Cheng
Metric Distance Constraints
In order to be tolerant to small variations
Differential dependencies (dds) declare the dependencies withrespect to metric distances (X → Y , ϕ)
dd1 ([Address] → [Region], < 8, 3 >)
< 8, 3 > is a pattern ϕ of distance thresholds on Address andRegion respectively.
States a constraint on metric distance:
Any two tuples have distance on Address less than a threshold(≤ 8),
then their Region values should be similar as well, i.e., the editdistance on Region is less than the corresponding threshold(≤ 3).
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 4/27
Shaoxu Song, Lei Chen, Hong Cheng
Motivation of This Work
Difficult to determine the proper settings of distance thresholds formetric distance constraints.
Unlike fds, already imply the equality function
a very tight threshold (≈ 0 as fds)too strict to be tolerant to various information formats
a loose threshold (≈ dmax the maximum distance value)meaningless, since any data can satisfy it
In this study,
employ certain statistical measures to evaluate the utility ofdistance threshold patternse.g., support, confidence and dependent quality
target on automatically determining the best settings ofdistance thresholds, having higher statistical measures.
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 5/27
Shaoxu Song, Lei Chen, Hong Cheng
Applicable to Other Types
Metric functional dependencies (mfds)
Xδ−→ A
equality operator in the left-hand-side
metric distance operator in the right-hand-side
for violation detection
e.g., manu2−→ addr
Matching dependencies (mds)
[X ≈ X ] → [A ⇋ A]
similarity operator in the left-hand-side
matching operator in the right-hand-side
for record matching
e.g., [addr ≈ addr] → [tel ⇋ tel]
Outline
Introduction
Preliminary
Determination Algorithm
Experiment
Summary
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 6/27
Shaoxu Song, Lei Chen, Hong Cheng
Statistical MeasuresSupport of ϕ:
the proportion of tuple pairs whose distances satisfy thethresholds in ϕ[XY ].
a ϕ with high support is preferred in order to detect moreviolations.
Confidence of ϕ:
the ratio of tuple pairs satisfying ϕ[XY ] to the pairs satisfyingϕ[X ].
a ϕ with high confidence is preferred to detect violations moreprecisely.
Dependent quality of ϕ denotes the quality of tolerance on thedependent attributes Y .
how close the distance threshold ϕ[Y ] to the equality is.
if the dependent quality is low (i.e., ϕ[Y ] is far away fromequality), the constraint is meaningless and useless.
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 7/27
Shaoxu Song, Lei Chen, Hong Cheng
Interaction of Measures
If the dependent quality is set too high
e.g., ϕ[Y ] = 0, equality in fds
too strict and may identify violations by mistake
confidence measure will be low
Contrarily, consider a ϕ with the lowest dependent quality
i.e., ϕ[Y ] = dmax the maximum distance value
has the highest confidence 1.0, since any tuple pairs canalways have distances ≤ dmax on Y
miss all the violations and is useless
For example, ([Address] → [Region], < 8, dmax >)
any pair of tuples always has distance on Region ≤ dmax
the confidence is 1.0
violations t4 and t6 cannot be detected by such a dd
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 8/27
Shaoxu Song, Lei Chen, Hong Cheng
Parameter-free Determination
To determine ϕ
applications prefer metric distance constraints with highstatistical measures
difficult to set the parameters of minimum support, confidenceand dependent quality, respectively
setting the requirements of some measures too high will makethe others low
A parameter-free style
automatically returning those best ϕ
s.t., not existing any other settings that can be found havinghigher support, confidence, and dependent quality than thereturned results at the same time.
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 9/27
Shaoxu Song, Lei Chen, Hong Cheng
Assuring the Utility
To avoid tuning parameters manually, we are interested in anoverall evaluation of utility.
Let b be the matching distance of any tuple pair.
U(ϕ) = Pr(b � ϕ[Y ],Q(ϕ) is high | b � ϕ[X ])
the conditional probability of b satisfying ϕ[Y ] with highdependent quality given b satisfies ϕ[X ].
to accurately detect the violations with small distance, weexpect the above probability of a ϕ to be high.
This U(ϕ) can roughly denote the utility of confidence anddependent quality, while support is not investigated.
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 10/27
Shaoxu Song, Lei Chen, Hong Cheng
Expected Utility
Compute an expected utility to refine U(ϕ) w.r.t. confidence anddependent quality by using support,
U(ϕ) = E (U(ϕ) | C(ϕ),D(ϕ),Q(ϕ)),
C(ϕ),D(ϕ) and Q(ϕ) are the statistics observed from data.
C(ϕ) is confidence measure
D(ϕ) is the proportion of tuple pairs with distance satisfyingϕ[X ], support of ϕ[X ]
support of ϕ is C(ϕ)D(ϕ)
Q(ϕ) is dependent quality
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 11/27
Shaoxu Song, Lei Chen, Hong Cheng
Computation of Expected Utility
The computation is derived by applying the Bayesian rule andBinomial distribution.
U(ϕ) = E (U | C,D,Q)
=
∫
uP(U = u | C,D,Q)du
...
=
∫
uf (DCQ;D, u)π(u)du∫
f (DCQ;D, u)π(u)du.
f (k ; n, p) is the probability mass function of Binomialdistribution.
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 12/27
Shaoxu Song, Lei Chen, Hong Cheng
Property of Expected Utility
According to the calculation formula of U(ϕ)
TheoremFor any ϕ1, ϕ2, if ϕ1 has higher support than ϕ2, denoted byS(ϕ1)S(ϕ2)
= ρ, ρ ≥ 1, and the confidence and dependent quality of ϕ1
are higher than those of ϕ2 as followsC(ϕ1)C(ϕ2)
≥ ρ,Q(ϕ1)Q(ϕ2)
≥ 1ρ, then
we have U(ϕ1) ≥ U(ϕ2).
This conclusion verifies our intuition that
higher support, confidence and dependent quality
contribute to a larger expected utility.
DefinitionThe distance threshold determination problem is to find a distancethreshold pattern ϕ for the dd on X → Y with the maximumexpected utility U(ϕ).
Outline
Introduction
Preliminary
Determination Algorithm
Experiment
Summary
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 13/27
Shaoxu Song, Lei Chen, Hong Cheng
OverviewDetermination process for the maximum U(ϕ) has two steps:
(i) to find the best ϕ[Y ] when given a fixed ϕ[X ];
(ii) to find the desired ϕ[X ] together with its best ϕ[Y ].
Candidate of distance threshold patterns, e.g., ϕ[Y ]for each A ∈ Y , consider thesearch space of distance thresholdϕ[A] from 0 to dmax.
enumerate all the distancethresholds ϕ[A] for all thedependent attributes A ∈ Y .
each node, such as < 1, 1 >,corresponds to a ϕ[Y ] ∈ CY
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 14/27
Shaoxu Song, Lei Chen, Hong Cheng
Determination for Dependent Attributes (PA)
Given a fixed ϕ[X ], to find the corresponding best ϕ[Y ] on thedependent attributes Y with the maximum U(ϕ).
D(ϕ) value is the same for any ϕ with same ϕ[X ].
study the other two measures C(ϕ) and Q(ϕ) in terms ofcontributions to U(ϕ).
TheoremConsider any two ϕ1, ϕ2, having the same D(ϕ1) = D(ϕ2) = D. If
their confidence and dependent quality satisfy
C(ϕ1)Q(ϕ1) ≥ C(ϕ2)Q(ϕ2), then we have U(ϕ1) ≥ U(ϕ2).
for a fixed ϕ[X ],
to find a ϕ with the maximum U(ϕ) is equivalent to find theone with the maximum C(ϕ)Q(ϕ).
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 15/27
Shaoxu Song, Lei Chen, Hong Cheng
Dominant Relationship for PruningPruning idea
Q(ϕ) directly computed from a given ϕ[Y ]
C(ϕ) is costly to compute by statistics of data
to avoid evaluate C(ϕ) for all possible candidates
DefinitionFor any ϕ1, ϕ2, if ϕ1[A] ≥ ϕ2[A], ∀A ∈ Z , then we say that ϕ1[Z ]dominates ϕ2[Z ], denoted by ϕ1[Z ]⋖ ϕ2[Z ].
Any tuple pair satisfying ϕ2[Z ] will always satisfy ϕ1[Z ]
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 16/27
Shaoxu Song, Lei Chen, Hong Cheng
Dominant Relationship for Pruning
LemmaFor any two ϕ1, ϕ2, having ϕ1[X ] = ϕ2[X ] and ϕ1[Y ]⋖ ϕ2[Y ],then C(ϕ1) ≥ C(ϕ2) and Q(ϕ1) ≤ Q(ϕ2).
By a downward traversal of candidates in the dominant graph,
the dependent quality increases from 0 to 1
the confidence decreases from 1 to 0
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 17/27
Shaoxu Song, Lei Chen, Hong Cheng
Pruning of Candidate Patterns (PAP)Consider the current ϕi in traversal of CY .i) Pruning by ϕmax.
The first pruning opportunity is introduced by ϕmax of thepreviously processed i − 1 candidates.
Let Vmax denote the maximum value of C(ϕ)Q(ϕ) in the firsti − 1 candidates, i.e.,
Vmax =i−1maxj=1
C(ϕj )Q(ϕj )
S0 = {ϕk | Q(ϕk) ≤ Vmax, ϕk [Y ] ∈ CY }can be pruned
For any ϕk [Y ] ∈ CY with Q(ϕk) ≤ Vmax,
C(ϕk)Q(ϕk) ≤ Q(ϕk) ≤ Vmax.
U(ϕmax) ≥ U(ϕk).
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 18/27
Shaoxu Song, Lei Chen, Hong Cheng
Pruning of Candidate Patterns (PAP)ii) Pruning by ϕi .
The second pruning opportunity is developed according to thecurrent ϕi in i -th step.
S1 = {ϕk | ϕi ⋖ ϕk ,Q(ϕk) ≤VmaxC(ϕi )
, ϕk [Y ] ∈ CY } is pruned
For any ϕk [Y ] ∈ CY with ϕi [Y ]⋖ ϕk [Y ] and Q(ϕk) ≤VmaxC(ϕi )
,
ϕi [Y ]⋖ ϕk [Y ] implies C(ϕk) ≤ C(ϕi )
follows C(ϕk)Q(ϕk) ≤ C(ϕi )Q(ϕk) ≤ Vmax
we have U(ϕmax) ≥ U(ϕk)
ϕk in S0,S1 can be safely pruned,without computing C(ϕk)
initialization of Vmax = 0
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 19/27
Shaoxu Song, Lei Chen, Hong Cheng
Determination for Determinant Attributes (DA)
To find a ϕ with the maximum U(ϕ)
consider all possible distance threshold patterns of thedeterminant attributes X , say CX ,
The straight-forward approach is to compute the best ϕ[Y ]for each ϕ[X ] ∈ CX
The most costly part is still the computation of ϕi [Y ], byeither pa or pap.
In order to improve the pruning power of pap, we expect tofind a larger pruning bound Vmax.
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 20/27
Shaoxu Song, Lei Chen, Hong Cheng
Pruning of Candidate Patterns
Pruning candidates with different ϕ[X ]
TheoremConsider any two ϕ1, ϕ2, having D(ϕ1) ≥ D(ϕ2). If theirconfidence and dependent quality satisfy
C(ϕ2)Q(ϕ2) ≤ 1−D(ϕ1)
D(ϕ2)
(
1−C(ϕ1)Q(ϕ1))
then we have U(ϕ1) ≥ U(ϕ2).
We can prune those ϕ2 whose C(ϕ2)Q(ϕ2) is no higher than
1− D(ϕ1)D(ϕ2)
(
1− C(ϕ1)Q(ϕ1))
To apply this pruning bound, we require a preconditionD(ϕ1) ≥ D(ϕ2).
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 21/27
Shaoxu Song, Lei Chen, Hong Cheng
Advanced Pruning Bound (DAP)
We process CX in descending order of D(ϕ) values
Let ϕmax be the current result with the maximum expectedutility by evaluating the first i − 1 candidates in CX .
for the next ϕi , we have D(ϕmax) ≥ D(ϕi )
An advanced pruning bound for computing ϕi [Y ]
Vmax = 1−D(ϕmax)
D(ϕi )
(
1− C(ϕmax)Q(ϕmax))
in the original pap, initialization of Vmax = 0
replace with the above possibly large bound
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 22/27
Shaoxu Song, Lei Chen, Hong Cheng
Analysis of Pruning
Practically, the worst case of dap is exactly the basic da, whenworking together with pap
If the calculated bound Vmax is less than 0, we can simplyassign 0 to it.
Once the bound is Vmax > 0, it can achieve a tighter pruningbound.
Theoretically, the theorem for advanced pruning is a generalizationof the theorem for basic pruning
when D(ϕ1) = D(ϕ2),
1−D(ϕ1)
D(ϕ2)
(
1−C(ϕ1)Q(ϕ1))
= C(ϕ1)Q(ϕ1)
Our experiments also verify that dap+pap is at least no worsethan da+pap.
Outline
Introduction
Preliminary
Determination Algorithm
Experiment
Summary
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 23/27
Shaoxu Song, Lei Chen, Hong Cheng
Settings
Preprocessing of three real data sets
pre-compute edit distance of all tuple pairs
store the distance results as up to 1,000,000 matching tuples
proposed techniques are then evaluated on the preparedmatching tuples
To determine the distance thresholds for
Rule1 : cora(author, title → venue, year)
Rule2 : cora(venue → address, publisher, editor)
Rule3 : restaurant(name, address → city, type)
Rule4 : citeseer(address, affiliation, description → subject)
where Rule 2 has a larger Y while Rule 4 has a larger X .
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 24/27
Shaoxu Song, Lei Chen, Hong Cheng
Example Results
Results also verify our property analysis of expected utility
higher support, confidence and dependent quality yield higherexpected utility
e.g., U(ϕ2) ≥ U(ϕ4)
there does not exist any ϕ which has higher support,confidence and dependent quality at the same time than thereturned ϕ1 with the maximum expected utility
the expected utility can reflect the usefulness in applicationsϕ[X ] ϕ[Y ] Measures Violation Detection
author title venue year S(ϕ) C(ϕ) Q(ϕ) U(ϕ) Precision Recall F-measure
ϕ1 4 1 3 1 0.1529 0.3760 0.80 0.2325 0.3725 0.5425 0.4418ϕ2 5 2 3 1 0.1764 0.3667 0.80 0.2296 0.3718 0.6266 0.4667ϕ3 5 1 3 2 0.1632 0.3774 0.75 0.2232 0.3179 0.4492 0.3723ϕ4 4 2 3 2 0.1657 0.3657 0.75 0.2188 0.3073 0.4457 0.3638ϕ5 4 1 4 2 0.1529 0.3852 0.70 0.2108 0.2654 0.3267 0.2928ϕ6 5 2 5 2 0.1764 0.3985 0.65 0.2106 0.2459 0.3337 0.2831fd 0 0 0 0 0.0064 0.3595 1.00 0.1064 0.4315 0.0735 0.1256
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 25/27
Shaoxu Song, Lei Chen, Hong Cheng
Pruning EvaluationPerformance of da and dap for determinant side X , pa and pap
for dependent side Y
dap and pap outperform da and pa, respectivelyRule 1 shows best performance when applying dap+pap
dap+pap approach can provide a pruning bound that is atleast no worse than the da+pap oneRule 3 verifies that the dap is at least no worse than the da
0
1000
2000
3000
4000
5000
6000
100k 300k 500k 700k 900k1m
Tim
e c
ost (s
)
data size
Rule 1
DA+PADA+PAP
DAP+PAP
0
1000
2000
3000
4000
5000
6000
100k 300k 500k 700k 900k1m
Tim
e c
ost (s
)
data size
Rule 3
DA+PADA+PAP
DAP+PAP
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 26/27
Shaoxu Song, Lei Chen, Hong Cheng
Pruning EvaluationRule 2 has a larger Y while Rule 4 has a larger X
Rule 2, which has more attributes in the dependent side, mayhave more opportunities of pruning by pap
pap can achieve a significant improvement in Rule 2
Rule 4, with smaller Y , is not as significant as Rule 2 on theimprovement by pap
dap do help in providing an advanced pruning bound for pap
0
1000
2000
3000
4000
5000
6000
100k 300k 500k 700k 900k1m
Tim
e c
ost (s
)
data size
Rule 2
DA+PADA+PAP
DAP+PAP
0
1000
2000
3000
4000
5000
6000
100k 300k 500k 700k 900k1m
Tim
e c
ost (s
)
data size
Rule 4
DA+PADA+PAP
DAP+PAP
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Summary 27/27
Shaoxu Song, Lei Chen, Hong Cheng
Conclusion
We study the problem of determining the distance thresholds formetric distance constraints
difficult to manually specify requirements of various statisticalmeasures
conduct the determination in a parameter-free style
i.e., to compute an expected utility of the distance thresholdpattern and return the results with the maximum expectedutility
several advanced pruning algorithms are then developed inorder to efficiently find the desired distance thresholds