cs345 compact skeletons. assume tuples components are scattered over website we have a tagger that...
Post on 22-Dec-2015
219 Views
Preview:
TRANSCRIPT
CS345CS345
Compact Skeletons
Compact SkeletonsCompact Skeletons
Assume tuples components are scattered over website
We have a tagger that can tag all tuple components on website– Assume no noise for now
Reconstruct relation
Website
Data Graph
Skeleton
Relation
Compact SkeletonsCompact Skeletons
Dept (D)
Title (T)
Salary (S)
Address (A)
Welcome to Big Corp!
Join our team.Jobs are available in these departments:
R&D
Corporate The following jobs are open:
Job #12345
Job #12346
Send resumes to:1200 Jose Blvd, CA 94123Job Title:
Salary: Must know Java…..
Programmer100K
Dept (D)
Title (T)
Salary (S)
Address (A)
Welcome to Big Corp!
Join our team.Jobs are available in these departments:
R&D
Corporate The following jobs are open:
Job #12345
Job #12346
Send resumes to:1200 Jose Blvd, CA 94123Job Title:
Salary: Must know Java…..
Programmer100K
R & D
Programmer 100K
1200 Jose Blvd
Corporate
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
Corporate
CEOAdminAsst
400 7th Ave
60K
Programmer 100K R &D 1200 Jose BlvdCTO 150K R & D 1200 Jose BlvdAdmin Asst 60K Corporate 400 7th AveCEO (null) Corporate 400 7th Ave
T S D A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
Corporate
CEOAdminAsst
60K
Programmer 100K R &D 1200 Jose BlvdCTO 150K R & D 1200 Jose BlvdAdmin Asst 60K Corporate 1200 Jose BlvdCEO (null) Corporate 1200 Jose Blvd
T S D A
Website
Data Graph
Skeleton
Relation
SkeletonsSkeletons
Labeled treesTransformation from data graphs to
relations
D
T S
A
D
T S
A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
D
T S
A
Overlays
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
D
T S
A
Overlays
T S D A
Programmer 100K R &D 1200 Jose Blvd
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
D
T S
A
Overlays
T S D A
Programmer 100K R &D 1200 Jose Blvd
CTO 150K R &D 1200 Jose Blvd
D
T S
A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
Overlays
D
T S
A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
T S D A
Programmer 150K R &D 1200 Jose Blvd
Overlays
D
T S
A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
T S D A
Programmer 150K R &D 1200 Jose Blvd
CTO 100K R & D 1200 Jose Blvd
Overlays
D
T S
A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
Inconsistent Overlays
D
T S
A
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
Inconsistent Overlays
Compact SkeletonsCompact Skeletons
A skeleton is compact if all overlays are consistent
Perfect if each node and edge of data graph is covered by at least one overlay
Given a data graph G, does G have a Perfect Compact Skeleton (PCS)?– Not always– But if it exists it is unique
R & D
Programmer 100K CTO 150K
1200 Jose Blvd
PCS Algorithm
D
T S T S
A
PCS Algorithm
Work bottom-up:Compute node signaturesPlace nodes in equivalence classes based on signatureConstruct skeleton from equivalence classes
D
T S T S
A
T S
A
D
PCS Algorithm
Corporate
CEOAdminAsst
400 7th Ave
60K
D
T S
A
Incomplete information
Corporate
CEOAdminAsst
400 7th Ave
60K
D
T S
A
T S D AAdmin Asst 60K Corporate 400 7th Ave
Incomplete information
D
T S
A
Corporate
CEOAdminAsst
400 7th Ave
60K
T S D AAdmin Asst 60K Corporate 400 7th Ave
Incomplete information
CEO Corporate 400 7th Ave
Partial Compact SkeletonsPartial Compact Skeletons
For data graphs with incomplete information, we allow partial overlays– Results in nulls in relation
If we can use consistent partial overlays to cover every node and edge of the graph, we have a partially perfect compact skeleton (PPCS)
Tuple subsumptionTuple subsumption
Tuple t subsumes tuple u if t and u agree on every component of u that is not null
tu
Noisy Data GraphsNoisy Data Graphs
Real-life websites are noisy– False positives e.g., MS = degree, state or
Microsoft?– Non-skeleton links e.g., featured products
Data graph for a retail websiteData graph for a retail website
c1 c2 c3
p1 p3 p4 a3a1 a2
i1 i2 i3 i4a4 C: CategoryI: ItemP: PriceA: Availability
C
I
Skeleton K1
P A
For simplicity: assume all nodes have a label
Coverage of a skeletonCoverage of a skeleton
c1 c2 c3
p1 p3 p4 a3
C
I
Skeleton K1
a1 a2
i1 i2 i3 i4a4
P A
Coverage of a skeletonCoverage of a skeleton
c1 c2 c3
p1 p3 p4 a3
C
I
Skeleton K1Coverage = 28
a1 a2
i1 i2 i3 i4a4
P A
Coverage of a skeletonCoverage of a skeleton
c1 c2 c3
p1 p3 p4 a3
C
I
C I
Skeleton K1Coverage = 28
Skeleton K2Coverage = 12
a1 a2
i1 i2 i3 i4a4
P A
A P
Skeletons for Noisy Data GraphsSkeletons for Noisy Data Graphs
Problem:– Find skeleton K with optimal coverage,
called the best-fit skeleton (BFS)NP-complete
Greedy Heuristic for BFSGreedy Heuristic for BFS
c1 c2 c3
p1 p3 p4 a3a1 a2
i1 i2 i3 i4a4
r
Greedy Heuristic for BFSGreedy Heuristic for BFS
C C C
P P P A CA A
I I I IA
R
R
Label Parent Count
P I 3
A I 3
C 1
I C 4
R 1
C R 1
I
P A
R
A B
CC C
D D D D
Label Parent Count
D C 4
C A 2
B 1
A R 1
B R 1
C
R
A
D
B
Greedy skeleton
R
A B
CC C
D D D D
C
R
A
D
B
Greedy skeletonCoverage = 9
R
A B
CC C
D D D D
C
R
A
D
B
Greedy skeletonCoverage = 9
C
R
A
D
B
Optimal skeletonCoverage = 15
Weighted Greedy Heuristic Weighted Greedy Heuristic
Simple Greedy heuristic uses parent counts– “Memory-less”
Weighted Greedy heuristic takes into account past selections to improve simple greedy selection– Computes “benefit” of each decision at
every stage
C
R
A
D
B
Greedy skeletonCoverage = 9
R
A B
CC C
D D D D
C
D
Weighted Greedy
C
R
A
D
B
Greedy skeletonCoverage = 9
R
A B
CC C
D D D D
C
D
Weighted Greedy
A C ) = 4benefit(
C
R
A
D
B
Greedy skeletonCoverage = 9
R
A B
CC C
D D D D
C
D
Weighted Greedy
A C ) = 4benefit((B C ) = 10benefit
C
R
A
D
B
Greedy skeletonCoverage = 9
R
A B
CC C
D D D D
C
R
A
D
B
Weighted Greedy
R
A B
CC C
D D D D
C
R
A
D
B
Greedy skeletonCoverage = 9
C
R
A
D
B
Weighted greedy skeletonCoverage = 15
Website
Data Graph
Compact Skeleton
Relation
SummarySummary
top related