![Page 1: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/1.jpg)
Efficient Discovery of XML Data Redundancies
Cong Yu and H. V. JagadishUniversity of Michigan, Ann Arbor
-VLDB 2006, Seoul, Korea September 12th, 2006
![Page 2: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/2.jpg)
2 / 42
Talk Outline•Motivating Example•A Comprehensive Notion of XML FD•XML Redundancy Discovery Algorithms•Experimental Evaluation•Conclusion
![Page 3: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/3.jpg)
3 / 42
An Example XML Documentwarehouse
state state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
![Page 4: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/4.jpg)
4 / 42
• An example constraint:For any two books, if they have the same ISBN, then they have the same title.
• Similar to Equality Generating Dependencies (EGDs) [BV84] and Nested EGDs [YP04]
Constraints on XML Data
Target Condition Element(s)
Implication Element(s)
![Page 5: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/5.jpg)
5 / 42
Data Redundancies•E.g., title is redundantly stored•Result of “non-optimal” design of the
database schema in the presence of constraints
•Lead to: Update anomalies Increased cost for data transfer and
manipulation•Constraints are the properties of data
May not be known at the design phase
![Page 6: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/6.jpg)
6 / 42
GoalEfficiently Discover
Redundancies From the XML Database By Discovering
Satisfied Constraints
![Page 7: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/7.jpg)
7 / 42
Main Contributions•A comprehensive notion of XML FD
Capturing a semantically richer set of XML constraints
Definition of XML data redundancy in terms of XML FDs and XML Keys
•Efficient algorithms for discovering FDs and data redundancies from an XML database
•Experimental Evaluation
![Page 8: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/8.jpg)
8 / 42
Talk Outline•Motivating Example•A Comprehensive Notion of XML FD•XML Redundancy Discovery Algorithms•Experimental Evaluation•Conclusion
![Page 9: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/9.jpg)
10 / 42
Example XML Constraints• Hierarchical: condition and/or implication
elements can come from multiple hierarchies
state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
![Page 10: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/10.jpg)
11 / 42
• Set elements: condition and/or implication elements can involve set elements
Example XML Constraints, Cont’d
store
bookname
book
store
name
book
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
state state
![Page 11: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/11.jpg)
12 / 42
Functional Dependencies (FDs)•FDs are used to describe constraints in
relational databases•A similar notion of FD is needed for XML•Challenges:
Target is difficult to specify due to the hierarchical structure
Set elements introduce new semantics
XML FD needs richer semantics !
![Page 12: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/12.jpg)
13 / 42
Previous Notions• Path Based Notion [LLL02,VLL04]
Example: {/warehouse/state/store/book/ISBN} /warehouse/state/store/book/title
Format: LHS RHS Semantics: for any two RHS nodes, same
(associated) LHS indicates same RHS• Tree Tuple Based Notion [AL04]
A tree tuple is a data tree, with exactly one data node for each schema element
Format: LHS RHS Semantics: for any two tree tuples, same LHS
indicates same RHS
![Page 13: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/13.jpg)
14 / 42
• Both capture hierarchical constraints• Neither can capture set constraints• {/store/book/ISBN} /store/book/au
Violated in previous Satisfied if the two au nodes are a single set
• {/store/book/title,/store/book/au} /store/book/ISBN Undefined in previous Intuitive if au nodes are
a single set
Previous Notions, cont’d
store
bookname
ISBN title au au
“Borders”
“… 269”“DB” “R.R.”“J.G.”price
“$59.9”
![Page 14: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/14.jpg)
15 / 42
A New Comprehensive Notion•Generalized Tree Tuple
A data tree constructed around a pivot data node (np)
Entire subtree rooted at np is kept All ancestors of np and their “attributes” are
kept•Tuple Class CP
The set of all generalized tree tuples, whose pivot nodes share the same path P (called pivot path)
![Page 15: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/15.jpg)
16 / 42
warehouse
state state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
Example Generalized Tree TuplePivot
![Page 16: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/16.jpg)
17 / 42
warehouse
state state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
Example Generalized Tree TuplePivot
![Page 17: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/17.jpg)
18 / 42
XML FD•<CP, LHS, RHS>: LHS RHS w.r.t. CP
•Semantics:
for any two generalized tree tuple t1, t2 in CP, if they share the same LHS, they have the same RHS.
•E.g., {./title, ./au} ./ISBN, w.r.t. C/warehouse/state/store/book
![Page 18: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/18.jpg)
19 / 42
Repeatable Elements Are Specialwarehouse
state state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
![Page 19: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/19.jpg)
20 / 42
Essential Tuple Classes•Definition:
Tuple classes with pivot paths that correspond to repeatable schema elements
C/warehouse/state/store/book is essential C/warehouse/state/store/name is not
•Express XML FDs that are expressible with non-essential tuple classes
•See paper for detailed proof
![Page 20: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/20.jpg)
23 / 42
XML Key and Data Redundancy• Let attribute @key uniquely identify each
node in the entire data tree• <CP, LHS> is an XML Key, when the database
satisfies XML FD: LHS ./@key w.r.t. CP
• Similar to the relative key notion proposed in [BDF+01]
• Data redundancy exists if the database: Satisfies the XML FD <CP, LHS, RHS>, But <CP, LHS> is not an XML key RHS is redundantly stored.
![Page 21: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/21.jpg)
24 / 42
Talk Outline•Motivating Example•A Comprehensive Notion of XML FD•XML Redundancy Discovery Algorithms•Experimental Evaluation•Conclusion
![Page 22: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/22.jpg)
25 / 42
Strategy•Discover satisfied XML FDs and Keys•Data redundancies can then be
discovered based on the definition
•First, we need an efficient representation of the XML data
![Page 23: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/23.jpg)
26 / 42
• Each essential tuple class a relation Similar to nested relations [OY87,MNE96] All relations together form a hierarchy Tree tuples can be reconstructed by joining @key
with parent
Hierarchical Representation of XML Data
R_state@key parent 2 root 3 root 18 root. . . . .
R_store@key parent name 4 3 Borders 12 3 Amazon 19 18 Borders
R_book@key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9
R_au@key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G.
![Page 24: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/24.jpg)
27 / 42
Intra-Relation FDs
state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
• {./ISBN} ./title, w.r.t. C/warehouse/state/store/book
![Page 25: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/25.jpg)
28 / 42Present in
R_book
Inter-Relation FDs
state
store
bookname
book
store
name
book
state
ISBN title au au
“Borders”“Borders”
“… 269”“DB” “R.R.”“J.G.”
store
name“Amazon”
ISBN title“… 269” “DB”
ISBN title au au“… 269” “DB” “R.R.”“J.G.”
price“$59.9” price
“$51.1”
price“$59.9”
… …
• {../name, ./ISBN} ./price, w.r.t. C/warehouse/state/store/book
Present in R_store
![Page 26: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/26.jpg)
29 / 42
Overview of the Discovery Process•Only interested in minimal FDs•Bottom-Up•At each relation
Discover intra-relation FDs and Keys Discover inter-relation FDs and Keys
involving descendant relations Generate candidate inter-relation FDs and
Keys for examination at the parent level•Attribute Partition as the basic data
structure
![Page 27: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/27.jpg)
30 / 42
Attribute Partition•Groups tuples
according to the attribute value
•∏{price} for Cbook = { {t6,t20}, {t13} }∏{@key} for Cbook = { {t6}, {t20}, {t13} }∏{price, @key} for Cbook = { {t6}, {t20}, {t13} }
•FD: LHS RHS w.r.t. CP is satisfied iff: ∏LHS∪RHS = ∏LHS
R_book@key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9
![Page 28: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/28.jpg)
31 / 42
Set Attribute Partition • Generated through
refinement Initialize ∏{au} for R_book to be { {t6, t13, t20} } ∏{@text} for R_au = { {t10, t24}, {t11, t25} } { {t6, t20}, {t6, t20} } ∏au for R_book = { {t6, t20}, {t13} }
• ∏au can then be used asa normal partition
R_au@key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G.
R_book@key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9
Convert to parent
Refine ∏{au} using partitions in ∏{@text}
![Page 29: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/29.jpg)
32 / 42
Discovery Algorithms•DiscoverFD:
Discover intra-relation FDs and Keys Similar to existing relational algorithms
•DiscoverXFD: Discover inter-relation FDs and Keys Key component:
Candidate inter-relation XML FD generation
![Page 30: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/30.jpg)
33 / 42
Generating Candidate Inter-Relation FDs• Let P' be a parent relation of P• Parent satisfaction property
For LHS∪X RHS w.r.t. CP to hold for any attribute set X in relation P', LHS∪{./parent} RHS w.r.t. CP must hold
• Child implication property For LHS∪X RHS w.r.t. CP to be a non-trivial FD for
any attribute set X in relation P', LHS RHS w.r.t. CP must not hold
• An FD is a candidate inter-relation FD if it satisfies both properties
![Page 31: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/31.jpg)
36 / 42
Talk Outline•Motivating Example•A Comprehensive Notion of XML FD•XML Redundancy Discovery Algorithms•Experimental Evaluation•Conclusion
![Page 32: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/32.jpg)
37 / 42
Real Datasets• DBLP contains a fair
amount of redundancy, as noted earlier in [AL04] as well
• ~ 10% redundancies in PIR (measured as # of redundant elements over total # of elements), schema modification reported to PIR
![Page 33: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/33.jpg)
38 / 42
Scalability on XMark
• Linear in terms of scale factor (# of elements) – even though exponential in theory
• Orders of magnitude faster than direct application of a state-of-the-art relational discovery algorithm The latter takes over 3 hours to run on XMark scale factor 1
![Page 34: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/34.jpg)
39 / 42
Related Work•XML Integrity Constraints (FDs and
Keys) [BDF+01], [LLL02], [FS03]
•XML Normal Form [AL04], [VLL04]
•Nested Relation Normal Form [OY87], [MNE96]
•Relational FD discovery FUN, Dep-Miner, TANE, fdep, FastFDs
![Page 35: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/35.jpg)
41 / 42
Conclusion•A comprehensive notion of XML FDs and
Keys, capturing set semantics•A system for for detecting XML data
redundancies through the discovery of FDs and Keys
•The system is practical for real datasets and out-performs direct application of the best available relational algorithm by orders of magnitude.
![Page 36: Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006](https://reader036.vdocuments.site/reader036/viewer/2022062907/5a4d1b907f8b9ab0599c0e0e/html5/thumbnails/36.jpg)
42 / 42
Questions ?