functional dependencies: redundancy analysis and ...solmaz/shariftalk.pdf · functional...

40
Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi [email protected] Postdoctoral Research Fellow Department of Computer Science University of British Columbia Joint work with Leonid Libkin and Laks Lakshmanan Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 1/20

Upload: ledat

Post on 04-Jun-2018

265 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Functional Dependencies: Redundancy Analysisand Correcting Violations

Solmaz [email protected]

Postdoctoral Research FellowDepartment of Computer Science

University of British Columbia

Joint work with Leonid Libkin and Laks Lakshmanan

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 1/20

Page 2: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Motivation

Both relational and XML databases may store redundant data:

Functional Dependency:

title → year

title director actor year

The Departed Scorsese DiCaprio 2006The Departed Scorsese Nicholson 2006Shrek the Third Miller Myers 2007Shrek the Third Miller Murphy 2007Shrek the Third Hui Myers 2007Shrek the Third Hui Murphy 2007

Functional Dependency:

@AreaCode → @City

416

CityToronto

416 416 416

City City CityToronto TorontoToronto

AreaCode AreaCodeAreaCode AreaCode

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 2/20

Page 3: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Motivation

Normalization techniques try to remove redundancies:

BCNF eliminates all redundancies.only key dependencies are allowed.cannot always be achieved without losing dependencies.

R(A, B, C) AB → C C → B

3NF eliminates some redundancies.allows redundancy on prime attributes.preserves dependencies.

XNF eliminates all redundancies w.r.t. XML functional dependencies.only XML keys are allowed: if X → p.@l, then X → p.introduced by Arenas & Libkin in 2002.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 3/20

Page 4: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Motivation

Traditional normalization theory

characterizes a database as redundant or non-redundant.

does not measure redundancy.

cannot provide guidelines to reduce redundancy.

The more redundant the data, the more prone to update anomalies.

Our goal is

to show that there is a spectrum of redundancy using aninformation-theoretic tool.

to choose database designs with low redundancy.

to handle databases with dependency violations.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 4/20

Page 5: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Outline

Motivation.

Reducing redundancy in relational and XML data:Measure of redundancy.Redundancy analysis of normal forms and schemas.

Correcting functional dependency violations.

Conclusions.

Future work.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 5/20

Page 6: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

Proposed by Arenas & Libkin in 2003.

Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.

Intuitively, RICI(p|Σ) measures the relative information content ofposition p in instance I w.r.t. constraints Σ.

Independent of data models and query languages.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20

Page 7: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

Proposed by Arenas & Libkin in 2003.

Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.

Intuitively, RICI(p|Σ) measures the relative information content ofposition p in instance I w.r.t. constraints Σ.

Independent of data models and query languages.

Σ = {A → C}

RICI(P |Σ)

0.875

A B C D

1 2 3 4

1 2 3 5

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20

Page 8: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

Proposed by Arenas & Libkin in 2003.

Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.

Intuitively, RICI(p|Σ) measures the relative information content ofposition p in instance I w.r.t. constraints Σ.

Independent of data models and query languages.

Σ = {A → C}

RICI(P |Σ)

0.875

0.781

A B C D

1 2 3 4

1 2 3 5

1 2 3 6

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20

Page 9: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

Proposed by Arenas & Libkin in 2003.

Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.

Intuitively, RICI(p|Σ) measures the relative information content ofposition p in instance I w.r.t. constraints Σ.

Independent of data models and query languages.

Σ = {A → C}

RICI(P |Σ)

0.875

0.781

0.711

A B C D

1 2 3 4

1 2 3 5

1 2 3 6

1 2 3 7

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20

Page 10: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

Proposed by Arenas & Libkin in 2003.

Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.

Intuitively, RICI(p|Σ) measures the relative information content ofposition p in instance I w.r.t. constraints Σ.

Independent of data models and query languages.

Σ = {A → C}

RICI(P |Σ)

0.875

0.781

0.711

0.658

A B C D

1 2 3 4

1 2 3 5

1 2 3 6

1 2 3 7

1 2 3 8

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20

Page 11: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

Proposed by Arenas & Libkin in 2003.

Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.

Intuitively, RICI(p|Σ) measures the relative information content ofposition p in instance I w.r.t. constraints Σ.

Independent of data models and query languages.

Σ = {A → C}

RICI(P |Σ)

0.875

0.781

0.711

0.658

A B C D

1 2 3 4

1 2 3 5

1 2 3 6

1 2 3 7

1 2 3 8

Σ = {A → C, B → C}

RICI(P |Σ)

0.781

0.629

0.522

0.446

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20

Page 12: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

1 2 3

1 2 4

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 13: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

2 3

1 2

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) =

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 14: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

1 2 3

1 2 1

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) =

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 15: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

4 2 3

1 2 7

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) =

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 16: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

4 2 3

1 2 7

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) = 48/

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 17: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

a 3

1 2

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a 6= 2

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 18: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

1 2 3

1 2 4

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a 6= 2

Conditional entropy : 2.8057

Average over all possible X: RICkI

= 2.4558

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 19: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure of Information Content

R(A, B, C) Σ = {A → B}

A B C

1 2 3

1 2 4

Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) forevery a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a 6= 2

Conditional entropy : 2.8057

Average over all possible X: RICkI

= 2.4558

RICI(p|Σ) = limk→∞

RICkI(p | Σ)

log k= 0.875

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20

Page 20: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of(S, Σ) and every position p in I RICI(p|Σ) = 1.

Known results (Arenas & Libkin, 2003):

relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.

XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20

Page 21: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of(S, Σ) and every position p in I RICI(p|Σ) = 1.

Known results (Arenas & Libkin, 2003):

relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.

XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.

Well-designed databases cannot always be achieved:

Performance issues.

Dependency preservation.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20

Page 22: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of(S, Σ) and every position p in I RICI(p|Σ) = 1.

Known results (Arenas & Libkin, 2003):

relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.

XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.

Well-designed databases cannot always be achieved:

Performance issues.

Dependency preservation.

General design goal: maximizing information content to the possibleextent by enforcing some design conditions.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20

Page 23: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Guaranteed Information Content

Given a condition C, guaranteed information content (GIC) is the smallestinformation content found in instances of schemas satisfying C.

instances of (S, Σ)

Schema satisfying condition C(S, Σ) =⇒

RICI(p|Σ) ≥ GIC

More formally,

we look at the set of all possible values for information content

POSSC(m) = {RICI(p | Σ) | I is an instance of (R, Σ),

R has m attributes,

(R, Σ) satisfies C},

then GICC(m), is the infimum of POSSC(m).Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 9/20

Page 24: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Price of Dependency Preservation

Design goal: minimizing redundancy while preserving FDs.

For a normal form NF , PRICE(NF) is the minimum information content thatNF loses to guarantee dependency preservation.

if c ∈ [0, 1] is the largest information content guaranteed fordecompositions into NF ,

(R1, Σ1)

(R2, Σ2)

(R3, Σ3)

(R, Σ)

NF -decomposition

RICI(p|Σ) ≥ c

then price of dependency preservation, PRICE(NF), is 1 − c.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 10/20

Page 25: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Price of Dependency Preservation

Theorem

PRICE(3NF) = 1/2.

PRICE(NF) ≥ 1/2 for any dependency-preserving normal form NF .

To pay the smallest price for achieving dependency preservation, weshould do a 3NF normalization.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20

Page 26: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Price of Dependency Preservation

Theorem

PRICE(3NF) = 1/2.

PRICE(NF) ≥ 1/2 for any dependency-preserving normal form NF .

To pay the smallest price for achieving dependency preservation, weshould do a 3NF normalization.

Not all 3NF normalizations are equal:

Special subclasses of 3NF exist (old research).

Only one subclass (3NF+) achieves the smallest price.

We compare normal forms based on guaranteed information contentor highest redundancy they allow.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20

Page 27: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Comparing Normal Forms

Theorem For every m > 2:

GICAll(m) = 21−m

GIC3NF(m) = 22−m

GIC3NF+(m) = 1/2

3NF is twice as good as doing nothing.

3NF+ is exponentially better.

similar results obtained if we compare normal forms based onguaranteed average information content.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 12/20

Page 28: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Redundancy of an Arbitrary Schema

Normalizing into smaller relations is not always desirable.losing constraints.slowing down query answering.

Normalization decision can be made based onhow much redundancy the schema allows; orwhere in the spectrum of redundancy the schema lies; orthe lowest information content found in all instances of theschema.

Design goal: decomposing the schema with the highest potential forredundancy.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 13/20

Page 29: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Redundancy of an Arbitrary Schema

Theorem Given an arbitrary schema R with FDs Σ, let

ΣA = {X | X → A, X is minimal and non-key};

#HS = the number of hitting sets of ΣA;

l = |⋃

X∈ΣAX|.

Then the smallest information content found in column A of instances is

GICR

Σ(A) = #HS · 2−l.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20

Page 30: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Redundancy of an Arbitrary Schema

Theorem Given an arbitrary schema R with FDs Σ, let

ΣA = {X | X → A, X is minimal and non-key};

#HS = the number of hitting sets of ΣA;

l = |⋃

X∈ΣAX|.

Then the smallest information content found in column A of instances is

GICR

Σ(A) = #HS · 2−l.

R1(A, B, C, D, E)

Σ1 = { AB → E,D → E}

R2(A, B, C, D, E)

Σ2 = { BC → E,AC → E,BD → E}

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20

Page 31: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Redundancy of an Arbitrary Schema

Theorem Given an arbitrary schema R with FDs Σ, let

ΣA = {X | X → A, X is minimal and non-key};

#HS = the number of hitting sets of ΣA;

l = |⋃

X∈ΣAX|.

Then the smallest information content found in column A of instances is

GICR

Σ(A) = #HS · 2−l.

R1(A, B, C, D, E)

Σ1 = { AB → E,D → E}

GICR1

Σ1(E) = 3

8

R2(A, B, C, D, E)

Σ2 = { BC → E,AC → E,BD → E}

GICR2

Σ2(E) = 1

2

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20

Page 32: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Outline

Motivation.

Reducing redundancy in relational and XML data:Measure of redundancy.Redundancy analysis of normal forms and schemas.

Correcting functional dependency violations.

Conclusions.

Future work.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 15/20

Page 33: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Functional Dependency Violations

Large databases often tend to violate a set of FDs.Σ = {cnt , arCode → reg , cnt , reg → prov}

An inconsistent databasename cnt prov reg arCode phone

t1 Smith CAN BC Van 604 123 4567t2 Adams CAN BC Van 604 765 4321t3 Simpson CAN BC Man 604 345 6789t4 Rice CAN AB Vic 604 987 6543

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20

Page 34: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Functional Dependency Violations

Large databases often tend to violate a set of FDs.Σ = {cnt , arCode → reg , cnt , reg → prov}

An inconsistent databasename cnt prov reg arCode phone

t1 Smith CAN BC Van 604 123 4567t2 Adams CAN BC Van 604 765 4321t3 Simpson CAN BC Man 604 345 6789t4 Rice CAN AB Vic 604 987 6543

A minimal repairname cnt prov reg arCode phone

t1 Smith CAN BC Van 604 123 4567t2 Adams CAN BC Van 604 765 4321t3 Simpson CAN BC Van 604 345 6789t4 Rice v1 AB Vic 604 987 6543

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20

Page 35: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).Inconsistent database D: does not satisfy Σ.We can produce a repair R by inserting/deleting tuples or modifyingvalues in D.

∆(D, R) = number of modifications

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20

Page 36: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).Inconsistent database D: does not satisfy Σ.We can produce a repair R by inserting/deleting tuples or modifyingvalues in D.

∆(D, R) = number of modificationsHandling inconsistency:

Consistent query answering:

certain answer for queryQ =⋂{Q(R) | R is a minimal repair for D}

Producing an optimum repair Ropt with minimum ∆.

Both approaches are intractable in general.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20

Page 37: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).Inconsistent database D: does not satisfy Σ.We can produce a repair R by inserting/deleting tuples or modifyingvalues in D.

∆(D, R) = number of modificationsHandling inconsistency:

Consistent query answering:

certain answer for queryQ =⋂{Q(R) | R is a minimal repair for D}

Producing an optimum repair Ropt with minimum ∆.

Both approaches are intractable in general.

Our approach: producing an approximate solution Rapp for optimum repair.

∆(D, Rapp) ≤ α · ∆(D, Ropt)

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20

Page 38: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Approximating Optimum Repair

Theorem. Finding an optimum solution for FD violations is NP-hard.

Theorem. Finding a constant-factor approximation for all FD violations isNP-hard.

Theorem. For every fixed set of FDs, there is a polynomial-time algorithmthat approximates optimum repair within a factor of α, where α dependson FDs.

Σ = {A → C, B → C, CD → E}

A B C D E

t1 a1

t2

t3

t4

t5

t6

a2

a1

a4

a5

a6

b1

b1

b3

b4

b6

c1

c2

c3

c4

c5

c4

d2

d3

d4

d5

d5

e1

e2

e3

e4

e5

e6

d1

b4

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 18/20

Page 39: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Conclusions

We analyze schemas and normal forms based on worst cases ofredundancy.

There is a spectrum of information content (redundancy) forschemas.

0 112poorly-designed well-designed

Producing optimum repair for FD violations is hard.

We introduced an approximation framework.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 19/20

Page 40: Functional Dependencies: Redundancy Analysis and ...solmaz/SharifTalk.pdf · Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca

Future Work

Comparing quality of schemas with low / high information content inpractice.

Defining normalization concepts for XML such as:dependency preserving decomposition.

Finding an equivalent of 3NF for XML as a normal form

that guarantees an information content of 1

2.

to which every XML document is decomposable.

Extending the repair algorithm for other integrity constraints.

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 20/20