floris geerts ( university of antwerp )

27
Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute )

Upload: linnea

Post on 22-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Floris Geerts ( University of Antwerp ) Giansalvatore Mecca, Donatello Santoro ( Università della Basilicata ) Paolo Papotti ( Qatar Computing Research Institute ). Overview. Motivations and Goals. Semantics. Experimental Results. Overview. Motivations and Goals. Semantics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Floris Geerts  ( University of Antwerp )

Floris Geerts (University of Antwerp)Giansalvatore Mecca, Donatello Santoro (Università della

Basilicata)Paolo Papotti (Qatar Computing Research Institute)

Page 2: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview 2

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 3: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview 3

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 4: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Data Cleaning as Data Repairing

•Data Cleaning is a very general term•standardization, entity resolution,

data fusion…•We consider one of its facets: data

repairing

4

Dirty Table R

Database Table R

ConstraintsC over R

Data Repairing

Cleaned Table R’

Page 5: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

5A Motivating ExampleCustomers

SSN Name Phone PhConf City CC#

t1 111 M. White 408-3334 0.8 NY 112321t2 222 L. Lennon 122-1876 0.9 SF 781658t3 222 L. Lennon 000-0000 0.0 SF 784659

Master DataSSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

fd1. Cust: SSN, Name → Phone

Treatments

SSN Salary Insurance Treat. Date

t4 111 10k Abx Dental 10/01/11t5 111 25k Abx Cholest. 08/12/12t6 222 30k Med Eye Surg. 06/10/12er6. IF Cust.SSN = MD.SSN,

Cust.Phone = MD.Phone → TAKE Name, Street, City from MD

fd2. Cust: SSN, Name → CC#fd3. Treat: SSN → Salarycfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]

?

INTERACTION!

PREFERRED VALUES

Page 6: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

ValuesRepair

Selection

6Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

6

How to put everything together?

Page 7: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

ValuesRepair

Selection

7Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

FD - ID

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

Problem 1:missing semantics

andmissing repair

algorithm

7

Page 8: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

ValuesRepair

Selection

8Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

FD - ID

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

Problem 2:missing formalism

to handle user-specified

preference rules

8

Page 9: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

ValuesRepair

Selection

9Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

FD - ID

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

Problem 3:no DBMS-based

scalable implementations

9

Page 10: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

10System Comparison

Depend. Language Repair Strategy Value Preference Solution Selection

FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert. Card.

Sampl

[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓

[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10

Page 11: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

11Our Goal

Depend. Language Repair Strategy Value Preference Solution Selection

FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert. Card.

Sampl

[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓

GOAL[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales

VLDB ’10

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Page 12: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview 12

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 13: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

13Cleaning EGDs

Customers

SSN Name Phone PhConf City CC#

t1 111 M. White 408-3334 0.8 NY 112321t2 222 L. Lennon 122-1876 0.9 SF 781658t3 222 L. Lennon 000-0000 0.0 SF 784659

Master DataSSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → ph = ph’e2. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → cc = cc’e3. Treat(ssn, sal, i, t, d) , Treat(ssn, sal’, i’, t’, d’) → sal = sal’e4. Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → tr = ‘Dental’

Treatments

SSN Salary Insurance Treat. Date

t4 111 10k Abx Dental 10/01/11t5 111 25k Abx Cholest. 08/12/12t6 222 30k Med Eye Surg. 06/10/12

e5. Cust(ssn, n, ph, c , cc ) , Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → c = ‘SF’e6. Cust(ssn, n, ph, c , cc ) , Master(ssn, n’, ph, s, c’) , → n = n’, c = c’

Σ: Cleaning EGDs

Page 14: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Semantics 14

‣ Cell Groups

‣ Upgrades

‣ LLUNs

‣ Partial Order

Page 15: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

The Partial Order• Standard preference rules

• NULLS CONST

• source values are preferred to target values

• Ordering attributeΠ(Cust.Phone) =

{dom(Cust.PhConf), ≥}Π(Treat.Salary) =

{dom(Treat.Date), ≥time}

•No orderΠ(Cust.CC#) = ∅

15

Customers

SSN Name Phone PhConf City CC#

t1 111 M. White 408-3334 0.8 NY 112321t2 222 L. Lennon 122-1876 0.9 SF 781658t3 222 L. Lennon 000-0000 0.0 SF 784659

Master DataSSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

Treatments

SSN Salary Insurance Treat. Date

t4 111 10k Abx Dental 10/01/11t5 111 25k Abx Cholest. 08/12/12t6 222 30k Med Eye Surg. 06/10/12

Page 16: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

SSN Name Phone PhConf

t1 222 L. Lennon 123 0.9

t2 222 F. Lennon 000 0.0

SSN Name Phone PhConf

t1 222 L. Lennon 123 0.9

t2 222 F. Lennon 123 0.0

Cell Groups

• We model partial repair by a Cell-Group

• set of target cells that must be changed together• we do not disrupt this equality in the following

• Carry also lineage information

16

g1 = <123 → {t1.phone, t2.phone}>occurrencesvalue

Master Data

Name Phone

tm L. Lennon 999

e2. Cust(s, n, ph) , Master(n, ph’) → ph = ph’

222 L. Lennon 999 0.9

222 F. Lennon 999 0.0

g2 = <999→ {t1.phone, t2.phone} by {tm.ph}>

we change the value of all occurrences

justifications

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’

Partial orderover cell groups

Page 17: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Upgrades 17

SSN Name Phonet2 222 L. Lennon 123t3 222 F. Lennon 123

<123→ {t2.ph, t3.ph} >

SSN Name Phonet2 222 L. Lennon 999t3 222 F. Lennon 999

<999→ {t2.ph, t3.ph} by {tm.ph}>

SSN Name Phonet2 222 L. Lennon 123t3 222 F. Lennon 000

Starting database J

upgrade Repair 1upgrade Repair N

Page 18: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

LLUNs•There are cases in which we don’t have any

clear strategy to remove a violation•LLUNs•a new class of symbols•placeholders used to mark conflicts•not only an unknown value but rather a

“hypervalue”• the opposite of a NULL, since it upgrades

constants

18

SSN Name Phone CC#t2 222 L. Lennon 122-1876 5555t3 222 L. Lennon 000-0000 6666e3. Cust(ssn, n, ph, cc ) , Cust(ssn, n, ph’, cc’) → cc = cc’

L0

Page 19: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Scenarios and Solutions19

•S: source schema, T: target schema, Σ: set of cleaning EGDs

•Π: the partial order specification•Π is the way to specify when a value is

preferrable to another•Solution: Given C, an instance I of S, and an

instance J of T compute an instance J’ such that:• it is a repair, i.e., “I and J’ satisfy Σ”•and “J’ is an upgrade of J according to Π”

Cleaning Scenario

Page 20: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

The Chase Algorithm•To compute solutions:•Chase algorithm for chasing egds

20

SSN Name Phonet2 L1 L. Lennon 123t3 222 F. Lennon 000

SSN Name Phonet2 222 L. Lennon 123t3 L2 F. Lennon 000

<L1→ {t2.ssn} >

Repair 2 (backward)

<L2→ {t3.ssn} >

Repair 3 (backward)

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’

SSN Name Phonet2 222 L. Lennon 123t3 222 F. Lennon 123

<123→ {t2.ph, t3.ph} >

Repair 1 (forward)

Page 21: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

A Few Results•Given a cleaning scenario C, and

instances <I, J>•C always has a solution for <I, J>•The chase always terminates (it never

fails)•The chase computes all minimal

solutions•The number of minimal solutions is

exponential in the size of J

21

Page 22: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview 22

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 23: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Chase Tree 23

J

R1

e0, b1

R2 R3

e0, b2e0, f

R4

e1, b1

R5 R6

e1, b2e1, f

R10e1, b1 R11 R12

e1, b2e1, f

R13

e0, b1

R14 R15

e0, b2e0, f

the e0-e1 sequence the e1-e0 sequence

Our goal: to make this scalable ! Different orders of application

give different results

Page 24: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Scalability Techniques•Chase implementation based on

equivalence classes•Delta Databases•a representation system for chase

trees•Cost managers•pluggable strategies to prune the

chase tree

24

Page 25: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

25Scalability

LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FOLLUNATIC-FR-S50

CUSTOMERSHOSPITAL

CUSTOMERS

100K 250K 400K 550K 700K 850K 1M

LLUNATICis the first scalable

DBMS-based chase algorithm for

data repairing

sec.

Page 26: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

26Quality of Repairs

5k 10k 25k

LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FO

SAMPLING-500 VERTEX COVER MIN. COST

HOSPITAL

Page 27: Floris Geerts  ( University of Antwerp )

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

That’s all Folks!