1 extending relational database functionality with data inconsistency resolution support ilya...

27
1 Extending Relational Extending Relational Database Database Functionality with Functionality with Data Inconsistency Data Inconsistency Resolution Support Resolution Support Ilya Pevzner, [email protected] Arthur Goldberg [email protected] Department of Computer Science Courant Institute New York University

Upload: wyatt-fowles

Post on 01-Apr-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

1

Extending Relational Database Extending Relational Database Functionality with Data Functionality with Data Inconsistency Resolution Inconsistency Resolution SupportSupport

Ilya Pevzner,[email protected]

Arthur [email protected]

Department of Computer ScienceCourant Institute

New York University

Page 2: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 2 VLDB-2003 Ph.D. Workshop

InconsistencyInconsistency

• Databases often contain information Databases often contain information about real world objectsabout real world objects

• When the data is collected and entered in When the data is collected and entered in the database (or the database (or measuredmeasured), errors are ), errors are introducedintroduced

• When the same object is When the same object is measuredmeasured more more than once, inconsistent data values may than once, inconsistent data values may resultresult

Page 3: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 3 VLDB-2003 Ph.D. Workshop

Object IdentificationObject Identification

• Identification of records describing the Identification of records describing the same real-world objectsame real-world object

• If If keykey values are inconsistent values, values are inconsistent values, object identification is not trivial and its object identification is not trivial and its results are uncertainresults are uncertain

• Also known as Also known as approximateapproximate matchingmatching, , duplicate detection duplicate detection and and record linkagerecord linkage

• Area with multiple successful techniques, Area with multiple successful techniques, topic of KDD-2003 workshoptopic of KDD-2003 workshop

Page 4: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 4 VLDB-2003 Ph.D. Workshop

Inconsistency Resolution Inconsistency Resolution ProblemProblem

• Given what is known about the world, find the “best” Given what is known about the world, find the “best” estimates for values of the inconsistent attributes estimates for values of the inconsistent attributes

• Possible sources of the knowledge about the world:Possible sources of the knowledge about the world:

a)a) The system designer or expertThe system designer or expert

b)b) The end userThe end user

c)c) The dataThe data

• Inconsistency resolution is also called Inconsistency resolution is also called mergingmerging

• Existing research is almost exclusively on a) and b)Existing research is almost exclusively on a) and b)– No systematic techniquesNo systematic techniques

• Our work concentrates on c)Our work concentrates on c)

Page 5: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 5 VLDB-2003 Ph.D. Workshop

– Match using ID (trivial)Match using ID (trivial)

– Merge using standardizationMerge using standardization

ID Name 10 Arthur 20 Johnny 10 Art 20 John

MATCH

ID Name Arthur

10 Art John

20 Johnny

MERGE ID Name

10 Arthur 20 John

Matching/Merging ExampleMatching/Merging Example

Page 6: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 6 VLDB-2003 Ph.D. Workshop

• Sometimes it is possible, but non-Sometimes it is possible, but non-trivial, to tell which attribute value trivial, to tell which attribute value is bestis best

• In other cases, the answer is In other cases, the answer is uncertainuncertain

ID Sex Country Name F US Andrea 10 F US Andrew M Russia Sergey 20 M Russia Sergio

MERGE ID Sex Country Name

10 F US Andrea 20 M Russia Sergey

Merging UncertaintyMerging Uncertainty

Page 7: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 7 VLDB-2003 Ph.D. Workshop

Research goalsResearch goals

• Develop merging methodologies Develop merging methodologies that rely on the analysis of the datathat rely on the analysis of the data

• Extend relational databases withExtend relational databases with– Integrated model for representing Integrated model for representing

matching and merging uncertaintiesmatching and merging uncertainties

– Integrated support for various matching Integrated support for various matching and merging methodologiesand merging methodologies

Page 8: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 8 VLDB-2003 Ph.D. Workshop

Uncertainty in Relational Uncertainty in Relational DatabasesDatabases• Semantics of NullsSemantics of Nulls

– E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993.

• Fuzzy databasesFuzzy databases– E.g. E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy

functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988.

• Probabilistic relationsProbabilistic relations– E.g. E Zimanyi and A. Pirotte. Imperfect Information in

Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997.

Page 9: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 9 VLDB-2003 Ph.D. Workshop

Probabilistic relations Probabilistic relations overviewoverview

• Probabilistic relations model uncertainty with truth Probabilistic relations model uncertainty with truth probabilities added to classic relationsprobabilities added to classic relations– E.g. tuple X is in relation with probability P[X]E.g. tuple X is in relation with probability P[X]

• Each probabilistic relation is associated with a set of Each probabilistic relation is associated with a set of classic relations representing “possible worlds” classic relations representing “possible worlds” where the collection of outcomes for each where the collection of outcomes for each probabilistic choice is fixedprobabilistic choice is fixed– E.g. the probabilistic relation with the probabilistic choice E.g. the probabilistic relation with the probabilistic choice

in the above example will have two possible worlds – one in the above example will have two possible worlds – one with tuple X and one withoutwith tuple X and one without

• Relational operations are defined through the Relational operations are defined through the associated classic relationsassociated classic relations

Page 10: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 10 VLDB-2003 Ph.D. Workshop

Zimanyi’sZimanyi’s Type-1 Type-1 Probabilistic RelationProbabilistic Relation• DefinitionDefinition

– A type-1 probabilistic relation is a relation A type-1 probabilistic relation is a relation RR with with a supplementary attribute a supplementary attribute w(R, t) w(R, t) addedadded to each to each tupletuple t t indicating the probability that a tuple indicating the probability that a tuple tt belongs to relation belongs to relation RR

Page 11: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 11 VLDB-2003 Ph.D. Workshop

• Probabilistic relationProbabilistic relation

• Possible worlds (Possible worlds (assuming unique(ID1) and assuming unique(ID1) and unique(ID2)):unique(ID2)):

Zimanyi’sZimanyi’s Type-1 Type-1 Probabilistic Relation Probabilistic Relation ExampleExample

Name SSN Phone ID1 ID2 w(R, t) John 111-22-3333 212-555-1212 50 100 .6 John 111-22-3333 646-444-1212 50 200 .8 Johnny 222-33-4444 212-555-1212 60 100 .5 Johnny 222-33-4444 646-444-1212 60 200 .9

Name SSN Phone

Name SSN Phone Johnny 222-33-4444 646-444-1212

Name SSN Phone John 111-22-3333 212-555-1212

Name SSN Phone John 111-22-3333 212-555-1212 Johnny 222-33-4444 646-444-1212

Name SSN Phone Johnny 222-33-4444 212-555-1212

Name SSN Phone John 111-22-3333 646-444-1212

Name SSN Phone John 111-22-3333 646-444-1212 Johnny 222-33-4444 212-555-1212

Page 12: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 12 VLDB-2003 Ph.D. Workshop

Probabilistic matchingProbabilistic matching• Example: matching by nameExample: matching by name

• The way The way w(R, t)w(R, t) is computed depends on the matching is computed depends on the matching methodologymethodology– An example of such methodology is ChoiceMaker™ An example of such methodology is ChoiceMaker™

ID1 ID2 Name 50 John 60 Johnny 100 Jon 200 Johnny

MATCH

Name ID1 ID2 w(R, t) John Jon

50 100 .6

John Johnny

50 200 .8

Johnny Jon

60 100 .5

Johnny 60 200 .9

Page 13: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 13 VLDB-2003 Ph.D. Workshop

Zimanyi’sZimanyi’s Type-2 Type-2 Probabilistic RelationProbabilistic Relation

• DefinitionDefinition– Generalized relation in which attribute Generalized relation in which attribute

values can be probabilistic setsvalues can be probabilistic sets

Page 14: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 14 VLDB-2003 Ph.D. Workshop

Zimanyi’sZimanyi’s Type-2 Type-2 Probabilistic Relation Probabilistic Relation ExampleExample

Name SSN

Value Probability John .7

111-22-3333 Jon .25 Johnny .2

222-33-4444 John .6

SSN Name 111-22-3333 John 222-33-4444 Johnny

SSN Name 111-22-3333 Jon 222-33-4444 Johnny

SSN Name 111-22-3333 Jon 222-33-4444 John

SSN Name 111-22-3333 John 222-33-4444 John

• Probabilistic relationProbabilistic relation

• Possible worldsPossible worlds

Page 15: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 15 VLDB-2003 Ph.D. Workshop

S1

SSN Name 111-22-3333 John 222-33-4444 Johnny

S2

SSN Name 111-22-3333 Jon 222-33-4444 John

Probabilistic Merging Probabilistic Merging ExampleExample• Data sourcesData sources

• Query:Query:– List all people with the their correct name and social security List all people with the their correct name and social security

numbernumber

• Execution plan:Execution plan:– Join using SSN (UID)Join using SSN (UID)

– Merge namesMerge names

Page 16: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 16 VLDB-2003 Ph.D. Workshop

Probabilistic Merging Probabilistic Merging Example: ResultExample: Result

Name SSN

Value Probability John .7

111-22-3333 Jon .25 Johnny .2

222-33-4444 John .6

S1

SSN Name 111-22-3333 John 222-33-4444 Johnny

S2

SSN Name 111-22-3333 Jon 222-33-4444 John

MERGE

Page 17: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 17 VLDB-2003 Ph.D. Workshop

Merging MethodologiesMerging Methodologies• Ad-hoc techniquesAd-hoc techniques

– StandardizationStandardization• E.g. convert both Jim and Jimmy to JamesE.g. convert both Jim and Jimmy to James

– Pre-defined rulesPre-defined rules• E.g. use gender to pick Andrea and not AndrewE.g. use gender to pick Andrea and not Andrew

• Machine LearningMachine Learning– Supervised (e.g. MaxEnt)Supervised (e.g. MaxEnt)

• Use experts to manually merge some data, use it to train Use experts to manually merge some data, use it to train and validateand validate

– Unsupervised (e.g. dependency-based)Unsupervised (e.g. dependency-based)• E.g. Mine data for dependencies, use dependencies to pick E.g. Mine data for dependencies, use dependencies to pick

the best estimatesthe best estimates

Page 18: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 18 VLDB-2003 Ph.D. Workshop

SQL ExtensionsSQL Extensions

• The MATCH predicateThe MATCH predicate– Uses a specified matching methodology to determine Uses a specified matching methodology to determine

if specified tuples describe the same objectif specified tuples describe the same object

• The MERGE functionThe MERGE function– Uses a specified merging methodology to provide Uses a specified merging methodology to provide

estimates for values of specified attributesestimates for values of specified attributes

• The PROB functionThe PROB function– Provides access to probabilities in type-1 and type-2 Provides access to probabilities in type-1 and type-2

probabilistic relationsprobabilistic relations

Page 19: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 19 VLDB-2003 Ph.D. Workshop

The MATCH PredicateThe MATCH Predicate

• Can be used in the WHERE clause of SELECT Can be used in the WHERE clause of SELECT statementstatement

• Takes the name of the matcher module and Takes the name of the matcher module and the tuples to be testedthe tuples to be tested

• Returns true if the tuples match with Returns true if the tuples match with probability exceeding the matcher threshold. probability exceeding the matcher threshold. Otherwise, returns falseOtherwise, returns false

• SELECT statements with MATCH produce SELECT statements with MATCH produce type-1 probabilistic relationstype-1 probabilistic relations

Page 20: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 20 VLDB-2003 Ph.D. Workshop

MATCH ExampleMATCH Example• Data source relationsData source relations

• QueryQuerySELECT S1.NAME, S1.SSN, S2.PHONE FROM SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 S1,S2 WHERE WHERE MATCHMATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) (‘NAME_MATCHER’,S1.NAME,S2.NAME)

• ResultResult

S1 Id Name SSN 50 John 111-22-3333 60 Johnny 222-33-4444

S2 Id Name Phone 100 Jon 212-555-1212 200 Johnny 646-444-1212

S1.Name SSN Phone w(R, t) John 111-22-3333 212-555-1212 .6=Match_prob(‘John’, ‘Jon’) John 111-22-3333 646-444-1212 .8=Match_prob(‘John’, ‘Johnny’) Johnny 222-33-4444 212-555-1212 .5=Match_prob(‘Johnny’, ‘Jon’) Johnny 222-33-4444 646-444-1212 .9=Match_prob(‘Johnny’, ‘Johnny’)

Page 21: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 21 VLDB-2003 Ph.D. Workshop

The MERGE functionThe MERGE function• May appear in SELECT listMay appear in SELECT list

• Accepts two parametersAccepts two parameters– Merger nameMerger name

– Merge listMerge list

• Returns a table of the form Returns a table of the form (v, w(v, wff) ) where where v v is a is a value and value and wwff is the corresponding probabilityis the corresponding probability

• SELECT statements with MERGE produce SELECT statements with MERGE produce type-2 probabilistic relationstype-2 probabilistic relations

Page 22: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 22 VLDB-2003 Ph.D. Workshop

S1 SSN Name 111-22-3333 John 222-33-4444 Johnny

S2 SSN Name 111-22-3333 Jon 222-33-4444 John

MERGE ExampleMERGE Example• Data sourcesData sources

• QueryQuerySELECT S1.SSN,SELECT S1.SSN,MERGEMERGE(‘NAME_MERGER’,(‘NAME_MERGER’,

(S1.NAME, S2.NAME)) AS NAME(S1.NAME, S2.NAME)) AS NAMEFROM S1, S2 WHERE S1.SSN=S2.SSNFROM S1, S2 WHERE S1.SSN=S2.SSN

• ResultResult Name SSN

Value Probability John .7

111-22-3333 Jon .25 Johnny .2

222-33-4444 John .6

Page 23: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 23 VLDB-2003 Ph.D. Workshop

Query Processing Query Processing DiagramDiagram

P rep rocesso r P ostp rocesso r

M a tch in g /M erg in gE n g in e

In con sis ten tD a ta

S Q L Q u ery R esu lt(In con sis ten t R ela tion )

P ostp rocessin gIn stru c tion s

E x ten d edS Q LQ u ery

T rad ition a lo r P rob ab ilis tic

R ela tion s

S tan d a rdS Q LQ u ery

P rob ab ilis ticR ela tion s

Page 24: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 24 VLDB-2003 Ph.D. Workshop

InterfacesInterfacesS tan d a rd S Q L

A p p lica tion

E xten d edS Q L

A p p lica tion

A d van cedIn con sis ten cy

R eso lu tionA p p lica tion

S tan d a rdS Q LT rad ition a lR ela tion

T rad ition a lR ela tion

P rob ab ilis ticR ela tion

E xten d edS Q L

E xten d edS Q L

S tan d a rdS Q L

In terfac e

E xten d edS Q L

In terfac e

A d van cedS Q L

In terfac e

E x ten d edD a tab aseS ys tem

In con sis ten tD a ta

Page 25: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 25 VLDB-2003 Ph.D. Workshop

Validating with real-world Validating with real-world datadata

• MEDLINE data setMEDLINE data set– Affiliation Fields:Affiliation Fields:

• E-mail, Organization, AddressE-mail, Organization, Address

– Statistics:Statistics:• 2,391,822 affiliations2,391,822 affiliations• 523,140 matched by e-mail address523,140 matched by e-mail address• 182,892 with US addresses182,892 with US addresses• 32,505 non-identical duplicates32,505 non-identical duplicates

• Looking for other interesting data setsLooking for other interesting data sets– ErrorsErrors– DependenciesDependencies– DuplicatesDuplicates– More distinct itemsMore distinct items– More FieldsMore Fields

Page 26: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 26 VLDB-2003 Ph.D. Workshop

Future plansFuture plans

• Consider several data setsConsider several data sets

• Develop several merging Develop several merging methodologiesmethodologies

• Evaluate using real data and looking atEvaluate using real data and looking at– PerformancePerformance

– Merge QualityMerge Quality

– UsabilityUsability

Page 27: 1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu

Ilya Pevzner, Arthur Goldberg 27 VLDB-2003 Ph.D. Workshop

QuestionsQuestions

• ??