1 extending relational database functionality with data inconsistency resolution support ilya...

1

Extending Relational Database Extending Relational Database Functionality with Data Functionality with Data Inconsistency Resolution Inconsistency Resolution SupportSupport

Ilya Pevzner,[email protected]

Arthur [email protected]

Department of Computer ScienceCourant Institute

New York University

Ilya Pevzner, Arthur Goldberg 2 VLDB-2003 Ph.D. Workshop

InconsistencyInconsistency

• Databases often contain information Databases often contain information about real world objectsabout real world objects

• When the data is collected and entered in When the data is collected and entered in the database (or the database (or measuredmeasured), errors are ), errors are introducedintroduced

• When the same object is When the same object is measuredmeasured more more than once, inconsistent data values may than once, inconsistent data values may resultresult


Object IdentificationObject Identification

• Identification of records describing the Identification of records describing the same real-world objectsame real-world object

• If If keykey values are inconsistent values, values are inconsistent values, object identification is not trivial and its object identification is not trivial and its results are uncertainresults are uncertain

• Also known as Also known as approximateapproximate matchingmatching, , duplicate detection duplicate detection and and record linkagerecord linkage

• Area with multiple successful techniques, Area with multiple successful techniques, topic of KDD-2003 workshoptopic of KDD-2003 workshop


Inconsistency Resolution Inconsistency Resolution ProblemProblem

• Given what is known about the world, find the “best” Given what is known about the world, find the “best” estimates for values of the inconsistent attributes estimates for values of the inconsistent attributes

• Possible sources of the knowledge about the world:Possible sources of the knowledge about the world:

a)a) The system designer or expertThe system designer or expert

b)b) The end userThe end user

c)c) The dataThe data

• Inconsistency resolution is also called Inconsistency resolution is also called mergingmerging

• Existing research is almost exclusively on a) and b)Existing research is almost exclusively on a) and b)– No systematic techniquesNo systematic techniques

• Our work concentrates on c)Our work concentrates on c)


– Match using ID (trivial)Match using ID (trivial)

– Merge using standardizationMerge using standardization

ID Name 10 Arthur 20 Johnny 10 Art 20 John

MATCH

ID Name Arthur

10 Art John

20 Johnny

MERGE ID Name

10 Arthur 20 John

Matching/Merging ExampleMatching/Merging Example


• Sometimes it is possible, but non-Sometimes it is possible, but non-trivial, to tell which attribute value trivial, to tell which attribute value is bestis best

• In other cases, the answer is In other cases, the answer is uncertainuncertain

ID Sex Country Name F US Andrea 10 F US Andrew M Russia Sergey 20 M Russia Sergio

MERGE ID Sex Country Name

10 F US Andrea 20 M Russia Sergey

Merging UncertaintyMerging Uncertainty


Research goalsResearch goals

• Develop merging methodologies Develop merging methodologies that rely on the analysis of the datathat rely on the analysis of the data

• Extend relational databases withExtend relational databases with– Integrated model for representing Integrated model for representing

matching and merging uncertaintiesmatching and merging uncertainties

– Integrated support for various matching Integrated support for various matching and merging methodologiesand merging methodologies


Uncertainty in Relational Uncertainty in Relational DatabasesDatabases• Semantics of NullsSemantics of Nulls

– E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993.

• Fuzzy databasesFuzzy databases– E.g. E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy

functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988.

• Probabilistic relationsProbabilistic relations– E.g. E Zimanyi and A. Pirotte. Imperfect Information in

Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997.


Probabilistic relations Probabilistic relations overviewoverview

• Probabilistic relations model uncertainty with truth Probabilistic relations model uncertainty with truth probabilities added to classic relationsprobabilities added to classic relations– E.g. tuple X is in relation with probability P[X]E.g. tuple X is in relation with probability P[X]

• Each probabilistic relation is associated with a set of Each probabilistic relation is associated with a set of classic relations representing “possible worlds” classic relations representing “possible worlds” where the collection of outcomes for each where the collection of outcomes for each probabilistic choice is fixedprobabilistic choice is fixed– E.g. the probabilistic relation with the probabilistic choice E.g. the probabilistic relation with the probabilistic choice

in the above example will have two possible worlds – one in the above example will have two possible worlds – one with tuple X and one withoutwith tuple X and one without

• Relational operations are defined through the Relational operations are defined through the associated classic relationsassociated classic relations


Zimanyi’sZimanyi’s Type-1 Type-1 Probabilistic RelationProbabilistic Relation• DefinitionDefinition

– A type-1 probabilistic relation is a relation A type-1 probabilistic relation is a relation RR with with a supplementary attribute a supplementary attribute w(R, t) w(R, t) addedadded to each to each tupletuple t t indicating the probability that a tuple indicating the probability that a tuple tt belongs to relation belongs to relation RR


• Probabilistic relationProbabilistic relation

• Possible worlds (Possible worlds (assuming unique(ID1) and assuming unique(ID1) and unique(ID2)):unique(ID2)):

Zimanyi’sZimanyi’s Type-1 Type-1 Probabilistic Relation Probabilistic Relation ExampleExample

Name SSN Phone ID1 ID2 w(R, t) John 111-22-3333 212-555-1212 50 100 .6 John 111-22-3333 646-444-1212 50 200 .8 Johnny 222-33-4444 212-555-1212 60 100 .5 Johnny 222-33-4444 646-444-1212 60 200 .9

Name SSN Phone

Name SSN Phone Johnny 222-33-4444 646-444-1212

Name SSN Phone John 111-22-3333 212-555-1212

Name SSN Phone John 111-22-3333 212-555-1212 Johnny 222-33-4444 646-444-1212

Name SSN Phone Johnny 222-33-4444 212-555-1212

Name SSN Phone John 111-22-3333 646-444-1212

Name SSN Phone John 111-22-3333 646-444-1212 Johnny 222-33-4444 212-555-1212


Probabilistic matchingProbabilistic matching• Example: matching by nameExample: matching by name

• The way The way w(R, t)w(R, t) is computed depends on the matching is computed depends on the matching methodologymethodology– An example of such methodology is ChoiceMaker™ An example of such methodology is ChoiceMaker™

ID1 ID2 Name 50 John 60 Johnny 100 Jon 200 Johnny

MATCH

Name ID1 ID2 w(R, t) John Jon

50 100 .6

John Johnny

50 200 .8

Johnny Jon

60 100 .5

Johnny 60 200 .9


Zimanyi’sZimanyi’s Type-2 Type-2 Probabilistic RelationProbabilistic Relation

• DefinitionDefinition– Generalized relation in which attribute Generalized relation in which attribute

values can be probabilistic setsvalues can be probabilistic sets


Zimanyi’sZimanyi’s Type-2 Type-2 Probabilistic Relation Probabilistic Relation ExampleExample

Name SSN

Value Probability John .7

111-22-3333 Jon .25 Johnny .2

222-33-4444 John .6

SSN Name 111-22-3333 John 222-33-4444 Johnny

SSN Name 111-22-3333 Jon 222-33-4444 Johnny

SSN Name 111-22-3333 Jon 222-33-4444 John

SSN Name 111-22-3333 John 222-33-4444 John

• Probabilistic relationProbabilistic relation

• Possible worldsPossible worlds


S1


S2

SSN Name 111-22-3333 Jon 222-33-4444 John

Probabilistic Merging Probabilistic Merging ExampleExample• Data sourcesData sources

• Query:Query:– List all people with the their correct name and social security List all people with the their correct name and social security

numbernumber

• Execution plan:Execution plan:– Join using SSN (UID)Join using SSN (UID)

– Merge namesMerge names


Probabilistic Merging Probabilistic Merging Example: ResultExample: Result

Name SSN


111-22-3333 Jon .25 Johnny .2

222-33-4444 John .6

S1


S2

SSN Name 111-22-3333 Jon 222-33-4444 John

MERGE


Merging MethodologiesMerging Methodologies• Ad-hoc techniquesAd-hoc techniques

– StandardizationStandardization• E.g. convert both Jim and Jimmy to JamesE.g. convert both Jim and Jimmy to James

– Pre-defined rulesPre-defined rules• E.g. use gender to pick Andrea and not AndrewE.g. use gender to pick Andrea and not Andrew

• Machine LearningMachine Learning– Supervised (e.g. MaxEnt)Supervised (e.g. MaxEnt)

• Use experts to manually merge some data, use it to train Use experts to manually merge some data, use it to train and validateand validate

– Unsupervised (e.g. dependency-based)Unsupervised (e.g. dependency-based)• E.g. Mine data for dependencies, use dependencies to pick E.g. Mine data for dependencies, use dependencies to pick

the best estimatesthe best estimates


SQL ExtensionsSQL Extensions

• The MATCH predicateThe MATCH predicate– Uses a specified matching methodology to determine Uses a specified matching methodology to determine

if specified tuples describe the same objectif specified tuples describe the same object

• The MERGE functionThe MERGE function– Uses a specified merging methodology to provide Uses a specified merging methodology to provide

estimates for values of specified attributesestimates for values of specified attributes

• The PROB functionThe PROB function– Provides access to probabilities in type-1 and type-2 Provides access to probabilities in type-1 and type-2

probabilistic relationsprobabilistic relations


The MATCH PredicateThe MATCH Predicate

• Can be used in the WHERE clause of SELECT Can be used in the WHERE clause of SELECT statementstatement

• Takes the name of the matcher module and Takes the name of the matcher module and the tuples to be testedthe tuples to be tested

• Returns true if the tuples match with Returns true if the tuples match with probability exceeding the matcher threshold. probability exceeding the matcher threshold. Otherwise, returns falseOtherwise, returns false

• SELECT statements with MATCH produce SELECT statements with MATCH produce type-1 probabilistic relationstype-1 probabilistic relations


MATCH ExampleMATCH Example• Data source relationsData source relations

• QueryQuerySELECT S1.NAME, S1.SSN, S2.PHONE FROM SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 S1,S2 WHERE WHERE MATCHMATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) (‘NAME_MATCHER’,S1.NAME,S2.NAME)

• ResultResult

S1 Id Name SSN 50 John 111-22-3333 60 Johnny 222-33-4444

S2 Id Name Phone 100 Jon 212-555-1212 200 Johnny 646-444-1212

S1.Name SSN Phone w(R, t) John 111-22-3333 212-555-1212 .6=Match_prob(‘John’, ‘Jon’) John 111-22-3333 646-444-1212 .8=Match_prob(‘John’, ‘Johnny’) Johnny 222-33-4444 212-555-1212 .5=Match_prob(‘Johnny’, ‘Jon’) Johnny 222-33-4444 646-444-1212 .9=Match_prob(‘Johnny’, ‘Johnny’)


The MERGE functionThe MERGE function• May appear in SELECT listMay appear in SELECT list

• Accepts two parametersAccepts two parameters– Merger nameMerger name

– Merge listMerge list

• Returns a table of the form Returns a table of the form (v, w(v, wff) ) where where v v is a is a value and value and wwff is the corresponding probabilityis the corresponding probability

• SELECT statements with MERGE produce SELECT statements with MERGE produce type-2 probabilistic relationstype-2 probabilistic relations


S1 SSN Name 111-22-3333 John 222-33-4444 Johnny

S2 SSN Name 111-22-3333 Jon 222-33-4444 John

MERGE ExampleMERGE Example• Data sourcesData sources

• QueryQuerySELECT S1.SSN,SELECT S1.SSN,MERGEMERGE(‘NAME_MERGER’,(‘NAME_MERGER’,

(S1.NAME, S2.NAME)) AS NAME(S1.NAME, S2.NAME)) AS NAMEFROM S1, S2 WHERE S1.SSN=S2.SSNFROM S1, S2 WHERE S1.SSN=S2.SSN

• ResultResult Name SSN


111-22-3333 Jon .25 Johnny .2

222-33-4444 John .6


Query Processing Query Processing DiagramDiagram

P rep rocesso r P ostp rocesso r

M a tch in g /M erg in gE n g in e

In con sis ten tD a ta

S Q L Q u ery R esu lt(In con sis ten t R ela tion )

P ostp rocessin gIn stru c tion s

E x ten d edS Q LQ u ery

T rad ition a lo r P rob ab ilis tic

R ela tion s

S tan d a rdS Q LQ u ery

P rob ab ilis ticR ela tion s


InterfacesInterfacesS tan d a rd S Q L

A p p lica tion

E xten d edS Q L

A p p lica tion

A d van cedIn con sis ten cy

R eso lu tionA p p lica tion

S tan d a rdS Q LT rad ition a lR ela tion

T rad ition a lR ela tion

P rob ab ilis ticR ela tion

E xten d edS Q L

E xten d edS Q L

S tan d a rdS Q L

In terfac e

E xten d edS Q L

In terfac e

A d van cedS Q L

In terfac e

E x ten d edD a tab aseS ys tem

In con sis ten tD a ta


Validating with real-world Validating with real-world datadata

• MEDLINE data setMEDLINE data set– Affiliation Fields:Affiliation Fields:

• E-mail, Organization, AddressE-mail, Organization, Address

– Statistics:Statistics:• 2,391,822 affiliations2,391,822 affiliations• 523,140 matched by e-mail address523,140 matched by e-mail address• 182,892 with US addresses182,892 with US addresses• 32,505 non-identical duplicates32,505 non-identical duplicates

• Looking for other interesting data setsLooking for other interesting data sets– ErrorsErrors– DependenciesDependencies– DuplicatesDuplicates– More distinct itemsMore distinct items– More FieldsMore Fields


Future plansFuture plans

• Consider several data setsConsider several data sets

• Develop several merging Develop several merging methodologiesmethodologies

• Evaluate using real data and looking atEvaluate using real data and looking at– PerformancePerformance

– Merge QualityMerge Quality

– UsabilityUsability


QuestionsQuestions

• ??

1 extending relational database functionality with data inconsistency resolution support ilya...

Documents