1 extending relational database functionality with data inconsistency resolution support ilya...
TRANSCRIPT
1
Extending Relational Database Extending Relational Database Functionality with Data Functionality with Data Inconsistency Resolution Inconsistency Resolution SupportSupport
Ilya Pevzner,[email protected]
Arthur [email protected]
Department of Computer ScienceCourant Institute
New York University
Ilya Pevzner, Arthur Goldberg 2 VLDB-2003 Ph.D. Workshop
InconsistencyInconsistency
• Databases often contain information Databases often contain information about real world objectsabout real world objects
• When the data is collected and entered in When the data is collected and entered in the database (or the database (or measuredmeasured), errors are ), errors are introducedintroduced
• When the same object is When the same object is measuredmeasured more more than once, inconsistent data values may than once, inconsistent data values may resultresult
Ilya Pevzner, Arthur Goldberg 3 VLDB-2003 Ph.D. Workshop
Object IdentificationObject Identification
• Identification of records describing the Identification of records describing the same real-world objectsame real-world object
• If If keykey values are inconsistent values, values are inconsistent values, object identification is not trivial and its object identification is not trivial and its results are uncertainresults are uncertain
• Also known as Also known as approximateapproximate matchingmatching, , duplicate detection duplicate detection and and record linkagerecord linkage
• Area with multiple successful techniques, Area with multiple successful techniques, topic of KDD-2003 workshoptopic of KDD-2003 workshop
Ilya Pevzner, Arthur Goldberg 4 VLDB-2003 Ph.D. Workshop
Inconsistency Resolution Inconsistency Resolution ProblemProblem
• Given what is known about the world, find the “best” Given what is known about the world, find the “best” estimates for values of the inconsistent attributes estimates for values of the inconsistent attributes
• Possible sources of the knowledge about the world:Possible sources of the knowledge about the world:
a)a) The system designer or expertThe system designer or expert
b)b) The end userThe end user
c)c) The dataThe data
• Inconsistency resolution is also called Inconsistency resolution is also called mergingmerging
• Existing research is almost exclusively on a) and b)Existing research is almost exclusively on a) and b)– No systematic techniquesNo systematic techniques
• Our work concentrates on c)Our work concentrates on c)
Ilya Pevzner, Arthur Goldberg 5 VLDB-2003 Ph.D. Workshop
– Match using ID (trivial)Match using ID (trivial)
– Merge using standardizationMerge using standardization
ID Name 10 Arthur 20 Johnny 10 Art 20 John
MATCH
ID Name Arthur
10 Art John
20 Johnny
MERGE ID Name
10 Arthur 20 John
Matching/Merging ExampleMatching/Merging Example
Ilya Pevzner, Arthur Goldberg 6 VLDB-2003 Ph.D. Workshop
• Sometimes it is possible, but non-Sometimes it is possible, but non-trivial, to tell which attribute value trivial, to tell which attribute value is bestis best
• In other cases, the answer is In other cases, the answer is uncertainuncertain
ID Sex Country Name F US Andrea 10 F US Andrew M Russia Sergey 20 M Russia Sergio
MERGE ID Sex Country Name
10 F US Andrea 20 M Russia Sergey
Merging UncertaintyMerging Uncertainty
Ilya Pevzner, Arthur Goldberg 7 VLDB-2003 Ph.D. Workshop
Research goalsResearch goals
• Develop merging methodologies Develop merging methodologies that rely on the analysis of the datathat rely on the analysis of the data
• Extend relational databases withExtend relational databases with– Integrated model for representing Integrated model for representing
matching and merging uncertaintiesmatching and merging uncertainties
– Integrated support for various matching Integrated support for various matching and merging methodologiesand merging methodologies
Ilya Pevzner, Arthur Goldberg 8 VLDB-2003 Ph.D. Workshop
Uncertainty in Relational Uncertainty in Relational DatabasesDatabases• Semantics of NullsSemantics of Nulls
– E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993.
• Fuzzy databasesFuzzy databases– E.g. E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy
functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988.
• Probabilistic relationsProbabilistic relations– E.g. E Zimanyi and A. Pirotte. Imperfect Information in
Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997.
Ilya Pevzner, Arthur Goldberg 9 VLDB-2003 Ph.D. Workshop
Probabilistic relations Probabilistic relations overviewoverview
• Probabilistic relations model uncertainty with truth Probabilistic relations model uncertainty with truth probabilities added to classic relationsprobabilities added to classic relations– E.g. tuple X is in relation with probability P[X]E.g. tuple X is in relation with probability P[X]
• Each probabilistic relation is associated with a set of Each probabilistic relation is associated with a set of classic relations representing “possible worlds” classic relations representing “possible worlds” where the collection of outcomes for each where the collection of outcomes for each probabilistic choice is fixedprobabilistic choice is fixed– E.g. the probabilistic relation with the probabilistic choice E.g. the probabilistic relation with the probabilistic choice
in the above example will have two possible worlds – one in the above example will have two possible worlds – one with tuple X and one withoutwith tuple X and one without
• Relational operations are defined through the Relational operations are defined through the associated classic relationsassociated classic relations
Ilya Pevzner, Arthur Goldberg 10 VLDB-2003 Ph.D. Workshop
Zimanyi’sZimanyi’s Type-1 Type-1 Probabilistic RelationProbabilistic Relation• DefinitionDefinition
– A type-1 probabilistic relation is a relation A type-1 probabilistic relation is a relation RR with with a supplementary attribute a supplementary attribute w(R, t) w(R, t) addedadded to each to each tupletuple t t indicating the probability that a tuple indicating the probability that a tuple tt belongs to relation belongs to relation RR
Ilya Pevzner, Arthur Goldberg 11 VLDB-2003 Ph.D. Workshop
• Probabilistic relationProbabilistic relation
• Possible worlds (Possible worlds (assuming unique(ID1) and assuming unique(ID1) and unique(ID2)):unique(ID2)):
Zimanyi’sZimanyi’s Type-1 Type-1 Probabilistic Relation Probabilistic Relation ExampleExample
Name SSN Phone ID1 ID2 w(R, t) John 111-22-3333 212-555-1212 50 100 .6 John 111-22-3333 646-444-1212 50 200 .8 Johnny 222-33-4444 212-555-1212 60 100 .5 Johnny 222-33-4444 646-444-1212 60 200 .9
Name SSN Phone
Name SSN Phone Johnny 222-33-4444 646-444-1212
Name SSN Phone John 111-22-3333 212-555-1212
Name SSN Phone John 111-22-3333 212-555-1212 Johnny 222-33-4444 646-444-1212
Name SSN Phone Johnny 222-33-4444 212-555-1212
Name SSN Phone John 111-22-3333 646-444-1212
Name SSN Phone John 111-22-3333 646-444-1212 Johnny 222-33-4444 212-555-1212
Ilya Pevzner, Arthur Goldberg 12 VLDB-2003 Ph.D. Workshop
Probabilistic matchingProbabilistic matching• Example: matching by nameExample: matching by name
• The way The way w(R, t)w(R, t) is computed depends on the matching is computed depends on the matching methodologymethodology– An example of such methodology is ChoiceMaker™ An example of such methodology is ChoiceMaker™
ID1 ID2 Name 50 John 60 Johnny 100 Jon 200 Johnny
MATCH
Name ID1 ID2 w(R, t) John Jon
50 100 .6
John Johnny
50 200 .8
Johnny Jon
60 100 .5
Johnny 60 200 .9
Ilya Pevzner, Arthur Goldberg 13 VLDB-2003 Ph.D. Workshop
Zimanyi’sZimanyi’s Type-2 Type-2 Probabilistic RelationProbabilistic Relation
• DefinitionDefinition– Generalized relation in which attribute Generalized relation in which attribute
values can be probabilistic setsvalues can be probabilistic sets
Ilya Pevzner, Arthur Goldberg 14 VLDB-2003 Ph.D. Workshop
Zimanyi’sZimanyi’s Type-2 Type-2 Probabilistic Relation Probabilistic Relation ExampleExample
Name SSN
Value Probability John .7
111-22-3333 Jon .25 Johnny .2
222-33-4444 John .6
SSN Name 111-22-3333 John 222-33-4444 Johnny
SSN Name 111-22-3333 Jon 222-33-4444 Johnny
SSN Name 111-22-3333 Jon 222-33-4444 John
SSN Name 111-22-3333 John 222-33-4444 John
• Probabilistic relationProbabilistic relation
• Possible worldsPossible worlds
Ilya Pevzner, Arthur Goldberg 15 VLDB-2003 Ph.D. Workshop
S1
SSN Name 111-22-3333 John 222-33-4444 Johnny
S2
SSN Name 111-22-3333 Jon 222-33-4444 John
Probabilistic Merging Probabilistic Merging ExampleExample• Data sourcesData sources
• Query:Query:– List all people with the their correct name and social security List all people with the their correct name and social security
numbernumber
• Execution plan:Execution plan:– Join using SSN (UID)Join using SSN (UID)
– Merge namesMerge names
Ilya Pevzner, Arthur Goldberg 16 VLDB-2003 Ph.D. Workshop
Probabilistic Merging Probabilistic Merging Example: ResultExample: Result
Name SSN
Value Probability John .7
111-22-3333 Jon .25 Johnny .2
222-33-4444 John .6
S1
SSN Name 111-22-3333 John 222-33-4444 Johnny
S2
SSN Name 111-22-3333 Jon 222-33-4444 John
MERGE
Ilya Pevzner, Arthur Goldberg 17 VLDB-2003 Ph.D. Workshop
Merging MethodologiesMerging Methodologies• Ad-hoc techniquesAd-hoc techniques
– StandardizationStandardization• E.g. convert both Jim and Jimmy to JamesE.g. convert both Jim and Jimmy to James
– Pre-defined rulesPre-defined rules• E.g. use gender to pick Andrea and not AndrewE.g. use gender to pick Andrea and not Andrew
• Machine LearningMachine Learning– Supervised (e.g. MaxEnt)Supervised (e.g. MaxEnt)
• Use experts to manually merge some data, use it to train Use experts to manually merge some data, use it to train and validateand validate
– Unsupervised (e.g. dependency-based)Unsupervised (e.g. dependency-based)• E.g. Mine data for dependencies, use dependencies to pick E.g. Mine data for dependencies, use dependencies to pick
the best estimatesthe best estimates
Ilya Pevzner, Arthur Goldberg 18 VLDB-2003 Ph.D. Workshop
SQL ExtensionsSQL Extensions
• The MATCH predicateThe MATCH predicate– Uses a specified matching methodology to determine Uses a specified matching methodology to determine
if specified tuples describe the same objectif specified tuples describe the same object
• The MERGE functionThe MERGE function– Uses a specified merging methodology to provide Uses a specified merging methodology to provide
estimates for values of specified attributesestimates for values of specified attributes
• The PROB functionThe PROB function– Provides access to probabilities in type-1 and type-2 Provides access to probabilities in type-1 and type-2
probabilistic relationsprobabilistic relations
Ilya Pevzner, Arthur Goldberg 19 VLDB-2003 Ph.D. Workshop
The MATCH PredicateThe MATCH Predicate
• Can be used in the WHERE clause of SELECT Can be used in the WHERE clause of SELECT statementstatement
• Takes the name of the matcher module and Takes the name of the matcher module and the tuples to be testedthe tuples to be tested
• Returns true if the tuples match with Returns true if the tuples match with probability exceeding the matcher threshold. probability exceeding the matcher threshold. Otherwise, returns falseOtherwise, returns false
• SELECT statements with MATCH produce SELECT statements with MATCH produce type-1 probabilistic relationstype-1 probabilistic relations
Ilya Pevzner, Arthur Goldberg 20 VLDB-2003 Ph.D. Workshop
MATCH ExampleMATCH Example• Data source relationsData source relations
• QueryQuerySELECT S1.NAME, S1.SSN, S2.PHONE FROM SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 S1,S2 WHERE WHERE MATCHMATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) (‘NAME_MATCHER’,S1.NAME,S2.NAME)
• ResultResult
S1 Id Name SSN 50 John 111-22-3333 60 Johnny 222-33-4444
S2 Id Name Phone 100 Jon 212-555-1212 200 Johnny 646-444-1212
S1.Name SSN Phone w(R, t) John 111-22-3333 212-555-1212 .6=Match_prob(‘John’, ‘Jon’) John 111-22-3333 646-444-1212 .8=Match_prob(‘John’, ‘Johnny’) Johnny 222-33-4444 212-555-1212 .5=Match_prob(‘Johnny’, ‘Jon’) Johnny 222-33-4444 646-444-1212 .9=Match_prob(‘Johnny’, ‘Johnny’)
Ilya Pevzner, Arthur Goldberg 21 VLDB-2003 Ph.D. Workshop
The MERGE functionThe MERGE function• May appear in SELECT listMay appear in SELECT list
• Accepts two parametersAccepts two parameters– Merger nameMerger name
– Merge listMerge list
• Returns a table of the form Returns a table of the form (v, w(v, wff) ) where where v v is a is a value and value and wwff is the corresponding probabilityis the corresponding probability
• SELECT statements with MERGE produce SELECT statements with MERGE produce type-2 probabilistic relationstype-2 probabilistic relations
Ilya Pevzner, Arthur Goldberg 22 VLDB-2003 Ph.D. Workshop
S1 SSN Name 111-22-3333 John 222-33-4444 Johnny
S2 SSN Name 111-22-3333 Jon 222-33-4444 John
MERGE ExampleMERGE Example• Data sourcesData sources
• QueryQuerySELECT S1.SSN,SELECT S1.SSN,MERGEMERGE(‘NAME_MERGER’,(‘NAME_MERGER’,
(S1.NAME, S2.NAME)) AS NAME(S1.NAME, S2.NAME)) AS NAMEFROM S1, S2 WHERE S1.SSN=S2.SSNFROM S1, S2 WHERE S1.SSN=S2.SSN
• ResultResult Name SSN
Value Probability John .7
111-22-3333 Jon .25 Johnny .2
222-33-4444 John .6
Ilya Pevzner, Arthur Goldberg 23 VLDB-2003 Ph.D. Workshop
Query Processing Query Processing DiagramDiagram
P rep rocesso r P ostp rocesso r
M a tch in g /M erg in gE n g in e
In con sis ten tD a ta
S Q L Q u ery R esu lt(In con sis ten t R ela tion )
P ostp rocessin gIn stru c tion s
E x ten d edS Q LQ u ery
T rad ition a lo r P rob ab ilis tic
R ela tion s
S tan d a rdS Q LQ u ery
P rob ab ilis ticR ela tion s
Ilya Pevzner, Arthur Goldberg 24 VLDB-2003 Ph.D. Workshop
InterfacesInterfacesS tan d a rd S Q L
A p p lica tion
E xten d edS Q L
A p p lica tion
A d van cedIn con sis ten cy
R eso lu tionA p p lica tion
S tan d a rdS Q LT rad ition a lR ela tion
T rad ition a lR ela tion
P rob ab ilis ticR ela tion
E xten d edS Q L
E xten d edS Q L
S tan d a rdS Q L
In terfac e
E xten d edS Q L
In terfac e
A d van cedS Q L
In terfac e
E x ten d edD a tab aseS ys tem
In con sis ten tD a ta
Ilya Pevzner, Arthur Goldberg 25 VLDB-2003 Ph.D. Workshop
Validating with real-world Validating with real-world datadata
• MEDLINE data setMEDLINE data set– Affiliation Fields:Affiliation Fields:
• E-mail, Organization, AddressE-mail, Organization, Address
– Statistics:Statistics:• 2,391,822 affiliations2,391,822 affiliations• 523,140 matched by e-mail address523,140 matched by e-mail address• 182,892 with US addresses182,892 with US addresses• 32,505 non-identical duplicates32,505 non-identical duplicates
• Looking for other interesting data setsLooking for other interesting data sets– ErrorsErrors– DependenciesDependencies– DuplicatesDuplicates– More distinct itemsMore distinct items– More FieldsMore Fields
Ilya Pevzner, Arthur Goldberg 26 VLDB-2003 Ph.D. Workshop
Future plansFuture plans
• Consider several data setsConsider several data sets
• Develop several merging Develop several merging methodologiesmethodologies
• Evaluate using real data and looking atEvaluate using real data and looking at– PerformancePerformance
– Merge QualityMerge Quality
– UsabilityUsability
Ilya Pevzner, Arthur Goldberg 27 VLDB-2003 Ph.D. Workshop
QuestionsQuestions
• ??