data-driven understanding and refinement of schema mappings data integration and service computing...

31
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

Upload: gilbert-phillips

Post on 13-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS

Data Integration and Service Computing

ITCS 6010

Page 2: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

INTRODUCTION• USER

– Difficult finding correct mappings for applications– Schema mappings are complex, effectively communicating subtleties

involved– Understanding source data difficult, hence provide facility for schema

and data exploration– Complexities of mapping and subtle difference between alternative

mappings– Reasoning about complex non-associative operators– Increase of data and necessity to integrate data from multiple source– Mappings between these schemas– But Still some issues need to be addressed

Page 3: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

ILLUSTRATIONS

“ The Ultimate goal of schema is not building correct queries but to extract correct data from source to populate target schema”

• The user is expected to • have thorough understanding of data • Debug complex SQL queries or procedural transformations• Clio makes it easy

Page 4: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

ILLUSTRATIONS

Source: Ling Ling Yan, Ren\&\#233;e J. Miller, Laura M. Haas, and Ronald Fagin. 2001. Data-driven understanding and refinement of schema mappings. SIGMOD Rec. 30, 2 (May 2001), 485-496. DOI=10.1145/376284.375729 http://doi.acm.org/10.1145/376284.37572

Page 5: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

Source: Ling Ling Yan, Ren\&\#233;e J. Miller, Laura M. Haas, and Ronald Fagin. 2001. Data-driven understanding and refinement of schema mappings. SIGMOD Rec. 30, 2 (May 2001), 485-496. DOI=10.1145/376284.375729 http://doi.acm.org/10.1145/376284.37572

Page 6: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010
Page 7: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

MAPPINGS• Mapping is a query on source schema that produces

subset of target relation• Mapping involves three main activities

• Determining Correspondences• Data Linking• Data trimming

Page 8: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

• A be set of attributes• A A • A relation on schema S is named finite set of tuples on S• t[A] dom(A) value of t on A

Assumption: Relation in source database do not contain any

tuple that are null on any attribute

Page 9: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

• Predicate P over schema S maps tuples on S to true or false– Join Predicate– Selection predicate

• A predicate is strong if it evaluates to false for every tuple that is null for all attributes in S

• Join Predicate is strong predicate• Selection predicate is not required to be strong

Page 10: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

Correspondence to Target• What attribute and how it should appear in target relation

• E.g: Kids.FamilyIncome = parents.salary + parents2.salary (ref

Page 11: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA LINKING

Page 12: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA LINKING

Page 13: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA LINKING

Page 14: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA LINKING

Page 15: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA TRIMMING• All tuples in Query Graph G may not be semantically

meaningful• Data associations in some category may be too

incomplete to include

• User decides some categories are excluded as they have incomplete coverage

Page 16: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

MAPPING DEFINITION

Page 17: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

MAPPING DEFINITION

• Mapping defines the relationship between a target relation and set of source relations, defined with three main components:

– Query graph G – Set V of Value Components

– Two sets of filter Cs and CT defining conditions source and target should satisfy

Page 18: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

MAPPING EXAMPLES

• Positive example states how source tuples contribute successfully to target relation

• Negative example states how source tuples are combined correctly but fails to contribute

Page 19: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

MAPPINGS OPERATORS• Correspondence Operators

Permit users to change value of correspondences

• Data Trimming OperatorsModify the source and target filters of a mapping.

They do not change the query graph of a mapping.

• Data Linking OperatorsDirectly change the query graph of mapping.

They are of two type:• Data Walk• Data Chase

Page 20: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA WALK• In a data walk, the user knows where the missing data

resides in the source or more specifically what source relation(s) contain this data.

• A data walk makes use of Clio’s knowledge of the source schema (which is gathered from schema and constraint definitions and from mining the source data, views, stored queries and metadata).

Page 21: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010
Page 22: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010
Page 23: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

DATA CHASE

• In a data Chase, the user does not know where the missing data resides. The chase permits the user to explore the source data incrementally to locate the desired data.

• The user may not know which relations to include in the extended query graph.

Page 24: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010
Page 25: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010
Page 26: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

CLIO FOR LARGE MAPPINGS• Manage and manipulate multiple (possible) mappings

while the user explores the data, creates new correspondences and extends the query graph.

• More complex the relationship between source and target, the more (possible) mappings we must handle.

• Large schemas are a source of complexity. Large volumes of data need to be transformed.

• Unfamiliar data sources the amount of data itself might be an obstacle for mapping.

Page 27: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

CLIO MAPPING FRAMEWORK• Clio provides

• Target Viewer• “What You Is What You Get” flavor to the mapping.

• Source Viewer• Serves as a palette from which users can choose the relations with

which they want to work or explicitly select an edge to follow.• Provides a visualization of the query graph being constructed.

• A set of workspaces, each associated with a single mapping alternative.

Page 28: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

COMPLEX MAPPINGS

• Many single target mappings create will have great deal of overlap, differing only in a few correspondences or a small portion of query graph.

• The decisions made in creating one mapping can be stored and made available to the user in order reduce the burden and overhead of re-creating the bulk of each mapping from scratch.

Page 29: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

CLIO FOR COMPLEX MAPPINGS

• Clio automatically computes both possible mappings and the user can accept one or several, adding filters as needed.

• Clio’s rich framework supports the user in specifying complex target mappings.

Page 30: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

SUMMARY

• presents a new framework that uses examples drawn from source data to illustrate complex schema mappings.

• Provides formal definitions of mappings, mapping examples and mapping operators and shows how they can be used to help a user understand the data and develop mappings.

Page 31: DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010

QUESTIONS?