ontology-based conceptual design of etl processes for both structured and semi-structured data

Post on 07-Jan-2016

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data. Outline. Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-

structured Data

Dimitrios Skoutas Alkis Simitsis{dskoutas,asimi}@dblab.ece.ntua.gr

National Technical University of AthensDept. of Electrical and Computer Engineering

http://www.dblab.ece.ntua.gr

2

Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions

Outline

3

IntroductionIntroduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions

Outline

4

Extract-Transform-Load (ETL)

Sources

Extract Transform & Clean

DW

Load

DSA

5

Problem description Conceptual design of ETL processes

is a critical task performed at the early stages of a DW project describe the integration of data from heterogeneous sources into

the Data Warehouse

Two main goals specify inter-schema mappings identify appropriate transformations

6

Motivation The problem of heterogeneity in data sources

structural heterogeneity data stored under different schemata

semantic heterogeneity different naming conventions

e.g., homonyms, synonyms different representation formats

e.g., units of measurement, currencies, encodings different ranges of values

7

Overview of our approach Key idea

an ontology-based approach to facilitate the conceptual design of an ETL scenario

An ontology is a “formal, explicit specification of a shared conceptualization” describes the knowledge in a domain in terms of classes,

properties, and relationships between them machine processable formal semantics reasoning mechanisms

The Web Ontology Language (OWL) is used as the language for the ontology W3C recommendation based on Description Logics

8

Overview of our approach Method

Construct a graph representation for each datastore

datastore graph Construct a suitable application ontology

ontology graph Annotate the datastores

Establish mappings between the datastore graph and the ontology graph

Apply reasoning techniques to

select relevant sources

to identify required transformations

9

Introduction Graph-based Datastore RepresentationGraph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions

Outline

10

The schema SD of a datastore comprises

elements containing the actual data

elements containing or referring to other elements

Datastore schema

11

Each element e defined in the schema SD is represented by a node ve ∈ VD.

Each containment relationship between elements e1, e2 is represented by an edge (v1, v2).

Each reference from element e1 to element e2 is represented by an edge (v1, v2).

Each edge is assigned a label of the form [min, max] denoting the corresponding cardinality.

Elements containing the actual data are represented by leaf nodes

Datastore graph

12

Reference exampleDatastore

Schema

DW PARTSUP(pkey, supplier, quantity, cost, city, address, date)

DS1 PS(pid, sid, department, address, date, cost, qty)

DS2

13

Reference example (cont’d) Datastore graphs

14

Introduction Graph-based Datastore Representation Application Ontology Construction and RepresentationApplication Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions

Outline

15

A suitable application ontology is constructed to model

the concepts of the domain

the relationships between those concepts

the attributes characterizing each concept

the different representation formats and (ranges of) values for each attribute

Application Ontology

16

The application ontology comprises a set of classes C = CC ∪ CT ∪ CG

CC : classes representing domain concepts CT : classes representing value types CG : classes representing aggregate functions

a set of properties P containing PP : properties representing attributes of concepts or

relationships between concepts property: convertsTo property: aggregates property: groups

Application Ontology

17

A graph representation specified for the ontology

Graph nodes represent classes in the ontology

Graph edges represent properties in the ontology

Different symbols are used for the different types of classes and properties

Ontology Graph

18

Ontology Graph

19

Reference example (cont’d) The application ontology graph

20

Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore AnnotationDatastore Annotation ETL Transformations Conclusions

Outline

21

The semantic annotation of each datastore consists in establishing the appropriate mappings between the datastore graph GS and the ontology graph GO.

Each internal node of GS may be mapped to one concept-node of GO.

A leaf node of GS may be mapped to one or more nodes of GO of the following types:

type-node format-node range-node aggregated-node

A node may have zero or more mappings.

Mappings are represented as node labels.

Datastore annotation

22

A defined class is created in the ontology for each internal labeled node of the datastore graph.

The definition for a node is constructed based on its neighbor labeled nodes.

A neighbor labeled node of n is each node n΄ such that: n΄ is labeled there is a path p in the datastore graph from node n to node n΄ p contains no other labeled nodes, except n and n΄

Datastore annotation

23

Reference example (cont’d) Datastore mappings

24

Reference example (cont’d) Datastore definitions

25

Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL TransformationsETL Transformations Conclusions

Outline

26

Generic types of ETL transformations

ETL Transformations

27

Generating ETL transformations Two main steps

select relevant sources to populate each DW element

identify required data transformations between the sources and the DW

28

Generating ETL transformations Selecting relevant sources

a source node nS, mapped to class cS a target node nT, mapped to class cT nS is provider for nT, if

cS and cT have a common superclass ensures that the integrated data records have the same

semantics cS and cT are not disjoint

prevents data integration between datastores with conflicting constraints

29

Generating ETL transformations Identifying data transformations (I)

a RETRIEVE operation for each provider node n

a MERGE operation to combine data from several provider nodes

an EXTRACT operation to extract a portion of data from a provider node

30

Generating ETL transformations Identifying data transformations (II)

if CS ≡ CT or CS ⊏ CT , no transformations are required

if CT ⊏ CS, AGGREGATE, FILTER and/or MINCARD/MAXCARD operations are required

else, as previous plus CONVERT operations

31

Generating ETL transformations Identifying data transformations (III)

a JOIN operation to combine recordsets from nodes, whose corresponding classes are related by a property.

a UNION operation, followed by a DD operation, to combine recordsets from nodes, whose corresponding classes have a common superclass.

a STORE operation to denote loading of data to the target datastore.

32

Reference example (cont’d) Provider nodes and transformations for DS2

33

Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations ConclusionsConclusions

Outline

34

Conclusions A graph-based representation, datastore graph, as a

common model for the datastores.

A suitable application ontology and a corresponding graph representation, ontology graph.

Datastore annotation through mappings from the datastore graph to the ontology graph.

Reasoning on the mappings to identify relevant sources and required transformations.

35

Current and Future Work Semi-automatic construction of the application ontology

Semi-automatic annotation of the datastores

Executable workflow

Evaluation on real-world ETL scenarios

Maintenance/adaptation of the ETL workflow

36

Thank You

37

Questions

top related