data exchange with  data-metadata translations

25
Data Exchange with Data-Metadata Translations Mauricio A. Mauricio A. Hernández Hernández IBM IBM Almaden Almaden Research Research Center Center Wang-Chiew Wang-Chiew Tan Tan UC Santa UC Santa Cruz Cruz Paolo Paolo Papotti Papotti Universit Universit à à Roma Tre Roma Tre VLDB VLDB 2008 2008 August 24 -- Auckland, New Zealand

Upload: jonah-saunders

Post on 30-Dec-2015

47 views

Category:

Documents


0 download

DESCRIPTION

Paolo  Papotti Università Roma Tre. Wang-Chiew Tan UC  Santa Cruz. Mauricio A.  Hernández IBM Almaden Research Center. Data Exchange with  Data-Metadata Translations. VLDB 2008. August 24 -- Auckland, New Zealand. Data exchange scenarios may involve metadata transformations. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Exchange with  Data-Metadata Translations

Data Exchange with Data-Metadata Translations

Data Exchange with Data-Metadata Translations

Mauricio A. Mauricio A. HernándezHernández

IBMIBMAlmaden ResearchAlmaden ResearchCenterCenter

Wang-ChiewWang-ChiewTan Tan     UC  Santa CruzUC  Santa Cruz

Paolo Paolo PapottiPapotti   UniversitàUniversitàRoma TreRoma Tre

VLDBVLDB20082008

August 24 -- Auckland, New Zealand

Page 2: Data Exchange with  Data-Metadata Translations

2

• Data exchange scenarios may involve metadata transformations.

– E.g., Pivot/Unpivot in spreadsheets.

[example from Miller98]

Data-Metadata TranslationsData-Metadata Translations

• Mapping systems support Data-to-Data transformations with fixed schemas.

• Goal: Extend mapping systems to support Data-Metadata Translations.

Page 3: Data Exchange with  Data-Metadata Translations

3

Source schema S

Source schema S

Target schema T

Target schema T

Declarative (internal) representationDeclarative (internal) representation

GUIGUI

Executable code (XSLT, XQuery, Java)Executable code (XSLT, XQuery, Java)

II JJ

IBM Clio

HepTox

MS ADO.net

Altova MapForce

StylusStudio

BEA Aqualogic

Data exchange

Mapping SystemsMapping Systems

Page 4: Data Exchange with  Data-Metadata Translations

Outline

1. Data and Metadata translations

Data-to-DataData-to-Data

Metadata-to-DataMetadata-to-Data

Data-to-MetadataData-to-Metadata Graphic Design

2. Generation Algorithms

Mapping GenerationMapping Generation

Query GenerationQuery Generation

Graphic Design

3. Results & Discussion

ExperimentsExperimentsRelated WorkRelated Work

ConclusionConclusion

Page 5: Data Exchange with  Data-Metadata Translations

• Mapping Generation Algorithm: [PVMHF 2002]

– Input: Source and Target schemas, and correspondences.

– Output: declarative schema mapping

• For example:

Data-to-DataData-to-Data

Source: Rcd Sales: SetOf Rcd country region style shipdate units price

Target: Rcd CountrySales: SetOf Rcd country Sales: SetOf Rcd style shipdate units id

for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units

Page 6: Data Exchange with  Data-Metadata Translations

MappingsMappings

• Query Generation into multiple query languages:– Input: a data to data schema mapping– Output: a query script (XQuery, XSLT, SQL, etc.)

for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units

for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units

for $x0 in $doc/Source/Sales return ( <CountrySales>

<country> { $x0/country/text() } </country> …

for $x0 in $doc/Source/Sales return ( <CountrySales>

<country> { $x0/country/text() } </country> …

Page 7: Data Exchange with  Data-Metadata Translations

7

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m1

“USA”

Page 8: Data Exchange with  Data-Metadata Translations

8

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK

““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m2

“UK”

Page 9: Data Exchange with  Data-Metadata Translations

9

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK

m3: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “Italy” and $t.units = $s.Italy

““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m3

“Italy”

Page 10: Data Exchange with  Data-Metadata Translations

10

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

countries label value

Select the elements to group

Placeholder Copy elements’

values

Copy elements’ labels

Source.Sales Jan 120 223 89 Feb 83 168 56

Target.Sales Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

Set of labels (strings)

Dynamic selection of the source

element

Is a label value

for $s in Source.Sales, $c in {“USA”, “UK”, “Italy”}{“USA”, “UK”, “Italy”}exists $t in Target.Saleswhere $t.month = $s.month and $t.country = $c and $t.units = $s.($c)

MetadatA-Data (MAD) mapping:

Metadata-to-Data: Our solutionMetadata-to-Data: Our solution

Page 11: Data Exchange with  Data-Metadata Translations

11

Target: Rcd Stockquotes: SetOf Rcd time symbols label value

Source: Rcd StockTicker: SetOf Rcd time symbol price Dynamic

element

Now we want to support the opposite operation [example from Miller98]

The target schema depends on the source data

We define a target template: Nested Dynamic Output Schemas (ndos)

Run-time: The dynamic element defines the target instance and the target schema.

Data-to-MetadataData-to-Metadata

Page 12: Data Exchange with  Data-Metadata Translations

StockTicker (time: 0900, Symbol : MSFT, Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM, Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT, Price: 27.30 )

There are two possible interpretations for the target ndos:

Consider this mapping and this source instance:

Stockquotes (time: 0900, MSFT: 27.20 ) Stockquotes (time: 0900, IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30 )

Target: Rcd Stockquotes: SetOf Rcd time symbols: Choice MSFT IBM

Computed Target Instance

Source Instance

First alternative: Heterogeneous target records

Computed Target Schema

Data-to-Metadata: Heterogeneous recordsData-to-Metadata: Heterogeneous records

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

Page 13: Data Exchange with  Data-Metadata Translations

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

StockTicker (time: 0900, Symbol : MSFT Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT Price: 27.30 )

There are two possible interpretations for the target ndos:

Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records

Consider this mapping and this source instance:

Computed Target Instance

Source Instance

Computed Target SchemaTarget: Rcd Stockquotes: SetOf Rcd time MSFT IBM

Stockquotes (time: 0900, MSFT: 27.20, IBM: null ) Stockquotes (time: 0900, MSFT: null , IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30, IBM: null )

Second alternative: Homogeneous target records

Page 14: Data Exchange with  Data-Metadata Translations

14

Natural solution for the Relational data model

Stockquotes(time: 0900, MSFT : 27.20, IBM: null ) Stockquotes(time: 0900, MSFT : null , IBM: 120.00) Stockquotes(time: 0905, MSFT : 27.30, IBM: null )

Homogeneity Constraint:“For every pair of tuples t1 and t2, if a is a label in t1, then a is a label in t2”

for $t1 in Target.Stockquotes, $t2 in Target.Stockquotes, $a in dom ($t1)exists $a’ in dom ($t2)where $a = $a’

Stockquotes(time: 0900, MSFT : 27.20 ) Stockquotes(time: 0900, IBM : 120.00 ) Stockquotes(time: 0905, MSFT : 27.30 )

Natural solution for semi-structured data models (XSD, DTD, JSON)

Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

Page 15: Data Exchange with  Data-Metadata Translations

15

Source.Salescountry region style shipdate units price USA East Tee 12-07 11 1200 USA East Elec. 12-07 12 3600 USA West Tee 01-08 10 1600 UK West Tee 02-08 12 2000

MAD Mapping GenerationMAD Mapping Generation

Target: Target: RcdRcd ByShipdateCountryByShipdateCountry: : SetOf ChoiceSetOf Choice datesdates labellabel1 1

valuevalue1 1 : : RcdRcd countriescountries labellabel22 valuevalue2 2 : : SetOfSetOf RcdRcd stylestyle unitsunits price price

Source: Source: RcdRcd SalesSales: : SetOf RcdSetOf Rcd countrycountry regionregion stylestyle shipdateshipdate unitsunits priceprice <ByShipDateCountry>

<12-07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec.</style><units>12</units><price>3600</price> </USA> </12-07> <01-08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01-08> <02-08> <UK> <style>Tee</style><units>12</units><price>2000</price> </UK> </02-08></ByShipDataCountry>

<ByShipDateCountry> <12-07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec.</style><units>12</units><price>3600</price> </USA> </12-07> <01-08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01-08> <02-08> <UK> <style>Tee</style><units>12</units><price>2000</price> </UK> </02-08></ByShipDataCountry>

Page 16: Data Exchange with  Data-Metadata Translations

16

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $y in dates, $u in case $t of $y, $z in countries, $v in $u.($z) where $y = $s.shipdate and $z= $s.country and $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($z) = SK[$s.shipdate,$s.country]

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $y in dates, $u in case $t of $y, $z in countries, $v in $u.($z) where $y = $s.shipdate and $z= $s.country and $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($z) = SK[$s.shipdate,$s.country]

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $u in case $t of $s.shipdate, $v in $u.($s.country) where $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($s.country) = SK[$s.shipdate,$s.country]

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $u in case $t of $s.shipdate, $v in $u.($s.country) where $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($s.country) = SK[$s.shipdate,$s.country]

MAD Mapping GenerationMAD Mapping Generation

Target: Target: RcdRcd ByShipdateCountryByShipdateCountry: : SetOf ChoiceSetOf Choice datesdates labellabel1 1

valuevalue1 1 : : RcdRcd countriescountries labellabel22 valuevalue2 2 : : SetOfSetOf RcdRcd stylestyle unitsunits price price

Source: Source: RcdRcd SalesSales: : SetOf RcdSetOf Rcd countrycountry regionregion stylestyle shipdateshipdate unitsunits priceprice

This is what we get from Clio [PVMHF 02]

Three Steps:

1. Modify schemas with dynamic placeholders

2. Compile mappings

3. Simplify mapping

Page 17: Data Exchange with  Data-Metadata Translations

Q1

I

S S1

Phase 1: Q1 shreds the source instance I over relational views of the target schema

conforms-to

[PVMHF [PVMHF 02]02]

Query Generation: two-phase algorithmQuery Generation: two-phase algorithm

Q2

r r r r

Phase 2: Q2 assembles the target instance J from the relational views

conforms-to

JJ

TT1 T2

T3T4

Page 18: Data Exchange with  Data-Metadata Translations

I

S S1

Q1

Phase 1: Q1 shreds the source instance I over relational views of the target ndos

conforms-to

New Query GenerationNew Query Generation

Q2

Q3

Q4

Phase 2: Q2 assembles the target instance J from the relational views

Q3 computes the target schema T

Q4 is the optional post - processing

conforms-to

conforms-to

JJ

TT1 T2

T3T4

ndosT1 T2

T3

r r r r

Page 19: Data Exchange with  Data-Metadata Translations

19

Commercial Tool

MAD Clio vs. Commercial ToolsMAD Clio vs. Commercial Tools

1

10

100

1000

10000

100000

0 100 200 300 400 500 600

Number of distinct labels

Que

ry e

xecu

tion

time

[s] Naive query

Page 20: Data Exchange with  Data-Metadata Translations

MAD Clio vs. Commercial ToolsMAD Clio vs. Commercial Tools

1

10

100

1000

10000

100000

0 100 200 300 400 500 600

Number of distinct labels

Que

ry e

xecu

tion

time

[s]

Naive queryDynamic queryStatic query

48 source labels (10 MB): naïve 183 s, dynamic 14 s, optimized 10 s

Optimized query

MAD Clio

Page 21: Data Exchange with  Data-Metadata Translations

21

12 target labels (10 MB): naïve 590 s, optimized 80 s [1 phase: 3 s]

MAD Clio Performance

Page 22: Data Exchange with  Data-Metadata Translations

22

• Lots of related work in the relational setting:– FIRA/FISQL [Wyss,Robertson 2005] has an excellent survey.– SchemaSQL [Lakshmanan,Sadri,Subramanian 1996],

FIRA/FISQL [Wyss,Robertson 2005] • Extensions to SQL to handle metadata as data

• Only relational dynamic output schemas

• Language and semantics, NO transformations from GUI

• In XML settings– HepTox [BCHLP 2005], commercial mapping tools [Altova

MapForce, MS ADO.net, StylusStudio, BEA (Oracle) Aqualogic]• No dynamic elements in the target

Some Related Work

Page 23: Data Exchange with  Data-Metadata Translations

23

Source schema S

Target schema T

Declarative (internal) representation

GUI

Executable code (XSLTXSLT, XQuery, JavaJava)

New construct to iterate over elements’ labels: placeholder

Target schema can be incomplete: nested dynamic output schema (ndos)

New constructs for the mapping language

New mapping & query generation algorithms

Including a query to generate the target schema.

Data exchange with data-metadata support: Data to Data is a special case

MAD ClioMAD Clio

Page 24: Data Exchange with  Data-Metadata Translations

24

Thank you.Thank you.

Questions?Questions?

Data Exchange with Data-Metadata Translations

Data Exchange with Data-Metadata Translations

Page 25: Data Exchange with  Data-Metadata Translations

25

...<properties name=“price” lang=“en-us”

date=“01-01-2008” ... > <pval>48.15</pval></properties> ...

...<price value=“48.15” lang=“en-us” date=“01-01-2008” ... /> ...

for $x1 in Source.properties, $x2 in { @lang, @date, …, @format }exists $y1 in Target.($x1.@name),where $y1.@value = $x1.pval and $y1.($x2) = $x1.($x2)

Source: Rcd properties: SetOf Rcd @name @lang @date … @format pval

<<attrs>> label value

Target: Rcd label1 value1: SetOf Rcd @value label2 value2

<<names>>

<<elems>>

Metadata to Metadata: placeholder to dynamic element

Metadata-to-MetadataMetadata-to-Metadata