data exchange with data-metadata translations
DESCRIPTION
Paolo Papotti Università Roma Tre. Wang-Chiew Tan UC Santa Cruz. Mauricio A. Hernández IBM Almaden Research Center. Data Exchange with Data-Metadata Translations. VLDB 2008. August 24 -- Auckland, New Zealand. Data exchange scenarios may involve metadata transformations. - PowerPoint PPT PresentationTRANSCRIPT
Data Exchange with Data-Metadata Translations
Data Exchange with Data-Metadata Translations
Mauricio A. Mauricio A. HernándezHernández
IBMIBMAlmaden ResearchAlmaden ResearchCenterCenter
Wang-ChiewWang-ChiewTan Tan UC Santa CruzUC Santa Cruz
Paolo Paolo PapottiPapotti UniversitàUniversitàRoma TreRoma Tre
VLDBVLDB20082008
August 24 -- Auckland, New Zealand
2
• Data exchange scenarios may involve metadata transformations.
– E.g., Pivot/Unpivot in spreadsheets.
[example from Miller98]
Data-Metadata TranslationsData-Metadata Translations
• Mapping systems support Data-to-Data transformations with fixed schemas.
• Goal: Extend mapping systems to support Data-Metadata Translations.
3
Source schema S
Source schema S
Target schema T
Target schema T
Declarative (internal) representationDeclarative (internal) representation
GUIGUI
Executable code (XSLT, XQuery, Java)Executable code (XSLT, XQuery, Java)
II JJ
IBM Clio
HepTox
MS ADO.net
Altova MapForce
StylusStudio
BEA Aqualogic
Data exchange
Mapping SystemsMapping Systems
Outline
1. Data and Metadata translations
Data-to-DataData-to-Data
Metadata-to-DataMetadata-to-Data
Data-to-MetadataData-to-Metadata Graphic Design
2. Generation Algorithms
Mapping GenerationMapping Generation
Query GenerationQuery Generation
Graphic Design
3. Results & Discussion
ExperimentsExperimentsRelated WorkRelated Work
ConclusionConclusion
• Mapping Generation Algorithm: [PVMHF 2002]
– Input: Source and Target schemas, and correspondences.
– Output: declarative schema mapping
• For example:
Data-to-DataData-to-Data
Source: Rcd Sales: SetOf Rcd country region style shipdate units price
Target: Rcd CountrySales: SetOf Rcd country Sales: SetOf Rcd style shipdate units id
for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units
MappingsMappings
• Query Generation into multiple query languages:– Input: a data to data schema mapping– Output: a query script (XQuery, XSLT, SQL, etc.)
for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units
for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units
for $x0 in $doc/Source/Sales return ( <CountrySales>
<country> { $x0/country/text() } </country> …
for $x0 in $doc/Source/Sales return ( <CountrySales>
<country> { $x0/country/text() } </country> …
7
Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56
Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56
m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA
““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data
Source: Rcd Sales: SetOf Rcd month USA UK Italy
Target: Rcd Sales: SetOf Rcd month country units
How can we transform the following source data into the corresponding target?
Schema mapping m1
“USA”
8
Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56
Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56
m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA
m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK
““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data
Source: Rcd Sales: SetOf Rcd month USA UK Italy
Target: Rcd Sales: SetOf Rcd month country units
How can we transform the following source data into the corresponding target?
Schema mapping m2
“UK”
9
Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56
Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56
m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA
m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK
m3: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “Italy” and $t.units = $s.Italy
““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data
Source: Rcd Sales: SetOf Rcd month USA UK Italy
Target: Rcd Sales: SetOf Rcd month country units
How can we transform the following source data into the corresponding target?
Schema mapping m3
“Italy”
10
Source: Rcd Sales: SetOf Rcd month USA UK Italy
Target: Rcd Sales: SetOf Rcd month country units
countries label value
Select the elements to group
Placeholder Copy elements’
values
Copy elements’ labels
Source.Sales Jan 120 223 89 Feb 83 168 56
Target.Sales Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56
Set of labels (strings)
Dynamic selection of the source
element
Is a label value
for $s in Source.Sales, $c in {“USA”, “UK”, “Italy”}{“USA”, “UK”, “Italy”}exists $t in Target.Saleswhere $t.month = $s.month and $t.country = $c and $t.units = $s.($c)
MetadatA-Data (MAD) mapping:
Metadata-to-Data: Our solutionMetadata-to-Data: Our solution
11
Target: Rcd Stockquotes: SetOf Rcd time symbols label value
Source: Rcd StockTicker: SetOf Rcd time symbol price Dynamic
element
Now we want to support the opposite operation [example from Miller98]
The target schema depends on the source data
We define a target template: Nested Dynamic Output Schemas (ndos)
Run-time: The dynamic element defines the target instance and the target schema.
Data-to-MetadataData-to-Metadata
StockTicker (time: 0900, Symbol : MSFT, Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM, Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT, Price: 27.30 )
There are two possible interpretations for the target ndos:
Consider this mapping and this source instance:
Stockquotes (time: 0900, MSFT: 27.20 ) Stockquotes (time: 0900, IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30 )
Target: Rcd Stockquotes: SetOf Rcd time symbols: Choice MSFT IBM
Computed Target Instance
Source Instance
First alternative: Heterogeneous target records
Computed Target Schema
Data-to-Metadata: Heterogeneous recordsData-to-Metadata: Heterogeneous records
Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue
Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice
Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue
Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice
StockTicker (time: 0900, Symbol : MSFT Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT Price: 27.30 )
There are two possible interpretations for the target ndos:
Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records
Consider this mapping and this source instance:
Computed Target Instance
Source Instance
Computed Target SchemaTarget: Rcd Stockquotes: SetOf Rcd time MSFT IBM
Stockquotes (time: 0900, MSFT: 27.20, IBM: null ) Stockquotes (time: 0900, MSFT: null , IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30, IBM: null )
Second alternative: Homogeneous target records
14
Natural solution for the Relational data model
Stockquotes(time: 0900, MSFT : 27.20, IBM: null ) Stockquotes(time: 0900, MSFT : null , IBM: 120.00) Stockquotes(time: 0905, MSFT : 27.30, IBM: null )
Homogeneity Constraint:“For every pair of tuples t1 and t2, if a is a label in t1, then a is a label in t2”
for $t1 in Target.Stockquotes, $t2 in Target.Stockquotes, $a in dom ($t1)exists $a’ in dom ($t2)where $a = $a’
Stockquotes(time: 0900, MSFT : 27.20 ) Stockquotes(time: 0900, IBM : 120.00 ) Stockquotes(time: 0905, MSFT : 27.30 )
Natural solution for semi-structured data models (XSD, DTD, JSON)
Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records
Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue
Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice
15
Source.Salescountry region style shipdate units price USA East Tee 12-07 11 1200 USA East Elec. 12-07 12 3600 USA West Tee 01-08 10 1600 UK West Tee 02-08 12 2000
MAD Mapping GenerationMAD Mapping Generation
Target: Target: RcdRcd ByShipdateCountryByShipdateCountry: : SetOf ChoiceSetOf Choice datesdates labellabel1 1
valuevalue1 1 : : RcdRcd countriescountries labellabel22 valuevalue2 2 : : SetOfSetOf RcdRcd stylestyle unitsunits price price
Source: Source: RcdRcd SalesSales: : SetOf RcdSetOf Rcd countrycountry regionregion stylestyle shipdateshipdate unitsunits priceprice <ByShipDateCountry>
<12-07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec.</style><units>12</units><price>3600</price> </USA> </12-07> <01-08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01-08> <02-08> <UK> <style>Tee</style><units>12</units><price>2000</price> </UK> </02-08></ByShipDataCountry>
<ByShipDateCountry> <12-07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec.</style><units>12</units><price>3600</price> </USA> </12-07> <01-08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01-08> <02-08> <UK> <style>Tee</style><units>12</units><price>2000</price> </UK> </02-08></ByShipDataCountry>
16
for $s in Source.Salesexists $t in Target.ByShipdateCountry, $y in dates, $u in case $t of $y, $z in countries, $v in $u.($z) where $y = $s.shipdate and $z= $s.country and $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($z) = SK[$s.shipdate,$s.country]
for $s in Source.Salesexists $t in Target.ByShipdateCountry, $y in dates, $u in case $t of $y, $z in countries, $v in $u.($z) where $y = $s.shipdate and $z= $s.country and $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($z) = SK[$s.shipdate,$s.country]
for $s in Source.Salesexists $t in Target.ByShipdateCountry, $u in case $t of $s.shipdate, $v in $u.($s.country) where $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($s.country) = SK[$s.shipdate,$s.country]
for $s in Source.Salesexists $t in Target.ByShipdateCountry, $u in case $t of $s.shipdate, $v in $u.($s.country) where $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($s.country) = SK[$s.shipdate,$s.country]
MAD Mapping GenerationMAD Mapping Generation
Target: Target: RcdRcd ByShipdateCountryByShipdateCountry: : SetOf ChoiceSetOf Choice datesdates labellabel1 1
valuevalue1 1 : : RcdRcd countriescountries labellabel22 valuevalue2 2 : : SetOfSetOf RcdRcd stylestyle unitsunits price price
Source: Source: RcdRcd SalesSales: : SetOf RcdSetOf Rcd countrycountry regionregion stylestyle shipdateshipdate unitsunits priceprice
This is what we get from Clio [PVMHF 02]
Three Steps:
1. Modify schemas with dynamic placeholders
2. Compile mappings
3. Simplify mapping
Q1
I
S S1
Phase 1: Q1 shreds the source instance I over relational views of the target schema
conforms-to
[PVMHF [PVMHF 02]02]
Query Generation: two-phase algorithmQuery Generation: two-phase algorithm
Q2
r r r r
Phase 2: Q2 assembles the target instance J from the relational views
conforms-to
JJ
TT1 T2
T3T4
I
S S1
Q1
Phase 1: Q1 shreds the source instance I over relational views of the target ndos
conforms-to
New Query GenerationNew Query Generation
Q2
Q3
Q4
Phase 2: Q2 assembles the target instance J from the relational views
Q3 computes the target schema T
Q4 is the optional post - processing
conforms-to
conforms-to
JJ
TT1 T2
T3T4
ndosT1 T2
T3
r r r r
19
Commercial Tool
MAD Clio vs. Commercial ToolsMAD Clio vs. Commercial Tools
1
10
100
1000
10000
100000
0 100 200 300 400 500 600
Number of distinct labels
Que
ry e
xecu
tion
time
[s] Naive query
MAD Clio vs. Commercial ToolsMAD Clio vs. Commercial Tools
1
10
100
1000
10000
100000
0 100 200 300 400 500 600
Number of distinct labels
Que
ry e
xecu
tion
time
[s]
Naive queryDynamic queryStatic query
48 source labels (10 MB): naïve 183 s, dynamic 14 s, optimized 10 s
Optimized query
MAD Clio
21
12 target labels (10 MB): naïve 590 s, optimized 80 s [1 phase: 3 s]
MAD Clio Performance
22
• Lots of related work in the relational setting:– FIRA/FISQL [Wyss,Robertson 2005] has an excellent survey.– SchemaSQL [Lakshmanan,Sadri,Subramanian 1996],
FIRA/FISQL [Wyss,Robertson 2005] • Extensions to SQL to handle metadata as data
• Only relational dynamic output schemas
• Language and semantics, NO transformations from GUI
• In XML settings– HepTox [BCHLP 2005], commercial mapping tools [Altova
MapForce, MS ADO.net, StylusStudio, BEA (Oracle) Aqualogic]• No dynamic elements in the target
Some Related Work
23
Source schema S
Target schema T
Declarative (internal) representation
GUI
Executable code (XSLTXSLT, XQuery, JavaJava)
New construct to iterate over elements’ labels: placeholder
Target schema can be incomplete: nested dynamic output schema (ndos)
New constructs for the mapping language
New mapping & query generation algorithms
Including a query to generate the target schema.
Data exchange with data-metadata support: Data to Data is a special case
MAD ClioMAD Clio
24
Thank you.Thank you.
Questions?Questions?
Data Exchange with Data-Metadata Translations
Data Exchange with Data-Metadata Translations
25
...<properties name=“price” lang=“en-us”
date=“01-01-2008” ... > <pval>48.15</pval></properties> ...
...<price value=“48.15” lang=“en-us” date=“01-01-2008” ... /> ...
for $x1 in Source.properties, $x2 in { @lang, @date, …, @format }exists $y1 in Target.($x1.@name),where $y1.@value = $x1.pval and $y1.($x2) = $x1.($x2)
Source: Rcd properties: SetOf Rcd @name @lang @date … @format pval
<<attrs>> label value
Target: Rcd label1 value1: SetOf Rcd @value label2 value2
<<names>>
<<elems>>
Metadata to Metadata: placeholder to dynamic element
Metadata-to-MetadataMetadata-to-Metadata