straightforward integration of federated data
TRANSCRIPT
Straightforward Integrationof Federated Data
Using standard RDBMS servers tolink databases across the internet
1st Annual CASIMIR SymposiumRome, Nov 27–28, 2007
Walter PargentGSF - National Research Center for Environment and Health
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (2)
Session 3Walter Pargent
Short Glossary
• RDBMS =Relational DataBase Management System
• SQL =Structured Query Language
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (3)
Session 3Walter Pargent
Aims / Topics
• Show standard RDBMS featuresas options for dynamic data integration
• Discuss databaseLinkage vs. Replication
• Technical, not semantic integration level• How can technology help semantics• In case: Enquire other’s experiences !!!
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (4)
Session 3Walter Pargent
Janan Eppig @ Corfu
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (5)
Session 3Walter Pargent
Data integrationETL style
• ETL =– Extraction– Transformation
(and Transfer)– Loading
• => REPLICATION
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (6)
Session 3Walter Pargent
Thumb ruleof data design
• Redundancy abets inconsistency,so better avoid it …
• … wherever possible– Exceptions for
• Performance issues• (Ease of implementation)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (7)
Session 3Walter Pargent
Linkage
Information-Proxy(RDBMS-Server)
Web-Interfaces
DataMart(BioMart)
SQL-Interface
Web-Services
M a s t e r - D a t a b a s e - S e r v e r s
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (8)
Session 3Walter Pargent
virtual private networks (VPN)
in DemilitarizedFirewall-Zones
The paranoid option:
Linkage Security Aspects
Information-Proxy(RDBMS-Server)
Local-Transact-Work-DBs
Local Proxies
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (9)
Session 3Walter Pargent
Integration@YourDesk
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (10)
Session 3Walter Pargent
MS-ACCESS
• … as an RDBMS client• Technical basis
– Data source names (DSNs) defined forindividual table connections
– Open Database Connectivity (ODBC)– Ability to join between heterologous RDBMS
tables and (local) spreadsheets– Relationship manager (for join predefs)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (11)
Session 3Walter Pargent
Integration@YourDeskFile -> External Data -> Link Tables
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (12)
Session 3Walter Pargent
Integration@YourDeskCreate an ODBC link
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (13)
Session 3Walter Pargent
Integration@YourDeskTools -> Relationships
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (14)
Session 3Walter Pargent
Integration@YourDesk
3 in One
• EMMASTR– mySQL
(Solaris)• cryoDb
(devel)– mySQL
(Windows)• MouseNet
– Sybase ASE(HP-UX)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (15)
Session 3Walter Pargent
Integration@YourDeskEMMA_2_MouseNet Query Building
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (16)
Session 3Walter Pargent
Integration@YourDeskEMMA_2_MouseNet : Results
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (17)
Session 3Walter Pargent
Integration@YourDesk
• Using a desktop database (MS-Access)is suitable for– Small scale and low complexity integration
• Prototyping
– Small scale ad hoc replication– Small scale data migrations / transformations– Ad hoc data editing
• … not suitable for– Handling of large datasets
(no optimization)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (18)
Session 3Walter Pargent
BIG & CENTRAL
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (19)
Session 3Walter Pargent
Large scale ¢ral server
• Rather use a ‘real’ RDBMS …• … on a central performant server• e.g.:
– mySQL FEDERATED engine– Sybase ASE CIS
(Component Integration Services)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (20)
Session 3Walter Pargent
mySQL
• Table defined on the master server# create table xxx ( c1 int, c2 char, … )ENGINE=InnoDB;
• Table definition on proxy server# create table f_xxx (c1 int, c2 char, …)ENGINE=FEDERATEDCONNECTION=
'mysql://usr:PW@host:3306/dbnam/table';
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (21)
Session 3Walter Pargent
mySQL (5.1)
• Alternative to link many tables(since version 5.1)# CREATE SERVER fedlink
FOREIGN DATA WRAPPER mysqlOPTIONS (
USER 'fed_user',HOST 'remote_host',PORT 9306,DATABASE 'federated' );
# CREATE TABLE test_table (col1, col2, …)ENGINE=FEDERATEDCONNECTION='fedlink.test_table';
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (22)
Session 3Walter Pargent
mySQL (doc)
• Find mySql documentation for that at:
– “Storage Engines”• “The Federated Storage Engine”
– http://dev.mysql.com/doc/refman/5.1/en/federated-storage-engine.html
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (23)
Session 3Walter Pargent
Sybase ASE
ToDo on the Master DATABASE• Add entry to interfaces file
# GSFIEG_JAXmaster tcp ether xxx.gsf.de 5555query tcp ether xxx.gsf.de 5555
• As sa user on the MASTERDB# sp_addserver GSFIEG_JAX# sp_addremotelogin GSFIEG_JAX# sp_addremotelogin GSFIEG_JAX, pargent,
<someOtherLocalUserName>
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (24)
Session 3Walter Pargent
Sybase ASEToDo on the Proxy DATABASE• Add entry to interfaces file
# JAX_PUBLICmaster tcp ether yyy.jax.org 4000query tcp ether yyy.jax.org 4000
• As sa user on the ProxyDB# sp_addserver JAX_PUBLIC# sp_addlogin pargent, <lognamAtJaxDb>
,null,null,‘Remote user wapar at jax‘
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (25)
Session 3Walter Pargent
Sybase ASE
Test it on Proxy DB• Remote Procedure Call (RPC)
# Exec JAX_PUBLIC.mgd..sp_tables
• Example for proxy table# Create proxy_table jax_vocab
at 'JAX_PUBLIC.mgd..VOC_vocab'# Create existing table jax_proc
(c1 int,c2 varchar(10),…)external procedure at 'JAX_PUBLIC.mgd..some_proc'
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (26)
Session 3Walter Pargent
Other RDBMS’s
• PostgreSQL: contrib/dblink()• IBM DB2 UDB: table nicknames• Oracle: database links• MS SQL-Server: ???
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (27)
Session 3Walter Pargent
Heterologous Links
• Connections between different RDBMS’s• Implemented using
– ODBC (universal)– protocols specific to individual RDBMS
(esp. towards Oracle, DB2, SQL-Server)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (28)
Session 3Walter Pargent
Availability …
• … of features forheterologous database integration:– mySQL and postgreSQL (NOT YET)– Sybase: ECDA options
(Enterprise Connect Data Access)– Oracle: Transparent Gateway option– IBM DB2: Discovery Link– MS SQL-Server: ???
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (29)
Session 3Walter Pargent
Linkage viaApplication vs. RDBMS
• Webservices and Webapps link onapplication level, which is appropriate forsequentiel / cascading 1:n linkage– Typical web-application (HTML tab with links)– Web-services
• Must have SQL (or similar interface) to dooverall set operations
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (30)
Session 3Walter Pargent
Why set operations• Scientific inference needs flexibility
– Cannot be confined to pre defined user interfaces(static query forms)
– “Find all mouse lines with traits linked to energymetabolism …”
• Need to use deliberate set oriented queries– Feasibility study / data quality checks in prototype
phase– Pre phase consistency checking to investigate data
curation needs– Prototype functionality and performance checks– Data transformation and masking (views)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (31)
Session 3Walter Pargent
Integration atthe input (write) level
• To reduce subsequent need for datacuration, we’ll want to use:– Central Controlled Vocabularies– Ontology servers … (SQL, Webservices or
JavaEnterpriseBeans)– Central Source of Linkage objects
• e.g. a proactive mouseLineDb producingworldwide unique identifiers (codes)
• c.f. internal situation at GSF (ambiguous linecodes)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (32)
Session 3Walter Pargent
Replication vs. Federation
• PRO:– Data are at hand for performance tuning and data
curation– Best performance for (client-side) queries
• CON:– Storage needs– Risk of outdated data– Risk of inconsistencies (from duplicates)– Expense of regular uploads
• Network load, script implementation, downtimes
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (33)
Session 3Walter Pargent
Replication vs. Federation
• PRO:– Always up to date– No denormalization (no add. risk for inconsistencies)– No extra expense for implementation of auto upload
• CON:– Several points of failure for service availability– Inadequate queries may hamper the master server
• However this could be hindered by db usage configuration(available resources, max rowcount, etc … per user …)
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (34)
Session 3Walter Pargent
Nix is fix(There is more than one way to Rome …)
• Decisions need not /must notbe carved in stone …– Find the currently appropriate mixture– Revalidate decisions regularly
(based on query statistics and user feedback)
– Change easily, whenever ..• Data volatility changes (prototype -> HTP project)
• Service reliability changes (to the better or worse)
• For performance issues
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (35)
Session 3Walter Pargent
Substitute linkage …
View:XXX
MasterDB
Proxy-table
p_XXX
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (36)
Session 3Walter Pargent
… by replication
View:XXX
MasterDB
Proxy-table
p_XXX
ReplicatedTable:XXX
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (37)
Session 3Walter Pargent
Linkage helps replication
• Use linkage of federated data forstraightforward implementation ofreplication# Insert into replica_table
select * from proxy_table• Use master_tab timestamps (ins_on,
upd_on) as filters for incrementalreplication …
• Use diff_log table for master_tab
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (38)
Session 3Walter Pargent
Where to use what
• Replication– Where master server is not reliably available
or has inadequate performance– The master server is not trusted for
sustainability (long term persistence)– Query load to this part of data is
extraordinarily high
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (39)
Session 3Walter Pargent
Where to use what
• Federation / Linkage– Master is well maintained by a “buddy”– Query load is intermediate or low– High volatility of master data– … or just in any case, where there is no
important CON
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (40)
Session 3Walter Pargent
Outlook
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (41)
Session 3Walter Pargent
Outlook
• Need to do heterologous tests includingperformance checks
• Adhoc statistics and complex queriesrapidly developed by pure SQL as viewsor stored procedures(by power-users or bio-informaticians …)
• Combine with BioMart ?
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (42)
Session 3Walter Pargent
Other tools• RDBMS inherent scheduling
and replication options• 3rd party ETL tools
(Extraction, Transformation, Loading)• Usage of client GUI for
SQL stored procedures
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (43)
Session 3Walter Pargent
THANXto
1st Ann. CASIMIR symposiumRome, Nov 28, 2007
Straightforward integrationof federated data (44)
Session 3Walter Pargent
Thanks to
• Colleagues at the GSF• Martin Hrabé de Angelis• Janan Eppig, Richard Butler and Damian
Smedley• Paul Schofield