straightforward integration of federated data

44
Straightforward Integration of Federated Data Using standard RDBMS servers to link databases across the internet 1st Annual CASIMIR Symposium Rome, Nov 27–28, 2007 Walter Pargent GSF - National Research Center for Environment and Health

Upload: others

Post on 12-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Straightforward Integration of Federated Data

Straightforward Integrationof Federated Data

Using standard RDBMS servers tolink databases across the internet

1st Annual CASIMIR SymposiumRome, Nov 27–28, 2007

Walter PargentGSF - National Research Center for Environment and Health

Page 2: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (2)

Session 3Walter Pargent

Short Glossary

• RDBMS =Relational DataBase Management System

• SQL =Structured Query Language

Page 3: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (3)

Session 3Walter Pargent

Aims / Topics

• Show standard RDBMS featuresas options for dynamic data integration

• Discuss databaseLinkage vs. Replication

• Technical, not semantic integration level• How can technology help semantics• In case: Enquire other’s experiences !!!

Page 4: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (4)

Session 3Walter Pargent

Janan Eppig @ Corfu

Page 5: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (5)

Session 3Walter Pargent

Data integrationETL style

• ETL =– Extraction– Transformation

(and Transfer)– Loading

• => REPLICATION

Page 6: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (6)

Session 3Walter Pargent

Thumb ruleof data design

• Redundancy abets inconsistency,so better avoid it …

• … wherever possible– Exceptions for

• Performance issues• (Ease of implementation)

Page 7: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (7)

Session 3Walter Pargent

Linkage

Information-Proxy(RDBMS-Server)

Web-Interfaces

DataMart(BioMart)

SQL-Interface

Web-Services

M a s t e r - D a t a b a s e - S e r v e r s

Page 8: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (8)

Session 3Walter Pargent

virtual private networks (VPN)

in DemilitarizedFirewall-Zones

The paranoid option:

Linkage Security Aspects

Information-Proxy(RDBMS-Server)

Local-Transact-Work-DBs

Local Proxies

Page 9: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (9)

Session 3Walter Pargent

Integration@YourDesk

Page 10: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (10)

Session 3Walter Pargent

MS-ACCESS

• … as an RDBMS client• Technical basis

– Data source names (DSNs) defined forindividual table connections

– Open Database Connectivity (ODBC)– Ability to join between heterologous RDBMS

tables and (local) spreadsheets– Relationship manager (for join predefs)

Page 11: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (11)

Session 3Walter Pargent

Integration@YourDeskFile -> External Data -> Link Tables

Page 12: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (12)

Session 3Walter Pargent

Integration@YourDeskCreate an ODBC link

Page 13: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (13)

Session 3Walter Pargent

Integration@YourDeskTools -> Relationships

Page 14: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (14)

Session 3Walter Pargent

Integration@YourDesk

3 in One

• EMMASTR– mySQL

(Solaris)• cryoDb

(devel)– mySQL

(Windows)• MouseNet

– Sybase ASE(HP-UX)

Page 15: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (15)

Session 3Walter Pargent

Integration@YourDeskEMMA_2_MouseNet Query Building

Page 16: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (16)

Session 3Walter Pargent

Integration@YourDeskEMMA_2_MouseNet : Results

Page 17: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (17)

Session 3Walter Pargent

Integration@YourDesk

• Using a desktop database (MS-Access)is suitable for– Small scale and low complexity integration

• Prototyping

– Small scale ad hoc replication– Small scale data migrations / transformations– Ad hoc data editing

• … not suitable for– Handling of large datasets

(no optimization)

Page 18: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (18)

Session 3Walter Pargent

BIG & CENTRAL

Page 19: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (19)

Session 3Walter Pargent

Large scale &central server

• Rather use a ‘real’ RDBMS …• … on a central performant server• e.g.:

– mySQL FEDERATED engine– Sybase ASE CIS

(Component Integration Services)

Page 20: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (20)

Session 3Walter Pargent

mySQL

• Table defined on the master server# create table xxx ( c1 int, c2 char, … )ENGINE=InnoDB;

• Table definition on proxy server# create table f_xxx (c1 int, c2 char, …)ENGINE=FEDERATEDCONNECTION=

'mysql://usr:PW@host:3306/dbnam/table';

Page 21: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (21)

Session 3Walter Pargent

mySQL (5.1)

• Alternative to link many tables(since version 5.1)# CREATE SERVER fedlink

FOREIGN DATA WRAPPER mysqlOPTIONS (

USER 'fed_user',HOST 'remote_host',PORT 9306,DATABASE 'federated' );

# CREATE TABLE test_table (col1, col2, …)ENGINE=FEDERATEDCONNECTION='fedlink.test_table';

Page 22: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (22)

Session 3Walter Pargent

mySQL (doc)

• Find mySql documentation for that at:

– “Storage Engines”• “The Federated Storage Engine”

– http://dev.mysql.com/doc/refman/5.1/en/federated-storage-engine.html

Page 23: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (23)

Session 3Walter Pargent

Sybase ASE

ToDo on the Master DATABASE• Add entry to interfaces file

# GSFIEG_JAXmaster tcp ether xxx.gsf.de 5555query tcp ether xxx.gsf.de 5555

• As sa user on the MASTERDB# sp_addserver GSFIEG_JAX# sp_addremotelogin GSFIEG_JAX# sp_addremotelogin GSFIEG_JAX, pargent,

<someOtherLocalUserName>

Page 24: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (24)

Session 3Walter Pargent

Sybase ASEToDo on the Proxy DATABASE• Add entry to interfaces file

# JAX_PUBLICmaster tcp ether yyy.jax.org 4000query tcp ether yyy.jax.org 4000

• As sa user on the ProxyDB# sp_addserver JAX_PUBLIC# sp_addlogin pargent, <lognamAtJaxDb>

,null,null,‘Remote user wapar at jax‘

Page 25: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (25)

Session 3Walter Pargent

Sybase ASE

Test it on Proxy DB• Remote Procedure Call (RPC)

# Exec JAX_PUBLIC.mgd..sp_tables

• Example for proxy table# Create proxy_table jax_vocab

at 'JAX_PUBLIC.mgd..VOC_vocab'# Create existing table jax_proc

(c1 int,c2 varchar(10),…)external procedure at 'JAX_PUBLIC.mgd..some_proc'

Page 26: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (26)

Session 3Walter Pargent

Other RDBMS’s

• PostgreSQL: contrib/dblink()• IBM DB2 UDB: table nicknames• Oracle: database links• MS SQL-Server: ???

Page 27: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (27)

Session 3Walter Pargent

Heterologous Links

• Connections between different RDBMS’s• Implemented using

– ODBC (universal)– protocols specific to individual RDBMS

(esp. towards Oracle, DB2, SQL-Server)

Page 28: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (28)

Session 3Walter Pargent

Availability …

• … of features forheterologous database integration:– mySQL and postgreSQL (NOT YET)– Sybase: ECDA options

(Enterprise Connect Data Access)– Oracle: Transparent Gateway option– IBM DB2: Discovery Link– MS SQL-Server: ???

Page 29: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (29)

Session 3Walter Pargent

Linkage viaApplication vs. RDBMS

• Webservices and Webapps link onapplication level, which is appropriate forsequentiel / cascading 1:n linkage– Typical web-application (HTML tab with links)– Web-services

• Must have SQL (or similar interface) to dooverall set operations

Page 30: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (30)

Session 3Walter Pargent

Why set operations• Scientific inference needs flexibility

– Cannot be confined to pre defined user interfaces(static query forms)

– “Find all mouse lines with traits linked to energymetabolism …”

• Need to use deliberate set oriented queries– Feasibility study / data quality checks in prototype

phase– Pre phase consistency checking to investigate data

curation needs– Prototype functionality and performance checks– Data transformation and masking (views)

Page 31: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (31)

Session 3Walter Pargent

Integration atthe input (write) level

• To reduce subsequent need for datacuration, we’ll want to use:– Central Controlled Vocabularies– Ontology servers … (SQL, Webservices or

JavaEnterpriseBeans)– Central Source of Linkage objects

• e.g. a proactive mouseLineDb producingworldwide unique identifiers (codes)

• c.f. internal situation at GSF (ambiguous linecodes)

Page 32: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (32)

Session 3Walter Pargent

Replication vs. Federation

• PRO:– Data are at hand for performance tuning and data

curation– Best performance for (client-side) queries

• CON:– Storage needs– Risk of outdated data– Risk of inconsistencies (from duplicates)– Expense of regular uploads

• Network load, script implementation, downtimes

Page 33: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (33)

Session 3Walter Pargent

Replication vs. Federation

• PRO:– Always up to date– No denormalization (no add. risk for inconsistencies)– No extra expense for implementation of auto upload

• CON:– Several points of failure for service availability– Inadequate queries may hamper the master server

• However this could be hindered by db usage configuration(available resources, max rowcount, etc … per user …)

Page 34: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (34)

Session 3Walter Pargent

Nix is fix(There is more than one way to Rome …)

• Decisions need not /must notbe carved in stone …– Find the currently appropriate mixture– Revalidate decisions regularly

(based on query statistics and user feedback)

– Change easily, whenever ..• Data volatility changes (prototype -> HTP project)

• Service reliability changes (to the better or worse)

• For performance issues

Page 35: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (35)

Session 3Walter Pargent

Substitute linkage …

View:XXX

MasterDB

Proxy-table

p_XXX

Page 36: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (36)

Session 3Walter Pargent

… by replication

View:XXX

MasterDB

Proxy-table

p_XXX

ReplicatedTable:XXX

Page 37: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (37)

Session 3Walter Pargent

Linkage helps replication

• Use linkage of federated data forstraightforward implementation ofreplication# Insert into replica_table

select * from proxy_table• Use master_tab timestamps (ins_on,

upd_on) as filters for incrementalreplication …

• Use diff_log table for master_tab

Page 38: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (38)

Session 3Walter Pargent

Where to use what

• Replication– Where master server is not reliably available

or has inadequate performance– The master server is not trusted for

sustainability (long term persistence)– Query load to this part of data is

extraordinarily high

Page 39: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (39)

Session 3Walter Pargent

Where to use what

• Federation / Linkage– Master is well maintained by a “buddy”– Query load is intermediate or low– High volatility of master data– … or just in any case, where there is no

important CON

Page 40: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (40)

Session 3Walter Pargent

Outlook

Page 41: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (41)

Session 3Walter Pargent

Outlook

• Need to do heterologous tests includingperformance checks

• Adhoc statistics and complex queriesrapidly developed by pure SQL as viewsor stored procedures(by power-users or bio-informaticians …)

• Combine with BioMart ?

Page 42: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (42)

Session 3Walter Pargent

Other tools• RDBMS inherent scheduling

and replication options• 3rd party ETL tools

(Extraction, Transformation, Loading)• Usage of client GUI for

SQL stored procedures

Page 43: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (43)

Session 3Walter Pargent

THANXto

Page 44: Straightforward Integration of Federated Data

1st Ann. CASIMIR symposiumRome, Nov 28, 2007

Straightforward integrationof federated data (44)

Session 3Walter Pargent

Thanks to

• Colleagues at the GSF• Martin Hrabé de Angelis• Janan Eppig, Richard Butler and Damian

Smedley• Paul Schofield