integration of biological sources: current systems and challenges ahead ( sigmod record, vol. 33....

19
Integration of Biological Sources: Integration of Biological Sources: Current Systems and Challenges Ahead Current Systems and Challenges Ahead ( ( Sigmod Record, Vol. 33. No. 3, September 2004 Sigmod Record, Vol. 33. No. 3, September 2004 ) ) Thomas Hernandez & Sybbarao Kambhampati Thomas Hernandez & Sybbarao Kambhampati Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering Arizona State University Arizona State University

Upload: garey-freeman

Post on 25-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration of Biological Sources: Current Integration of Biological Sources: Current Systems and Challenges AheadSystems and Challenges Ahead

( (Sigmod Record, Vol. 33. No. 3, September 2004Sigmod Record, Vol. 33. No. 3, September 2004 ) )

Thomas Hernandez & Sybbarao KambhampatiThomas Hernandez & Sybbarao KambhampatiDept. of Computer Science and EngineeringDept. of Computer Science and Engineering

Arizona State UniversityArizona State University

Page 2: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

IntroductionIntroduction

Traditionally, the integration of biological data was done Traditionally, the integration of biological data was done manually by biologists. However, the availability of more manually by biologists. However, the availability of more data in different formats and the wide distribution over data in different formats and the wide distribution over the internet makes the manual integration practically the internet makes the manual integration practically infeasible. There is a need for computer integration. infeasible. There is a need for computer integration.

This need is also justified by the characteristics of the This need is also justified by the characteristics of the biological sources:biological sources:

Page 3: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Characteristics of Biological Characteristics of Biological Sources Sources

Variety of data. Typical data stored cover several Variety of data. Typical data stored cover several biological and genomic research fields (e.g. gene biological and genomic research fields (e.g. gene expression and sequences, disease characteristics, expression and sequences, disease characteristics, molecular structures, microarray data, etc). Not only can molecular structures, microarray data, etc). Not only can the quantity of data available in a source be quite large, the quantity of data available in a source be quite large, but also the size of each record can itself be extremely but also the size of each record can itself be extremely large (e. g. DNA sequences, 3D protein structures, etc). large (e. g. DNA sequences, 3D protein structures, etc).

Heterogeneous representations. Several sources Heterogeneous representations. Several sources containing the similar data can have very different containing the similar data can have very different representations. The representational heterogeneity representations. The representational heterogeneity includes structural (i. e. schema), naming, semantic (i.e. includes structural (i. e. schema), naming, semantic (i.e. the same semantic concept with different terms and the the same semantic concept with different terms and the opposite), content (different data for the same semantic opposite), content (different data for the same semantic object) differences. object) differences.

Page 4: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Characteristics of Biological Characteristics of Biological Sources Sources

Autonomous operations. They are free to Autonomous operations. They are free to modify their design and/or schema , remove modify their design and/or schema , remove or modify data without any prior public or modify data without any prior public notification. Nearly all sources are web-based notification. Nearly all sources are web-based and therefore dependent on network traffic and therefore dependent on network traffic and overall availability. The data is dynamic. and overall availability. The data is dynamic.

Different interfaces and querying capabilities. Different interfaces and querying capabilities.

Page 5: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration Approaches in Integration Approaches in Existing SystemsExisting Systems

They can be classified first in terms of data models. This refers to They can be classified first in terms of data models. This refers to the design assumptions made by the integration system as to the the design assumptions made by the integration system as to the syntactic nature of the data being exported by the sources. syntactic nature of the data being exported by the sources. 1. Text data model. They view sources as exporting mainly text, 1. Text data model. They view sources as exporting mainly text, and their integration involves supporting keyword/text search and their integration involves supporting keyword/text search across the sources. across the sources. 2. Structures data model. When sources are viewed as exporting 2. Structures data model. When sources are viewed as exporting more structured data, there are two broad types of integration more structured data, there are two broad types of integration approaches: warehoused or accessed on demand from the approaches: warehoused or accessed on demand from the sources.sources.3. Linked records model. They view sources as exporting linked 3. Linked records model. They view sources as exporting linked sets of browsable records and the integration involves supporting sets of browsable records and the integration involves supporting effective navigation across sources. effective navigation across sources.

Page 6: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration Approaches in Integration Approaches in Existing SystemsExisting Systems

The majority of systems use the (semi-) structured or linked record The majority of systems use the (semi-) structured or linked record models. More details about those systems are going to be discussed. models. More details about those systems are going to be discussed.

They include three types of approach: They include three types of approach: 1. Warehouse integration. It materializes the data from multiple sources 1. Warehouse integration. It materializes the data from multiple sources into a local warehouse and executes all queries on the data contained into a local warehouse and executes all queries on the data contained in the warehouse instead of the actual sources. It emphasizes the data in the warehouse instead of the actual sources. It emphasizes the data translation instead of query translation in mediator-based integration. translation instead of query translation in mediator-based integration. Pros: less dependency on network, improved efficiency of query Pros: less dependency on network, improved efficiency of query optimization, enabling users to filter, validate, modify, and annotate the optimization, enabling users to filter, validate, modify, and annotate the data obtained from the sources. data obtained from the sources.

Cons: outdated data and the need for frequent updates. Cons: outdated data and the need for frequent updates.

Page 7: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration Approaches in Integration Approaches in Existing SystemsExisting Systems

2. Mediator-based integration. It concentrates on query translation. A 2. Mediator-based integration. It concentrates on query translation. A mediator is responsible for reformulating a query at runtime on a single mediator is responsible for reformulating a query at runtime on a single mediated schema into a query on the local schema of the underlying data mediated schema into a query on the local schema of the underlying data sources. Mapping between the source description and the mediator is very sources. Mapping between the source description and the mediator is very crucial for such a translation. There are two main approaches for crucial for such a translation. There are two main approaches for establishing mapping between each source schema and the global schema: establishing mapping between each source schema and the global schema: global-as-view (GAV) and local-as-view (LAV). In GAV, the mediator global-as-view (GAV) and local-as-view (LAV). In GAV, the mediator relations are written directly in terms of the source relations. In LAV, every relations are written directly in terms of the source relations. In LAV, every source relation is defined over the relations and the schema of the source relation is defined over the relations and the schema of the mediator. LAV is preferred for large scale integration and GAV is mediator. LAV is preferred for large scale integration and GAV is appropriate when the set of sources being integrated is known and stable. appropriate when the set of sources being integrated is known and stable.

Page 8: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration Approaches in Integration Approaches in Existing SystemsExisting Systems

3. 3. Navigation-based integration. It emerges from the fact Navigation-based integration. It emerges from the fact that an increasing number of sources on the web require of that an increasing number of sources on the web require of users that they manually browse through several web users that they manually browse through several web pages and data sources in order to obtain the desired pages and data sources in order to obtain the desired information. The specific paths essentially constitute information. The specific paths essentially constitute workflows in which the output of a source is redirected to workflows in which the output of a source is redirected to the input of the next source until the requested information the input of the next source until the requested information is reached. is reached.

Page 9: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration Approaches in Integration Approaches in Existing SystemsExisting Systems

There are also other classifications besides the data model There are also other classifications besides the data model classification: classification:

1. Aim of integrations – portal or query oriented;1. Aim of integrations – portal or query oriented; 2. Source model – complimentary (horizontal) or vertical (overlapping 2. Source model – complimentary (horizontal) or vertical (overlapping exists and requires aggregation);exists and requires aggregation);

3. User model – low expertise, high expertise in query languages, and 3. User model – low expertise, high expertise in query languages, and interactive query formulations;interactive query formulations;

4. Level of transparency: users choosing sources or hard-wiring 4. Level of transparency: users choosing sources or hard-wiring choices of sources. choices of sources.

Page 10: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Integration Approaches in Integration Approaches in Existing SystemsExisting Systems

Page 11: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Sequence Retrieval System Sequence Retrieval System (SRS) (SRS)

SRS first parses flat files that contain structured text with field SRS first parses flat files that contain structured text with field names. It then creates and stores an index for each field and names. It then creates and stores an index for each field and used these local indexes at query-time to retrieve relevant used these local indexes at query-time to retrieve relevant entries. Although extensive indexed entries are kept locally to entries. Although extensive indexed entries are kept locally to be used by the query processor at query time, SRS is not a be used by the query processor at query time, SRS is not a warehouse system as the actual data is neither modified nor warehouse system as the actual data is neither modified nor stored locally. The other main feature of SRS is that it keeps stored locally. The other main feature of SRS is that it keeps track of the cross-references between sources. It uses its own track of the cross-references between sources. It uses its own parsing component to identify links that exists between parsing component to identify links that exists between entries in different sources during parsing and indexing. These entries in different sources during parsing and indexing. These links are then used to suggest more results to a user after a links are then used to suggest more results to a user after a query has been processed. query has been processed.

http://srs.embl-heidelberg.de:8000/srs5/http://srs.embl-heidelberg.de:8000/srs5/

Page 12: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

BioKleisliBioKleisli

BioKleisli is a mediator-based integration system. The BioKleisli is a mediator-based integration system. The mediator on top of the underlying sources relies mainly mediator on top of the underlying sources relies mainly on a high level query language (CPL, more expressive on a high level query language (CPL, more expressive than SQL) to query across several sources. Queries are than SQL) to query across several sources. Queries are decomposed into sub-queries and source-specific decomposed into sub-queries and source-specific wrappers map sub-queries to specific heterogeneous wrappers map sub-queries to specific heterogeneous sources, which are accessed through predefined atomic sources, which are accessed through predefined atomic query functions. query functions.

BioKleisli doesn’t use any global molecular biology BioKleisli doesn’t use any global molecular biology schema or ontology. schema or ontology.

It is aimed at performing a horizontal integration. A It is aimed at performing a horizontal integration. A query attribute is usually bound to an attribute in a query attribute is usually bound to an attribute in a single predetermined source and there is essentially no single predetermined source and there is essentially no content overlap. content overlap.

Page 13: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

TAMBISTAMBIS

TAMBIS is a mediator-based and ontology-TAMBIS is a mediator-based and ontology-driven integration system. driven integration system.

GUI (ConceptsDefinedIn a globalSchema)

Source-independent GRAIL query

Query internal form

Source dependent CPL query execution plan

Use BioKleisli existing

function library to access sources

Page 14: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

TAMBISTAMBIS

The TAMBIS domain ontology mainly serves The TAMBIS domain ontology mainly serves the purpose of easing the user’s task of the purpose of easing the user’s task of formulating the query instead of schema formulating the query instead of schema mapping between sources. mapping between sources.

Page 15: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

DiscoveryLinkDiscoveryLink

DiscoveryLink is also a mediator-based integration DiscoveryLink is also a mediator-based integration system. Applications typically connect to system. Applications typically connect to DiscoveryLink and submit a query in SQL on the DiscoveryLink and submit a query in SQL on the global schema, not necessarily aware of the global schema, not necessarily aware of the underlying sources. Underneath, a federated underlying sources. Underneath, a federated database query processor communicates with database query processor communicates with source-specific wrappers to determine the optimal source-specific wrappers to determine the optimal plan for a given query. plan for a given query.

The wrappers have two roles. They translate the The wrappers have two roles. They translate the source data models and provide source-specific source data models and provide source-specific information about query capabilities that will help information about query capabilities that will help the optimizer determine which parts of a query can the optimizer determine which parts of a query can be submitted to each source. be submitted to each source.

Page 16: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Other Existing SystemsOther Existing Systems

BASCIIS is an end-use product which was BASCIIS is an end-use product which was developed following a mediator-based approach developed following a mediator-based approach combined with extensive use of a knowledge combined with extensive use of a knowledge base (KB). The KB contains a domain ontology base (KB). The KB contains a domain ontology which serves as a global schema and maps the which serves as a global schema and maps the data base schema to the domain ontology. data base schema to the domain ontology.

BioNavigator is a commercially available BioNavigator is a commercially available navigation integration system. Users can define navigation integration system. Users can define their preferred execution path for a query and their preferred execution path for a query and reuse it later.reuse it later.

GUS is a warehouse-based integration system. GUS is a warehouse-based integration system.

Page 17: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Discussion Discussion

As mentioned earlier, warehouse-based approaches As mentioned earlier, warehouse-based approaches have two clear advantages. First, it simplifies query have two clear advantages. First, it simplifies query optimization and processing by storing the data locally optimization and processing by storing the data locally according to a single global schema. Second, it enables according to a single global schema. Second, it enables users to add their own annotations to some stored data users to add their own annotations to some stored data and specify some filtering conditions to clean the data and specify some filtering conditions to clean the data as it is stored locally. as it is stored locally.

However, it is still unclear how this user-friendly feature However, it is still unclear how this user-friendly feature can be achieved efficiently and more specifically how can be achieved efficiently and more specifically how the data could effectively be validated or modified the data could effectively be validated or modified without human interventions and extensive domain without human interventions and extensive domain expertise. Furthermore, data warehousing faces the big expertise. Furthermore, data warehousing faces the big problem of handling updates in the sources and even a problem of handling updates in the sources and even a bigger challenge as the data can be modified and bigger challenge as the data can be modified and annotated locally, and therefore different from the data annotated locally, and therefore different from the data in the sources. in the sources.

Page 18: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

Discussion Discussion

Although GAV and LAV are introduced earlier for Although GAV and LAV are introduced earlier for mediator-based approach, there are no mediator-mediator-based approach, there are no mediator-based integration systems implementing them so far. based integration systems implementing them so far. Wrapper-oriented approaches are still relatively new. Wrapper-oriented approaches are still relatively new.

Much like TAMBIS and BioKleisli, most of the current Much like TAMBIS and BioKleisli, most of the current systems only address the horizontal integration and systems only address the horizontal integration and don’t consider the potential overlapping aspect of don’t consider the potential overlapping aspect of sources. DiscoveryLink makes an attempt to solve the sources. DiscoveryLink makes an attempt to solve the problem of selecting between several potential problem of selecting between several potential sources by using the information provided by sources by using the information provided by wrappers to estimate querying costs. But the overlap wrappers to estimate querying costs. But the overlap and coverage point of view of optimization and and coverage point of view of optimization and source selection is not considered. source selection is not considered.

Page 19: Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati

ReferenceReference

Thomas Hernandez & Subbarao Kambhampati. Thomas Hernandez & Subbarao Kambhampati. Integration of BiologicalIntegration of Biological

Sources: Current Systems and ChalleSources: Current Systems and Challenngesges AAhheadead. . Sigmod Sigmod RecordRecord, Vol. 33, No. , Vol. 33, No.

3, September 2004. 3, September 2004.