(almost) hands-off information integration for the life sciences

(Almost) Hands-Off Information Integration for the Life Sciences

Ulf Leser, Felix Naumann

Presented By: Darshit Parekh

Table of Contents• Introduction - Introduction to Life Sciences

- Data Integration in Life Sciences

• ALADIN - Features of ALADIN

- System Architecture for ALADIN

- System Components Description

• A Case study on Protein Data - Comparison of ALADIN with other existing Technologies

- Advantages, Challenges and Bottlenecks in ALADIN

• Summary - Demo on COLUMBA Project

1. Introduction to Life Sciences

Life science is the study of living things- plants and animals. It helps to explain how living things relate to each other and to their surroundings. It is the in-depth study of living organisms More specifically following fields are included Agrotechnology Animal Science Bio-Engineering Bioinformatics and Biocomputing Cell Biology Neuroscience A Broad field that studies life.

2. Data Integration in Life Sciences

Data integration in the life sciences has been topic of intensive research. It’s one of the areas where large number of complex databases is freely

available. Research in this area is important as there is advancement in the medical

technologies and hence, human health and wellness. The data required for this kind of research and analysis is widely scattered

over in many heterogeneous and autonomous data sources. Life Science databases have number of traits which we need to consider

when we start designing data integration system for the same. Life sciences database stores only primary type of object which are

described by rich and sometimes deeply nested annotations. We consider an example of Swiss-Prot which is essentially protein

database.

2. Data Integration in Life Sciences Contd….

The “Hubs” in the life science data world, provide links to large number of databases pointing from primary object of proteins to further information like protein structure, publications, taxonomic information, the gene encoding the protein, related diseases, known mutations , etc.

Internally stored as a pair ( target-database, accession number). Presented as hyperlinks on web pages and helps the end user retrieve useful

information from the protein database. With large number of databases ,identifying the duplicates is an important tasks and

needs to carefully perform the task. Data Integration in life sciences involves either manual data curation or schema

mapping and integration approach.


• Manual Data Curation Projects that achieve high standard of quality in the integrated data through

manual curation – data focused. The curation work is performed by experienced professionals. Swiss Prot- data on protein sequences from journal publications,

submissions, personal communications, other databases. Data Focused projects are typically managed by domain experts like Biologist. Very little Database concepts or technology used in this case. Data in text like manner. Even if detailed schemata are developed, it can cannot be used to query the

database and obtain the results through query.

2. Data Integration in Life Sciences Contd….• Schema Focused Projects of this type make use of Database technology and are maintained by computer

scientists, database analyst , database programmer etc. These projects are aimed at providing integration middleware rather than building concrete

databases. Techniques like schema integration, schema mapping and mediator based query rewriting is

used. Examples : TAMBIS, OPM. Some sort of wrapper required for query processing, detailed semantic mapping between

heterogeneous source schemata and a global mediated schema. The mappings must be specified in a special language which makes the work of domain

experts very difficult.


Data Focused projects are very successful in biological scene but it does so with a price.

Schema focused projects are hardly used in real life science projects and did not achieve the required attention that it should have.

The major reason for its failure lies in the fact that it is schema centric The schema mapping and integration also leaves the biologists in a fix as

they are not used to with database technologies

3. Introduction to ALADIN

ALADIN is a novel combination of data and text mining, schema matching, and duplicate detection and justifies high level of automatism feature.

It leads to previously unseen relationship between objects, thus directly supporting the discovery based work of life science researchers

ALADIN has two major contributions. Firstly ALADIN is a knowledge resource for life science research and secondly it offers challenges and bottlenecks in the database reserach.

ALADIN: Almost Automatic Data Integration. The novel feature includes automatic integration with minimum

information loss and also takes care of information quality. The proposed technique has features better than Data Focused and Schema

Matching techniques.

4. Features of ALADIN

ALADIN’S architecture consist of several components that together allow for automatic integration of diverse data sources into a global, materialized repository of biological objects and links between them.

The databases that ALADIN helps integration have data that is semi-structured and text centric.

ALADIN uses relational database as its basis. Can Integrate different types of data sources for which relational representation

exists or can be generated. ALADIN can integrate data in the XML file and flat file using appropriate parsers. Integration in ALADIN does not depend on predefined integrity constraints

structuring a schema but uses technologies from schema matching and data and text mining to detect relation containing primary objects between each data source , to infer relationships between relations and objects within one source , and to infer relationships between objects in different sources.

4. Features of ALADIN Contd…

The system does not rely on any particular schema for its data sources . Generic Parsers are used like generic-XML to relational mapping tools. ALADIN integrates data sources in a five step process. First step- Imports Data source into relational format. Second step- From the relational representation it tries to find out relation that

represents the primary objects within the data source. Third step- the fields containing annotations for the primary objects are detected .

Existing integrity constraints are discovered subject to availability , otherwise they are guessed from the data analysis.

Fourth Step- Links between the objects of the primary relations of different data sources are searched for. Links are generated based on similarity of text fields.

Fifth Step- Duplicates are detected across different data sources and they are removed.

Once the data is imported the process becomes almost automatic.

4. Features of ALADIN Contd…

ALADIN also supports structured queries , detects and flags structured and flags duplicate objects , and adds a wealth of additional links between objects that are undiscoverable when looking at the database in the isolation.

ALADIN can be readily browsed without any schema information. ALADIN is useful in scenarios where explorative approach is necessary.

5. System Architecture for ALADIN


The Main Components of Architecture 1. Data Import The data source needs to be imported to into the relational database system. In cases where no downloadable import method exists, this is where

ALADIN requires human work The situation is very rare, most of the time the parsers are readily available. Schema design or re-design is not required.

2. Discovery of Primary Objects Identifying primary objects stored in primary relation. Primary relation contain data about the main objects of interest in the source

such as “ proteins” and “diseases”.


The relations store a primary key but it does not have information regarding the foreign key.

3. Discovery of Secondary Objects. Secondary objects are additional information about primary objects. Cardinalities of relationships are determined in this step. At the end of this step, the internal structure of the newly imported data

source is more clear or known. In this step there can be a possibility of errors while identifying

relationships. The errors can be minimized in the ALADIN by introducing performance

measure parameters.


4. Link Discovery In this step we search for attributes that are cross-references to objects in

other data sources. Cross-references always point to primary objects in other data sources as

these are the attributes with stable public ID’s. The output from the second step are necessary input to determine all

possible link targets. This step justifies the theoretical requirement of comparing all pairs of

attributes from all sources.


5. Duplicate Detection In this step, search for a special kind of “ links” between primary objects

in different data sources representing the same real world object is initiated.

Such duplicate links are established if two objects are sufficiently similar according to some similarity matrix.

Knowledge of duplicates enhances the users browsing and querying experience .

6. Browse, Search and Query Engine Once the data is integrated into the system, there are three modes to access

the data. Browsing displays objects and different kind of links that users can follow.


Search allows the users to make a full text search on all stored data and a focused search restricted to certain partitions of data like a certain data sources , particular field, etc.

Querying allows full SQL queries on the schemata as imported. Appropriate Graphical User Interfaces are provided to carry on with these

operations.

7. Metadata Repository. It contains known and discovered schemata , information about primary

and secondary relations, statistical metadata and sample data to improve discovery efficiency.

Integration Steps in ALADIN

6. System Components Description

1. Data Import Read data source into the relational database. No necessary to have integrity constraints at this time. Some databases like Swiss-Prot , the Gene Ontology provide direct

relational dump files. For text-based exports, readily downloadable parsers are available.

Examples are BioSQL and BioPerl packages which are able to read Swiss-Prot , Genebank Databases.

Some databases provide parsers with their export files, such as the open MMS parser for the Protein Structure Database.

Databases exported as XML files can be parsed using a generic XML shredder.


2. Discovery of Primary Relations Discovering primary objects without the use of parsers. Heuristic rules along with schema and the actual data is used to determine

the primary relation Rules derived from the previous experience of data integration. SQL query on each attributes. The attributes found are alphanumeric in nature and are called accession

members . Foreign key relationships and cardinalities. The detected primary relation and set of relationships achieved are input to

the next steps


3. Discovery of secondary Relations The need to connect objects in one source to the other source . The step determines the description and annotation that is displayed

together with the primary object in the web interface. The computation of the paths from the primary relation to each of the other

relations of data source is done using transitivity of relationships. The paths are stored in the metadata repository.

4. Link Discovery Explicit Links and Implicit Links Explicit cross-references in life science databases. Cross references are stored as accession numbers .


E.g. “ ENSG00000042753” or “Uniprot:P11140” String matching techniques needed. Many relationships are not explicitly stored. The implicit relationships are discovered by searching for similar data

values among the other data sources. Three types of comparisons taken into consideration First is DNA, RNA, or protein sequences compared to each other. Second is attributes containing longer text strings, such as textual

descriptions are analyzed using information retrieval and text mining. Use of standard vocabularies across the data sources. The discovered links are stored in the metadata repository to avoid

repeated discovery and computation at query time.

7. A Case Study on Protein Data

The design decisions of ALADIN have taken based on the past experiences drawn from the integration projects in this domain.

The paper discusses the most recent COLUMBA Project. COLUMBA is an integrated , relational data warehouse that annotates

protein structures taken from the protein data bank (PDB). The data explains following properties of protein- Classification on structures Protein sequence and sequence features Functional annotation of proteins Participation on metabolic pathways The extraction and transformation from the initial data source schema into

the target schema is currently hard-coded and it certainly requires a lot of effort.


Understanding the schema is very difficult , as they are very poorly documented and often use structures that are hard to understand just looking at the schema .

Transformation requires operations that are not defined in the current schema mapping languages .

Use of SQL and Java. COLUMBA annotate Protein structures from Protein Data Bank. It also

includes protein fold classification databases SCOP and CATH. Further functional and taxonomic annotation is available from Swiss-Prot,

the Gene Ontology (GO) and the NCBI Taxonomy.


Part of BioSQL Schema. Arrows indicate candidates for Primary relations and cross-references


In the complete schema there are three tables with an in-degree above five. BioEntry

The BioEntry table is used to store primary objects.

Bio EntryIdDisplay IdIdentifierAccessionDescriptionEntry VersionBio Database ID ( FK)Taxon Id (FK)


The ontology term table is used to store functional descriptions. Ontology Term

The SeqFeature storing a meta representation of sequence features. SeqFeature

Ontology Term IdTerm NameTerm DefinitionTerm_IdentifierCategory_Id (FK)

Seqfeature IdSeqfeature RankBioentry IdOntology Term Id (FK)Seqfeature Source Id (FK)


The BioEntry has an accession number candidate , whose values are mixed characters and integers and all have the same length.

The other fields in the BioEntry are either non unique (e.g. Taxon Id), have no alphanumeric character ( e.g. Bioentry Id) or have varying length ( e.g. name)

This table qualifies in ALADIN as the primary relation. Primary and Foreign keys are determined by analyzing the scope of

different attributes storing surrogate keys. In the next step in COLUMBA, protein structures are connected to

annotations using existing cross –references or by matching sequence similarity.


The BioSQL schema contains several attributes whose values are excellent candidates for finding out implicit links.

OntologyTerm. Term_Definition linking to biological ontologies. BioEntry.description linking to disease or gene-focused databases. Biosequence.Biosequence_str, containing the actual DNA or protein

sequence. Duplicate Detection is an important step here as protein structures from the

PDB are available in three different flavors: The original PDB files Cleansed version available as dump files. Cleansed version available with a parser. PDB accession number is available in all the three versions and hence

removing redundancy is easy in this case.


• SQL example - fetch the accessions of all sequences from SwissProt: SELECT DISTINCT bioentry.accession FROM bioentry JOIN biodatabase USING (biodatabase_id) WHERE biodatabase.name = 'swiss' -- or 'Swiss-Prot'

• SQL example - how many unique entries are there in GenBank: SELECT COUNT(DISTINCT bioentry.accession) FROM bioentry JOIN biodatabase USING (biodatabase_id) WHERE biodatabase.name = 'genbank'

8. Some Related Work

Discovery Link OPM TAMBIS Makes use of Schema information rather than the data SRS- need to mention the primary and secondary relation explicitly in the

parsers . GenMapper BioMediator The Project closest to our proposal is Revere Project

Comparison of ALADIN with the existing technologies

9. Challenges and Bottlenecks in ALADIN

ALADIN system is a true challenge in terms of size, number, and the complexity of the data sources to be integrated.

Incorrectly identified primary or secondary relations leads to incorrect targets to link discovery.

Incorrect links in turn influence the precision of duplicate detection. Issue of performance is not addressed in the paper. Integrating new data sources to the existing ones leads to the poor

efficiency as it involves lots of calculations, sorting, schema matching . It takes a lot of time to achieve the desired results.

Another important problem is that of data changes. When the data in the source changes, all links needs to be recomputed which involves lot of overhead.

10. Summary

ALADIN architecture and framework looks out to be a novel proposal for data integration in life sciences.

The design seems to be almost automatic using text mining, schema matching , data mining and information retrieval.

ALADIN offers clear added-value to the biological user when compared to the current data landscape.

It enables structured queries crossing several databases. The system suggests a lot of new relationships interlinking all the areas of life

sciences , offers ranked searched capabilities across databases for users that want goggle style information retrieval .

The queries like getting the genes of a certain species on a certain chromosome that are connected to a disease via a protein whose function is known. For each of the object types in query , several potential data sources exist and this system takes into account all of the data sources , a feature not supported by any of the integration technology.

11. DEMO OF COLUMBA PROJECT

(almost) hands-off information integration for the life sciences

Documents