(almost) hands-off information integration for the life sciences
DESCRIPTION
(Almost) Hands-Off Information Integration for the Life Sciences. Ulf Leser, Felix Naumann Presented By: Darshit Parekh . Table of Contents. Introduction - Introduction to Life Sciences - Data Integration in Life Sciences ALADIN - Features of ALADIN - PowerPoint PPT PresentationTRANSCRIPT
(Almost) Hands-Off Information Integration for the Life Sciences
Ulf Leser, Felix Naumann
Presented By: Darshit Parekh
Table of Contents• Introduction - Introduction to Life Sciences
- Data Integration in Life Sciences
• ALADIN - Features of ALADIN
- System Architecture for ALADIN
- System Components Description
• A Case study on Protein Data - Comparison of ALADIN with other existing Technologies
- Advantages, Challenges and Bottlenecks in ALADIN
• Summary - Demo on COLUMBA Project
1. Introduction to Life Sciences
Life science is the study of living things- plants and animals. It helps to explain how living things relate to each other and to their surroundings. It is the in-depth study of living organisms More specifically following fields are included Agrotechnology Animal Science Bio-Engineering Bioinformatics and Biocomputing Cell Biology Neuroscience A Broad field that studies life.
2. Data Integration in Life Sciences
Data integration in the life sciences has been topic of intensive research. It’s one of the areas where large number of complex databases is freely
available. Research in this area is important as there is advancement in the medical
technologies and hence, human health and wellness. The data required for this kind of research and analysis is widely scattered
over in many heterogeneous and autonomous data sources. Life Science databases have number of traits which we need to consider
when we start designing data integration system for the same. Life sciences database stores only primary type of object which are
described by rich and sometimes deeply nested annotations. We consider an example of Swiss-Prot which is essentially protein
database.
2. Data Integration in Life Sciences Contd….
The “Hubs” in the life science data world, provide links to large number of databases pointing from primary object of proteins to further information like protein structure, publications, taxonomic information, the gene encoding the protein, related diseases, known mutations , etc.
Internally stored as a pair ( target-database, accession number). Presented as hyperlinks on web pages and helps the end user retrieve useful
information from the protein database. With large number of databases ,identifying the duplicates is an important tasks and
needs to carefully perform the task. Data Integration in life sciences involves either manual data curation or schema
mapping and integration approach.
2. Data Integration in Life Sciences Contd….
• Manual Data Curation Projects that achieve high standard of quality in the integrated data through
manual curation – data focused. The curation work is performed by experienced professionals. Swiss Prot- data on protein sequences from journal publications,
submissions, personal communications, other databases. Data Focused projects are typically managed by domain experts like Biologist. Very little Database concepts or technology used in this case. Data in text like manner. Even if detailed schemata are developed, it can cannot be used to query the
database and obtain the results through query.
2. Data Integration in Life Sciences Contd….• Schema Focused Projects of this type make use of Database technology and are maintained by computer
scientists, database analyst , database programmer etc. These projects are aimed at providing integration middleware rather than building concrete
databases. Techniques like schema integration, schema mapping and mediator based query rewriting is
used. Examples : TAMBIS, OPM. Some sort of wrapper required for query processing, detailed semantic mapping between
heterogeneous source schemata and a global mediated schema. The mappings must be specified in a special language which makes the work of domain
experts very difficult.
2. Data Integration in Life Sciences Contd….
Data Focused projects are very successful in biological scene but it does so with a price.
Schema focused projects are hardly used in real life science projects and did not achieve the required attention that it should have.
The major reason for its failure lies in the fact that it is schema centric The schema mapping and integration also leaves the biologists in a fix as
they are not used to with database technologies
3. Introduction to ALADIN
ALADIN is a novel combination of data and text mining, schema matching, and duplicate detection and justifies high level of automatism feature.
It leads to previously unseen relationship between objects, thus directly supporting the discovery based work of life science researchers
ALADIN has two major contributions. Firstly ALADIN is a knowledge resource for life science research and secondly it offers challenges and bottlenecks in the database reserach.
ALADIN: Almost Automatic Data Integration. The novel feature includes automatic integration with minimum
information loss and also takes care of information quality. The proposed technique has features better than Data Focused and Schema
Matching techniques.
4. Features of ALADIN
ALADIN’S architecture consist of several components that together allow for automatic integration of diverse data sources into a global, materialized repository of biological objects and links between them.
The databases that ALADIN helps integration have data that is semi-structured and text centric.
ALADIN uses relational database as its basis. Can Integrate different types of data sources for which relational representation
exists or can be generated. ALADIN can integrate data in the XML file and flat file using appropriate parsers. Integration in ALADIN does not depend on predefined integrity constraints
structuring a schema but uses technologies from schema matching and data and text mining to detect relation containing primary objects between each data source , to infer relationships between relations and objects within one source , and to infer relationships between objects in different sources.
4. Features of ALADIN Contd…
The system does not rely on any particular schema for its data sources . Generic Parsers are used like generic-XML to relational mapping tools. ALADIN integrates data sources in a five step process. First step- Imports Data source into relational format. Second step- From the relational representation it tries to find out relation that
represents the primary objects within the data source. Third step- the fields containing annotations for the primary objects are detected .
Existing integrity constraints are discovered subject to availability , otherwise they are guessed from the data analysis.
Fourth Step- Links between the objects of the primary relations of different data sources are searched for. Links are generated based on similarity of text fields.
Fifth Step- Duplicates are detected across different data sources and they are removed.
Once the data is imported the process becomes almost automatic.
4. Features of ALADIN Contd…
ALADIN also supports structured queries , detects and flags structured and flags duplicate objects , and adds a wealth of additional links between objects that are undiscoverable when looking at the database in the isolation.
ALADIN can be readily browsed without any schema information. ALADIN is useful in scenarios where explorative approach is necessary.
5. System Architecture for ALADIN
5. System Architecture for ALADIN
The Main Components of Architecture 1. Data Import The data source needs to be imported to into the relational database system. In cases where no downloadable import method exists, this is where
ALADIN requires human work The situation is very rare, most of the time the parsers are readily available. Schema design or re-design is not required.
2. Discovery of Primary Objects Identifying primary objects stored in primary relation. Primary relation contain data about the main objects of interest in the source
such as “ proteins” and “diseases”.
5. System Architecture for ALADIN
The relations store a primary key but it does not have information regarding the foreign key.
3. Discovery of Secondary Objects. Secondary objects are additional information about primary objects. Cardinalities of relationships are determined in this step. At the end of this step, the internal structure of the newly imported data
source is more clear or known. In this step there can be a possibility of errors while identifying
relationships. The errors can be minimized in the ALADIN by introducing performance
measure parameters.
5. System Architecture for ALADIN
4. Link Discovery In this step we search for attributes that are cross-references to objects in
other data sources. Cross-references always point to primary objects in other data sources as
these are the attributes with stable public ID’s. The output from the second step are necessary input to determine all
possible link targets. This step justifies the theoretical requirement of comparing all pairs of
attributes from all sources.
5. System Architecture for ALADIN
5. Duplicate Detection In this step, search for a special kind of “ links” between primary objects
in different data sources representing the same real world object is initiated.
Such duplicate links are established if two objects are sufficiently similar according to some similarity matrix.
Knowledge of duplicates enhances the users browsing and querying experience .
6. Browse, Search and Query Engine Once the data is integrated into the system, there are three modes to access
the data. Browsing displays objects and different kind of links that users can follow.
5. System Architecture for ALADIN
Search allows the users to make a full text search on all stored data and a focused search restricted to certain partitions of data like a certain data sources , particular field, etc.
Querying allows full SQL queries on the schemata as imported. Appropriate Graphical User Interfaces are provided to carry on with these
operations.
7. Metadata Repository. It contains known and discovered schemata , information about primary
and secondary relations, statistical metadata and sample data to improve discovery efficiency.
Integration Steps in ALADIN
6. System Components Description
1. Data Import Read data source into the relational database. No necessary to have integrity constraints at this time. Some databases like Swiss-Prot , the Gene Ontology provide direct
relational dump files. For text-based exports, readily downloadable parsers are available.
Examples are BioSQL and BioPerl packages which are able to read Swiss-Prot , Genebank Databases.
Some databases provide parsers with their export files, such as the open MMS parser for the Protein Structure Database.
Databases exported as XML files can be parsed using a generic XML shredder.
6. System Components Description
2. Discovery of Primary Relations Discovering primary objects without the use of parsers. Heuristic rules along with schema and the actual data is used to determine
the primary relation Rules derived from the previous experience of data integration. SQL query on each attributes. The attributes found are alphanumeric in nature and are called accession
members . Foreign key relationships and cardinalities. The detected primary relation and set of relationships achieved are input to
the next steps
6. System Components Description
3. Discovery of secondary Relations The need to connect objects in one source to the other source . The step determines the description and annotation that is displayed
together with the primary object in the web interface. The computation of the paths from the primary relation to each of the other
relations of data source is done using transitivity of relationships. The paths are stored in the metadata repository.
4. Link Discovery Explicit Links and Implicit Links Explicit cross-references in life science databases. Cross references are stored as accession numbers .
6. System Components Description
E.g. “ ENSG00000042753” or “Uniprot:P11140” String matching techniques needed. Many relationships are not explicitly stored. The implicit relationships are discovered by searching for similar data
values among the other data sources. Three types of comparisons taken into consideration First is DNA, RNA, or protein sequences compared to each other. Second is attributes containing longer text strings, such as textual
descriptions are analyzed using information retrieval and text mining. Use of standard vocabularies across the data sources. The discovered links are stored in the metadata repository to avoid
repeated discovery and computation at query time.
7. A Case Study on Protein Data
The design decisions of ALADIN have taken based on the past experiences drawn from the integration projects in this domain.
The paper discusses the most recent COLUMBA Project. COLUMBA is an integrated , relational data warehouse that annotates
protein structures taken from the protein data bank (PDB). The data explains following properties of protein- Classification on structures Protein sequence and sequence features Functional annotation of proteins Participation on metabolic pathways The extraction and transformation from the initial data source schema into
the target schema is currently hard-coded and it certainly requires a lot of effort.
7. A Case Study on Protein Data
Understanding the schema is very difficult , as they are very poorly documented and often use structures that are hard to understand just looking at the schema .
Transformation requires operations that are not defined in the current schema mapping languages .
Use of SQL and Java. COLUMBA annotate Protein structures from Protein Data Bank. It also
includes protein fold classification databases SCOP and CATH. Further functional and taxonomic annotation is available from Swiss-Prot,
the Gene Ontology (GO) and the NCBI Taxonomy.
7. A Case Study on Protein Data
Part of BioSQL Schema. Arrows indicate candidates for Primary relations and cross-references
7. A Case Study on Protein Data
In the complete schema there are three tables with an in-degree above five. BioEntry
The BioEntry table is used to store primary objects.
Bio EntryIdDisplay IdIdentifierAccessionDescriptionEntry VersionBio Database ID ( FK)Taxon Id (FK)
7. A Case Study on Protein Data
The ontology term table is used to store functional descriptions. Ontology Term
The SeqFeature storing a meta representation of sequence features. SeqFeature
Ontology Term IdTerm NameTerm DefinitionTerm_IdentifierCategory_Id (FK)
Seqfeature IdSeqfeature RankBioentry IdOntology Term Id (FK)Seqfeature Source Id (FK)
7. A Case Study on Protein Data
The BioEntry has an accession number candidate , whose values are mixed characters and integers and all have the same length.
The other fields in the BioEntry are either non unique (e.g. Taxon Id), have no alphanumeric character ( e.g. Bioentry Id) or have varying length ( e.g. name)
This table qualifies in ALADIN as the primary relation. Primary and Foreign keys are determined by analyzing the scope of
different attributes storing surrogate keys. In the next step in COLUMBA, protein structures are connected to
annotations using existing cross –references or by matching sequence similarity.
7. A Case Study on Protein Data
The BioSQL schema contains several attributes whose values are excellent candidates for finding out implicit links.
OntologyTerm. Term_Definition linking to biological ontologies. BioEntry.description linking to disease or gene-focused databases. Biosequence.Biosequence_str, containing the actual DNA or protein
sequence. Duplicate Detection is an important step here as protein structures from the
PDB are available in three different flavors: The original PDB files Cleansed version available as dump files. Cleansed version available with a parser. PDB accession number is available in all the three versions and hence
removing redundancy is easy in this case.
7. A Case Study on Protein Data
• SQL example - fetch the accessions of all sequences from SwissProt: SELECT DISTINCT bioentry.accession FROM bioentry JOIN biodatabase USING (biodatabase_id) WHERE biodatabase.name = 'swiss' -- or 'Swiss-Prot'
• SQL example - how many unique entries are there in GenBank: SELECT COUNT(DISTINCT bioentry.accession) FROM bioentry JOIN biodatabase USING (biodatabase_id) WHERE biodatabase.name = 'genbank'
8. Some Related Work
Discovery Link OPM TAMBIS Makes use of Schema information rather than the data SRS- need to mention the primary and secondary relation explicitly in the
parsers . GenMapper BioMediator The Project closest to our proposal is Revere Project
Comparison of ALADIN with the existing technologies
9. Challenges and Bottlenecks in ALADIN
ALADIN system is a true challenge in terms of size, number, and the complexity of the data sources to be integrated.
Incorrectly identified primary or secondary relations leads to incorrect targets to link discovery.
Incorrect links in turn influence the precision of duplicate detection. Issue of performance is not addressed in the paper. Integrating new data sources to the existing ones leads to the poor
efficiency as it involves lots of calculations, sorting, schema matching . It takes a lot of time to achieve the desired results.
Another important problem is that of data changes. When the data in the source changes, all links needs to be recomputed which involves lot of overhead.
10. Summary
ALADIN architecture and framework looks out to be a novel proposal for data integration in life sciences.
The design seems to be almost automatic using text mining, schema matching , data mining and information retrieval.
ALADIN offers clear added-value to the biological user when compared to the current data landscape.
It enables structured queries crossing several databases. The system suggests a lot of new relationships interlinking all the areas of life
sciences , offers ranked searched capabilities across databases for users that want goggle style information retrieval .
The queries like getting the genes of a certain species on a certain chromosome that are connected to a disease via a protein whose function is known. For each of the object types in query , several potential data sources exist and this system takes into account all of the data sources , a feature not supported by any of the integration technology.
11. DEMO OF COLUMBA PROJECT