bioinformatics

Bioinformatics

The application of computational techniques to understand and organise

the information associated with biological macromolecules

Aims of Bioinformatics

1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced

2. to develop tools and resources that aid in the analysis of data

3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features

Source of dataSource of data

1. DNA or Protein sequences

2. Macromolecular structures

3. Results of functional genomics and proteomics experiments (gene expression data)

DNA or Protein sequencesDNA or Protein sequences

DNA sequences are strings of the 4 base-letters comprising genes, each tipically 1,000 bases long. The widest db contains at least 27 million entries.

Protein sequences are strings of the 20 aminoacid-letters. At present more than 400,000 protein sequences are known.

At April 2001,

- GenBank db of nucleic acid sequences contained 11,546,000 entries

- SwissProt db of protein sequences contained 95,320 entries

These databases doubled in size in 15 months

Size of data

Biological data are being produced at a phenomenal rate

Size of dataSize of data

Anthony Kervelage of Celera recently cited that

an experimental laboratory can produce over 100 gigabytes of data

per day with ease.

Biological processing powerBiological processing power

This incredible processing power has been matched by developments in computer

technology

Areas of improvementsAreas of improvements

- CPU (faster computations)

- disk storage (better data storage)

- Internet (revolutionalised the methods for accessing and exchanging data)

Macromolecular structureMacromolecular structure

There are currently 15,000 entries in the Protein Data Bank, PDB

The PDB db contains atomic structures (xyz-coordinates) of proteins, DNA and RNA solved by x-ray crystallography and NMR

A typical PDB file contains the xyz-coordinates of ca. 2000 atoms.

Gene expression dataGene expression data

These experiments measure the amount of mRNA (functional genomics) or protein (proteomics) that is produced by the cell under different conditions, different stages of the cell cycle and different cell types in multi-cellular organisms.

One of the largest dataset available has made approximately 20 time-point measurements for 6,000 genes (yeast).


On a experimental point of view, it is possible to determine the expression levels of almost every gene in a given cell on a whole-genome level.

However there is currently no central depository for these data and public availability is limited.

Biological dataBiological data

The diversity in the size and complexity of different datasets.

Although macromolecular structures and gene expression experiments are giving much more biological information than the raw sequence data, there are invariably more sequence-based data than others.

Why?Why?

Because of the relative ease with which they can be produced

Why?Why?

Because they can be easily managed by both biologists and by computer scientists also with very low biological background

Why?Why?

On the other hand, gene expression data are far more complex to be managed and:


1. biologists rarely achieve mathematical competence beyond elementary calculus and maybe a few statistical formulae.

2. although everybody uses a computer, biologists rarely use anything but standard commercial software

Gene expression data are far more complex to be managed and:


3. people with non-biological background can find surprisingly difficult to master the complex and apparently unconnected information that is the working knowledge of every biologist


4. Genomic-scale data include biochemical information on metabolic pathways, regulatory networks, protein-protein interactions and data from two-hybrid experiments and systematic knockouts of individual genes

IntegrationIntegration

Integration of multiple sources of data.

At a basic level, this problem is frequently addressed by providing external links to other databases.

At a more advanced level, an integrated access across several data sources is provided.

Data organisationData organisation

First biological databases were simple flat files.

At the moment most of them are relational db with Web-page interfaces.

Data and Software ToolsData and Software Tools

Data and Software ToolsData and Software ToolsFor example:- software for gene finding (identification

of coding regions)

- software for similarity searches

- multiple sequence alignments and searching for functional domains

- homology modeling

- calculations of surface and volume shapesand analysis of protein interactions with

DNA, RNA, other proteins or drugs (chemoinformatics)

Similarity searchingSimilarity searching

Having sequenced a particular protein, it is of interest to compare it with previously characterised sequences.

This need more than just simple text-based search, and these programs must consider what constitutes a biologically significant match.

Biologically significant match:- two sequences share a common function- two sequences share a common evolutionary history (homologs)

Homology modelingHomology modeling

At a structural level, it is predicted to be a finite number of different tertiary structures - estimates range between 1,000 and 10,000 folds.

A structure can be predicted on a homology-based manner, by comparison with known structures (3-D structural alignments)

Although the number of structures in the PDB db has increased exponentially, the rate of discovery of novel folds has actually decreased.

Ab initioAb initio structure prediction structure prediction

Prediction of the 3-D structure is based on the protein sequence only: e.g. the propensity of certain aminoacid combinations to produce secondary structural elements.

Data explorationData exploration

Finding relationships between different proteins:

- Analysis of one type of data to infer and understand the observations for another type of data

- Comparative analysis to do classification

Expansion of biological analysis in two dimensions, depht and breadth

Expansion of biological analysis in two dimensions: depht

This approach takes a single gene and follow through ana anlysis that maximises our understaning of the protein it encodes.

Then prediction algorithms can be used to calculate the structure and to make hypothesis on its function

Geometry calculations can define the shape of the protein’s surface and identify or design ligands that can become drugs specifically altering the protein’s function.

Example: Rational drug design

Expansion of biological analysis in two dimensions: breadth

This approach can lead to extract sequence patterns or structural templates that define a family of proteins sharing a common property.

This approach can also lead to construct phylogenetic trees to trace evolutions. E.g. the SARS virus.

Example: comparison of a gene or a gene product with others.

Data organisationData organisation

First biological databases were simple flat files.

At the moment most of them are relational db with Web-page interfaces.

Sequence analysisSequence analysis

Techniques include mainly string comparison methods

Motif and pattern identificationMotif and pattern identificationand classificationand classification

depend on:

-Machine learning

-Clustering and data mining techniques

3-D structural analysis3-D structural analysis

include:

-Euclidean geometry calculations

-Basic application of physical chemistry

-Graphical representation of surface and volumes

-Structural comparison (3-D matching)

This unexpected union between the two subjects is attributed to te fact that life itself is an information technology

Un organism’s physiology is largely determined by its genes, which at its most basic can be viewed as digital information

Bio Bio InformaticsInformatics

bioinformatics

Documents

protein data bank

available data

protein proteomics

protein sequencesdna

gigabytes of data

gene expression dataon

results of functional

expression levels