bioinformatics

39
Bioinformatics The application of computational techniques to understand and organise the information associated with biological macromolecules

Upload: jerry-dale

Post on 30-Dec-2015

18 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics. The application of computational techniques to understand and organise the information associated with biological macromolecules. Aims of Bioinformatics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics

Bioinformatics

The application of computational techniques to understand and organise

the information associated with biological macromolecules

Page 2: Bioinformatics

Aims of Bioinformatics

1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced

2. to develop tools and resources that aid in the analysis of data

3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features

Page 3: Bioinformatics

Aims of Bioinformatics

1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced

2. to develop tools and resources that aid in the analysis of data

3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features

Page 4: Bioinformatics

Source of dataSource of data

1. DNA or Protein sequences

2. Macromolecular structures

3. Results of functional genomics and proteomics experiments (gene expression data)

Page 5: Bioinformatics

DNA or Protein sequencesDNA or Protein sequences

DNA sequences are strings of the 4 base-letters comprising genes, each tipically 1,000 bases long. The widest db contains at least 27 million entries.

Protein sequences are strings of the 20 aminoacid-letters. At present more than 400,000 protein sequences are known.

Page 6: Bioinformatics

At April 2001,

- GenBank db of nucleic acid sequences contained 11,546,000 entries

- SwissProt db of protein sequences contained 95,320 entries

These databases doubled in size in 15 months

Size of data

Biological data are being produced at a phenomenal rate

Page 7: Bioinformatics

Size of dataSize of data

Anthony Kervelage of Celera recently cited that

an experimental laboratory can produce over 100 gigabytes of data

per day with ease.

Page 8: Bioinformatics

Biological processing powerBiological processing power

This incredible processing power has been matched by developments in computer

technology

Page 9: Bioinformatics

Areas of improvementsAreas of improvements

- CPU (faster computations)

- disk storage (better data storage)

- Internet (revolutionalised the methods for accessing and exchanging data)

Page 10: Bioinformatics

Source of dataSource of data

1. DNA or Protein sequences

2. Macromolecular structures

3. Results of functional genomics and proteomics experiments (gene expression data)

Page 11: Bioinformatics

Macromolecular structureMacromolecular structure

There are currently 15,000 entries in the Protein Data Bank, PDB

The PDB db contains atomic structures (xyz-coordinates) of proteins, DNA and RNA solved by x-ray crystallography and NMR

A typical PDB file contains the xyz-coordinates of ca. 2000 atoms.

Page 12: Bioinformatics

Source of dataSource of data

1. DNA or Protein sequences

2. Macromolecular structures

3. Results of functional genomics and proteomics experiments (gene expression data)

Page 13: Bioinformatics

Gene expression dataGene expression data

These experiments measure the amount of mRNA (functional genomics) or protein (proteomics) that is produced by the cell under different conditions, different stages of the cell cycle and different cell types in multi-cellular organisms.

One of the largest dataset available has made approximately 20 time-point measurements for 6,000 genes (yeast).

Page 14: Bioinformatics

Gene expression dataGene expression data

On a experimental point of view, it is possible to determine the expression levels of almost every gene in a given cell on a whole-genome level.

However there is currently no central depository for these data and public availability is limited.

Page 15: Bioinformatics

Biological dataBiological data

The diversity in the size and complexity of different datasets.

Although macromolecular structures and gene expression experiments are giving much more biological information than the raw sequence data, there are invariably more sequence-based data than others.

Why?Why?

Page 16: Bioinformatics

Because of the relative ease with which they can be produced

Why?Why?

Page 17: Bioinformatics

Because they can be easily managed by both biologists and by computer scientists also with very low biological background

Why?Why?

Page 18: Bioinformatics

On the other hand, gene expression data are far more complex to be managed and:

Gene expression dataGene expression data

1. biologists rarely achieve mathematical competence beyond elementary calculus and maybe a few statistical formulae.

2. although everybody uses a computer, biologists rarely use anything but standard commercial software

Page 19: Bioinformatics

Gene expression data are far more complex to be managed and:

Gene expression dataGene expression data

3. people with non-biological background can find surprisingly difficult to master the complex and apparently unconnected information that is the working knowledge of every biologist

Page 20: Bioinformatics

Source of dataSource of data

1. DNA or Protein sequences

2. Macromolecular structures

3. Results of functional genomics and proteomics experiments (gene expression data)

Page 21: Bioinformatics

Source of dataSource of data

4. Genomic-scale data include biochemical information on metabolic pathways, regulatory networks, protein-protein interactions and data from two-hybrid experiments and systematic knockouts of individual genes

Page 22: Bioinformatics

IntegrationIntegration

Integration of multiple sources of data.

At a basic level, this problem is frequently addressed by providing external links to other databases.

At a more advanced level, an integrated access across several data sources is provided.

Page 23: Bioinformatics

Data organisationData organisation

First biological databases were simple flat files.

At the moment most of them are relational db with Web-page interfaces.

Page 24: Bioinformatics

Aims of Bioinformatics

1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced

2. to develop tools and resources that aid in the analysis of data

3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features

Page 25: Bioinformatics

Data and Software ToolsData and Software Tools

Page 26: Bioinformatics

Data and Software ToolsData and Software ToolsFor example:- software for gene finding (identification

of coding regions)

- software for similarity searches

- multiple sequence alignments and searching for functional domains

- homology modeling

- calculations of surface and volume shapesand analysis of protein interactions with

DNA, RNA, other proteins or drugs (chemoinformatics)

Page 27: Bioinformatics

Similarity searchingSimilarity searching

Having sequenced a particular protein, it is of interest to compare it with previously characterised sequences.

This need more than just simple text-based search, and these programs must consider what constitutes a biologically significant match.

Biologically significant match:- two sequences share a common function- two sequences share a common evolutionary history (homologs)

Page 28: Bioinformatics

Homology modelingHomology modeling

At a structural level, it is predicted to be a finite number of different tertiary structures - estimates range between 1,000 and 10,000 folds.

A structure can be predicted on a homology-based manner, by comparison with known structures (3-D structural alignments)

Although the number of structures in the PDB db has increased exponentially, the rate of discovery of novel folds has actually decreased.

Page 29: Bioinformatics

Ab initioAb initio structure prediction structure prediction

Prediction of the 3-D structure is based on the protein sequence only: e.g. the propensity of certain aminoacid combinations to produce secondary structural elements.

Page 30: Bioinformatics

Aims of Bioinformatics

1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced

2. to develop tools and resources that aid in the analysis of data

3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features

Page 31: Bioinformatics

Data explorationData exploration

Finding relationships between different proteins:

- Analysis of one type of data to infer and understand the observations for another type of data

- Comparative analysis to do classification

Expansion of biological analysis in two dimensions, depht and breadth

Page 32: Bioinformatics

Expansion of biological analysis in two dimensions: depht

This approach takes a single gene and follow through ana anlysis that maximises our understaning of the protein it encodes.

Then prediction algorithms can be used to calculate the structure and to make hypothesis on its function

Geometry calculations can define the shape of the protein’s surface and identify or design ligands that can become drugs specifically altering the protein’s function.

Example: Rational drug design

Page 33: Bioinformatics

Expansion of biological analysis in two dimensions: breadth

This approach can lead to extract sequence patterns or structural templates that define a family of proteins sharing a common property.

This approach can also lead to construct phylogenetic trees to trace evolutions. E.g. the SARS virus.

Example: comparison of a gene or a gene product with others.

Page 34: Bioinformatics

Data organisationData organisation

First biological databases were simple flat files.

At the moment most of them are relational db with Web-page interfaces.

Page 35: Bioinformatics

Sequence analysisSequence analysis

Techniques include mainly string comparison methods

Page 36: Bioinformatics

Motif and pattern identificationMotif and pattern identificationand classificationand classification

depend on:

-Machine learning

-Clustering and data mining techniques

Page 37: Bioinformatics

3-D structural analysis3-D structural analysis

include:

-Euclidean geometry calculations

-Basic application of physical chemistry

-Graphical representation of surface and volumes

-Structural comparison (3-D matching)

Page 38: Bioinformatics

This unexpected union between the two subjects is attributed to te fact that life itself is an information technology

Un organism’s physiology is largely determined by its genes, which at its most basic can be viewed as digital information

Bio Bio InformaticsInformatics

Page 39: Bioinformatics