bioinformatics
DESCRIPTION
Bioinformatics. The application of computational techniques to understand and organise the information associated with biological macromolecules. Aims of Bioinformatics. - PowerPoint PPT PresentationTRANSCRIPT
Bioinformatics
The application of computational techniques to understand and organise
the information associated with biological macromolecules
Aims of Bioinformatics
1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced
2. to develop tools and resources that aid in the analysis of data
3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features
Aims of Bioinformatics
1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced
2. to develop tools and resources that aid in the analysis of data
3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features
Source of dataSource of data
1. DNA or Protein sequences
2. Macromolecular structures
3. Results of functional genomics and proteomics experiments (gene expression data)
DNA or Protein sequencesDNA or Protein sequences
DNA sequences are strings of the 4 base-letters comprising genes, each tipically 1,000 bases long. The widest db contains at least 27 million entries.
Protein sequences are strings of the 20 aminoacid-letters. At present more than 400,000 protein sequences are known.
At April 2001,
- GenBank db of nucleic acid sequences contained 11,546,000 entries
- SwissProt db of protein sequences contained 95,320 entries
These databases doubled in size in 15 months
Size of data
Biological data are being produced at a phenomenal rate
Size of dataSize of data
Anthony Kervelage of Celera recently cited that
an experimental laboratory can produce over 100 gigabytes of data
per day with ease.
Biological processing powerBiological processing power
This incredible processing power has been matched by developments in computer
technology
Areas of improvementsAreas of improvements
- CPU (faster computations)
- disk storage (better data storage)
- Internet (revolutionalised the methods for accessing and exchanging data)
Source of dataSource of data
1. DNA or Protein sequences
2. Macromolecular structures
3. Results of functional genomics and proteomics experiments (gene expression data)
Macromolecular structureMacromolecular structure
There are currently 15,000 entries in the Protein Data Bank, PDB
The PDB db contains atomic structures (xyz-coordinates) of proteins, DNA and RNA solved by x-ray crystallography and NMR
A typical PDB file contains the xyz-coordinates of ca. 2000 atoms.
Source of dataSource of data
1. DNA or Protein sequences
2. Macromolecular structures
3. Results of functional genomics and proteomics experiments (gene expression data)
Gene expression dataGene expression data
These experiments measure the amount of mRNA (functional genomics) or protein (proteomics) that is produced by the cell under different conditions, different stages of the cell cycle and different cell types in multi-cellular organisms.
One of the largest dataset available has made approximately 20 time-point measurements for 6,000 genes (yeast).
Gene expression dataGene expression data
On a experimental point of view, it is possible to determine the expression levels of almost every gene in a given cell on a whole-genome level.
However there is currently no central depository for these data and public availability is limited.
Biological dataBiological data
The diversity in the size and complexity of different datasets.
Although macromolecular structures and gene expression experiments are giving much more biological information than the raw sequence data, there are invariably more sequence-based data than others.
Why?Why?
Because of the relative ease with which they can be produced
Why?Why?
Because they can be easily managed by both biologists and by computer scientists also with very low biological background
Why?Why?
On the other hand, gene expression data are far more complex to be managed and:
Gene expression dataGene expression data
1. biologists rarely achieve mathematical competence beyond elementary calculus and maybe a few statistical formulae.
2. although everybody uses a computer, biologists rarely use anything but standard commercial software
Gene expression data are far more complex to be managed and:
Gene expression dataGene expression data
3. people with non-biological background can find surprisingly difficult to master the complex and apparently unconnected information that is the working knowledge of every biologist
Source of dataSource of data
1. DNA or Protein sequences
2. Macromolecular structures
3. Results of functional genomics and proteomics experiments (gene expression data)
Source of dataSource of data
4. Genomic-scale data include biochemical information on metabolic pathways, regulatory networks, protein-protein interactions and data from two-hybrid experiments and systematic knockouts of individual genes
IntegrationIntegration
Integration of multiple sources of data.
At a basic level, this problem is frequently addressed by providing external links to other databases.
At a more advanced level, an integrated access across several data sources is provided.
Data organisationData organisation
First biological databases were simple flat files.
At the moment most of them are relational db with Web-page interfaces.
Aims of Bioinformatics
1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced
2. to develop tools and resources that aid in the analysis of data
3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features
Data and Software ToolsData and Software Tools
Data and Software ToolsData and Software ToolsFor example:- software for gene finding (identification
of coding regions)
- software for similarity searches
- multiple sequence alignments and searching for functional domains
- homology modeling
- calculations of surface and volume shapesand analysis of protein interactions with
DNA, RNA, other proteins or drugs (chemoinformatics)
Similarity searchingSimilarity searching
Having sequenced a particular protein, it is of interest to compare it with previously characterised sequences.
This need more than just simple text-based search, and these programs must consider what constitutes a biologically significant match.
Biologically significant match:- two sequences share a common function- two sequences share a common evolutionary history (homologs)
Homology modelingHomology modeling
At a structural level, it is predicted to be a finite number of different tertiary structures - estimates range between 1,000 and 10,000 folds.
A structure can be predicted on a homology-based manner, by comparison with known structures (3-D structural alignments)
Although the number of structures in the PDB db has increased exponentially, the rate of discovery of novel folds has actually decreased.
Ab initioAb initio structure prediction structure prediction
Prediction of the 3-D structure is based on the protein sequence only: e.g. the propensity of certain aminoacid combinations to produce secondary structural elements.
Aims of Bioinformatics
1. to organise data in a way that allows researchers to access existing information and to submit new entries as they are produced
2. to develop tools and resources that aid in the analysis of data
3. to conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features
Data explorationData exploration
Finding relationships between different proteins:
- Analysis of one type of data to infer and understand the observations for another type of data
- Comparative analysis to do classification
Expansion of biological analysis in two dimensions, depht and breadth
Expansion of biological analysis in two dimensions: depht
This approach takes a single gene and follow through ana anlysis that maximises our understaning of the protein it encodes.
Then prediction algorithms can be used to calculate the structure and to make hypothesis on its function
Geometry calculations can define the shape of the protein’s surface and identify or design ligands that can become drugs specifically altering the protein’s function.
Example: Rational drug design
Expansion of biological analysis in two dimensions: breadth
This approach can lead to extract sequence patterns or structural templates that define a family of proteins sharing a common property.
This approach can also lead to construct phylogenetic trees to trace evolutions. E.g. the SARS virus.
Example: comparison of a gene or a gene product with others.
Data organisationData organisation
First biological databases were simple flat files.
At the moment most of them are relational db with Web-page interfaces.
Sequence analysisSequence analysis
Techniques include mainly string comparison methods
Motif and pattern identificationMotif and pattern identificationand classificationand classification
depend on:
-Machine learning
-Clustering and data mining techniques
3-D structural analysis3-D structural analysis
include:
-Euclidean geometry calculations
-Basic application of physical chemistry
-Graphical representation of surface and volumes
-Structural comparison (3-D matching)
This unexpected union between the two subjects is attributed to te fact that life itself is an information technology
Un organism’s physiology is largely determined by its genes, which at its most basic can be viewed as digital information
Bio Bio InformaticsInformatics