high throughput biology projects
DESCRIPTION
High throughput biology projects. The new biology. Traditional biology: Small team working on a specialized topic Well defined experiment to answer precise questions New « high-throughput » biology Large international teams using cutting edge technology defining the project - PowerPoint PPT PresentationTRANSCRIPT
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
High throughput biology projects
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
The new biology
Traditional biology: Small team working on a specialized topic Well defined experiment to answer precise questions
New « high-throughput » biology Large international teams using cutting edge
technology defining the project Results are given raw to the scientific community
without any underlying hypothesis
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Example of « high-throughput »
Complete genome sequencing Large-scale sampling of the transcriptome Simultaneous gene expression analysis of thousands of
gene (DNA chips) Large-scale sampling of the proteome Protein-protein analysis large-scale 2-hybrid (yeast,
worm) Large-scale 3D structure production (yeast) Metabolism modelling Biodiversity
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Role of bioinformatics
Control and management of the data Analysis of primary data e.g.
Base calling from chromatograms Mass spectra analysis DNA chips images analysis
Statistics Results analysis in a biological context
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Genomes in numbers
Sizes: virus: 103 to 105 nt bacteria: 105 to 107 nt yeast: 1.35 x 107 nt mammals: 108 to
1010 nt plants: 1010 to 1011 nt
Gene number: virus: 3 to 100 bacteria: ~ 1000 yeast: ~ 7000 mammals: ~ 30’000
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Sequencing projects
« small » genomes (<107): bacteria, virus Many already sequenced (industry excluded) More than 60 bacterial genomes already in the public
domain More to come! (one every two weeks…)
« large » genomes (107-1010) eucaryotes 5 finished (S.cerevisiae, C.elegans, D.melanogaster,
A.thaliana, Homo sapiens) Many more to come: mouse, rat, rice (and other plants),
fishes, many pathogenic parasites EST sequencing
Partial mRNA sequences ~8.5x106 sequences in the public domain
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Human genome
Size: 3 x 109 nt for a haploid genome Highly repetitive sequences 25%, moderately repetitive
sequences 25-30% Size of a gene: from 900 to >2’000’000 bases (introns
included) Proportion of the genome coding for proteins: 5-7% Number of chromosomes: 22 autosomal, 1 sexual
chromosome Size of a chromosome: 5 x 107 to 5 x 108 bases
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
How to sequence the human genome?
Consortium « international » approach: Generate genetic maps (meiotic recombination) and
pseudogenetic maps (chromosome hybrids) for indicator sequences
Generate a physical map based on large clones (BAC or PAC) Sequence enough large clones to cover the genome
« commercial » approach (Celera): Generate random libraries of fixed length genomic clones (2kb
and 10kb) Sequence both ends of enough clones to obtain a 10x coverage Use computer techniques to reconstitute the chromosomal
sequences, check with the public project physical map
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Mapping resources
Genetic and physical maps: Genethon, GDB, NCBI
Radiation hybrid map: Sanger BAC production & mapping: Oakland, Caltech,
others Clone information and retrieval: RZPD
(Germany) Physical maps in ACEDB format from
chromosome coordinators
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Sequencing
Create shotgun library from BAC/PAC Sequence individual clones to get a ten-fold
coverage Phases:
0 = single sequence (like STS) 1 = unordered contigs 2 = ordered, oriented contigs 3 = finished, annotated sequence
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Chromosome size sequences
Problem: full chromosomes or entire bacterial genomes are too long to fit the database entry specifications
Solution: split the sequence in overlapping “chunks”
New problem: have to reassemble chunks if you want to analyze the whole sequence
GenBank provides “meta-entries” (CON division) with assembly instructions
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Interpretation of the human draft
Many gaps and unordered small pieces A genomic sequence does not tell you where
the genes are encoded. The genome is far from being « decoded »
One must combine genome and transcriptome to have a better idea
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
The transcriptome
The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome
The documentation of the localization (cell type) and conditions under which these RNAs are expressed
The documentation of the biological function(s) of each RNA species
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Public draft transcriptome
Information about the expression specificity and the function of mRNAs « full » cDNA sequences of know function « full » cDNA sequences, but « anonymous » (e.g.
KIAA or DKFZ collections) EST sequences
cDNA libraries derived from many different tissues Rapid random sequencing of the ends of all clones ORESTES sequences
Limited set of expression data
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
How to organise EST collections?
Clustering: associate individual EST sequences with unique transcripts or genes
Assembling: derive consensus sequences from overlapping ESTs belonging to the same cluster
Mapping: associate ESTs (or EST contigs) with exons in genomic sequences
Interpreting: find and correct coding regions
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Example mapping of ESTs and mRNAs
ESTsmRNAs
Computer prediction
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
How to cope with the amount of data?
Enormous increase of sequences Always moving data (phases…) Automatic annotation projects
RefSeq (NCBI) ENSEMBL (EBI) HAMAP (SIB)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
RefSeq: NCBI Reference sequences
mRNAs and Proteins
NM_123456 Reference mRNANP_123456 Reference ProteinXM_123456 Predicted TranscriptXP_123456 Predicted ProteinXR_123456 Predicted non-coding TranscriptGene RecordsNG_123456 Reference Genomic SequenceAssembliesNT_123456 Reference Contig (Mouse and Human
Genomes)NC_123455 Reference Chromosome, Microbial
Genomes, Plasmid
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Status codes
RefSeq records are provided with a status code which provides an indication of the level of review a RefSeq record has undergone.
REVIEWED The RefSeq record has been the reviewed by NCBI Staff. The review process
includes reviewing available sequence data and frequently also includes a review of the literature.
PROVISIONAL The RefSeq record has not yet been subject to individual review.
PREDICTED Some aspect of the RefSeq record is predicted and there is supporting evidence
that the locus is valid. GENOME ANNOTATION
This identifies the contig (NT_ accessions), mRNA (XM_), non-coding transcript (XR_), and protein (XP_) RefSeq records provided by the NCBI Genome Annotation process. These records are provided via automated processing.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
NM_
XM_
NT_
Map view of RefSeq
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
ENSEMBL
Goals of Ensembl Accurate, automatic analysis of genome data Analysis maintained on the current data Presentation of the analysis to biologists via the Web Distribution of the analysis to other bioinformatics laboratories.
The Ensembl project will be a foundation for a next generation sequence database that provides a curated, distributed, non redundant view of the genomes of model organisms.
Commitments of the Ensembl project Public release of data
All the data and analysis will be put into the public domain immediately.
Open, collaborative software development The software which forms the automated pipeline will be available to everyone under an open
license, modelled after the Apache license.
Collaboration on agreed standards for distribution We hope to provide the data in as many useful forms as is practical, including the EMBL flat file
formats and new data distribution channels such as XML and CORBA.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
ENSEMBL
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
ENSEMBL views
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
High quality Automated Microbial Annotation of Proteomes
Aim: automatically annotate with the highest level of quality a significant percentage of proteins originating from microbial genome sequencing projects.
The programs being developed are specifically designed to track down "eccentric" proteins. Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.
This project should allow annotators in the SWISS-PROT groups at SIB and EBI to concentrate on the proteins that really need careful manual annotation.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
HAMAP origin
About 60 microbial genomes are available today >1000 in a few years; >1 million microbial
proteins! Functional analysis and detailed biochemical
characterization will only be available: For « all » proteins in a handful of model organisms (i.e.
E.coli, B.subtilis, etc.) For proteins involved in pathogenicity (medical and
pharmaceutical interests) For proteins involved in specific biosynthetic or catabolic
pathways (biotechnological and food industry interests)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
HAMAP overview
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
HAMAP flow chart
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
HAMAP study case
The case of the Escherichia coli proteome According to the original analysis in 1997: 4286
protein coding genes 60 were missed (almost all <100 residues) 120 are most probably « bogus » 50 pairs or triplets of ORFs had to be fused 719 have proven or probable wrong start sites ~1800 are still not biochemically characterized; only
one new « functionalisation » per week…
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2001.11
Unix reminder
General: man, pwd, cd, ls, mkdir, rmdir, passwd, exit
Files manipulation: cat, more, cp, mv, rm, grep, find, diff, head, tail, chmod
Editing: vi, pico, emacs Compression: tar, (un)compress, gzip Various: redirection (<>>) and piping (|)