high throughput biology projects

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2001.11

High throughput biology projects


LF-2001.11

The new biology

Traditional biology: Small team working on a specialized topic Well defined experiment to answer precise questions

New « high-throughput » biology Large international teams using cutting edge

technology defining the project Results are given raw to the scientific community

without any underlying hypothesis


LF-2001.11

Example of « high-throughput »

Complete genome sequencing Large-scale sampling of the transcriptome Simultaneous gene expression analysis of thousands of

gene (DNA chips) Large-scale sampling of the proteome Protein-protein analysis large-scale 2-hybrid (yeast,

worm) Large-scale 3D structure production (yeast) Metabolism modelling Biodiversity


LF-2001.11

Role of bioinformatics

Control and management of the data Analysis of primary data e.g.

Base calling from chromatograms Mass spectra analysis DNA chips images analysis

Statistics Results analysis in a biological context


LF-2001.11

Genomes in numbers

Sizes: virus: 103 to 105 nt bacteria: 105 to 107 nt yeast: 1.35 x 107 nt mammals: 108 to

1010 nt plants: 1010 to 1011 nt

Gene number: virus: 3 to 100 bacteria: ~ 1000 yeast: ~ 7000 mammals: ~ 30’000


LF-2001.11

Sequencing projects

« small » genomes (<107): bacteria, virus Many already sequenced (industry excluded) More than 60 bacterial genomes already in the public

domain More to come! (one every two weeks…)

« large » genomes (107-1010) eucaryotes 5 finished (S.cerevisiae, C.elegans, D.melanogaster,

A.thaliana, Homo sapiens) Many more to come: mouse, rat, rice (and other plants),

fishes, many pathogenic parasites EST sequencing

Partial mRNA sequences ~8.5x106 sequences in the public domain


LF-2001.11

Human genome

Size: 3 x 109 nt for a haploid genome Highly repetitive sequences 25%, moderately repetitive

sequences 25-30% Size of a gene: from 900 to >2’000’000 bases (introns

included) Proportion of the genome coding for proteins: 5-7% Number of chromosomes: 22 autosomal, 1 sexual

chromosome Size of a chromosome: 5 x 107 to 5 x 108 bases


LF-2001.11

How to sequence the human genome?

Consortium « international » approach: Generate genetic maps (meiotic recombination) and

pseudogenetic maps (chromosome hybrids) for indicator sequences

Generate a physical map based on large clones (BAC or PAC) Sequence enough large clones to cover the genome

« commercial » approach (Celera): Generate random libraries of fixed length genomic clones (2kb

and 10kb) Sequence both ends of enough clones to obtain a 10x coverage Use computer techniques to reconstitute the chromosomal

sequences, check with the public project physical map


LF-2001.11

Mapping resources

Genetic and physical maps: Genethon, GDB, NCBI

Radiation hybrid map: Sanger BAC production & mapping: Oakland, Caltech,

others Clone information and retrieval: RZPD

(Germany) Physical maps in ACEDB format from

chromosome coordinators


LF-2001.11

Sequencing

Create shotgun library from BAC/PAC Sequence individual clones to get a ten-fold

coverage Phases:

0 = single sequence (like STS) 1 = unordered contigs 2 = ordered, oriented contigs 3 = finished, annotated sequence


LF-2001.11

Chromosome size sequences

Problem: full chromosomes or entire bacterial genomes are too long to fit the database entry specifications

Solution: split the sequence in overlapping “chunks”

New problem: have to reassemble chunks if you want to analyze the whole sequence

GenBank provides “meta-entries” (CON division) with assembly instructions


LF-2001.11

Interpretation of the human draft

Many gaps and unordered small pieces A genomic sequence does not tell you where

the genes are encoded. The genome is far from being « decoded »

One must combine genome and transcriptome to have a better idea


LF-2001.11

The transcriptome

The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome

The documentation of the localization (cell type) and conditions under which these RNAs are expressed

The documentation of the biological function(s) of each RNA species


LF-2001.11

Public draft transcriptome

Information about the expression specificity and the function of mRNAs « full » cDNA sequences of know function « full » cDNA sequences, but « anonymous » (e.g.

KIAA or DKFZ collections) EST sequences

cDNA libraries derived from many different tissues Rapid random sequencing of the ends of all clones ORESTES sequences

Limited set of expression data


LF-2001.11

How to organise EST collections?

Clustering: associate individual EST sequences with unique transcripts or genes

Assembling: derive consensus sequences from overlapping ESTs belonging to the same cluster

Mapping: associate ESTs (or EST contigs) with exons in genomic sequences

Interpreting: find and correct coding regions


LF-2001.11

Example mapping of ESTs and mRNAs

ESTsmRNAs

Computer prediction


LF-2001.11

How to cope with the amount of data?

Enormous increase of sequences Always moving data (phases…) Automatic annotation projects

RefSeq (NCBI) ENSEMBL (EBI) HAMAP (SIB)


LF-2001.11

RefSeq: NCBI Reference sequences

mRNAs and Proteins

NM_123456 Reference mRNANP_123456 Reference ProteinXM_123456 Predicted TranscriptXP_123456 Predicted ProteinXR_123456 Predicted non-coding TranscriptGene RecordsNG_123456 Reference Genomic SequenceAssembliesNT_123456 Reference Contig (Mouse and Human

Genomes)NC_123455 Reference Chromosome, Microbial

Genomes, Plasmid


LF-2001.11

Status codes

RefSeq records are provided with a status code which provides an indication of the level of review a RefSeq record has undergone.

REVIEWED The RefSeq record has been the reviewed by NCBI Staff. The review process

includes reviewing available sequence data and frequently also includes a review of the literature.

PROVISIONAL The RefSeq record has not yet been subject to individual review.

PREDICTED Some aspect of the RefSeq record is predicted and there is supporting evidence

that the locus is valid. GENOME ANNOTATION

This identifies the contig (NT_ accessions), mRNA (XM_), non-coding transcript (XR_), and protein (XP_) RefSeq records provided by the NCBI Genome Annotation process. These records are provided via automated processing.


LF-2001.11

NM_

XM_

NT_

Map view of RefSeq


LF-2001.11

ENSEMBL

Goals of Ensembl Accurate, automatic analysis of genome data Analysis maintained on the current data Presentation of the analysis to biologists via the Web Distribution of the analysis to other bioinformatics laboratories.

The Ensembl project will be a foundation for a next generation sequence database that provides a curated, distributed, non redundant view of the genomes of model organisms.

Commitments of the Ensembl project Public release of data

All the data and analysis will be put into the public domain immediately.

Open, collaborative software development The software which forms the automated pipeline will be available to everyone under an open

license, modelled after the Apache license.

Collaboration on agreed standards for distribution We hope to provide the data in as many useful forms as is practical, including the EMBL flat file

formats and new data distribution channels such as XML and CORBA.


LF-2001.11

ENSEMBL


LF-2001.11

ENSEMBL views


LF-2001.11

High quality Automated Microbial Annotation of Proteomes

Aim: automatically annotate with the highest level of quality a significant percentage of proteins originating from microbial genome sequencing projects.

The programs being developed are specifically designed to track down "eccentric" proteins. Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.

This project should allow annotators in the SWISS-PROT groups at SIB and EBI to concentrate on the proteins that really need careful manual annotation.


LF-2001.11

HAMAP origin

About 60 microbial genomes are available today >1000 in a few years; >1 million microbial

proteins! Functional analysis and detailed biochemical

characterization will only be available: For « all » proteins in a handful of model organisms (i.e.

E.coli, B.subtilis, etc.) For proteins involved in pathogenicity (medical and

pharmaceutical interests) For proteins involved in specific biosynthetic or catabolic

pathways (biotechnological and food industry interests)


LF-2001.11

HAMAP overview


LF-2001.11

HAMAP flow chart


LF-2001.11

HAMAP study case

The case of the Escherichia coli proteome According to the original analysis in 1997: 4286

protein coding genes 60 were missed (almost all <100 residues) 120 are most probably « bogus » 50 pairs or triplets of ORFs had to be fused 719 have proven or probable wrong start sites ~1800 are still not biochemically characterized; only

one new « functionalisation » per week…


LF-2001.11

Unix reminder

General: man, pwd, cd, ls, mkdir, rmdir, passwd, exit

Files manipulation: cat, more, cp, mv, rm, grep, find, diff, head, tail, chmod

Editing: vi, pico, emacs Compression: tar, (un)compress, gzip Various: redirection (<>>) and piping (|)

high throughput biology projects

Documents