Transcript
Page 1: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Genes and Genomes

Page 2: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December
Page 3: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Genome On Line Database (GOLD)

• 243 Published complete genomes

• 536 Prokaryotic ongoing genomes

• 434 Eukaryotic ongoing genomes

December 2004 : 1245 genome projects

Page 4: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Common Genome Browsers

NCBI: http://www.ncbi.nlm.nih.gov/mapview/static/MVstart.html

Eukaryote Only UCSC: http://genome.ucsc.edu

Ensembl: http://www.ensembl.org

Prokaryote Only MGV: http://cmbipc49.cmbi.kun.nl/genome/

TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl

/

Page 5: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

What can we learn from genomes

• Genes

• Splice variants

• Variation analysis

• Promoters

• Comparative Genomics

• Evolution

Page 6: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

5’UTR 3’UTR

CDS

Page 7: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Alternative splice variants

Page 8: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Looking for genes in genomes

• Existing mRNA and EST data

• Gene prediction program

• Comparative genomics

Page 9: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

ESTs (Expressed Sequence Tags)• cDNA provide a best tool to identify genes in a genome.

– For unsequenced genomes it was the primary source for identifying genes

• Basic strategy - select cDNA clones at random and perform a single automated read from one/both ends of the transcript. – Many clones will be redundant.– Very cost effective.– ESTs are short (400-600b), relatively inaccurate (2% error).

• ESTs are correlated to known genes using a relatively small region of sequence alignment.

• Used to discover genes, alternative splicing variants, etc.

Page 10: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Problems with ESTs-Incomplete Coverage

Bias for high copy number genes-Experimental mistakes- not always reliable-Enrichment of 3’ ends of genes-High representation of cancer cells

Usage of EST-Predicting of coding regions-Detecting of alternative splicing-Clustering to form genes

Page 11: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

RefSeq database (NCBI)

• The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

• RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.

Page 12: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Gene Finding Approaches

– Learn characteristics of known genes

– Search for new genes using characteristics

– Different types of genes have different characteristics.

Prediction StatusThe problem of gene prediction is very much open even in well studied genomes:

The number of genes in human keeps changing.

Page 13: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Gene Finding

• Input– Chromosomal genetic sequence

• Output– Region which encodes for gene– Strand and reading frame– Start and end of coding sequence– Exon-intron boundaries

Page 14: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Prokaryotes Vs. EukaryotesRequire different gene finding strategies. • Prokaryotes:

– the genome is compact (Shorter intergenic regions, no introns).

– several genes may reside on the same mRNA in different reading frames.

– Promoter regions are more conserved.

• Eukaryotes– large genomes; intron/exon structure; alternative

splicing; pseudogenes, very long intergenic regions – The human genome: average gene ~ 27,800b.

8 exon ~ 100b. intron 100-30,000 b.

Page 15: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

ORF Finding

Open Reading Frames – sequences that presumably code for proteins.How can ORFs be detected?• All reading frames are checked.• Search for initiation and termination codons

within a sequence.• Are these codons totally conserved?

http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Page 16: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Protein-coding Gene Characteristic

• GC Content

• Uneven codon usage– Amino acid bias– Species’ preferred codons

• Promoter and splicing signals

• These characteristics may aid in

– Prediction.

– Validation.

Page 17: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Codon Usage

• DNA is not a random choice of possible codons for each amino acid.– It is an ordered list of codons that reflects

evolutionary origin and constraints related to gene expression.

• Each species has its own coding preferences – codon usage.

Page 18: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

• T he genet ic code - Each amino acid is coded by 3 nucleot ides, named codon.• Code redundancy - M ost amino acids are coded by several codons.

- 64 t r iplet s code f or 20 amino acids & 3 st op codons.

T he Genet ic Code

Page 19: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

http://www.kazusa.or.jp/codon/

This site provides:

• Codon usage tables per organism

• Computation of codon usage for query coding sequences.

Codon Usage Database

Page 20: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Codon Usage Preferences• Different codon usage for highly vs. weakly expressed

genes.– in E. Coli genes were divided into 3 groups based on

their codon usage – - regular genes (70%) - highly expressed genes (15%)

- horizontally transferred genes (15%)

• There is strong preferences in ORFs for specific codon pairs and for specific codons near terminators.

• The base in the third position in each codon tends to repeat itself in the same ORF.

Page 21: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Sequence Signals• Prokaryotes:

– Promoter (-35, -10 from TSS)– Ribosome Binding Site (Shine-Dalgarno) is conserved.

Located ~ -15 upstream AUG.

• Eukaryotes– Transcription signals

TATA (~-30 TSS), cap signal, poly-adenylation site. Any signal may be missing.

– Translation signalsKozak signal (immediately upstream ATG).

– Splicing signals – recognized by the Spliceosome. Introns usually start with GT and end with AG.

Page 22: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Computational Approaches to Prediction

• Various computational methods including decision trees, neural nets, Markov models and Hidden Markov models (HMM).

• A model is studied based on known genes, and then applied to genomic sequences.

• Each genome defines its own model.

Page 23: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Markov Models• Probabilistic approach.

• Modeled by states and the probability of transition from one state to the next.

• The probability of being at state X in step i depends only on the state we reached at step i-1.

It has been found that ORFs have a reading-frame specific hexamer (6mer) composition. => the probability of the 6th base can be computed using the previous 5.=> The probability that a sequence is an ORF in a specific reading frame can be computed from its 6th-mer composition.

Page 24: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Grail II for finding Exons(Neural Network)

Score of 6mers

Score of 6mers in flanking region

Markov model score

GC composition

GC composition in flanking

Score for splicing acceptor

Score for splicing donor

Input layer

Hidden layeroutput

Exonscore

Page 25: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

GenScan (HMM)

• One of most accurate programs– Best for human/vertebrate sequences

• Markov parameters for different regions– Introns beginning at 3 phases– Exons: first, intermediate, last– Promoter region– 3’ and 5’ untranslated regions– Intragenic regions

Page 26: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Base S1 S2 S3 S4 S5 S6 S7 S8

A 0 0 0.33 0.60 0.49 0.71 0 1

C 0 0 0.37 0.13 0.03 0.07 1 0

G 1 0 0.18 0.14 0.45 0.12 0 0

T 0 1 0.12 0.13 0.03 0.09 0 0

HMM for a GC reach intronic region

Page 27: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

The General Scheme

1. Obtain new genomic DNA sequence.

2. i. Translate in all 6 reading frames and compare to protein databases.ii. Perform database similarity search of expressed sequence tags (EST) database of same organism, or cDNA sequences if available.

3. Use gene prediction program to locate genes.

4. Analyze regulatory sequences and signals in the gene. Can help characterize putative genes.

Page 28: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Other gene Finding Tools

• GeneMark (prokaryote, eukaryote)– http://opal.biology.gatech.edu/GeneMark/

• Glimmer (bacteria, archaea)– http://www.tigr.org/software/glimmer/

• GeneFinder (human, mouse, arabidopsis)– http://argon.cshl.org/genefinder/

• HMMgene (vertebrate, C. elegans)– http://www.cbs.dtu.dk/services/HMMgene/

http://www.tigr.org/genefinding/software.shtml

Page 29: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Prediction EvaluationPrediction tools are compared using two criteria:

• Sensitivity - % true predicted genes out of the true genes in the genome.TP /(TP+FN)

• Specificity - % true predicted non genes out of the total number of non genes.

TN /(TN+FP)

Both need to be high, results vary from genome to genome

Accuracy comparisons tested on vertebrates

SN SPGENSCAN 0.93 0.93GRAILII 0.72 0.84

Page 30: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Functional RNA Genes

• RNA genes are transcribed but are not translated – no codon preference exists.How can rRNA, tRNA and small RNA genes be predicted?

• Promoter regions can be characterized, but remain a big challenge.

• RNA secondary structure is important.Can be predicted using RNA structure prediction tools (MFOLD tool).

Page 31: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Comparative genomics

• Finding Orthologs

• Looking for genes in one species not found in another

• Searching for conserved regulatory elements

• Gene Clusters

• Conserved regulatory networks

Page 32: Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Conservation of the IGFALS (Insulin-like growth factor)Between human and mouse.


Top Related