lectures part 08

7/28/2019 Lectures Part 08

1/40

Computational Biology, Part 8Protein Coding Regions

Robert F. Murphy

Copyright 1996-2009.All rights reserved.


2/40

Sequence Analysis Tasks

Finding protein coding regions


3/40

Goal

Given a DNA or RNA sequence, find thoseregions that code for protein(s)

Direct approach: Look for stretches that can beinterpreted as protein using the genetic codeStatistical approaches: Use other knowledge

about likely coding regions


4/40

Direct Approach


5/40

Genetic codes

The set of tRNAs that an organism possesses defines its genetic code(s)The universal genetic code is common to allorganismsProkaryotes, mitochondria and chloroplastsoften use slightly different genetic codesMore than one tRNA may be present for agiven codon, allowing more than one

possible translation product


6/40

Genetic codes

Differences in genetic codes occur in startand stop codons onlyAlternate initiation codons : codons thatencode amino acids but can also be used tostart translation (GUG, UUG, AUA, UUA,

CUG)Suppressor tRNA codons : codons thatnormally stop translation but are translated

as amino acids (UAG, UGA, UAA)


7/40

Genetic codes


8/40

Genetic codes


9/40

Genetic codes

Note additional start codons: UUA, UUG, CUG

Note conversion of stop codon UGA (opal) to Trp


10/40

Reading Frames

Since nucleotide sequences are read three bases at a time, there are three possible

frames in which a given nucleotidesequence can be read (in the forwarddirection)

Taking the complement of the sequence andreading in the reverse direction gives threemore reading frames


11/40

Reading frames

TTC TCA TGT TTG ACA GCT RF1 Phe Ser Cys Leu Thr Ala> RF2 Ser His Val *** Gln Leu> RF3 Leu Met Phe Asp Ser>

AAG AGT ACA AAC TGT CGA RF4


12/40

Open Reading Frames (ORF)

Concept: Region of DNA or RNA sequencethat could be translated into a peptide

sequence ( open refers to absence of stopcodons)Prerequisite: A specific genetic code

Definition: (start codon) (amino acid coding codon) n (stop codon)

Note: Not all ORFs are actually used


13/40


14/40

Open Reading Frames

Click boxes for List ORFS and ORF map


15/40

Check readingframe:mod(696,3)=0-> RF3


16/40

EMBOSS plotorf


17/40


18/40


19/40

Splicing ORFs

For eukaryotes, which have interruptedgenes, ORFs in different reading frames

may be spliced together to generate final productORFs from forward and reverse directions

cannot be combined


20/40

Block Diagram for Search for

ORFs

SearchEngine

Sequence to besearched

Genetic code

List of ORF positions

Both strands?

Ends start/stop?


21/40

Statistical Approaches


22/40

Calculation Windows

Many sequence analyses require calculatingsome statistic over a long sequence looking

for regions where the statistic is unusuallyhigh or lowTo do this, we define a window size to be

the width of the region over which eachcalculation is to be doneExample: %AT


23/40

Base Composition Bias

For a protein with a roughly normalamino acid composition, the first 2 positions

of all codons will be about 50% GCIf an organism has a high GC contentoverall, the third position of all codons must

be mostly GCUseful for prokaryotes

Not useful for eukaryotes due to large

amount of noncoding DNA


24/40

Ficketts statistic

Also called TestCode analysisLooks for asymmetry of base compositionStrong statistical basis for calculationsMethod:

For each window on the sequence, calculatethe base composition of nucleotides 1, 4, 7...,then of 2, 5, 8..., and then of 3, 6, 9...Calculate statistic from resulting three numbers


25/40


26/40

Codon Bias (Codon Preference)

Starting point: Table of observed codonfrequencies in known genes from a given

organism best to use highly expressed genes

Method

Calculate coding potential within a movingwindow for all three reading frames Look for ORFs with high scores


27/40

Codon Bias (Codon Preference)

Works best for prokaryotes or unicellular eukaryotes because for multicellular

eukaryotes, different pools of tRNA may beexpressed at different stages of developmentin different tissues

may have to group genes into setsCodon bias can also be used to estimate

protein expression level


28/40

Portion of D. melanogaster

codon frequency tableAmino Acid Codon Number Freq/1000 Fraction

Gly GGG 11 2.60 0.03

Gly GGA 92 21.74 0.28

Gly GGT 86 20.33 0.26

Gly GGC 142 33.56 0.43

Glu GAG 212 50.11 0.75Glu GAA 69 16.31 0.25

GlyG


29/40

Comparison of Glycine codon

frequencies

Codon E. coli D. melanogaster

GGG 0.02 0.03GGA 0.00 0.28

GGT 0.59 0.26

GGC 0.38 0.43GlyG


30/40


31/40


32/40


33/40


34/40

Codon Preference Algorithms

The Staden method (from Staden &McLachlan, 1982) uses a codon usage table

directly in identifying coding regions. Thecodon usage table is normalized so that thesum of all 64 codons is 1. The usages for each codon in each reading frame in eachwindow are multiplied together andnormalized by the sum of the probabilitiesin all three positions to generate a relative

coding probability.


35/40

Codon Preference Algorithms

The Gribskov method uses a codon usagetable normalized so that the sum of thealternatives for each amino acid add to 1.The values for each codon for each readingframe in each window are multipliedtogether and normalized by the random

probability expected for that codon giventhe mononucleotide frequencies of the targetsequence. It is the most commonly usedmethod.


36/40


37/40


38/40

Plotfromsyco


39/40

Summary

Translation of nucleic acid sequences intohypothetical protein sequences requires a

genetic codeTranslation can occur in three forward andthree reverse reading frames

Open reading frames are regions that can betranslated without encountering a stopcodon


40/40

Summary

The likelihood that a particular open readingframes is in fact a coding region (actually

made into protein) can be estimated usingthird-codon base composition or codon preference tablesThis can be used to scan long sequences for

possible coding regions

lectures part 08

Documents