Download - Lectures Part 08
-
7/28/2019 Lectures Part 08
1/40
Computational Biology, Part 8Protein Coding Regions
Robert F. Murphy
Copyright 1996-2009.All rights reserved.
-
7/28/2019 Lectures Part 08
2/40
Sequence Analysis Tasks
Finding protein coding regions
-
7/28/2019 Lectures Part 08
3/40
Goal
Given a DNA or RNA sequence, find thoseregions that code for protein(s)
Direct approach: Look for stretches that can beinterpreted as protein using the genetic codeStatistical approaches: Use other knowledge
about likely coding regions
-
7/28/2019 Lectures Part 08
4/40
Direct Approach
-
7/28/2019 Lectures Part 08
5/40
Genetic codes
The set of tRNAs that an organism possesses defines its genetic code(s)The universal genetic code is common to allorganismsProkaryotes, mitochondria and chloroplastsoften use slightly different genetic codesMore than one tRNA may be present for agiven codon, allowing more than one
possible translation product
-
7/28/2019 Lectures Part 08
6/40
Genetic codes
Differences in genetic codes occur in startand stop codons onlyAlternate initiation codons : codons thatencode amino acids but can also be used tostart translation (GUG, UUG, AUA, UUA,
CUG)Suppressor tRNA codons : codons thatnormally stop translation but are translated
as amino acids (UAG, UGA, UAA)
-
7/28/2019 Lectures Part 08
7/40
Genetic codes
-
7/28/2019 Lectures Part 08
8/40
Genetic codes
-
7/28/2019 Lectures Part 08
9/40
Genetic codes
Note additional start codons: UUA, UUG, CUG
Note conversion of stop codon UGA (opal) to Trp
-
7/28/2019 Lectures Part 08
10/40
Reading Frames
Since nucleotide sequences are read three bases at a time, there are three possible
frames in which a given nucleotidesequence can be read (in the forwarddirection)
Taking the complement of the sequence andreading in the reverse direction gives threemore reading frames
-
7/28/2019 Lectures Part 08
11/40
Reading frames
TTC TCA TGT TTG ACA GCT RF1 Phe Ser Cys Leu Thr Ala> RF2 Ser His Val *** Gln Leu> RF3 Leu Met Phe Asp Ser>
AAG AGT ACA AAC TGT CGA RF4
-
7/28/2019 Lectures Part 08
12/40
Open Reading Frames (ORF)
Concept: Region of DNA or RNA sequencethat could be translated into a peptide
sequence ( open refers to absence of stopcodons)Prerequisite: A specific genetic code
Definition: (start codon) (amino acid coding codon) n (stop codon)
Note: Not all ORFs are actually used
-
7/28/2019 Lectures Part 08
13/40
-
7/28/2019 Lectures Part 08
14/40
Open Reading Frames
Click boxes for List ORFS and ORF map
-
7/28/2019 Lectures Part 08
15/40
Check readingframe:mod(696,3)=0-> RF3
-
7/28/2019 Lectures Part 08
16/40
EMBOSS plotorf
-
7/28/2019 Lectures Part 08
17/40
-
7/28/2019 Lectures Part 08
18/40
-
7/28/2019 Lectures Part 08
19/40
Splicing ORFs
For eukaryotes, which have interruptedgenes, ORFs in different reading frames
may be spliced together to generate final productORFs from forward and reverse directions
cannot be combined
-
7/28/2019 Lectures Part 08
20/40
Block Diagram for Search for
ORFs
SearchEngine
Sequence to besearched
Genetic code
List of ORF positions
Both strands?
Ends start/stop?
-
7/28/2019 Lectures Part 08
21/40
Statistical Approaches
-
7/28/2019 Lectures Part 08
22/40
Calculation Windows
Many sequence analyses require calculatingsome statistic over a long sequence looking
for regions where the statistic is unusuallyhigh or lowTo do this, we define a window size to be
the width of the region over which eachcalculation is to be doneExample: %AT
-
7/28/2019 Lectures Part 08
23/40
Base Composition Bias
For a protein with a roughly normalamino acid composition, the first 2 positions
of all codons will be about 50% GCIf an organism has a high GC contentoverall, the third position of all codons must
be mostly GCUseful for prokaryotes
Not useful for eukaryotes due to large
amount of noncoding DNA
-
7/28/2019 Lectures Part 08
24/40
Ficketts statistic
Also called TestCode analysisLooks for asymmetry of base compositionStrong statistical basis for calculationsMethod:
For each window on the sequence, calculatethe base composition of nucleotides 1, 4, 7...,then of 2, 5, 8..., and then of 3, 6, 9...Calculate statistic from resulting three numbers
-
7/28/2019 Lectures Part 08
25/40
-
7/28/2019 Lectures Part 08
26/40
Codon Bias (Codon Preference)
Starting point: Table of observed codonfrequencies in known genes from a given
organism best to use highly expressed genes
Method
Calculate coding potential within a movingwindow for all three reading frames Look for ORFs with high scores
-
7/28/2019 Lectures Part 08
27/40
Codon Bias (Codon Preference)
Works best for prokaryotes or unicellular eukaryotes because for multicellular
eukaryotes, different pools of tRNA may beexpressed at different stages of developmentin different tissues
may have to group genes into setsCodon bias can also be used to estimate
protein expression level
-
7/28/2019 Lectures Part 08
28/40
Portion of D. melanogaster
codon frequency tableAmino Acid Codon Number Freq/1000 Fraction
Gly GGG 11 2.60 0.03
Gly GGA 92 21.74 0.28
Gly GGT 86 20.33 0.26
Gly GGC 142 33.56 0.43
Glu GAG 212 50.11 0.75Glu GAA 69 16.31 0.25
GlyG
-
7/28/2019 Lectures Part 08
29/40
Comparison of Glycine codon
frequencies
Codon E. coli D. melanogaster
GGG 0.02 0.03GGA 0.00 0.28
GGT 0.59 0.26
GGC 0.38 0.43GlyG
-
7/28/2019 Lectures Part 08
30/40
-
7/28/2019 Lectures Part 08
31/40
-
7/28/2019 Lectures Part 08
32/40
-
7/28/2019 Lectures Part 08
33/40
-
7/28/2019 Lectures Part 08
34/40
Codon Preference Algorithms
The Staden method (from Staden &McLachlan, 1982) uses a codon usage table
directly in identifying coding regions. Thecodon usage table is normalized so that thesum of all 64 codons is 1. The usages for each codon in each reading frame in eachwindow are multiplied together andnormalized by the sum of the probabilitiesin all three positions to generate a relative
coding probability.
-
7/28/2019 Lectures Part 08
35/40
Codon Preference Algorithms
The Gribskov method uses a codon usagetable normalized so that the sum of thealternatives for each amino acid add to 1.The values for each codon for each readingframe in each window are multipliedtogether and normalized by the random
probability expected for that codon giventhe mononucleotide frequencies of the targetsequence. It is the most commonly usedmethod.
-
7/28/2019 Lectures Part 08
36/40
-
7/28/2019 Lectures Part 08
37/40
-
7/28/2019 Lectures Part 08
38/40
Plotfromsyco
-
7/28/2019 Lectures Part 08
39/40
Summary
Translation of nucleic acid sequences intohypothetical protein sequences requires a
genetic codeTranslation can occur in three forward andthree reverse reading frames
Open reading frames are regions that can betranslated without encountering a stopcodon
-
7/28/2019 Lectures Part 08
40/40
Summary
The likelihood that a particular open readingframes is in fact a coding region (actually
made into protein) can be estimated usingthird-codon base composition or codon preference tablesThis can be used to scan long sequences for
possible coding regions