computational biology, part 8 protein coding regions robert f. murphy copyright 1996-2009. all...
TRANSCRIPT
Computational Biology, Part 8Protein Coding Regions
Computational Biology, Part 8Protein Coding Regions
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1996-2009. 1996-2009.
All rights reserved.All rights reserved.
Sequence Analysis TasksSequence Analysis Tasks
Finding protein coding regionsFinding protein coding regions
GoalGoal
Given a DNA or RNA sequence, find those Given a DNA or RNA sequence, find those regions that code for protein(s)regions that code for protein(s) Direct approach: Look for stretches that can be Direct approach: Look for stretches that can be
interpreted as protein using the genetic codeinterpreted as protein using the genetic code Statistical approaches: Use other knowledge Statistical approaches: Use other knowledge
about likely coding regionsabout likely coding regions
Direct ApproachDirect Approach
Genetic codesGenetic codes
The set of tRNAs that an organism The set of tRNAs that an organism possesses defines its genetic code(s)possesses defines its genetic code(s)
The The universal genetic codeuniversal genetic code is common to all is common to all organismsorganisms
Prokaryotes, mitochondria and chloroplasts Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codesoften use slightly different genetic codes
More than one tRNA may be present for a More than one tRNA may be present for a given codon, allowing more than one given codon, allowing more than one possible translation productpossible translation product
Genetic codesGenetic codes
Differences in genetic codes occur in start Differences in genetic codes occur in start and stop codons onlyand stop codons only
Alternate initiation codonsAlternate initiation codons: codons that : codons that encode amino acids but can also be used to encode amino acids but can also be used to start translation (GUG, UUG, AUA, UUA, start translation (GUG, UUG, AUA, UUA, CUG)CUG)
Suppressor tRNA codonsSuppressor tRNA codons: codons that : codons that normally stop translation but are translated normally stop translation but are translated as amino acids (UAG, UGA, UAA)as amino acids (UAG, UGA, UAA)
Genetic codesGenetic codes
Genetic codesGenetic codes
Genetic codesGenetic codes
Note additional start codons: UUA, UUG, CUGNote additional start codons: UUA, UUG, CUG Note conversion of stop codon UGA (opal) to TrpNote conversion of stop codon UGA (opal) to Trp
Reading FramesReading Frames
Since nucleotide sequences are “read” three Since nucleotide sequences are “read” three bases at a time, there are three possible bases at a time, there are three possible “frames” in which a given nucleotide “frames” in which a given nucleotide sequence can be “read” (in the forward sequence can be “read” (in the forward direction)direction)
Taking the complement of the sequence and Taking the complement of the sequence and reading in the reverse direction gives three reading in the reverse direction gives three more more reading framesreading frames
Reading framesReading frames
TTC TCA TGT TTG ACA GCTTTC TCA TGT TTG ACA GCT
RF1 RF1 Phe Ser Cys Leu Thr Ala>Phe Ser Cys Leu Thr Ala>
RF2 RF2 Ser His Val *** Gln Leu>Ser His Val *** Gln Leu>
RF3 RF3 Leu Met Phe Asp Ser>Leu Met Phe Asp Ser>
AAG AGT ACA AAC TGT CGAAAG AGT ACA AAC TGT CGA
RF4 RF4 <Glu *** Thr Gln Cys Ser<Glu *** Thr Gln Cys Ser
RF5 RF5 <Glu His Lys Val Ala<Glu His Lys Val Ala
RF6 RF6 <Arg Met Asn Ser Leu<Arg Met Asn Ser Leu
Open Reading Frames (ORF)Open Reading Frames (ORF)
Concept: Region of DNA or RNA sequence Concept: Region of DNA or RNA sequence that that could could be translated into a peptide be translated into a peptide sequence (sequence (openopen refers to absence of stop refers to absence of stop codons)codons)
Prerequisite: A specific genetic codePrerequisite: A specific genetic code Definition:Definition:
(start codon) (amino acid coding codon)(start codon) (amino acid coding codon)nn (stop codon) (stop codon)
Note: Not all ORFs are Note: Not all ORFs are actuallyactually used used
Open Reading FramesOpen Reading Frames
Click boxes for Click boxes for List ORFS List ORFS and and ORF mapORF map
Check reading Check reading frame: frame: mod(696,3)=0 mod(696,3)=0 -> RF3-> RF3
EMBOSS plotorfEMBOSS plotorf
Splicing ORFsSplicing ORFs
For eukaryotes, which have interrupted For eukaryotes, which have interrupted genes, ORFs in different reading frames genes, ORFs in different reading frames may be spliced together to generate final may be spliced together to generate final productproduct
ORFs from forward and reverse directions ORFs from forward and reverse directions cannot be combinedcannot be combined
Block Diagram for Search for ORFsBlock Diagram for Search for ORFs
Search Engine
Sequence to be searched
Genetic code
List of ORF positions
Both strands?
Ends start/stop?
Statistical ApproachesStatistical Approaches
Calculation WindowsCalculation Windows
Many sequence analyses require calculating Many sequence analyses require calculating some statistic over a long sequence looking some statistic over a long sequence looking for regions where the statistic is unusually for regions where the statistic is unusually high or lowhigh or low
To do this, we define a To do this, we define a window sizewindow size to be to be the width of the region over which each the width of the region over which each calculation is to be donecalculation is to be done
Example: %ATExample: %AT
Base Composition BiasBase Composition Bias
For a protein with a roughly “normal” amino acid For a protein with a roughly “normal” amino acid composition, the first 2 positions of all codons composition, the first 2 positions of all codons will be about 50% GCwill be about 50% GC
If an organism has a high GC content overall, the If an organism has a high GC content overall, the third position of all codons must be mostly GCthird position of all codons must be mostly GC
Useful for prokaryotesUseful for prokaryotes Not useful for eukaryotes due to large amount of Not useful for eukaryotes due to large amount of
noncoding DNAnoncoding DNA
Fickett’s statisticFickett’s statistic
Also called Also called TestCodeTestCode analysis analysis Looks for Looks for asymmetryasymmetry of base composition of base composition Strong statistical basis for calculationsStrong statistical basis for calculations Method:Method:
For each For each windowwindow on the sequence, calculate on the sequence, calculate the base composition of nucleotides 1, 4, 7..., the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9...then of 2, 5, 8..., and then of 3, 6, 9...
Calculate statistic from resulting three numbersCalculate statistic from resulting three numbers
Codon Bias (Codon Preference)Codon Bias (Codon Preference)
PrinciplePrinciple Different levels of expression of different Different levels of expression of different
tRNAs for a given amino acid lead to pressure tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred on coding regions to “conform” to the preferred codon usagecodon usage
Non-coding regions, on the other hand, feel no Non-coding regions, on the other hand, feel no selective pressure and can driftselective pressure and can drift
Codon Bias (Codon Preference)Codon Bias (Codon Preference)
Starting point: Table of observed codon Starting point: Table of observed codon frequencies in known genes from a given frequencies in known genes from a given organismorganism best to use highly expressed genesbest to use highly expressed genes
MethodMethod Calculate “coding potential” within a moving Calculate “coding potential” within a moving
windowwindow for all three for all three reading framesreading frames Look for ORFs with high scoresLook for ORFs with high scores
Codon Bias (Codon Preference)Codon Bias (Codon Preference)
Works best for prokaryotes or unicellular Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be eukaryotes, different pools of tRNA may be expressed at different stages of development expressed at different stages of development in different tissuesin different tissues may have to group genes into setsmay have to group genes into sets
Codon bias can also be used to estimate Codon bias can also be used to estimate protein expression levelprotein expression level
Portion of D. melanogaster codon frequency tablePortion of D. melanogaster codon frequency tableAmino Acid Codon Number Freq/1000 Fraction
Gly GGG 11 2.60 0.03
Gly GGA 92 21.74 0.28
Gly GGT 86 20.33 0.26
Gly GGC 142 33.56 0.43
Glu GAG 212 50.11 0.75
Glu GAA 69 16.31 0.25G ly G
Comparison of Glycine codon frequenciesComparison of Glycine codon frequencies
Codon E. coli D. melanogaster
GGG 0.02 0.03
GGA 0.00 0.28
GGT 0.59 0.26
GGC 0.38 0.43G ly G
Illustration of Codon Bias PlotsIllustration of Codon Bias Plots
Use Entrez via MacVector to get sequence of lexA Use Entrez via MacVector to get sequence of lexA under “Database” select “Internet Entrez Search”under “Database” select “Internet Entrez Search” Select gene=lexA AND organism=EscherichiaSelect gene=lexA AND organism=Escherichia Pick one (e.g., region from 89.2 to 92.8)Pick one (e.g., region from 89.2 to 92.8)
Under “Analyze” select “Codon Preference Plots”Under “Analyze” select “Codon Preference Plots” Choose Escherichia coli codon bias fileChoose Escherichia coli codon bias file Choose gene region corresponding to lacZChoose gene region corresponding to lacZ Click on Staden codon bias and Gribskov codon biasClick on Staden codon bias and Gribskov codon bias
-28
-19
-10
-0
9
18
122400 122500 122600 122700 122800 122900
Staden Codon Preference: Window = 40 codonsFrame +1
-28
-19
-10
-0
9
18
122400 122500 122600 122700 122800 122900
Staden Codon Preference: Window = 40 codonsFrame +2
-28
-19
-10
-0
9
18
122400 122500 122600 122700 122800 122900
Staden Codon Preference: Window = 40 codonsFrame +3
-28
-19
-10
-0
9
18
122400 122500 122600 122700 122800 122900
Staden Codon Preference: Window = 40 codonsFrame -1
-28
-19
-10
-0
9
18
122400 122500 122600 122700 122800 122900
Staden Codon Preference: Window = 40 codonsFrame -2
-28
-19
-10
-0
9
18
122400 122500 122600 122700 122800 122900
Staden Codon Preference: Window = 40 codonsFrame -3
0.70
0.80
0.90
1.00
1.10
1.20
1.30
122400 122500 122600 122700 122800 122900
Gribskov Codon Preference: Window = 40 codonsFrame +1
0.70
0.80
0.90
1.00
1.10
1.20
1.30
122400 122500 122600 122700 122800 122900
Gribskov Codon Preference: Window = 40 codonsFrame +2
0.70
0.80
0.90
1.00
1.10
1.20
1.30
122400 122500 122600 122700 122800 122900
Gribskov Codon Preference: Window = 40 codonsFrame +3
-22
-16
-10
-4
2
8
14
122000 122200 122400 122600 122800 123000 123200
Staden Codon Preference: Window = 40 codonsFrame +1
-22
-16
-10
-4
2
8
14
122000 122200 122400 122600 122800 123000 123200
Staden Codon Preference: Window = 40 codonsFrame +2
-22
-16
-10
-4
2
8
14
122000 122200 122400 122600 122800 123000 123200
Staden Codon Preference: Window = 40 codonsFrame +3
-22
-16
-10
-4
2
8
14
122000 122200 122400 122600 122800 123000 123200
Staden Codon Preference: Window = 40 codonsFrame -1
-22
-16
-10
-4
2
8
14
122000 122200 122400 122600 122800 123000 123200
Staden Codon Preference: Window = 40 codonsFrame -2
-22
-16
-10
-4
2
8
14
122000 122200 122400 122600 122800 123000 123200
Staden Codon Preference: Window = 40 codonsFrame -3
0.70
0.80
0.90
1.00
1.10
1.20
1.30
122000 122200 122400 122600 122800 123000 123200
Gribskov Codon Preference: Window = 40 codonsFrame +1
0.70
0.80
0.90
1.00
1.10
1.20
1.30
122000 122200 122400 122600 122800 123000 123200
Gribskov Codon Preference: Window = 40 codonsFrame +2
0.70
0.80
0.90
1.00
1.10
1.20
1.30
122000 122200 122400 122600 122800 123000 123200
Gribskov Codon Preference: Window = 40 codonsFrame +3
Codon Preference AlgorithmsCodon Preference Algorithms
The Staden method (from Staden & The Staden method (from Staden & McLachlan, 1982) uses a codon usage table McLachlan, 1982) uses a codon usage table directly in identifying coding regions. The directly in identifying coding regions. The codon usage table is normalized so that the codon usage table is normalized so that the sum of all 64 codons is 1. The usages for sum of all 64 codons is 1. The usages for each codon in each reading frame in each each codon in each reading frame in each window are multiplied together and window are multiplied together and normalized by the sum of the probabilities normalized by the sum of the probabilities in all three positions to generate a relative in all three positions to generate a relative coding probability.coding probability.
Codon Preference AlgorithmsCodon Preference Algorithms
The Gribskov method uses a codon usage The Gribskov method uses a codon usage table normalized so that the sum of the table normalized so that the sum of the alternatives for each amino acid add to 1. alternatives for each amino acid add to 1. The values for each codon for each reading The values for each codon for each reading frame in each window are multiplied frame in each window are multiplied together and normalized by the random together and normalized by the random probability expected for that codon given probability expected for that codon given the mononucleotide frequencies of the the mononucleotide frequencies of the target sequence. It is the most commonly target sequence. It is the most commonly used method.used method.
Plot Plot from from syco
SummarySummary
Translation of nucleic acid sequences into Translation of nucleic acid sequences into hypothetical protein sequences requires a hypothetical protein sequences requires a genetic codegenetic code
Translation can occur in three forward and Translation can occur in three forward and three reverse reading framesthree reverse reading frames
Open reading frames are regions that can be Open reading frames are regions that can be translated without encountering a stop translated without encountering a stop codoncodon
SummarySummary
The likelihood that a particular open The likelihood that a particular open reading frames is in fact a coding region reading frames is in fact a coding region (actually made into protein) can be (actually made into protein) can be estimated using third-codon base estimated using third-codon base composition or codon preference tablescomposition or codon preference tables
This can be used to scan long sequences for This can be used to scan long sequences for possible coding regionspossible coding regions