computational biology, part 8 protein coding regions robert f. murphy copyright 1996-2009. all...

40
Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Robert F. Murphy Copyright Copyright 1996-2009. 1996-2009. All rights reserved. All rights reserved.

Upload: abagail-dry

Post on 31-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Computational Biology, Part 8Protein Coding Regions

Computational Biology, Part 8Protein Coding Regions

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996-2009. 1996-2009.

All rights reserved.All rights reserved.

Page 2: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Sequence Analysis TasksSequence Analysis Tasks

Finding protein coding regionsFinding protein coding regions

Page 3: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

GoalGoal

Given a DNA or RNA sequence, find those Given a DNA or RNA sequence, find those regions that code for protein(s)regions that code for protein(s) Direct approach: Look for stretches that can be Direct approach: Look for stretches that can be

interpreted as protein using the genetic codeinterpreted as protein using the genetic code Statistical approaches: Use other knowledge Statistical approaches: Use other knowledge

about likely coding regionsabout likely coding regions

Page 4: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Direct ApproachDirect Approach

Page 5: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Genetic codesGenetic codes

The set of tRNAs that an organism The set of tRNAs that an organism possesses defines its genetic code(s)possesses defines its genetic code(s)

The The universal genetic codeuniversal genetic code is common to all is common to all organismsorganisms

Prokaryotes, mitochondria and chloroplasts Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codesoften use slightly different genetic codes

More than one tRNA may be present for a More than one tRNA may be present for a given codon, allowing more than one given codon, allowing more than one possible translation productpossible translation product

Page 6: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Genetic codesGenetic codes

Differences in genetic codes occur in start Differences in genetic codes occur in start and stop codons onlyand stop codons only

Alternate initiation codonsAlternate initiation codons: codons that : codons that encode amino acids but can also be used to encode amino acids but can also be used to start translation (GUG, UUG, AUA, UUA, start translation (GUG, UUG, AUA, UUA, CUG)CUG)

Suppressor tRNA codonsSuppressor tRNA codons: codons that : codons that normally stop translation but are translated normally stop translation but are translated as amino acids (UAG, UGA, UAA)as amino acids (UAG, UGA, UAA)

Page 7: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Genetic codesGenetic codes

Page 8: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Genetic codesGenetic codes

Page 9: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Genetic codesGenetic codes

Note additional start codons: UUA, UUG, CUGNote additional start codons: UUA, UUG, CUG Note conversion of stop codon UGA (opal) to TrpNote conversion of stop codon UGA (opal) to Trp

Page 10: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Reading FramesReading Frames

Since nucleotide sequences are “read” three Since nucleotide sequences are “read” three bases at a time, there are three possible bases at a time, there are three possible “frames” in which a given nucleotide “frames” in which a given nucleotide sequence can be “read” (in the forward sequence can be “read” (in the forward direction)direction)

Taking the complement of the sequence and Taking the complement of the sequence and reading in the reverse direction gives three reading in the reverse direction gives three more more reading framesreading frames

Page 11: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Reading framesReading frames

TTC TCA TGT TTG ACA GCTTTC TCA TGT TTG ACA GCT

RF1 RF1 Phe Ser Cys Leu Thr Ala>Phe Ser Cys Leu Thr Ala>

RF2 RF2 Ser His Val *** Gln Leu>Ser His Val *** Gln Leu>

RF3 RF3 Leu Met Phe Asp Ser>Leu Met Phe Asp Ser>

AAG AGT ACA AAC TGT CGAAAG AGT ACA AAC TGT CGA

RF4 RF4 <Glu *** Thr Gln Cys Ser<Glu *** Thr Gln Cys Ser

RF5 RF5 <Glu His Lys Val Ala<Glu His Lys Val Ala

RF6 RF6 <Arg Met Asn Ser Leu<Arg Met Asn Ser Leu

Page 12: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Open Reading Frames (ORF)Open Reading Frames (ORF)

Concept: Region of DNA or RNA sequence Concept: Region of DNA or RNA sequence that that could could be translated into a peptide be translated into a peptide sequence (sequence (openopen refers to absence of stop refers to absence of stop codons)codons)

Prerequisite: A specific genetic codePrerequisite: A specific genetic code Definition:Definition:

(start codon) (amino acid coding codon)(start codon) (amino acid coding codon)nn (stop codon) (stop codon)

Note: Not all ORFs are Note: Not all ORFs are actuallyactually used used

Page 13: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved
Page 14: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Open Reading FramesOpen Reading Frames

Click boxes for Click boxes for List ORFS List ORFS and and ORF mapORF map

Page 15: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Check reading Check reading frame: frame: mod(696,3)=0 mod(696,3)=0 -> RF3-> RF3

Page 16: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

EMBOSS plotorfEMBOSS plotorf

Page 17: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved
Page 18: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved
Page 19: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Splicing ORFsSplicing ORFs

For eukaryotes, which have interrupted For eukaryotes, which have interrupted genes, ORFs in different reading frames genes, ORFs in different reading frames may be spliced together to generate final may be spliced together to generate final productproduct

ORFs from forward and reverse directions ORFs from forward and reverse directions cannot be combinedcannot be combined

Page 20: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Block Diagram for Search for ORFsBlock Diagram for Search for ORFs

Search Engine

Sequence to be searched

Genetic code

List of ORF positions

Both strands?

Ends start/stop?

Page 21: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Statistical ApproachesStatistical Approaches

Page 22: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Calculation WindowsCalculation Windows

Many sequence analyses require calculating Many sequence analyses require calculating some statistic over a long sequence looking some statistic over a long sequence looking for regions where the statistic is unusually for regions where the statistic is unusually high or lowhigh or low

To do this, we define a To do this, we define a window sizewindow size to be to be the width of the region over which each the width of the region over which each calculation is to be donecalculation is to be done

Example: %ATExample: %AT

Page 23: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Base Composition BiasBase Composition Bias

For a protein with a roughly “normal” amino acid For a protein with a roughly “normal” amino acid composition, the first 2 positions of all codons composition, the first 2 positions of all codons will be about 50% GCwill be about 50% GC

If an organism has a high GC content overall, the If an organism has a high GC content overall, the third position of all codons must be mostly GCthird position of all codons must be mostly GC

Useful for prokaryotesUseful for prokaryotes Not useful for eukaryotes due to large amount of Not useful for eukaryotes due to large amount of

noncoding DNAnoncoding DNA

Page 24: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Fickett’s statisticFickett’s statistic

Also called Also called TestCodeTestCode analysis analysis Looks for Looks for asymmetryasymmetry of base composition of base composition Strong statistical basis for calculationsStrong statistical basis for calculations Method:Method:

For each For each windowwindow on the sequence, calculate on the sequence, calculate the base composition of nucleotides 1, 4, 7..., the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9...then of 2, 5, 8..., and then of 3, 6, 9...

Calculate statistic from resulting three numbersCalculate statistic from resulting three numbers

Page 25: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Codon Bias (Codon Preference)Codon Bias (Codon Preference)

PrinciplePrinciple Different levels of expression of different Different levels of expression of different

tRNAs for a given amino acid lead to pressure tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred on coding regions to “conform” to the preferred codon usagecodon usage

Non-coding regions, on the other hand, feel no Non-coding regions, on the other hand, feel no selective pressure and can driftselective pressure and can drift

Page 26: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Codon Bias (Codon Preference)Codon Bias (Codon Preference)

Starting point: Table of observed codon Starting point: Table of observed codon frequencies in known genes from a given frequencies in known genes from a given organismorganism best to use highly expressed genesbest to use highly expressed genes

MethodMethod Calculate “coding potential” within a moving Calculate “coding potential” within a moving

windowwindow for all three for all three reading framesreading frames Look for ORFs with high scoresLook for ORFs with high scores

Page 27: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Codon Bias (Codon Preference)Codon Bias (Codon Preference)

Works best for prokaryotes or unicellular Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be eukaryotes, different pools of tRNA may be expressed at different stages of development expressed at different stages of development in different tissuesin different tissues may have to group genes into setsmay have to group genes into sets

Codon bias can also be used to estimate Codon bias can also be used to estimate protein expression levelprotein expression level

Page 28: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Portion of D. melanogaster codon frequency tablePortion of D. melanogaster codon frequency tableAmino Acid Codon Number Freq/1000 Fraction

Gly GGG 11 2.60 0.03

Gly GGA 92 21.74 0.28

Gly GGT 86 20.33 0.26

Gly GGC 142 33.56 0.43

Glu GAG 212 50.11 0.75

Glu GAA 69 16.31 0.25G ly G

Page 29: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Comparison of Glycine codon frequenciesComparison of Glycine codon frequencies

Codon E. coli D. melanogaster

GGG 0.02 0.03

GGA 0.00 0.28

GGT 0.59 0.26

GGC 0.38 0.43G ly G

Page 30: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Illustration of Codon Bias PlotsIllustration of Codon Bias Plots

Use Entrez via MacVector to get sequence of lexA Use Entrez via MacVector to get sequence of lexA under “Database” select “Internet Entrez Search”under “Database” select “Internet Entrez Search” Select gene=lexA AND organism=EscherichiaSelect gene=lexA AND organism=Escherichia Pick one (e.g., region from 89.2 to 92.8)Pick one (e.g., region from 89.2 to 92.8)

Under “Analyze” select “Codon Preference Plots”Under “Analyze” select “Codon Preference Plots” Choose Escherichia coli codon bias fileChoose Escherichia coli codon bias file Choose gene region corresponding to lacZChoose gene region corresponding to lacZ Click on Staden codon bias and Gribskov codon biasClick on Staden codon bias and Gribskov codon bias

Page 31: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved
Page 32: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

-28

-19

-10

-0

9

18

122400 122500 122600 122700 122800 122900

Staden Codon Preference: Window = 40 codonsFrame +1

-28

-19

-10

-0

9

18

122400 122500 122600 122700 122800 122900

Staden Codon Preference: Window = 40 codonsFrame +2

-28

-19

-10

-0

9

18

122400 122500 122600 122700 122800 122900

Staden Codon Preference: Window = 40 codonsFrame +3

-28

-19

-10

-0

9

18

122400 122500 122600 122700 122800 122900

Staden Codon Preference: Window = 40 codonsFrame -1

-28

-19

-10

-0

9

18

122400 122500 122600 122700 122800 122900

Staden Codon Preference: Window = 40 codonsFrame -2

-28

-19

-10

-0

9

18

122400 122500 122600 122700 122800 122900

Staden Codon Preference: Window = 40 codonsFrame -3

0.70

0.80

0.90

1.00

1.10

1.20

1.30

122400 122500 122600 122700 122800 122900

Gribskov Codon Preference: Window = 40 codonsFrame +1

0.70

0.80

0.90

1.00

1.10

1.20

1.30

122400 122500 122600 122700 122800 122900

Gribskov Codon Preference: Window = 40 codonsFrame +2

0.70

0.80

0.90

1.00

1.10

1.20

1.30

122400 122500 122600 122700 122800 122900

Gribskov Codon Preference: Window = 40 codonsFrame +3

Page 33: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

-22

-16

-10

-4

2

8

14

122000 122200 122400 122600 122800 123000 123200

Staden Codon Preference: Window = 40 codonsFrame +1

-22

-16

-10

-4

2

8

14

122000 122200 122400 122600 122800 123000 123200

Staden Codon Preference: Window = 40 codonsFrame +2

-22

-16

-10

-4

2

8

14

122000 122200 122400 122600 122800 123000 123200

Staden Codon Preference: Window = 40 codonsFrame +3

-22

-16

-10

-4

2

8

14

122000 122200 122400 122600 122800 123000 123200

Staden Codon Preference: Window = 40 codonsFrame -1

-22

-16

-10

-4

2

8

14

122000 122200 122400 122600 122800 123000 123200

Staden Codon Preference: Window = 40 codonsFrame -2

-22

-16

-10

-4

2

8

14

122000 122200 122400 122600 122800 123000 123200

Staden Codon Preference: Window = 40 codonsFrame -3

0.70

0.80

0.90

1.00

1.10

1.20

1.30

122000 122200 122400 122600 122800 123000 123200

Gribskov Codon Preference: Window = 40 codonsFrame +1

0.70

0.80

0.90

1.00

1.10

1.20

1.30

122000 122200 122400 122600 122800 123000 123200

Gribskov Codon Preference: Window = 40 codonsFrame +2

0.70

0.80

0.90

1.00

1.10

1.20

1.30

122000 122200 122400 122600 122800 123000 123200

Gribskov Codon Preference: Window = 40 codonsFrame +3

Page 34: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Codon Preference AlgorithmsCodon Preference Algorithms

The Staden method (from Staden & The Staden method (from Staden & McLachlan, 1982) uses a codon usage table McLachlan, 1982) uses a codon usage table directly in identifying coding regions. The directly in identifying coding regions. The codon usage table is normalized so that the codon usage table is normalized so that the sum of all 64 codons is 1. The usages for sum of all 64 codons is 1. The usages for each codon in each reading frame in each each codon in each reading frame in each window are multiplied together and window are multiplied together and normalized by the sum of the probabilities normalized by the sum of the probabilities in all three positions to generate a relative in all three positions to generate a relative coding probability.coding probability.

Page 35: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Codon Preference AlgorithmsCodon Preference Algorithms

The Gribskov method uses a codon usage The Gribskov method uses a codon usage table normalized so that the sum of the table normalized so that the sum of the alternatives for each amino acid add to 1. alternatives for each amino acid add to 1. The values for each codon for each reading The values for each codon for each reading frame in each window are multiplied frame in each window are multiplied together and normalized by the random together and normalized by the random probability expected for that codon given probability expected for that codon given the mononucleotide frequencies of the the mononucleotide frequencies of the target sequence. It is the most commonly target sequence. It is the most commonly used method.used method.

Page 36: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved
Page 37: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved
Page 38: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

Plot Plot from from syco

Page 39: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

SummarySummary

Translation of nucleic acid sequences into Translation of nucleic acid sequences into hypothetical protein sequences requires a hypothetical protein sequences requires a genetic codegenetic code

Translation can occur in three forward and Translation can occur in three forward and three reverse reading framesthree reverse reading frames

Open reading frames are regions that can be Open reading frames are regions that can be translated without encountering a stop translated without encountering a stop codoncodon

Page 40: Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  1996-2009. All rights reserved

SummarySummary

The likelihood that a particular open The likelihood that a particular open reading frames is in fact a coding region reading frames is in fact a coding region (actually made into protein) can be (actually made into protein) can be estimated using third-codon base estimated using third-codon base composition or codon preference tablescomposition or codon preference tables

This can be used to scan long sequences for This can be used to scan long sequences for possible coding regionspossible coding regions