lectures part 08

Upload: dev-ashwani

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Lectures Part 08

    1/40

    Computational Biology, Part 8Protein Coding Regions

    Robert F. Murphy

    Copyright 1996-2009.All rights reserved.

  • 7/28/2019 Lectures Part 08

    2/40

    Sequence Analysis Tasks

    Finding protein coding regions

  • 7/28/2019 Lectures Part 08

    3/40

    Goal

    Given a DNA or RNA sequence, find thoseregions that code for protein(s)

    Direct approach: Look for stretches that can beinterpreted as protein using the genetic codeStatistical approaches: Use other knowledge

    about likely coding regions

  • 7/28/2019 Lectures Part 08

    4/40

    Direct Approach

  • 7/28/2019 Lectures Part 08

    5/40

    Genetic codes

    The set of tRNAs that an organism possesses defines its genetic code(s)The universal genetic code is common to allorganismsProkaryotes, mitochondria and chloroplastsoften use slightly different genetic codesMore than one tRNA may be present for agiven codon, allowing more than one

    possible translation product

  • 7/28/2019 Lectures Part 08

    6/40

    Genetic codes

    Differences in genetic codes occur in startand stop codons onlyAlternate initiation codons : codons thatencode amino acids but can also be used tostart translation (GUG, UUG, AUA, UUA,

    CUG)Suppressor tRNA codons : codons thatnormally stop translation but are translated

    as amino acids (UAG, UGA, UAA)

  • 7/28/2019 Lectures Part 08

    7/40

    Genetic codes

  • 7/28/2019 Lectures Part 08

    8/40

    Genetic codes

  • 7/28/2019 Lectures Part 08

    9/40

    Genetic codes

    Note additional start codons: UUA, UUG, CUG

    Note conversion of stop codon UGA (opal) to Trp

  • 7/28/2019 Lectures Part 08

    10/40

    Reading Frames

    Since nucleotide sequences are read three bases at a time, there are three possible

    frames in which a given nucleotidesequence can be read (in the forwarddirection)

    Taking the complement of the sequence andreading in the reverse direction gives threemore reading frames

  • 7/28/2019 Lectures Part 08

    11/40

    Reading frames

    TTC TCA TGT TTG ACA GCT RF1 Phe Ser Cys Leu Thr Ala> RF2 Ser His Val *** Gln Leu> RF3 Leu Met Phe Asp Ser>

    AAG AGT ACA AAC TGT CGA RF4

  • 7/28/2019 Lectures Part 08

    12/40

    Open Reading Frames (ORF)

    Concept: Region of DNA or RNA sequencethat could be translated into a peptide

    sequence ( open refers to absence of stopcodons)Prerequisite: A specific genetic code

    Definition: (start codon) (amino acid coding codon) n (stop codon)

    Note: Not all ORFs are actually used

  • 7/28/2019 Lectures Part 08

    13/40

  • 7/28/2019 Lectures Part 08

    14/40

    Open Reading Frames

    Click boxes for List ORFS and ORF map

  • 7/28/2019 Lectures Part 08

    15/40

    Check readingframe:mod(696,3)=0-> RF3

  • 7/28/2019 Lectures Part 08

    16/40

    EMBOSS plotorf

  • 7/28/2019 Lectures Part 08

    17/40

  • 7/28/2019 Lectures Part 08

    18/40

  • 7/28/2019 Lectures Part 08

    19/40

    Splicing ORFs

    For eukaryotes, which have interruptedgenes, ORFs in different reading frames

    may be spliced together to generate final productORFs from forward and reverse directions

    cannot be combined

  • 7/28/2019 Lectures Part 08

    20/40

    Block Diagram for Search for

    ORFs

    SearchEngine

    Sequence to besearched

    Genetic code

    List of ORF positions

    Both strands?

    Ends start/stop?

  • 7/28/2019 Lectures Part 08

    21/40

    Statistical Approaches

  • 7/28/2019 Lectures Part 08

    22/40

    Calculation Windows

    Many sequence analyses require calculatingsome statistic over a long sequence looking

    for regions where the statistic is unusuallyhigh or lowTo do this, we define a window size to be

    the width of the region over which eachcalculation is to be doneExample: %AT

  • 7/28/2019 Lectures Part 08

    23/40

    Base Composition Bias

    For a protein with a roughly normalamino acid composition, the first 2 positions

    of all codons will be about 50% GCIf an organism has a high GC contentoverall, the third position of all codons must

    be mostly GCUseful for prokaryotes

    Not useful for eukaryotes due to large

    amount of noncoding DNA

  • 7/28/2019 Lectures Part 08

    24/40

    Ficketts statistic

    Also called TestCode analysisLooks for asymmetry of base compositionStrong statistical basis for calculationsMethod:

    For each window on the sequence, calculatethe base composition of nucleotides 1, 4, 7...,then of 2, 5, 8..., and then of 3, 6, 9...Calculate statistic from resulting three numbers

  • 7/28/2019 Lectures Part 08

    25/40

  • 7/28/2019 Lectures Part 08

    26/40

    Codon Bias (Codon Preference)

    Starting point: Table of observed codonfrequencies in known genes from a given

    organism best to use highly expressed genes

    Method

    Calculate coding potential within a movingwindow for all three reading frames Look for ORFs with high scores

  • 7/28/2019 Lectures Part 08

    27/40

    Codon Bias (Codon Preference)

    Works best for prokaryotes or unicellular eukaryotes because for multicellular

    eukaryotes, different pools of tRNA may beexpressed at different stages of developmentin different tissues

    may have to group genes into setsCodon bias can also be used to estimate

    protein expression level

  • 7/28/2019 Lectures Part 08

    28/40

    Portion of D. melanogaster

    codon frequency tableAmino Acid Codon Number Freq/1000 Fraction

    Gly GGG 11 2.60 0.03

    Gly GGA 92 21.74 0.28

    Gly GGT 86 20.33 0.26

    Gly GGC 142 33.56 0.43

    Glu GAG 212 50.11 0.75Glu GAA 69 16.31 0.25

    GlyG

  • 7/28/2019 Lectures Part 08

    29/40

    Comparison of Glycine codon

    frequencies

    Codon E. coli D. melanogaster

    GGG 0.02 0.03GGA 0.00 0.28

    GGT 0.59 0.26

    GGC 0.38 0.43GlyG

  • 7/28/2019 Lectures Part 08

    30/40

  • 7/28/2019 Lectures Part 08

    31/40

  • 7/28/2019 Lectures Part 08

    32/40

  • 7/28/2019 Lectures Part 08

    33/40

  • 7/28/2019 Lectures Part 08

    34/40

    Codon Preference Algorithms

    The Staden method (from Staden &McLachlan, 1982) uses a codon usage table

    directly in identifying coding regions. Thecodon usage table is normalized so that thesum of all 64 codons is 1. The usages for each codon in each reading frame in eachwindow are multiplied together andnormalized by the sum of the probabilitiesin all three positions to generate a relative

    coding probability.

  • 7/28/2019 Lectures Part 08

    35/40

    Codon Preference Algorithms

    The Gribskov method uses a codon usagetable normalized so that the sum of thealternatives for each amino acid add to 1.The values for each codon for each readingframe in each window are multipliedtogether and normalized by the random

    probability expected for that codon giventhe mononucleotide frequencies of the targetsequence. It is the most commonly usedmethod.

  • 7/28/2019 Lectures Part 08

    36/40

  • 7/28/2019 Lectures Part 08

    37/40

  • 7/28/2019 Lectures Part 08

    38/40

    Plotfromsyco

  • 7/28/2019 Lectures Part 08

    39/40

    Summary

    Translation of nucleic acid sequences intohypothetical protein sequences requires a

    genetic codeTranslation can occur in three forward andthree reverse reading frames

    Open reading frames are regions that can betranslated without encountering a stopcodon

  • 7/28/2019 Lectures Part 08

    40/40

    Summary

    The likelihood that a particular open readingframes is in fact a coding region (actually

    made into protein) can be estimated usingthird-codon base composition or codon preference tablesThis can be used to scan long sequences for

    possible coding regions