gen ant

Upload: dinesh-gupta

Post on 06-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Gen Ant

    1/22

    5/4/12

    Genome Annotation

    Din

    es

  • 8/3/2019 Gen Ant

    2/22

    5/4/12

    www.genomesonline.org

  • 8/3/2019 Gen Ant

    3/22

    5/4/12

    protein-coding genes, nonprotein-coding genes

    easier to find than other functionalelements

    why?genes are transcribedwhich meansthat we can identify them by looking atRNA

    traditionally this has been done bycDNA or EST sequencing, morerecently by microarray, SAGE,MPSS, etc.

  • 8/3/2019 Gen Ant

    4/22

    5/4/12

    protein-coding genes, nonprotein-coding genes

    we can also find genes ab initio usingcomputational methods

    this is most suited to protein-coding genes

    why?

    protein-coding genes have recognizablefeatures

    open reading frames (ORFs)

    codon bias known transcription and translationalstart and stop motifs (promoters, 3 poly-

    A sites) -

  • 8/3/2019 Gen Ant

    5/22

    5/4/12

    ab initio gene discovery

    Protein-coding genes have recognizable features

    We can design software to scan the genome andidentify these features

    Some of these programs work quite well, especially inbacteria and simpler eukaryotes with smaller and

    more compact genomesIts a lot harder for the higher eukaryotes where thereare a lot of long introns, genes can be found withinintrons of other genes, etc.

    We tend to do OK finding protein coding regions,but miss a lot of non-coding 5 exons and thelike

  • 8/3/2019 Gen Ant

    6/22

    5/4/12

    ab initio gene discoveryvalidatingpredictions and refining gene models

    Standard types of evidence for validation ofpredictions include:

    match to previously annotated cDNA

    match to EST from same organism

    similarity of nucleotide or conceptuallytranslated protein sequence to sequences inGenBank

    (translation works betterwhy?)

    protein structure prediction match to a PFAM domain

    associated with recognized promoter sequences, ie TATAbox, CpG island

    known phenotype from mutation of the locus

  • 8/3/2019 Gen Ant

    7/22

    5/4/12

    Finding nonprotein-coding genes

    e.g., tRNA, rRNA, snoRNA, miRNA, various otherncRNAs

    Harder to find than protein-coding genes

    Why?

    often not poly-A taileddont end up in cDNAlibraries

    no ORF

    constraint on sequence divergence at nucleotidenot protein level, so homology is harder todetect

    So, how do we find these?

  • 8/3/2019 Gen Ant

    8/22

    5/4/12

    Finding nonprotein-coding genes

    secondary structure

    homology, especially alignment of related species

    experimentally

    isolation through non-polyA dependent cloningmethods

    microarrays

  • 8/3/2019 Gen Ant

    9/22

    5/4/12

    ab initio gene discoveryapproaches

    Most gene-discovery programs makes use of someform ofmachine learning algorithm. A machinelearning algorithm requires a training setof inputdata that the computer uses to learn how to find apattern.

    Two common machine learning approaches used ingene discovery (and many other bioinformaticsapplications) are artificial neural networks (ANNs)and hidden Markov models (HMMs).

  • 8/3/2019 Gen Ant

    10/22

    5/4/12

    ab initio gene discoveryHMMs

    An example state diagram for an HMM for genediscovery is this simplified version of one used byGenescan:

    begin

    generegion

    start

    translation

    donor

    splicesite

    acceptor

    splicesite

    stop

    translation

    end

    generegion

    single exon

    exonfinalexon

    initialexon

    5 UTR 3 UTR

    intron

    Each box and arrow has associated transitionprobabilities, and emission probabilities for emissionof nucleotides (dotted arrow). These are learned fromexamples of known gene models and provide the

    probability that a stretch of sequence is a gene.

    A,T,G,C

    adapted from Gibson and Muse,A Primer ofGenome Science

  • 8/3/2019 Gen Ant

    11/22

    5/4/12

    Despite good progress in identifying bothprotein coding and non-protein codinggenes, much work remains to be donebefore even the best-studied genomes are

    fully annotated. For the higher eukaryotes,only a tiny percentage of features such asTFBSs, CRMs, and other non-gene featureshave so far been indentified.

    Genome Annotationmuch work remains

  • 8/3/2019 Gen Ant

    12/22

    5/4/121212

    The value of genome sequences liesin their annotation

    Annotation Characterizing genomicfeatures using computational andexperimental methods

    Genes: Four levels of annotation

    Gene Prediction Where are genes?

    What do they look like?

    Domains What do the proteins do?

    Role What pathway(s) involved in?

  • 8/3/2019 Gen Ant

    13/22

    5/4/121313

    How many genes?

    Consortium: 35,000 genes?

    Celera: 30,000 genes?

    Affymetrix: 60,000 human genes on

    GeneChips?

    Incyte and HGS: over 120,000 genes?

    GenBank: 49,000 unique gene coding

    sequences?

    UniGene: > 89,000 clusters of unique

  • 8/3/2019 Gen Ant

    14/22

    5/4/121414

    Current consensus (in flux )

    20-25000 protein-coding genes

    19,599 known genes, another 2,188 DNA segments

    predicted to be protein coding genes.

  • 8/3/2019 Gen Ant

    15/22

    5/4/121515

    How to we get from here

  • 8/3/2019 Gen Ant

    16/22

  • 8/3/2019 Gen Ant

    17/22

    5/4/121717

    Complete DNA segments responsible tomake functional products

    Products

    Proteins

    Functional RNA molecules

    RNAi (interfering RNA)

    rRNA (ribosomal RNA)

    snRNA (small nuclear)

    snoRNA (small nucleolar)

    tRNA (transfer RNA)

    What are genes? - 1

  • 8/3/2019 Gen Ant

    18/22

    5/4/121818

    What are genes? - 2

    Definition vs. dynamic concept

    Consider

    Prokaryotic vs. eukaryotic gene models

    Introns/exons

    Posttranscriptional modifications

    Alternative splicing

    Differential expression

    Genes-in-genes

    Genes-ad-genes

  • 8/3/2019 Gen Ant

    19/22

    5/4/121919

    Prokaryotic gene model: ORF-genes

    Small genomes, high gene density

    Haemophilus influenza genome 85% genic

    Operons

    One transcript, many genes

    No introns.

    One gene, one protein

    Open reading frames

    One ORF per gene

    ORFs begin with start,

    end with stop codon (def.)TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl

    NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html

    http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.splhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
  • 8/3/2019 Gen Ant

    20/22

    5/4/122020

    Eukaryotic gene model: spliced genes

    n Posttranscriptional modificationu 5-CAP, polyA tail, splicing

    n Open reading framesu Mature mRNA contains ORFu All internal exons contain open read-

    throughu Pre-start and post-stop sequences are

    UTRs

    http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
  • 8/3/2019 Gen Ant

    21/22

    5/4/122121

    Expansions and Clarifications

    ORFs

    Start triplets stop

    Prokaryotes: gene = ORF

    Eukaryotes: spliced genes or ORF genes

    Exons

    Remain after introns have been removed

    Flanking parts contain non-coding sequence(5- and 3-UTRs)

    http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
  • 8/3/2019 Gen Ant

    22/22

    5/4/12

    http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html