genome annotation

Download Genome Annotation

Post on 14-Feb-2016

50 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Genome Annotation. Daniel Lawson VectorBase @ EBI. Genome annotation - building a pipeline. Genome sequence. Map repeats. Map ESTs. Map Peptides. Genefinding . nc-RNAs. Protein-coding genes. Functional annotation. Release . Repeat features. Genomes contain repetitive sequences. - PowerPoint PPT Presentation

TRANSCRIPT

  • Genome AnnotationDaniel LawsonVectorBase @ EBI

    BDV Manuas 2007

  • Repeats

    BDV Manuas 2007

  • Genome annotation - building a pipelineGenome sequenceMap repeatsGenefinding Protein-coding genesMap ESTsMap Peptidesnc-RNAsFunctional annotationRelease

    BDV Manuas 2007

  • Repeat features Genomes contain repetitive sequences

    GenomeSize (Mb)% RepeatAedes aegypti1,300~70Anopheles gambiae260~30Culex pipiens 540~50

    BDV Manuas 2007

  • Repeat features: Tandem repeats Pattern of two or more nucleotides repeated where the repetitions are directly adjacent to each other Polymorphic between individuals/populations Example programs: Tandem, TRF

    BDV Manuas 2007

  • Repeat features: Interspersed elements Transposable elements (TEs) Transposons, Retrotransposons etc Entire research field in itself Example programs: Repeatscout, RECON

    BDV Manuas 2007

  • Finding repeats as a preliminary to gene prediction Repeat discovery Literature and public databanks Automated approaches (e.g. RepeatScout or RECON) Generate a library of example repeat sequences (FASTA file with a defined header line format) Use RepeatMasker to search the genome and mask the sequence

    BDV Manuas 2007

  • Masked sequenceRepeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with XsWidely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TEs in the final annotation set>my sequenceatgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct>my sequence (repeatmasked)atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

    BDV Manuas 2007

  • Masked sequence - Hard or Soft?Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked>my sequenceATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT>my sequence (softmasked)ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

    BDV Manuas 2007

  • Pairwisealignments

    BDV Manuas 2007

  • Genome annotation - building a pipelineGenome sequenceMap RepeatsGenefinding Protein-coding genesMap ESTsMap Peptidesnc-RNAsFunctional annotationRelease

    BDV Manuas 2007

  • Genefinding

    BDV Manuas 2007

  • Genome annotation - building a pipelineGenome sequenceMap RepeatsGenefinding Protein-coding genesMap ESTsMap Peptidesnc-RNAsFunctional annotationRelease

    BDV Manuas 2007

  • More terminology Gene predictionPredicted exon structure for the primary transcript of a gene CDSCoding sequence for a protein-coding gene prediction (not necessarily continuous in a genomic context) ORFOpen reading frame, sequence devoid of stop codons SimilaritySimilarity between sequences which does not necessarily infer any evolutionary linkage ab initio predictionPrediction of gene structure from first principles using only the genome sequence Hidden Markov Model (HMM)Statistical model (dynamic Baysian network) which can be used as a sensitive statistically robust search algorithm. Use of profile HMMs to search sequence data is widespread

    BDV Manuas 2007

  • Eukaryote genome annotationGenomeATGSTOPAAAnABTranscriptionPrimary TranscriptProcessed mRNAPolypeptideFolded proteinFunctional activityTranslationProtein foldingEnzyme activityRNA processingm7GFind locusFind exons using transcriptsFind exons using peptidesFind function

    BDV Manuas 2007

  • Prokaryote genome annotationGenomeSTARTSTOPABTranscriptionPrimary TranscriptProcessed RNAPolypeptideFolded proteinFunctional activityTranslationProtein foldingEnzyme activityRNA processingFind locusFind CDS Find functionSTARTSTOP

    BDV Manuas 2007

  • Genefindingab initiosimilarity

    BDV Manuas 2007

  • Genefinding resourcesTranscriptcDNA sequencesEST sequencesOther (MPSS, SAGE, ditags)PeptideNon-redundant (nr) protein databaseProtein sequence data, Mass spectrometry dataGenomeOther genomic sequence

    BDV Manuas 2007

  • ab initio predictionGenomeATGSTOPAAAnABTranscriptionPrimary TranscriptProcessed mRNAPolypeptideFolded proteinFunctional activityTranslationProtein foldingEnzyme activityRNA processingm7G

    BDV Manuas 2007

  • ab initio predictionGenomeATGSTOPAAAnABTranscriptionPrimary TranscriptProcessed mRNAPolypeptideFolded proteinFunctional activityTranslationProtein foldingEnzyme activityRNA processingm7G

    BDV Manuas 2007

  • Genefinding - ab initio predictions Use compositional features of the DNA sequence to define coding segments (essentially exons) ORFs Coding bias Splice site consensus sequences Start and stop codons Each feature is assigned a log likelihood score Use dynamic programming to find the highest scoring path Need to be trained using a known set of coding sequences Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

    BDV Manuas 2007

  • ab initio predictionGenomeCoding potentialCoding potentialATG & Stop codonsATG & Stop codonsSplice sites

    BDV Manuas 2007

  • ab initio predictionGenomeCoding potentialCoding potentialATG & Stop codonsATG & Stop codonsSplice sites

    BDV Manuas 2007

  • ab initio predictionFind best predictionGenomeCoding potentialCoding potentialATG & Stop codonsATG & Stop codonsSplice sites

    BDV Manuas 2007

  • Similarity predictionGenomeATGSTOPAAAnABTranscriptionPrimary TranscriptProcessed mRNAPolypeptideFolded proteinFunctional activityTranslationProtein foldingEnzyme activityRNA processingm7G

    BDV Manuas 2007

  • Similarity predictionGenomeATGSTOPAAAnABTranscriptionPrimary TranscriptProcessed mRNAPolypeptideFolded proteinFunctional activityTranslationProtein foldingEnzyme activityRNA processingm7GFind exons using transcriptsFind exons using peptides

    BDV Manuas 2007

  • Genefinding - similarity Use known coding sequence to define coding regions EST sequences Peptide sequences Needs to handle fuzzy alignment regions around splice sites Needs to attempt to find start and stop codons Examples: EST2Genome, exonerate, genewise

    BDV Manuas 2007

  • Similarity-based predictionAlignCreate predictionGenomecDNA/peptide

    BDV Manuas 2007

  • Genefinding - comparative Use 2 or more genomic sequences to predict genes based on conservation of exon sequences Examples: Twinscan and SLAM

    BDV Manuas 2007

  • Genefinding - manualManual annotation is time consumingAnnotators use specialized utilities to view genomic regions with tiers/columns of data from which they construct a gene predictionMost decisions are subjective and tedious to documentAvoids the systematic problems of ab initio predictors and automated annotation pipeline

    BDV Manuas 2007

  • Manual predictionCoding potentialCoding potentialATG & Stop codonsATG & Stop codonsSplice sitesEST similarity

    BDV Manuas 2007

  • Manual predictionCoding potentialCoding potentialATG & Stop codonsATG & Stop codonsSplice sitesEST similarity

    BDV Manuas 2007

  • Manual predictionPredict structureCoding potentialCoding potentialATG & Stop codonsATG & Stop codonsSplice sitesEST similarity

    BDV Manuas 2007

  • Genefinding - non-coding RNA genes Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes Rfam - a suite of HMMs trained against a large number of different RNA genes

    BDV Manuas 2007

  • Example system

    BDV Manuas 2007

  • Overview of current annotation systemAssembled genomeVectorBase gene predictionsSequencing centre gene predictionsMerge into canonical setProtein analysisDisplay on genome browserRelease to GenBank/EMBL/DDBJ

    BDV Manuas 2007

  • VectorBase gene prediction pipeline Blessed predictionsCommunity submissionsManual annotationsSpecies-specific predictionsSimilarity predictionsTranscript based predictionsAb initio gene predictions(Genewise) (Genewise) (SNAP) (Exonerate) (Apollo) (Genewise, Exonerate, Apollo) Protein family HMMs(Genewise) ncRNA predictions(Rfam)

    BDV Manuas 2007

  • VectorBase curation database pipeline for manual/community annotationCurationwarehouse dbManual annotation (Harvard)ApolloApolloCommunity annotation (Community representatives)Chado-XMLChado-XMLChadoEnsemblGFF3Gene build dbCommunity annotation

    BDV Manuas 2007

  • Genefinding - ReviewGene prediction relies heavily on similarity dataEST/cDNA sequences are vital for genefindingTraining for ab initio approachesSimilarity buildsValidating predictionsProtein data is the predominant supporting evidence for prediction in most vector genomesNeed to be wary of predicting from predictionsGenefinding is still something of a dark artEfforts to standardize and document supporting evidence for prediction and modifications are ongoing

    BDV Manuas 2007

  • Genefinding omissionsAlternative splice formsCurr