biol335: how to annotate a genome

Download BIOL335: How to annotate a genome

Post on 03-Jul-2015

134 views

Category:

Science

1 download

Embed Size (px)

DESCRIPTION

Course material for: http://www.canterbury.ac.nz/courseinfo/GetCourseDetails.aspx?course=BIOL335

TRANSCRIPT

  • 1. Genome annotation Paul Gardner March 3, 2015 Paul Gardner Genome annotation

2. Medical genomics Vicky Cameron & Anna Pilbrow at Otago are identifying genetic variation and genes associated with an increased risk of heart disease. Mike Stratton at the Sanger Institute is hunting for genetic variation that is associated with an increased risk of cancer. Rob Knight at UC Boulder is sequencing the microbes that live on us. Finding associations between our health and microbial communities. See Robs TEDTalk. Paul Gardner Genome annotation 3. Agricultural genomics Graeme Attwood at AgResearch is trying to stop cows & sheep from emitting greenhouse gases by studying their gut microbes. He has sequenced two methanogenic Archaeal genomes of Methanobrevibacter sp. Honour McCann at Massey University is trying to determine how Pseudomonas syringae pv. actinidiae (PSA) is killing kiwifruit. Rebecca Ganley at SCION is investigating how Phytophthora Taxon Agathis (PTA) is causing kauri die-back disease and killing kauri trees. Paul Gardner Genome annotation 4. Academic interest genomics Tom Gilbert at the University of Copenhagen is sequencing bird and giant squid genomes. Elizabeth Murchison is sequencing tasmanian devils (and their transmissible cancers). See Lizs TEDTalk. Neil Gemmel at Otago University is sequencing the tuatara genome. Paul Gardner Genome annotation 5. Annotate me! TTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAA CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAG TGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCA CCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTT TGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTC ACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGG CAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACT ACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG ATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCC AGTTCCAGATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGG GCATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCAC GCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAA TGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGGTAGGTGATGGTATGC GCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCT CTGTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCG TCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACT CGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTC GCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGTGGCGGATCAATATGCCGACTTCCTGCGCGAAGGTT TCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCAGTTGCGTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCT ATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCTG GTTCGCTTTCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGC GAGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTG TGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAACTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCC GTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCA AAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAG CTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGACATGGTTAAAGTTTATGCCCCCATGGTTAAAGTTTATGCCCCG GCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCG GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCA Paul Gardner Genome annotation 6. Discussion How should these researchers annotate their genomes (after they have sequenced and assembled them)? What are the fast and cheap methods? What are the most accurate methods? Paul Gardner Genome annotation 7. The data tsunami Thanks to new sequencing technologies (recall Ants teeny-tiny little sequencer). Biologists no longer spend years acquiring data. The bottle-neck for research is now in the analysis phase of research. Biologists with good mathematics skills and mathematicians with an interest in biology are in high demand. Gather data Analyze-Classify Hypotheses- Predictions Experiment GCGAGCAGACGCA CCGAACAGACACA GUGAGCAGGCGCC CCGAGCAGUCAUA ACACUGAGACGCA GCGAGCGU-AACG R A A A A R C Y Y R R G Y U U U U U U U5' 0.0 1.0 2.0 A C GU CC A GA5 A GA U CAGG U A10 CA GU CU G A Paul Gardner Genome annotation 8. We can use sequence analysis... Genes leave a statistical signal in the genome... Example: identify promotors, ribosome binding sites, open-reading frames (ORFs), terminators In eukaryotes CpG islands, splicing signals and poly-A tails may be incorporated How reliable are these approaches? What are the main weaknesses & strengths? Figure from: http://zerocool.is-a-geek.net/?p=630 Paul Gardner Genome annotation 9. Sequence analysis: strengths and weaknesses ORF prediction: Prodigal, GLIMMER Strengths: very fast cheap Weaknesses: false positives (see AntiFam) misses short peptides (e.g. toxins-antitoxin systems) No ncRNAs, pseudogenes, recoding elements, ... Paul Gardner Genome annotation 10. Annotate me! TTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAA CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAG TGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCA CCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTT TGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTC ACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGG CAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACT ACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG ATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCC AGTTCCAGATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGG GCATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCAC GCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAA TGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGGTAGGTGATGGTATGC GCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCT CTGTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCG TCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACT CGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTC GCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGTGGCGGATCAATATGCCGACTTCCTGCGCGAAGGTT TCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCAGTTGCGTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCT ATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCTG GTTCGCTTTCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGC GAGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTG TGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAACTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCC GTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCA AAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAG CTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGACATGGTTAAAGTTTATGCCCCCATGGTTAAAGTTTATGCCCCG GCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCG GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCA Paul Gardner Genome annotation 11. We can use homology... Evolution tends to preserve functional genomic regions... Example 1: Use an existing set of genes from related species and map these onto your genome (e.g. RATT) Example 2: Align two or more related genomes, look for conserved regions, patterns of variation can be indicative of function (e.g. QRNA, RNAz & RNAcode) How reliable are these approaches? What are the main weaknesses & strengths? Paul Gardner Genome annotation 12. The QRNA approach... Rivas et al. (2001) Computational identication of noncoding RNAs in E. coli by comparative genomics. Current Biology. Paul Gardner Genome annotation 13. DNA encodes Protein # STOCKHOLM 1.0 #33 unique RNA sequences, 1 peptide sequence #=GR PR1 G..A..D..V..T..H..P..P..A..G..D.. #=GR PR3 GlyAlaAspValThrHisProProAlaGlyAsp platypus GGAGCAGACGTCACTCACCCCCCAGCCGGAGAT opossum GGAGCAGATGTTACTCACCCTCCTGCTGGAGAT sloth GGAGCAGACGTCACACACCCTCCCGCGGGGGAT armadillo GGAGCAGACGTCACGCACCCTCCGGCAGGGGAT tenrec GGGGCCGACGTCACGCACCCCCCTGCGGGCGAT elephant GGAGCGGATGTCACACACCCGCCTGCGGGGGAT shrew GGCGCAGATGTCACGCATCCTCCAGCAGGGGAC hedgehog GGAGCAGATGTCACACACCCCCCAGCAGGAGAT megabat GGAGCAGATGTCACACACCCTCCTGCAGGAG