biol335: how to annotate a genome
DESCRIPTION
Course material for: http://www.canterbury.ac.nz/courseinfo/GetCourseDetails.aspx?course=BIOL335TRANSCRIPT
Genome annotation
Paul Gardner
March 3, 2015
Paul Gardner Genome annotation
Medical genomics
I Vicky Cameron & Anna Pilbrow at Otago areidentifying genetic variation and genes associatedwith an increased risk of heart disease.
I Mike Stratton at the Sanger Institute is huntingfor genetic variation that is associated with anincreased risk of cancer.
I Rob Knight at UC Boulder is sequencing themicrobes that live on us. Finding associationsbetween our health and microbial communities.See Rob’s TEDTalk.
Paul Gardner Genome annotation
Agricultural genomics
I Graeme Attwood at AgResearch is trying to stopcows & sheep from emitting greenhouse gases bystudying their gut microbes. He has sequencedtwo methanogenic Archaeal genomes ofMethanobrevibacter sp.
I Honour McCann at Massey University is tryingto determine how Pseudomonas syringae pv.actinidiae (PSA) is killing kiwifruit.
I Rebecca Ganley at SCION is investigating howPhytophthora Taxon Agathis (PTA) is causingkauri die-back disease and killing kauri trees.
Paul Gardner Genome annotation
Academic interest genomics
I Tom Gilbert at the University of Copenhagen issequencing bird and giant squid genomes.
I Elizabeth Murchison is sequencing tasmaniandevils (and their transmissible cancers). SeeLiz’s TEDTalk.
I Neil Gemmel at Otago University is sequencingthe tuatara genome.
Paul Gardner Genome annotation
Annotate me!
TTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAA
CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAG
TGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCA
CCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG
GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTT
TGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTC
ACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGG
CAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACT
ACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG
ATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCC
AGTTCCAGATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGG
GCATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCAC
GCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAA
TGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGGTAGGTGATGGTATGC
GCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCT
CTGTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCG
TCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACT
CGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTC
GCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGTGGCGGATCAATATGCCGACTTCCTGCGCGAAGGTT
TCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCAGTTGCGTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCT
ATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCTG
GTTCGCTTTCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGC
GAGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTG
TGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAACTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCC
GTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCA
AAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAG
CTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGACATGGTTAAAGTTTATGCCCCCATGGTTAAAGTTTATGCCCCG
GCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCG
GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCA
Paul Gardner Genome annotation
Discussion
I How should these researchers annotate their genomes (afterthey have sequenced and assembled them)?
I What are the fast and cheap methods?I What are the most accurate methods?
Paul Gardner Genome annotation
The data tsunami
I Thanks to new sequencing technologies (recall Ant’steeny-tiny little sequencer).
I Biologists no longer spend years acquiring data.I The bottle-neck for research is now in the analysis phase of
research.I Biologists with good mathematics skills and mathematicians
with an interest in biology are in high demand.
Gather data
Analyze-Classify
Hypotheses-Predictions
ExperimentGCGAGCAGACGCACCGAACAGACACAGUGAGCAGGCGCCCCGAGCAGUCAUAACACUGAGACGCAGCGAGCGU-AACG
RAAAARCY
Y R
RGYUUUUUU U5'
0.0
1.0
2.0
A
CGU
CC
A
GA5
A
GA
U
CAGGUA10
CAGUCUGA
Paul Gardner Genome annotation
We can use sequence analysis...
I Genes leave a statistical signal in the genome...I Example: identify promotors, ribosome binding sites,
open-reading frames (ORFs), terminatorsI In eukaryotes CpG islands, splicing signals and poly-A tails may
be incorporatedI How reliable are these approaches? What are the main
weaknesses & strengths?
Figure from: http://zerocool.is-a-geek.net/?p=630
Paul Gardner Genome annotation
Sequence analysis: strengths and weaknesses
I ORF prediction: Prodigal, GLIMMERI Strengths:
I very fastI cheap
I Weaknesses:I false positives (see AntiFam)I misses short peptides (e.g. toxins-antitoxin systems)I No ncRNAs, pseudogenes, recoding elements, ...
Paul Gardner Genome annotation
Annotate me!
TTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAA
CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAG
TGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCA
CCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG
GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTT
TGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTC
ACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGG
CAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACT
ACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG
ATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCC
AGTTCCAGATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGG
GCATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCAC
GCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAA
TGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGGTAGGTGATGGTATGC
GCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCT
CTGTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCG
TCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACT
CGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTC
GCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGTGGCGGATCAATATGCCGACTTCCTGCGCGAAGGTT
TCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCAGTTGCGTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCT
ATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCTG
GTTCGCTTTCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGC
GAGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTG
TGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAACTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCC
GTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCA
AAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAG
CTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGACATGGTTAAAGTTTATGCCCCCATGGTTAAAGTTTATGCCCCG
GCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCG
GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCA
Paul Gardner Genome annotation
We can use homology...
I Evolution tends to preserve functional genomic regions...
I Example 1: Use an existing set of genes from related speciesand map these onto your genome (e.g. RATT)
I Example 2: Align two or more related genomes, look forconserved regions, patterns of variation can be indicative offunction (e.g. QRNA, RNAz & RNAcode)
I How reliable are these approaches? What are the mainweaknesses & strengths?
Paul Gardner Genome annotation
The QRNA approach...
Rivas et al. (2001) Computational identification of noncoding RNAs in E. coli by comparative genomics. CurrentBiology.
Paul Gardner Genome annotation
DNA encodes Protein
# STOCKHOLM 1.0#33 unique RNA sequences, 1 peptide sequence#=GR PR1 G..A..D..V..T..H..P..P..A..G..D..#=GR PR3 GlyAlaAspValThrHisProProAlaGlyAspplatypus GGAGCAGACGTCACTCACCCCCCAGCCGGAGATopossum GGAGCAGATGTTACTCACCCTCCTGCTGGAGATsloth GGAGCAGACGTCACACACCCTCCCGCGGGGGATarmadillo GGAGCAGACGTCACGCACCCTCCGGCAGGGGATtenrec GGGGCCGACGTCACGCACCCCCCTGCGGGCGATelephant GGAGCGGATGTCACACACCCGCCTGCGGGGGATshrew GGCGCAGATGTCACGCATCCTCCAGCAGGGGAChedgehog GGAGCAGATGTCACACACCCCCCAGCAGGAGATmegabat GGAGCAGATGTCACACACCCTCCTGCAGGAGATmicrobat GGAGCAGATGTCACCCACCCCCCTGCAGGGGACdog GGAGCGGATGTCACACACCCCCCAGCCGGGGACcat GGAGCCGATGTCACGCACCCCCCAGCAGGGGAThorse GGAGCGGATGTCACACACCCTCCGGCAGGGGATpika GGAGCAGATGTCACTCACCCTCCAGCTGGGGATrabbit GGTGCAGATGTCACACACCCCCCAGCTGGAGATsquirrel GGAGCAGATGTCACTCACCCTCCAGCGGGAGATguinea_pig GGAGCAGATGTCACACACCCACCAGCGGGAGATmouse GGAGCAGATGTCACTCATCCGCCTGCTGGGGACrat GGAGCAGATGTCACTCATCCACCTGCTGGGGATkangaroo_rat GGAGCAGATGTTACACACCCTCCAGCAGGGGATtree_shrew GGCGCAGACGTCACGCACCCCCCGGCCGGGGAThuman GGAGCGGATGTCACACACCCCCCAGCAGGGGATtarsier GGTGCTGATGTCACACACCCCCCTGCAGGGGATmarmoset GGAGCAGATGTCACACACCCACCAGCAGGGGATzebrafinch GGAGCAGATGTCACTCACCCTCCCGCCGGGGATgreen_anole GGGGCAGACGTCACTCACCCGCCAGCCGGGGACxenopus GGAGCAGATGTTACACACCCACCTGCTGGTGATpufferfish GGTGCGGATGTTACTCATCCTCCTGCTGGTGATfugu GGGGCTGATGTTACTCACCCTCCAGCTGGTGATstickleback GGTGCAGACGTCACACATCCTCCAGCGGGTGATmedaka GGTGCCGATGTCACTCATCCTCCTGCCGGGGACzebrafish GGGGCAGATGTTACACACCCGCCGGCTGGTGATlamprey GGTGCCGATGTGACACACCCTCCAGCGGGAGAC//
GA
A
A
A
A
G
G
G
G
C
C
C
C
U
U
U
U
UC AG UCAGUCAGUCAGUCAGU
CAG
UCAGUCAGUCAGUCAGU
CA
GUCAGUCAGUCAG
UCAG UCAG
P
S
U
nG
nG
oG
oG
oG
G
P
P
P
P
P
nM
nM
M
M
nM
nM
nM
Phenylalanine
Phe
Leucine
Leu
Leucine
Leu
Proline
Pro
Histidine
His
Glutamine
Gln
Isoleucine
Ile
Methionine
Met
Threonine
Thr
Asparagine
Asn
Lysine
Lys
Arginine
Arg
Arginine
Arg
Valine
Val
Alanine
Ala
Glutamic acid
Glu
Aspartic acid
Asp
Glycine
Gly
Serine
Ser
Serine
Ser
Tyrosine
Tyr
Cysteine
Cys
Tryptophan
Trp
Stops
Stop
E G F LS
S
Y
C
WL
P
H
R
R
QIM
TN
K
V
A
D89.09
75.07
174.20
174.20
146.19
165.19
133.11
117.15
147.13
146.15
155.16
115.13
105.09
105.09
131.18
132.12
MW
= 1
49.2
1 Da
131.18
119.12
204.23
131.18
181.19
121.16
HN
NH2
NH
H2N
OH
O
H2N
CH3 OH
O
H2N
O
H2N
OH
O
O
HO
H2N
OH
O
HS
H2N
OH
O
H2N
O
NH2
OH
O
O
OH
H2N
OH
OH2N
OH
O
NH
H2N
OH
O
N
CH3 CH3
H2N
OH
O
CH3
CH3
H2N
OH
O
CH3
CH3
H2N
OH
O
H2N
H2N
OH
O
CH3 S
H2N
OH
O
H2N
OH
O
NH
OH
O
H2N
HO OH
O
H2N
HO OH
O
H2N
HO
CH3
OH
O
NH
H2N
OH
O
HO
H2N
OH
O
H2N
CH3
CH3
OH
O
BasicAcidicPolarNonpolar(hydrophobic)
S -M - P - U - nM -oG - nG -
SumoMethyl
PhosphoUbiquitinN-Methyl
O-glycosylN-glycosyl
Modification
am
ino a
cid
2nd1st position 3rdUC
Image source: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svgPaul Gardner Genome annotation
DNA encodes RNA
GCGGAUUU
AGCUC
AGDDGG G A
G A G CG
CCA
GACUG
A A.A.
CUGGAGGU
CC U G U G
T . CGA
UCCACAG
AAUUCGC
AC
CA
VariableLoopAnticodon
Loop
T ΨCLoop
10 15 20 25 30 355 40 45 50 55 60 65 70 75
AnticodonLoop
Acceptor Stem
GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA5’ 3’
Secondary Structure Tertiary StructureB C
Primary StructureA
Acceptor Stem
T ΨCLoop
ΨΨ
Ψ
Ψ
Y
6560
55
40
10
20
155
70
75
25
30
35
45
50
D Loop
3’
5’
5’3’
D Loop
Paul Gardner Genome annotation
Homology-based annotation: strengths and weaknesses
I Example 1: map known genes onto genomesI Strengths: fast, cheap, ...I Weaknesses:
I Inaccurate for divergent species (e.g. Graeme’sMethanobrevibacter or GEBA genomes)
I Requires manual correction of border-line resultsI Errors are propagated throughout the databases
I Example 2: aligning genomesI Strengths:
I “cheap” if genomes already existI fast for small genomesI evolutionary support for all discoveries
I Weaknesses:I Requires lots of powerful computers for large genomesI Inaccurate for divergent species (e.g. Neil’s tuatara or
Graeme’s Methanobrevibacter)I Requires manual correction of border-line results
Paul Gardner Genome annotation
Homology annotation: nucleotides are difficult to align
0
20
40
60
80
100
Conservation of Xfam families in bacterial genomes
Con
served
families
(%)Freq.
RNA−seq species0
10
Pfam (N=6671)Rfam (N=331)
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Phylogenetic distance
Lindgreen et al. (2014) Robust identification of noncoding RNA from transcriptomes requiresphylogenetically-informed sampling. PLOS Computational Biology.
Paul Gardner Genome annotation
We can use RNA detection methods...
I Remember the central dogma of molecular biologyI Example: sequence RNAs from multiple tissues,
developmental stages and environmental conditionsI How reliable is this approach? What are the main weaknesses
& strengths?
Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics.
Paul Gardner Genome annotation
RNA-seq: strengths and weaknesses
I RNA-seqI Strengths:
I Experimental support for transcribed regionsI Identifies untranslated regions (UTRs), ncRNAs, antisense
RNAs, ...I Identifies alternatively spliced and edited RNAs
I Weaknesses:I Expensive & lots of workI RNA degradation and genomic contaminationI Transcription does not prove translationI Will miss genes transcribed in specific developmental stages,
tissues & environmental conditions E.g. lsy-6 microRNA
Paul Gardner Genome annotation
We can use protein detection methods...
I Central dogma of molecular biologyI Example: Protein mass spectrometry
I How reliable is this approach? What are the main weaknesses& strengths?
Figure from: http://en.wikipedia.org/wiki/Protein mass spectrometry
Paul Gardner Genome annotation
Protein mass spectrometry: strengths and weaknesses
I Protein mass spectrometryI Strengths:
I Experimental support for translated regionsI Identifies alternative isoforms and post-translational
modifications (Ezkurdia et al. 2012)
I Weaknesses:I Expensive & lots of workI Misses genes transcribed in specific developmental stages,
tissues & environmental conditionsI Currently technology generally only detects the most
abundant proteins
Ezkurdia et al. (2012) Comparative proteomics reveals a significant bias toward alternative protein isoforms withconserved structure and function. Mol Biol Evol.
Paul Gardner Genome annotation
How cool is this?!
Schwanhausser et al. (2011) Global quantification of mammalian gene expression control. Nature
Paul Gardner Genome annotation
This is also kinda neat...
Lu et al. (2007) Absolute protein expression profiling estimates the relative contributions of transcriptional andtranslational regulation. Nature Biotechnology
Paul Gardner Genome annotation
Relevant reading
I Reviews:I Stein L (2001) Genome annotation: from sequence to biology.
Nature Reviews Genetics.I Reed JL et al. (2006) Towards multidimensional genome
annotation. Nature Reviews Genetics.I ORF finding:
I Delcher AL et al. (2007) Identifying bacterial genes andendosymbiont DNA with Glimmer. Bioinformatics.
I Hyatt D et al. (2010) Prodigal: prokaryotic gene recognitionand translation initiation site identification. BMCBioinformatics.
I RNA-seq (Ant’s lectures)I Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary
tool for transcriptomics. Nature Reviews Genetics.I Proteomics (Sarah’s lectures)
I Ezkurdia et al. (2012) Comparative proteomics reveals asignificant bias toward alternative protein isoforms withconserved structure and function. Mol Biol Evol.
Paul Gardner Genome annotation
Homework: How to make a sequence alignment?
I Play: http://phylo.cs.mcgill.ca
I or even better, play Ribo: http://ribo.cs.mcgill.ca/
Paul Gardner Genome annotation
The End
Paul Gardner Genome annotation