proteogenomics : refining and improving genome annotation

23
Proteogenomics: Refining and Improving Genome Annotation Samuel H Payne J Craig Venter Institute

Upload: reina

Post on 23-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Proteogenomics : Refining and Improving Genome Annotation. Samuel H Payne J Craig Venter Institute. State of Genome Annotation. Most prokaryotic genomes are auto-annotated. Sequence and function are inferred with comparative genomics; validation is sparse. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Proteogenomics : Refining and Improving Genome Annotation

Proteogenomics:Refining and Improving

Genome Annotation

Samuel H PayneJ Craig Venter Institute

Page 2: Proteogenomics : Refining and Improving Genome Annotation

State of Genome Annotation

• Most prokaryotic genomes are auto-annotated.• Sequence and function are inferred with

comparative genomics; validation is sparse.• Difficulties with novel or HGT genes• Mature protein features

localization PTM, cleavage

Salzberg 2007

Page 3: Proteogenomics : Refining and Improving Genome Annotation

Diversity or Confusion

Page 4: Proteogenomics : Refining and Improving Genome Annotation

Proteomics

• Input: protein sample • Output: list of peptides

Page 5: Proteogenomics : Refining and Improving Genome Annotation

Proteogenomics

• Definition: using proteomics data to do genome annotation

• Goals: Find all coding regions of the genome,

annotated and unannotated Submit improved annotation to NCBI Identify “mature protein” features

Page 6: Proteogenomics : Refining and Improving Genome Annotation

Proteogenomics Protocol

• Data sources Yersinia pestis - Pieper et al., 2008, 2009 Bacillus anthracis – PRC/NIAID   

Page 7: Proteogenomics : Refining and Improving Genome Annotation

Correcting Errors

• Unannotated genes Both known and totally novel

Page 8: Proteogenomics : Refining and Improving Genome Annotation

Correcting Errors

• Unannotated genes Both known and totally novel

Page 9: Proteogenomics : Refining and Improving Genome Annotation

Correcting Errors

• Start site assignment

Page 10: Proteogenomics : Refining and Improving Genome Annotation

Exceptions to Rules

• Multi-ORF genes: self splicing, frame shift

Page 11: Proteogenomics : Refining and Improving Genome Annotation

Exceptions to Rules

• Non-canonical start codons infC – ATT (Sacerdot 1982, Payne 2010) in

enterobacteria; ATA in Shewanella (Gupta 2007) Deinococcus (Baudet 2009) suggests new non-

standard starts

Page 12: Proteogenomics : Refining and Improving Genome Annotation

Overlaps/Wrong Frames

Page 13: Proteogenomics : Refining and Improving Genome Annotation

Pseudo?genes

• Expression of ABC transporter n-terminus. Missing critical motif elements.

• 5 peptides (with splicing) map to a transposable element gene. Sequence alignment to an Arabidopsis Ulp1 Castellana 2008

Page 14: Proteogenomics : Refining and Improving Genome Annotation

Signal Peptide

• N-terminal motif, target protein for export• 1983 Perlman & Halvorson

Early basic residue, hydrophobic patch, AxB motif– A = [I,V,L,A,G,S], B = [A,G,S]

Page 15: Proteogenomics : Refining and Improving Genome Annotation

Profile of an Exported Protein

Early basic residue, hydrophobic patch, motif

Page 16: Proteogenomics : Refining and Improving Genome Annotation

Future

• Rinse and repeat• 30 proteomes in 3 years• Stable, robust pipeline for general use

Hosted at TeraGrid

Novel New StartY. pestis 4 5

B. anthracis 4 6

D. radiodurans 225 117

D. vulgaris 55 89

L. interrogans 20 23

Page 17: Proteogenomics : Refining and Improving Genome Annotation

When Gene Predictors Fail

• Are GC extremes difficult? 50% (Y. pestis) – 4 missed 30’s (B. anthracis, L.interrogans) 4, 20 60’s (D. vulgaris, D. radiodurans) 55, 225

Page 18: Proteogenomics : Refining and Improving Genome Annotation

Are They Strange?

• Relative GC – does it fail on genes with different GC from others?

Page 19: Proteogenomics : Refining and Improving Genome Annotation

Are They All Short?

Page 20: Proteogenomics : Refining and Improving Genome Annotation

We See What We Know

• Proximity to Model Organism

Yersinia/Bacillus errors: 4/4

‘Remote species’ errors: 20, 55, >200

Page 21: Proteogenomics : Refining and Improving Genome Annotation

We See What We Know

• Hypothetical vs. Named Compare novel genes to observed proteome Hypergeometric where Null probability is from

the observed proteomeHypothetical Named p-value

B. anthracis 3 1 0.018

L. interrogans 12 8 0.018

D. radiodurans 31 8 10-10

D. vulgaris 39 16 10-14

Page 22: Proteogenomics : Refining and Improving Genome Annotation

Expressed Protein Resource

• Protein Sequences >30 M sequences nr, uniprot JCVI metagenomics JGI genomes

• 40,000 clusters• Cross referenced with

proteomics, for validated proteins

Page 23: Proteogenomics : Refining and Improving Genome Annotation

Acknowledgements

• Eli Venter• Shih-Ting Huang, Rembert Pieper• Granger Sutton• Dick Smith, PNNL

• NSF