Peptide-assisted annotation of the Mlp genome
Philippe TanguayNicolas FeauDavid JolyRichard Hamelin
Objective
• Use peptide libraries to validate the in silico prediction of gene models
Mapping peptides on a translated genome sequence = provides « correct frames of translation »
Assumption : « if a peptide protein is detected, then there must be a gene that encodes it »
Methodology (hardware)
Urediniospores (3729)
Protein extraction
1D SDS-PAGE
Gel slicing (64)
Trypsin digestion
LC-MS/MS
Bioinformatics
Waters MassPREP station
LTQ ThermoElectron
Extraction SlicingDigestionElution
Peptide MS/MS dataacquisition
Methodology (Bioinformatic)
Spectral identification by sequence
database searching
Statistical validation of peptide identifications
Protein databases built from…
1 - Comparison of results from both db2- Comparison of peptides and GM
(validation/correction of genome annotations)
6 frames translation of the genome
Gene catalog (16694 GM)
MascotSequest
MascotSequest
MLP proteomic results so far
• 691 000 MS/MS spectra obtained from the total proteins
10980 3524699
Gene catalog 6-frame translation
Mascot +
SequestOnly
Mascot
352 unique peptides obtained from the 6-frames translation db have do not match GM of the Gene catalog
Unique peptides:
False discovery rate below 1.6%
Peptide frequency distribution on GM
0
50
100
150
200
250
300
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
No. peptide/gene model
No
. ge
ne
mo
del
Mean 9 peptides covering 134 AA / GM
The 10980 + 4699 peptides represent assignments for nearly 10% of the Gene catalog e.g. 1659 GM
Automated classification of peptides with no hit (352) on the Gene catalog
• 5’ extension of a predicted GM– If peptide (s) located within the 1000 bp upstream the predicted
GM start codon• 3’ extension of a predicted GM
– If peptide (s) located within the 1000 bp downstream the predicted GM stop codon
• 5’ and 3’ extension of a predicted GM– If peptides located within the 1000 bp upstream the start codon
and within the 1000 bp downstream the predicted GM stop codon
• Internal extension of a predicted GM– If peptide (s) located in the GM
• New GM– If no predicted GM in the vicinity of the peptide (s)
Corrections-Additions to the Gene catalog
Modification Number of GM
5’ extension 44
Internal exon extension 31
3’ extension 22
5’ and 3’ extension 5
New GM 73
Total 172
• Mapping of the peptides with no hit on the genome allowed the following modifications
Manual curation- Internal extension
Manual curation- Internal extension
• EuGene’s prediction is OK
Manual curation- New GM
Manual curation- New GM
Summary – Peptide-assisted genome annotation
– Validated 10 % of the predicted GM– Corrected/found > 170 GM
According the manual curation accomplished so far, it appears that EuGene had predicted most of the corrected/found > 170 GM
With little resources (6000 $ worth of materials and services, and a few weeks worth of labour) our proteomic analysis:
• A quantitative proteomic approach (iTRAQ) will be used to compare urediniospores, germinated urediniospores and haustoria protein complexes
Perspectives
• Analysing the Sequest output obtained from the 6-frames translation
5051 peptides identified with Mascot (352 with no hits on the Gene catalog)
Sequest ?
Available material
• Our set of peptide spectra from urediniospores proteins is available to validate new GM predictions
• The peptides GFF files will be made available to the Melampsora community
Finding the peptides on the different model prediction sets
Gene Catalog 16694 1659 9,9%
EuGene 12386 1348 10,9%
Genewise1 14087 977 6,9%
Genewise1Plus 14162 1046 7,4%
fgenesh1_pg 15760 1140 7,2%
fgenesh2_pg 17833 1377 7,7%
Do we need to perform a new spectra search on the whole model prediction sets ?
Total GMModel prediction set GM validated %