reviewers’ comments - psidev.info€¦  · web viewreviewers’ comments. invited reviewer 1....

25
REVIEWERS’ COMMENTS Invited Reviewer 1 mzTab: exchange format for proteomics and metabolomics results Review of specification documentation in DocProc General thoughts: Though the PSI has put much effort into the development of vendor- independent formats for mass spectrometry data, these formats are not easily parsable XML files. These data contain all relevant information to grasp the workflow of an analysis, starting with the machine settings and including the whole data analysis workflow. The main purpose of these formats is the compact and complete storage of the whole workflow up to the point of the generation of the reported file. This makes a re- analysis of the data possible and gives interested scientists as well as journals as much information about the proceedings and findings as possible. The proposed mzTab does not follow these guidelines. Instead it is proposed as a lightweight, easy to parse and human readable format. But it is also not intended to replace the other formats (especially mzIdentML and mzQuantML), but to give a recommendation for an interchange format, only containing the data used by most tools for further analysis. This is a good idea as it, but it also contains the thread, that the “heavier” formats won’t be used at all and instead only mzTab will be used. Apart from this thread I think, a well-defined, easily read- and parsable format for data interchange is a good idea and the specification is mainly well done, though some more work should be performed. Specific comments: Page 7: protein inference Allowing only one accession per peptide is very counter intuitive. One very simple example: the mzTab file is only used to give peptide information (what should be possible), how would it be possible to report the connection to all possible proteins of the peptide? Referring to a protein via the "reported" accession for an ambiguity group as in the

Upload: others

Post on 26-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

REVIEWERS’ COMMENTS

Invited Reviewer 1

mzTab: exchange format for proteomics and metabolomics results

Review of specification documentation in DocProc

General thoughts:Though the PSI has put much effort into the development of vendor-independent formats for mass spectrometry data, these formats are not easily parsable XML files. These data contain all relevant information to grasp the workflow of an analysis, starting with the machine settings and including the whole data analysis workflow. The main purpose of these formats is the compact and complete storage of the whole workflow up to the point of the generation of the reported file. This makes a re-analysis of the data possible and gives interested scientists as well as journals as much information about the proceedings and findings as possible.

The proposed mzTab does not follow these guidelines. Instead it is proposed as a lightweight, easy to parse and human readable format. But it is also not intended to replace the other formats (especially mzIdentML and mzQuantML), but to give a recommendation for an interchange format, only containing the data used by most tools for further analysis. This is a good idea as it, but it also contains the thread, that the “heavier” formats won’t be used at all and instead only mzTab will be used.

Apart from this thread I think, a well-defined, easily read- and parsable format for data interchange is a good idea and the specification is mainly well done, though some more work should be performed.

Specific comments:

Page 7: protein inferenceAllowing only one accession per peptide is very counter intuitive. One very simple example: the mzTab file is only used to give peptide information (what should be possible), how would it be possible to report the connection to all possible proteins of the peptide? Referring to a protein via the "reported" accession for an ambiguity group as in the given specification may be confusing. This group may have accessions in the "ambiguity group", which are not valid for the given peptide, depending on the resource.

Answer: We have decided to change this in the new version of the specification. Following the method used by various search engines (f.e. OMSSA) PSMs may be duplicated and thereby assigned to different proteins. Even though it is still (and deliberately) not possible to completely represent the complexity of the protein inference problem (as f.e. in mzIdentML) it is thereby possible no to lose any data. The fact that these PSMs are duplicates can either be resolved through the (identical) external spectral reference or through the identical search engine scores / retention times / precursors.

Page14: Unit IDs'IDs MUST NOT contain the prefix "_rep[1-n]" unless [...]' -> I think suffix is meant here.

Page 2: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Answer: This typo was corrected in the specification document

Page 14 (or anywhere):Every section must obviously start with the header line (e.g. PRH), though this is nowhere stated in the specification nor the 10-minute-guide. It is also not stated, that the each header row must only appear once in the document.

Answer: All points were clarified in the specification document.

Page 14-20:There should be a metadata-field allowing software-settings to at least capture the most important settings of the used software. Though it is possible via the custom-field, a defined settings field would be much easier.

Answer: The field “software[1-n]-setting” was added to the format specification.

Page 20, also applies for peptides and small molecules:Is there any specific reason, why the columns MUST be in the order of the document? There is also the header-row, specifying which columns appear in the protein block and thus also the ordering will be given.

Answer: We currently do not see any disadvantage, either for software developers or for end-users to enforce a fixed order of the format’s columns. The main advantage of this feature is that sections from different files (or even answers from web services) can easily be concatenated (one of the use cases mzTab aims to support). Additionally, we believe that human users might find this convention helpful as all files will have the same structure. In addition, software viewers can be configured according to the needs/preferences of each user.

Page 24 "uri", "go_terms" and "protein_coverage":They should be optional, as they are not easily found after a peptide search (and not even given by many search engines).

Answer: While it is true that these fields may be unavailable at several times, we tried to keep the number of optional columns as low as possible. This ensures that all mzTab files have a similar structure and are therefore easier to use for inexperienced users.

Page 25:Why must the peptide section follow the (optional) protein section? I don't see any reason for this, except a possible accession-parsing. Though, the accessions in the peptide don't have to match any accessions in the protein section, neither must there be any protein section at all.

Answer: As mentioned before, we believe that a fixed order of sections and thereby a more “stable” format makes the format easier to use. At the same time, we cannot see any disadvantage in enforcing a certain order. People developing software that is able to generate mzTab files will on average be more experienced than people “consuming” the files. For them, it should not make a difference whether a certain order of sections is enforced while the latter group might find it helpful to know “where” to find what type of information.

Page 26:The "unique" column may be very useful, but as mzTab should be an easy generated format, it should be optional and not every search engine reports this value.

Answer: See previous responses. If this information is not available, it is possible to just use ‘null’.

Page 3: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

The type is specified as “Boolean”, but in the example the values "0" and "1" are used. Though it is obvious, what is meant, another way would be using "true" and "false" (or these written in capitals). It would be easier for parses to further specify, whether to use 0/1, true/false or both.

Answer: This has been clarified in the specification document. Only “0”/”1” are supported.

Page 27:Also "retention_time" is neither given by every search engine, nor by every method in MS relevant (e.g. direct injection), and thus should be optional.

Answer: We believe that the retention time of a peptide or a small molecule is a vital piece of information that is used and usable in many workflows. Therefore, it should be possible to report it in mzTab. To minimize the number of optional columns this column was defined as non-optional.

Page 28:As the charge (with today’s machines) can only be an Integer, the type should be Integer rather than Double.

Answer: We have decided to change it to an Integer in the new version of the format specification.

The uri should be optional with the same reason as above.

Answer: See answers above (the same applies).

In the description of the metadata it is said, that the {UNIT_ID}_ms-file is an optional value (as all metadata is). But the mandatory field "spectra_ref" uses these fields information. If it is not given in the metadata, there is no use of this mandatory reference format for the "spectra_ref". So either this field should be optional or rather the ms-file in the metadata mandatory.

Answer: While the protein / peptide / small molecule sections are table based, the metadata section is key-value based. To enable the easy concatenation of the table based sections we have decided to minimize the number of optional columns. At the same time, the metadata section was defined as an “all optional” section. We believe, that while this specific constellation is not ideal these more general rules are easier to understand.

Page 29:Why must the small molecule section follow the (optional) protein and peptide section?

Answer: see Answer above. We prefer this decision to keep the format consistent.

Page 31:charge: type should be Integer

retention_time: should be optional with same argument as in peptide section

Answer: see answers above.

Page 32:uri: Though I am not used to small molecule identification, I think, it should be optional (same as in proteins and peptides).

spectra_ref: see comment for peptides

Answer: See out answers about the same topic before.

Page 4: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Invited Reviewer 2

Generally I find the document well written and very useful.

Here my comments:

The ten minutes guide:

----------------------

In the ten minutes guide in the "Units in MzTab" chapter you specify in the developers section which symbols to use for units. No such developer specification is given for other fields such as column names.

It would be clearer to leave the details in the main document and only refer to them.

Answer: While developing mzTab in cooperation with several research groups at different locations we found that the concept of “Units” caused the most problems. We therefore decided to explicitly explain this feature in the “10 minute guide”.

In the peptides section you use "ABC" as a peptide name (better use a real peptide). This raises the question which amino acid letters are allowed in the MzTab documents, and if one uses the letters `B`,`J`,`O`, ... shouldn't one specify what they mean.

Answer: Section 5.10.5 (Reporting sequence ambiguity), including the cases of B, J, O,… has been added to the new version of the specification document.

mzTab format specification:

---------------------------

Do protein and peptide data need to be consistent? E.g. if the PRT field num_peptides = 3, is it required that 3 peptides of this protein are listed in the file? Or do modifications of the protein also appear on the peptide level? This may have some relevance for programmers writing parsers.

Answer: If a protein and a peptide section are present in an mzTab file it is expected that these two sections are consistent – as in any other proteomics file format. This has been clarified in the specification document (Section 6.1).

I would indicate for each column whether it is mandatory or not. This would help readers who just have a quick look at the text.

Answer: In the table based sections all columns are mandatory, apart from the quantification associated ones and – of course – the optional columns. This is highlighted in the format specification by the word “(Optional)” after each optional column name. In the beginning of the metadata section it explicitly states that “All fields in the metadata section are optional.”

Page 5: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

In some applications spectrum matches to the same peptide are combined. For example, in quantitative proteomics it could make sense to combine peptides with different charge states or chemical modifications. Or in spectrum clustering a peptide could match to a consensus spectrum. This requires that one PEP entry would have several spectra, which does not seem to be possible in MzTab. Could you comment on this issue and add some clarification to the text.

Answer: Peptides can be linked to multiple spectra. This is explicitly defined in the mzTab specification: “Multiple spectra MUST be referenced using a “|” delimited list.” Thereby, a peptide identification can be linked to multiple spectra, even from different source files.

The ratio 0/0 should be specified as NaN and not INF. I don't think that NaN should be replaced by NA, since these are 2 different things.

Answer: NaN (and also INF) have been added to the mzTab specification for these use cases.

Page 6: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

SC_reviewer 1

SC reviewer 1 comments:

-----------------------

4. Relationship to other specifications

=> no relationship to MIAPE? Explicitly? implicitly? not necessary?

Answer: mzTab does not aim to fulfil the MIAPE guidelines. In fact, it allows different degrees of experimental metadata annotation. We believe that it is not necessary to give further explanation about this. Other formats like mzIdentML and mzQuantML have been developed with the MIAPE guidelines in mind.

=> no relationships with TrAML?

Answer: TraML is a file format for SRM transition specific data while mzTab is focused on identification/ basic quantification related MS based data. In this version of the mzTab specification no support for SRM quantification is provided. In the future, it would be definitely possible to do it (and this is where TraML would have a clear role).

4.1

=> Please add NEWT as possible CV as it is used in examples in 6.2.23 to examplify the unit {UNIT_ID}(-{SUB_ID})-species[1-n] ;

same for BTO, CL, DOID etc used under 6.2.nn

Answer: The mzTab format specification does not define allowed ontologies. Any suitable ontology may be used. The above mentioned CVs/ontologies are now included in Section 4.1 in the specification document.

6.2.13 {UNIT_ID}-uri

=> if multiplicity is 0 ... *, please specify 6.2.13 {UNIT_ID}-uri[1-n]

otherwise | as separator for multiple uris is not appropriate and can be misleading

Answer: This field can occur multiple times (thus there can be multiple “uri” lines for a single unit) to specify multiple URIs for a single UNIT. It would not be possible to define any alpha-numerical as separator for URLs as they may be part of a URL. We have extended the example provided to exemplify this convention.

5.1. Handling updates to the controlled vocabulary

the form http://www.psidev.info/index.php?q=node/440 points to a url that does not exist.

=> Please provide a more stable one

Answer: This URL no longer exists in the new PSI website, so it has been removed from the specification document.

Page 7: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

in 5.4

=>In text " ...for an experiment ìEXP_1î, the replicates must have the UNIT_IDs ìEXP_1-rep[1-n]î " must have MUST be capitalized here.

Answer: This has been updated in the specification document.

=> "Biological replicates are not explicitly supported in the same way in mzTab." therefore anything is possible? what are the constraints? no constraints = difficult to limit imagination of people...

Answer: Experimental setups are extremely diverse and constantly evolving. Therefore, any set constraint would result in unsupported use cases. We therefore decided to deliberately provide researchers with an, in this respect, open format to be able to report the data from their experimental designs.

in 5.5

example :

The following example shows how two different quantitative experiments can be reported in one mzTab file. Not all labels are shown

MTD EXP_1-quantification_method [MS,MS:1001837,iTraq,]

=> in MS CV, MS:1001837 is defined as "iTRAQ quantitation analysis" ; please correct the example accordingly.

Answer: This was corrected.

later in same example:

MTD EXP_2-quantification_method [MS,MS:100999,SILAC,] ;

=> replace text in brackets by [MS,MS:1001835,SILAC quantitation analysis,]

Answer: This was corrected.

in second example of 5.5:

Example showing how emPAI values are reported in an additional column using

MS CV parameter emPAI value (MS:1001905)

PRH accession opt_cv_MS:1001905

PRT P12345 0.658

=> the column title is opt_cv_MS:1001905, therefore not readable by a non bioinformatician. Where can one have a full text name to help targeted users reading the information? As stated in 5.10.3, it is allowed to

Page 8: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

use anything, as soon as it is different from another column. So why specifying this possibility that is not a human readable one?

Answer: In the new version, for optional columns it is needed to specify both the CV param accession and the parameter name following this structure:

opt_cv_{accession}_{parameter name}.

5.8

=> in text " ... ì-ì must be provided as the value for each of " : please capitalize MUST

=> in the text " The reliability MUST be an integer" : change to "When a reliability value is provided, this value MUST be an integer"

Answer: This was updated in the specification document.

in 5.9

in text : "All (identified) variable modifications as well as fixed modifications MUST be reported for every identification."

=> this sounds nice but if mzTab is meant to cover a simplified but straight to the point representation of peptide or protein or small molecule identifications, this point is overkilling. Take the simple example of phosphopeptides: it is not important to show the position of an oxidized methionine, or of a iTraQ label when an author wants to report a list of phosphorylation positions. I'm sure that this will not be followed. Therefore change MUST by something less stringent

Answer: This is a well-known issue when dealing with modifications. However, we believe that this should be enforced with a MUST. Otherwise, data will not be consistent and then, difficult to trust.

later in text: "Furthermore, mass deltas MUST NOT be reported if the given delta can be expressed through a known and unambiguous chemical formula."

=> as this is commonly used by tools such as Sequest and ProteinProspector, I'm not convinced at all that you can claim this to be followed, particularly because these mods are defined by a mass delta value without requiring a "name" for it.

Answer: We now changed the requirement level to SHOULD NOT.

about 5.10.1 the approach means that one peptide (or one protein) is represented twice (two lines) if it is found both by Andromeda and Mascot. This is not a simple final list. Authors prefer to have one line with both scores in one line (this comment is based on mztab_merged_example.txt

=> An appropriate example is probably missing, as one can encode [MS,MS:1001171,Mascot score,50]|[MS,MS:1001155,Sequest:xcorr,2] according to 6.3.9

Page 9: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Answer: The merged example demonstrates how multiple mzTab files can be merged by simple concatenation. It is therefore natural and intended that one peptide or protein might be represented by two lines but with different unit ids. In contrast, files that contain search results from multiple search engines only report every protein / peptide / small molecule once (see specification document 6.3.8-9 for the reporting format). Note that this can be achieved only by additional processing, e.g. consider a researcher wants to provide a single mzTab file based on two different mzTab files originating from different search engines. He can simply do so by reporting each peptide or protein once using the format to report multiple scores and search engines but he has to take care to provide a new unit id and adapt the meta values accordingly.

about 5.10.4: about the text "This field MUST NOT be used to reference an external MS data file. MS data files should be referenced using the method described in Section 5.2".

=> what about referencing mzML? or a specific spectrum in mzML? this is not specified in 5.2

Answer: Referencing spectra in mzML files is done using the method described in “MS:1000777”. This method is used for any file format that has a native id format for the spectra within the files (ie. mzML, mzData, etc.).

in 6 "Every line in an mzTab file must start"

=> capitalize MUST

Answer: This was updated in the specification document.

in text (under Params) "Any field that is not available should be left empty"

=> should'nt that be MUST be left empty ??

Answer: This was changed to MUST.

=> how are space and comma characters constrained?

About the numbers: is a scientific number forbidden (such as 1.4E10) ?

Answer: No, they are allowed.

6.2

in section Unit ID: the term unit is given in small caps: under 5.4 and 6.1 it is always written UNIT.

=> please be coherent

Answer: This was fixed.

about

{UNIT_ID}-{SUB_ID}-custom

and

{UNIT_ID}-custom

Page 10: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

=> what is the difference as "-" characters are allowed in the naming of these terms

Answer: {UNIT_ID}-{SUB_ID}-custom is only applicable to the sub-samples, but the general concept is the same. There are several experimental setups where a researcher may want to report additional information about a specially treated / processed sample which cannot be expressed with the fields provided. As this information may not be applicable to all subsamples (i.e. in a 4-plex setup) optional subsample fields are allowed.

in 6.3

The protein section must always come

=> capitalize MUST

Answer: Done.

There MUST NOT be any empty cells.

=> how do we need to fill these? NA?

Answer: All the empty cells need to be filled using ‘null” if no information is available. In the new version of the specification, ‘NA’ has been substituted by ‘null’ (INF and NaN are now also possible).

The columns in the protein section MUST be in the order they are presented in this document

=> it is not said whether the columns described under 6.3. are all mandatory or not. It is overkilling to add all of these columns if not necessary from the user point of view to explain and transfer its result (for instance go-terms, number of peptides, taxid, etc.)

Answer: As mentioned before for Reviewer 1, we prefer to keep a number of mandatory columns. If no information is provided, it is possible to add just ‘null’.

6.3.4 taxid and 6.3.5 species

=> what if a custom or synthesized protein? NA?

Answer: There is one CV term from NEWT (taxID) called ‘synthetic’ (accession number 32630).

6.3.7 database_version

Page 11: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

=> today, UniprotKB version names are using underscore and not hyphen in their names. Please change 2011-11 by 2011_11

Answer: Done.

6.3.11, 12, 13

=> what about number of PSM? is this considered identical to 6.3.11?

Answer: This is left for each data producer to export following its specific criteria. A software may consider that two peptides with the same sequence correspond to 2 peptides (2 PSMs) or just one. We think the first approach is probably more consistent.

6.3.15 modifications

=> sounds like reinventing the wheel and is not compatible with current outputs from some tools (forcing a double between 0 and 1 when a given software provides an open scale value is not a good idea. Why not looking at PEFF and at GPMDB for this?

Answer: This was changed in the specification document (please see Section 5.9, Reporting modifications and amino acid substitutions). The use of params for modification reliability is now mandatory so a Double value is no longer supported.

6.3.16 uri

=> not sur what this is: uri representing the entry in the searched database or something else?

Answer: The URI can represent for instance the location of the protein identification in a proteomics repository, or the name of the original mzIdentML file where that protein was detected. Please also see section 5.10.4 of the spec document.

6.3.17 go_terms

=> why using a string list and not a string with | separators

Answer: This was changed in the format specification.

6.3.19 protein_abundance_sub[1-n] (Optional)

=> what is expected here? concentrations, number of molecules? there are no units...=

Answer: In the new version of the specification, a mechanism was introduced to specify units (see sections 6.2.32, 6.2.33 and 6.2.34).

Page 12: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

6.4.

The peptide section must always

=>capitalize MUST

Answer: Fixed.

=> are they all required? this might potentially generate huge and non necessary redundancy (for instance if one database and one database version was used, which is the case for all non merge results)

Answer: As mentioned earlier, we prefer to keep the files consistent at the price of redundancy. As explained before, this can help to concatenate different files.

6.4.1 sequence

=> how to encode sequence ambiguity (I/L), others, and results from sequence tags experiments?

Answer: See section new version 5.10.5. I/L can be represented as ‘J’.

6.4.5 database

=> how to describe UniProtKB/Swiss-Prot human complete proteome subset + crap database ?

Answer: The database is reported on a per entry basis in the protein / peptide / small molecule section. Thereby, even if a combined database is used, a single entry can only originate from one of the underlying original databases and the problem does not occur.

6.4.8 search_engine_score

=> I want to report score ane evalues for one peptide identified by Mascot and xcorr and peptide prophet probabilities . How do ai show this?

Answer: This can be done for the same peptide, adding both scores as CV params separated by “|”.

6.4.11 retention_time

=> why constraining to seconds? It might be simpler and actually more often used as minutes of relative retention time or even retention index depending on the application

Answer: See answer above about how units can now be represented in mzTab (see sections 6.2.32, 6.2.33 and 6.2.34).

6.4.12 charge

=> why a Double?

Page 13: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Answer: See previous answers to the same topic. We decided to change it to an Integer.

6.4.14 uri

=> I thought it can be the pointer to a mzIdentML file position... SO how to do this?

Answer: It is a pointer to the mzIdentML files where this identification is reported. However, we think it is not needed to go further and specify the location in the file.

6.5

The small molecule section must always

=> capitalize MUST

Answer: Fixed.

6.5.3 chemical_formula

Elements should be capitalized properly to avoid confusion

=> I would change should to MUST ...

Answer: Fixed.

The chemical formula reported should refer to the neutral form.

Charge state is reported by the charge field. This permits the comparison of positive and negative mode results

=> No only chemical formula of neutral and charge state is not sufficient. what about adducts? and how to exemplify for glycine: C2H5NO2: the protonated form is C2H3NO2 with charge 1 and the deprotd ionated (negative form) is C2H4NO2 with charge -1. But sodiates C2H4NO2Na charge 1. And if I take the example of a doubly charged species, it gets even worse... How do you want to compare the chemical formulae?

Please be more precise with what is expected or remove the comment.

Answer: We thank the reviewer for pointing this out. We now note that mzTab offers support for adducts via the modifications column in the specification document.

For example:

- a sodiated glycine is reported with formula: C2H5NO2, modifications: CHEMMOD:+Na-H and charge: 1,

- the deprotonated ionated negative form is reported with formula: C2H5NO2, modifications: CHEMMOD:-H and charge: -1, and

- the protonated form with formula: C2H5NO2, modifications: CHEMMOD:+H and charge: 1.

6.5.6 description

Page 14: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

if it is allowed to provide a list of identifiers, then it should also be needed to have multiple descriptions (same is true for inchi...)

=> is it allowed? if not, please add a constrain under 6.5.1; if yes, please make sure there are no ambiguity left...

Answer: More than one description, InChi and SMILES are now supported.

6.5.9 retention time

=> same as 6.4.11

Answer: Same answer applies.

6.5.16 spectra_ref

The reference must be in the format

ms_file[1-n]:{SPEC_REF} where SPEC_REF must follow the format defined in mzIdentML

=> but according to the specs, it is possible to refer to something else than a mzIdentML file, which can have another way to index or point to a spectrum. Please allow more options

Answer: We are not sure about what the reviewers means. In mzTab it is possible to reference external spectra in the same way it is done in the mzIdentML format specification. We think that all the options are possible there and for allowing for in the future, just new CV params would need to be created (see section 5.2).

7. Conclusions

"These artefacts are currently undergoing the PSI document process standardization process"

=> remove "standardization process"

Answer: Done.

Other question: why not using PSI-MS CV for terms? no relationships to mzIdentML terms? just independant terms?

Answer: We do not quite understand the reviewers comment. The mzTab format specification does not exclude and CVs to be used but actually recommends to use the PSI-MS CV.

Page 15: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification
Page 16: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Public commenter 1

I was pleased to see this format proposed as I think it's important to allow people to exchange data in a tab-delimited format. I think the specification looks good and the supporting documents are good. I'm concerned that you have too many required columns, but I guess the easy-out is to just put NA when you don't know the value or it's not applicable.

I would not require that the columns must be in the specific order; downstream software should be able to parse the header line of each section to determine the columns that are present and the order that they're in. By requiring this order of columns you lock yourself into specific columns in a specific order long-term. Also, this would allow users to simply leave out a column if it's not applicable. That way you don't get an entire column of NA values.

Again, downstream software can read the header line, see what's there, and for the columns that it knows about that aren't there, it can just internally record NA.

Answer: As mentioned before, we currently do not see any disadvantage, either for software developers or for end-users to enforce a fixed order of the format’s columns. The main advantage of this feature is that sections from different files (or even answers from web services) can easily be concatenated (one of the use cases mzTab aims to support). Additionally, we believe that human users might find this convention helpful as all the files will have the same structure.

I don't like having to record NA for an empty cell, though I can understand having the requirement. Still, I don't like it; software can easily parse two tabs in a row as meaning there is an empty cell.

Answer: We agree in that is not completely ideal but we still think it can solve a lot of potential problems. In the new version of the specification, ‘NA’ has been substituted by ‘null’ (also NaN and INF are now possible).

Did you consider listing database, database_version, and search_engine in the Metadata section? By including those in the protein section and in the peptide section you're replicating the same data on every line, thus leading to file-bloat. The only instance I could see where you would need these in the peptide section is if the mzTab document includes search results from multiple search engines.

Answer: To have this granularity allows combined results from different search engines in a more efficient way. Also the concatenation of files is made more consistent if these three essential pieces of information are annotated per protein/peptide/small molecule.

One additional thought: did you consider having an optional section for mass spectral data (scan, m/z, and intensity)? If I wanted to exchange MS data (either MS1 spectra or MS2 spectra) in a text format, what would be the suggested format? The MS2 format comes to mind (http://noble.gs.washington.edu/proj/crux/ms2-format.html), but I believe that is specific to MS/MS data. I realize we don't need yet-another file format for MS data, but you have defined a clear text-based format here for the proteins and peptides, so I thought perhaps people might also want to include some important mass spectra using this format (likely not full mass spectra, just some key peaks).

Page 17: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Answer: We think there are already quite a few formats for reporting mass spectra in a text format. Some of them are incomplete (pkl, dta) but others allow the reporting (optionally) of rich information such as mgf or MS2. It is outside the scope of mzTab.

portion 2:

---------

I have another suggestion: I think it would help readability and would provide some error checking to include the residue in the modifications. For example, instead of:

accession modifications

gi|10181184 13[0.8]-UNIMOD:35,29[0.2]|35[0.4]-UNIMOD:21

gi|1050551 50-MOD:01499,K59-MOD:01499

IPI00000980 53[68.0]-MOD:00016

IPI00002824 NA

Use:

accession modifications

gi|10181184 N13[0.8]-UNIMOD:35,Q29[0.2]|V35[0.4]-UNIMOD:21

gi|1050551 E70-MOD:01499,K79-MOD:01499

IPI00000980 E53[68.0]-MOD:00016

IPI00002824 NA

Answer: We have preferred to leave just the position of the aminoacid (not including the amino acid). We do not think that this redundancy is needed.

Also, is there a reason why the iTRAQ mods (MOD:01499) weren't being shown at the protein level in mztab_merged_example.txt?

Answer: Thanks to the reviewer for pointing this out. The example file was fixed.

portion 3:

---------

Page 18: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

The iTRAQ mods are included at the protein level in PRIDE_Exp_Complete_Ac_16649.xml-mztab.txt so that makes me feel a little better. I was worried that wasn't allowed, but now I see that I was wrong.

Page 19: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

Public commenter 2

I have read and considered the mzTab document and most examples. It addresses well the many issues of reporting proteomics data and gives itself room for flexibility. It was readable in spite of the underlying complexity that this field of work imposes.

I noticed the following point in the documentation and examples. Page 6: The data type and terms included: "WIFF nativeID format" but did not specify source, i.e. ABSciex/ABI (maybe not important)

Answer: We think this is not really very important. This information should be included in the PSI-MS CV.

The example files did not detail as many MTD entries as are described in the documentation. If similar documents are likely to be previewed by potential users in the future, a better representation of the terms would be useful.

Answer: We have tried to improve the examples, with more metadata information. However, it is important to highlight that all metadata information is optional.

Maybe I misread the description of the terminology, but the iTRAQ example contains many uncharacteristic values of protein abundance values. i.e. there is a one values of unity and several that are 60 thousand +/- 60k. This again is likely unimportant in the framework of the document.

Answer: There are several ways how to report the results of an iTRAQ study. Our example covers reporting of ratios (as specified in the metadata section). These can be normalized to the 114 channel (as done in our example) yielding unity for the first channel and relative ratios for the other ones. Also several orders of magnitudes of variation are expected depending on the biological sample and experimental setup.

Suggested text changes. Page 3:

I see "support. Section Error! Reference source not found."

Answer: Now corrected.

Page 4:

"The following use cases have driven the development of the mzTab data model,"

The following cases of usage have driven the development of the mzTab data model,......

Answer: Now corrected.

Page 4:

Page 20: REVIEWERS’ COMMENTS - psidev.info€¦  · Web viewREVIEWERS’ COMMENTS. Invited Reviewer 1. mzTab: exchange format for proteomics and metabolomics results. Review of specification

"The specification described in this document is not being developed in isolation; indeed, it is designed to be complementary to, and thus used in conjunction with, several existing and emerging models. Related specifications include the following:"

The specification described in this document has not been developed in isolation....

Page 5: "The CV has been generated by collection of terms from software vendors and academic groups working in the area of mass spectrometry and proteome informatics."

The CV has been generated with a collection of terms from.......

Answer: All these text changes have been done.

Many thanks to the hardworking members of the PSI community.