supplementary methods of “thermal proteome profiling for ... · different cell line (k562), and...
TRANSCRIPT
1
Supplementary Methods of “Thermal proteome profiling for unbiased identification of
direct and indirect drug targets“
Holger Franken1,†
, Toby Mathieson1,†
, Dorothee Childs1,2,†
, Gavain Sweetman1†
, Thilo
Werner1, Ina Tögel
1, Carola Doce
1, Stephan Gade
1, Marcus Bantscheff
1, Gerard Drewes
1,
Friedrich B.M. Reinhard1,*
, Wolfgang Huber2,*
, Mikhail M. Savitski1,*
1 Cellzome GmbH, Molecular Discovery Research, GlaxoSmithKline, Meyerhofstrasse 1,
Heidelberg, Germany
2 Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
† These authors contributed equally to this work
* Corresponding authors
[email protected], [email protected], [email protected]
The supplementary procedures were adapted with permission from the authors of the Jafari et
al protocol3. The steps were adjusted for different compound treatment (panobinostat),
different cell line (K562), and the in-house equipment used.
Materials
Reagents
Liquid nitrogen (use any local provider)
K562 cell line (ATCC, cat. no. CCL-243)
HDAC inhibitor Panobinostat (Santa Cruz Biotechnology , cat. no. SC-208148)
RPMI1640 medium (Life Technologies, cat. no. 21875-034)
FCS (Life Technologies, cat. no. 10270-106)
DPBS without calcium chloride/ magnesium chloride (Life Technologies, cat. no.
14190-094)
DMSO waterfree (Sigma-Aldrich, cat. no. 41647)
cOmplete, EDTA-free protease inhibitors (Roche, cat. no. 11873580001)
Dimethyl sulfoxide DMSO (Fluka , cat. no. 01934-1L)
Equipment
Waterbath TW20 (Julabo, cat. no. 9550120)
Forma™ Steri-Cult™ CO2 Incubator (Thermo Scientific, cat. no. 3311)
Nature Protocols: doi:10.1038/nprot.2015.101
2
CASY Cell Counter TT 150µM (Roche Innovatis, cat. no. 05651697001)
PCR Tubes (Brand, cat. no. 781305)
Heraeus Multifuge 3SR (Thermo Scientific, cat. no. 75004371)
Peltier Thermal Cycler (MJ Research / BioRad, cat. no. PTC-200)
Tabletop Centrifuge 5415D (Eppendorf, cat. no. 022621425)
0.2 ml Ultracentrifuge Tubes (Beckman Coulter, cat. no. 343775)
Nunc U-bottom 96-well polypropylene plates, clear (Nunc, cat. no. 267245)
Optima Max XP benchtop ultracentrifuge (Beckman Coulter, cat. no. 393315)
Reagent Setup
Cell culture medium Supplement the RPMI cell culture medium by adding FCS to a final
concentration of 10% (vol/vol). Preheat the culture medium to 37 °C using a water bath
before use in cell culture experiments.
Cells This protocol has been optimized for use with cells in suspension, and K562 cells are
used in the example used in the procedure. For adherent cells normally growing as monolayer
cultures, their use as described in this protocol must be carefully validated for each individual
system. This validation would, for example, involve the investigation of whether there are any
differences in drug-target engagement in cells growing as a monolayer compared with the
format presented here (i.e., detached and in suspension). Such comparative studies can be
achieved by performing the compound incubation step in the cell culture flasks before
detaching the cells, rather than in suspension, as outlined herein, with the downstream steps
remaining the same.
Compound preparation Compounds delivered as powders are dissolved in DMSO to yield
10 mM stock solutions. All library compounds are stored frozen as 10 mM stock solutions
until use. The 10 mM stock solutions are diluted to 2 mM in DMSO before use in the
experiments.
! CAUTION Use appropriate safety equipment and a controlled environment when working
with potentially toxic and mutagenic compounds. This is particularly important when
handling DMSO solutions, as the solvent is highly skin permeable.
▲CRITICAL STEP Water-soluble compounds can be dissolved in water instead. For
nonpolar organic compounds, it might be necessary to use lower concentrations of the
compounds in DMSO owing to their lower solubility. If this is the case, it is important to
determine the DMSO tolerance of the cells that you are working with before deciding how
much of the compound to add.
EQUIPMENT SETUP
Forma™ Steri-Cult™ CO2 Incubator Place a water tray containing sterilized ultrapure
water in the incubator, and then set the incubator to 37 °C, 95% humidity and 5% CO2.
Peltier Thermal Cycler Set and equilibrate to the required temperature before the tubes are
placed inside.
PROCEDURE
Nature Protocols: doi:10.1038/nprot.2015.101
3
Supplementary procedure 1, Performing the cell handling and compound treatment
part of the TPP-TR experiments ● TIMING 7 h
▲CRITICAL STEP Part 1 of the PROCEDURE (Steps S1–S16) describes how to perform
the cell handling and compound treatment part of a TPP-TR experiment on intact K562 cells.
S1| Expand K562 cells in cell culture medium to a cell density of ~1.4 million cells per ml
using standard sterile cell culture procedures and supplies. Approximately 120 million K562
cells are required to perform four TPP-TR experiments. This experiment should be performed
at least two times, each on different days, in order to get statistically meaningful results.
S2| Add 24 ml of the ~1.4 million K562 cells per milliliter of suspension into four separate
T75 flasks. If you are working with adherent cells and it is desirable to avoid detaching the
cells, the compound incubation step could instead be performed in suitable cell culture flasks
or microplates, with the heating step (Step S11) performed either immediately after cell
detachment (using the method of your choice) or directly in the culture containers.
S3| Add 120 μl each of a 200 µM solution of Panobinostat in DMSO to individual flasks to
get a final concentration of 1 μM of each compound. Add the same volume of DMSO to the
remaining flask serving as the vehicle control (0.5% finak concentration). Gently mix the cell
suspension by pipetting up and down several times using a serological pipette.
▲CRITICAL STEP If the compound concentrations are adjusted, ensure that the
predetermined DMSO tolerability of the cells in question is not exceeded. As a rule of thumb,
avoid DMSO concentrations above 1% (vol/vol), but note that many cells are much more
sensitive, so it is advisable to do a pilot experiment to determine DMSO tolerance.
! CAUTION Use appropriate safety equipment and a controlled environment when working
with potentially toxic and mutagenic compounds. This is particularly important when
handling DMSO solutions, as the solvent is highly skin-permeable.
S4| Incubate the cell culture flasks for 5 h in the CO2 incubator at 37 °C.
S5| Collect the cell suspension with a serological pipette and transfer the cells to marked 50-
ml conical tubes.
S6| Count the cell numbers and assess cell viability using a preferred method.
▲CRITICAL STEP It is important to examine whether the compound has acutely affected
the viability or membrane integrity of the compound-treated cells.
S7| Centrifuge the conical tubes at 300g for 3 min at room temperature to pellet the cells, and
then carefully remove and discard all of the culture medium.
▲CRITICAL STEP The centrifugal force may need to be adjusted if the cellular test system
is altered such that the cells are more sensitive to disruption during centrifugation or are more
difficult to pellet.
S8| Gently resuspend the cell pellets with 20 ml of ice cold PBS and centrifuge them at 300g
for 3 min at room temperature to pellet the cells again. Carefully remove and discard all of the
supernatant. Repeat this step.
S9| Add 1.2 ml of ice-cold PBS supplemented with protease inhibitors to each respective tube
and carefully resuspend the cell pellet.
Nature Protocols: doi:10.1038/nprot.2015.101
4
! CAUTION Protease inhibitors must be avoided if they interfere with the target or system
being evaluated. The extent of such interference can be tested in prior experiments for each
individual target protein.
S10| Distribute each cell suspension, i.e., with vehicle control or with the test compound, into
ten different 0.2-ml PCR tubes with 100 μl of cell suspension in each tube (~3 million cells
per tube). Mark each tube or strip with a designated temperature (37–67 °C). This yields a
total of 40 PCR tubes.
S11| Centrifuge the PCR tubes at 300g for 3 min at 4°C to pellet the cells, and then carefully
remove 80 µl of PBS. Gently tapp the tubes to resuspend the cells. The tubes are kept at 4°C
before the heat treatment step.
Heat treatment of cell suspensions ●TIMING 10 min
S12| Heat the PCR-tubes with the first four temperature endpoints (37°C, 41°C, 44°C, 47°C)
at their designated temperature for 3 min in the PTC-200 Peltier Thermal Cycler. Immediately
after heating, remove and incubate the tubes at room temperature for 3 min. After this 3-min
incubation, immediately snap-freeze the samples according to the instructions in Step S13.
▲CRITICAL STEP Do not let the temperature in the blocks rise to the designated
temperature while the tubes are in the cycler. Only place the tubes in the blocks when the
temperature has reached the designated temperature. It is crucial to ensure consistent timing
between tubes in both the heating and cooling steps, as well as between heat stages.
S13| In the meantime, heat the samples subjected to the next four temperature points
remaining at their designated temperature (50°C, 53°C, 56°C, 59°C) for 3 min in the PTC-200
Peltier Thermal Cycler. Immediately after heating, remove and incubate the tubes at room
temperature for 3 min. After this 3-min incubation, immediately snap-freeze the samples
according to the instructions in Step S13. Finally, perform this also for the remaining two
temperatures at their designated temperature (63°C and 67°C) for 3 min in the PTC-200
Peltier Thermal Cycler. Immediately after heating, remove and incubate the tubes at room
temperature for 3 min.
▲CRITICAL STEP Do not let the temperature in the blocks rise to the designated
temperature while the tubes are in the cycler. Only place the tubes in the blocks when the
temperature has reached the designated temperature. It is crucial to ensure consistent timing
between tubes in both the heating and cooling steps.
S14| Snap-freeze the heat-treated cell suspensions in liquid nitrogen.
■ PAUSE POINT The experiment can be paused here, with the samples kept at −80 °C
overnight.
Cell lysis ●TIMING 1 h
S15| Freeze-thaw the cells twice using liquid nitrogen and a thermal cycler or heating block
set at 25 °C in order to ensure a uniform temperature between tubes. The tubes are vortexed
briefly after each thawing. The resulting cell lysates are kept on (4 °C) ice after the last
thawing step.
S16| Briefly vortex the tubes, transfer the cell lysates to 0.2 ml polycarbonate thickwall tubes
and centrifuge the cell lysate–containing tubes at 100,000g for 20 min at 4 °C to pellet cell
Nature Protocols: doi:10.1038/nprot.2015.101
5
debris together with precipitated and aggregated proteins. Carefully remove the tubes from the
centrifuge and avoid disturbing the pellets. Keep the samples on ice in a cooling block.
Continue with step 3 in the main protocol.
PROCEDURE
Supplementary procedure 2, Performing the cell handling and compound treatment
part of the TPP-CCR experiments ● TIMING 7 h
▲CRITICAL STEP The cell handling and compound treatment part of a TPP-CCR is
similar to that of a TPP-TR experiment, except that the compound concentration is varied
instead of the temperature when heating the cells. Steps S18–S30 show how to perform the
cell handling and compound treatment part of a concentration-response experiment at 55 °C
on the basis of 9 different compound concentrations.
S17| Expand K562 cells in cell culture medium to a cell density of ~1.4 million cells per ml
using standard sterile cell culture procedures and supplies. Approximately 60 million K562
cells are needed for two TPP-CCR experiments.
S18| Transfer the 10 * 2.5 ml cells with a density of ~ 1.4 million cells/ml to 6-well plates and
label the wells with the compound concentration that will be applied (10, 2.5, 0.625, 0.15625,
0.03906, 0.00977, 0.00244, 0.00061, 0.00015 µM and vehicle).
S19| Place 60 μl of a 2 mM DMSO stock solution of panobinostat in one well of column 1 of
a 96U NUNC plate. Place 45 μl of DMSO in columns 2–10 of the same plate row.
! CAUTION Use appropriate safety equipment and a controlled environment when working
with potentially toxic and mutagenic compounds. This is particularly important when you are
handling DMSO solutions, as the solvent is highly skin-permeable.
S20| Serially dilute the stock solutions by transferring 15 μl from columns 1 to 2 and by
mixing thoroughly with the predispensed DMSO by a minimum of seven aspiration and
dispensing steps. Continue this process by moving one column at the time, i.e., 2–3, 3–4 and
so on, until the sample in column 9 has been thoroughly mixed. Remove 15 μl from column 9
to ensure the same volume of 45 μl in all wells. This procedure generates a 9-point dose-
response curve with fourfold difference in concentration between wells. Column 10 with
added DMSO serves as a vehicle control.
S21| Split the serial dilutions into two copies by transferring 22 μl of all solutions to a second
96U NUNC plate. Place one of the copies in a −20 °C freezer as backup.
S22| Transfer 12.5 μl of all diluted compound solutions to the corresponding labeled well of
th 6-well plate containing the 2.5 ml of 1.4 million cells per ml.
S23| Incubate the plate for 5 hrs in the CO2 incubator at 37 °C.
S24| Transfer 10x 2.5ml cell suspension into to 10 correspondingly labeled 15ml conical
tubes.
S25| Centrifuge the conical tubes at 300g for 3 min at room temperature to pellet the cells,
and then carefully remove and discard all of the culture medium.
Nature Protocols: doi:10.1038/nprot.2015.101
6
▲CRITICAL STEP The centrifugal force may need to be adjusted if the cellular test system
is altered such that the cells are more sensitive to disruption during centrifugation or are more
difficult to pellet.
S26| Gently resuspend the cell pellets with 10 ml of PBS and centrifuge them at 300g for 3
min at room temperature to pellet the cells again. Carefully remove and discard all of the
supernatant. Repeat this step.
S27| Add 125 µl of ice-cold PBS supplemented with protease inhibitors to each respective
tube and carefully resuspend the cell pellet.
! CAUTION Protease inhibitors must be avoided if they interfere with the target or system
being evaluated. The extent of such interference can be tested in prior experiments for each
individual target protein.
S28| Distribute 100 µl of each cell suspension into ten different 0.2-ml PCR tubes (~3 million
cells per tube). Mark each tube with a designated concentration.
S29| Centrifuge the PCR tubes at 300g for 3 min at 4°C to pellet the cells, and then carefully
remove 80 µl of PBS. Gently tapp the tubes to resuspend the cells. The tubes are kept at 4°C
before the heat treatment step.
Heat treatment of cell suspensions ● TIMING 10 min
S30| Place the PCR tubes into the cycler and heat the cells for 3 min at 55 °C.
S31| Remove the tubes from the instrument and place it in an aluminum block at room
temperature for 3 min to ensure consistent cooling between wells.
Cell lysis ●TIMING 1 h
S32| Freeze-thaw the cells twice using liquid nitrogen and a thermal cycler or heating block
set at 25 °C in order to ensure a uniform temperature between tubes. The tubes are vortexed
briefly after each thawing. The resulting cell lysates are kept on (4 °C) ice after the last
thawing step.
S33| Briefly vortex the tubes, transfer the cell lysates to 0.2 ml polycarbonate thickwall tubes
and centrifuge the cell lysate–containing tubes at 100,000g for 20 min at 4 °C to pellet cell
debris together with precipitated and aggregated proteins. Carefully remove the tubes from the
centrifuge and avoid disturbing the pellets. Keep the samples on ice in a cooling block.
Continue with step 3 in the main protocol.
For detailed discussion of the experimental steps and troubleshooting please consult the Jafari
et al3 protocol.
References
3. R. Jafari, H. Almqvist, H. Axelsson et al., Nat Protoc 9 (9), 2100 (2014).
Nature Protocols: doi:10.1038/nprot.2015.101
1
Supplementary Manual: isobarQuant
Table of Contents
INSTALLATION GUIDE ................................................................................................................................................................. 3
INSTALLATION REQUIREMENTS ............................................................................................................................................................... 3
OVERVIEW OF WORKFLOW ........................................................................................................................................................ 4
RUNNING THE PREMASCOT WORKFLOW ................................................................................................................................... 6
RESULTS ............................................................................................................................................................................................ 7
EXPLANATION OF PREMASCOT.CFG PARAMETERS ...................................................................................................................................... 7
Runtime Section ......................................................................................................................................................................... 7
Logging Section .......................................................................................................................................................................... 7
General Section .......................................................................................................................................................................... 7
DETAILED DESCRIPTION OF THE PREMASCOT WORKFLOW ........................................................................................................................... 9
Data Extraction from .raw Files .................................................................................................................................................. 9
Creating Files for Mascot Searching ........................................................................................................................................... 9
MASCOT SEARCHES .................................................................................................................................................................. 11
RUNNING THE POSTMASCOT WORKFLOW ............................................................................................................................... 11
RESULTS .......................................................................................................................................................................................... 12
EXPLANATION OF POSTMASCOT.CFG PARAMETERS .................................................................................................................................. 12
Runtime Section ....................................................................................................................................................................... 13
Logging Section ........................................................................................................................................................................ 13
General Section ........................................................................................................................................................................ 13
Quantification Section .............................................................................................................................................................. 13
DETAILED DESCRIPTION OF THE POSTMASCOT WORKFLOW ....................................................................................................................... 13
Transfer and QC of Mascot-Derived Peptides Matches............................................................................................................ 14
Hook Peptides ........................................................................................................................................................................... 14
Protein Inference ...................................................................................................................................................................... 15
Performing Quantification Correction ...................................................................................................................................... 15
Performing Protein Quantification ........................................................................................................................................... 15
Output Creation ........................................................................................................................................................................ 16
ADDING NEW QUANTIFICATION METHODS .............................................................................................................................. 17
APPENDIX 1: RESULTS TEXT FILES DESCRIPTION ....................................................................................................................... 19
PROTEIN OUTPUT .............................................................................................................................................................................. 19
PEPTIDE OUTPUT ............................................................................................................................................................................... 20
SUMMARY OUTPUT ........................................................................................................................................................................... 21
APPENDIX 2: RESULTS .HDF5 FILE DESCRIPTION ....................................................................................................................... 22
FDRDATA.......................................................................................................................................................................................... 22
CONFIGPARAMETERS .......................................................................................................................................................................... 22
PROTEINHIT ...................................................................................................................................................................................... 22
PEPTIDE ........................................................................................................................................................................................... 23
SPECTRUM ....................................................................................................................................................................................... 24
SAMPLE ........................................................................................................................................................................................... 24
SPECQUANT ...................................................................................................................................................................................... 25
PROTEINQUANT ................................................................................................................................................................................. 25
STATISTICS ........................................................................................................................................................................................ 25
ISOTOPES ......................................................................................................................................................................................... 26
Nature Protocols: doi:10.1038/nprot.2015.101
2
APPENDIX 3: MS.HDF5 FILE DESCRIPTION ................................................................................................................................ 27
TOP LEVEL TABLE .............................................................................................................................................................................. 27
RAWDATA SECTION ............................................................................................................................................................................ 27
MASCOT DATA SECTION ..................................................................................................................................................................... 31
APPENDIX 4: DIRECTORY STRUCTURE OF ISOBARQUANT ......................................................................................................... 35
APPENDIX 5: TROUBLE SHOOTING............................................................................................................................................ 36
Nature Protocols: doi:10.1038/nprot.2015.101
3
Installation Guide
Installation Requirements
isobarQuant Package
The Python code developed by us is hosted on github and may be downloaded by following the link:
https://github.com/protcode/isob/archive/1.0.0.zip
When extracted, this package also contains a License: please read it! By using the isobarQuant software you agree to
be bound by the terms outlined in it. If this is extracted directly to ‘C:\’ it will likely be ‘C:\isob-1.0.0’.
Python
This software was developed using Python2.7 (32 bit version). It requires the following python modules to be
installed:
numpy
numexpr
tables
pathlib
To install Python and the required packages simply follow the instructions below. This information is also given in
the README.txt file which is also found in the .zip file. Any updates to the required libraries will be reflected in the
README file; the user is therefore recommended to read that first and adapt the names provided below where
necessary.
The example below is for 2.7.9 and assumes that c:\temp is the location of all files downloaded. Adapt this where
necessary to a directory name of your choice. If using an earlier version of Python than 2.7.9 it might be necessary
to install the Python 'pip' package as well. Ways to do this can be found here:
https://pip.pypa.io/en/latest/installing.html
On a windows machine with internet access, proceed as follows:
1) Download the following Microsoft installer (msi) file: https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi
Execute the .msi file and follow instructions to install Python to a location of your choice. Ensure that Python is
added to the system path (this is the bottom option on the first page of the installer after the point where the user is
requested to set the location for python to be installed)
2) Within the contents of the extracted .zip file, there are 3 binary (.whl) files taken from the external repository
maintained by Christoph Gohlke (http://www.lfd.uci.edu/~gohlke/pythonlibs ), which correspond to Python version
2.7 on a 32-bit Windows machine and were used in development and testing of this tool
numpy-1.8.2+mkl-cp27-none-win32.whl
numexpr-2.4-cp27-none-win32.whl
tables-3.2.0-cp27-none-win32.whl
These should be installed via the command prompt in the order given below:
3) From a command prompt run:
C:\>pip install pathlib
C:\>pip install c:\isob-1.0.0\numpy-1.8.2+mkl-cp27-none-win32.whl
C:\>pip install c:\isob-1.0.0\numexpr-2.4-cp27-none-win32.whl
C:\>pip install c:\isob-1.0.0\tables-3.2.0-cp27-none-win32.whl
Nature Protocols: doi:10.1038/nprot.2015.101
4
Alternatively, if the computer onto which you are installing the packages is behind a firewall / unable to connect to
the internet the following procedure should be followed:
1) On a computer with internet connection, download the following Microsoft installer (msi) file:
https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi and copy it to a directory of your choosing on the
computer where you are installing isobarQuant. e.g. c:\temp
2) On a computer with internet connection, download pathlib from:
https://pypi.python.org/packages/source/p/pathlib/pathlib-1.0.1.tar.gz and copy to a directory of your choosing
on the computer where the installation is being done e.g. c:\temp
3) Execute the .msi file and follow instructions to install Python to a location of your choice. Ensure that Python is
added to the system path (this is the bottom option on the first page of the installer after the point where the
user is requested to set the location for python to be installed)
4) On the computer where the installation is being carried out, open a command prompt and run the following
commands (the first 3 are included in this download and are taken from the external repository maintained by
Christoph Gohlke (http://www.lfd.uci.edu/~gohlke/pythonlibs ))
The required python packages are now installed. The downloaded files in c:\temp may now be deleted.
Xcalibur Software (Thermo Scientific™)
The Xcalibur Foundation installation is required for the preMascot functionality of this software (tested with Xcalibur
versions 2.2 and 3.0).
NOTE: The need to link to Xcalibur modules forces the windows 32 bit requirement on the preMascot section of the
code. We believe that any Windows environment that allows the successful installation and running of Xcalibur will
be compatible with the analysis code. However, we have only used it on WindowsNT, Win7 and Windows Server
2012.
The postMascot section does not require Xcalibur and can be run in any operating system. We have tested it in
Win7, Windows Server 2012 and Red Hat Enterprise Linux Server release 6.6.
In order to run isobarQuant the user should change to the directory into which the code was unpacked from the .zip
file.
The intermediate results and data are all stored in HDF5 format since this allows the efficient storage and retrieval of
the large quantities of data common in .raw files. See http://www.hdfgroup.org/HDF5/whatishdf5.html for more
information about this format. As it is a binary format, it is not possible to view it in standard text editors. There are
several possible viewers for these .hdf5 files; a list of which is maintained by the HDF Group:
http://www.hdfgroup.org/products/hdf5_tools/index.html.
Overview of Workflow The isobarQuant workflow is split into two parts, the first part (preMascot) deals with the processing of raw Xcalibur
data files. The data from the .raw file is converted to HDF5 and then the raw spectra from the Mass Spectrometer
are processed, performing ion detection on profile data. The expected reporter ions for isobaric tagged
quantification are also detected at this point. The accuracy of precursor ion masses is enhanced by detecting the
elution profile of the precursor ion and its isotopes in the MS1 data and reporting the intensity weighted mean of
the precursor mass for the top (greater than 50% maximum intensity region) of the precursor peak. This data is then
C:\>pip install c:\isob-1.0.0\numpy-1.8.2+mkl-cp27-none-win32.whl
C:\>pip install c:\isob-1.0.0\numexpr-2.4-cp27-none-win32.whl
C:\>pip install c:\isob-1.0.0\tables-3.2.0-cp27-none-win32.whl
C:\>pip install c:\temp\pathlib-1.0.1.tar.gz
Nature Protocols: doi:10.1038/nprot.2015.101
5
used in the generation of peak lists, stored in Mascot generic format, ready for manual submission to Mascot for
searching.
NOTE: The code does not perform automated Mascot searches. These are left for the user to run manually with the
generated .mgf files.
Figure 1. Flow diagram for the two-stage proteomics workflow code. The scheme shows not only the application
functionality but also the interaction with data files utilized and created during an analysis of multiple files. Solid
arrows indicate the application flow; dashed arrows indicate the flow of data to or from a data file; dotted arrows
indicate that the data files are the same entity but have been reproduced in a new location in the scheme for clarity;
red lines indicate that the user must perform the step manually.
The second part of the analysis (postMascot) collates the Mascot results with the MS data by extracting data from
the Mascot results file, linking the Mascot query information to the Xcalibur spectrum information and performing
peptide level QC. The relevant data is then stored in a separate section of the MS.hdf5 file. From this point it is
possible to perform protein inference and quantification on individual files or collectively by merging the data from
multiple files. The results of the protein inference are stored in a newly created .hdf5 file (results.hdf5) and reports
are generated from this as simple tab-delimited text files, which the user may then inspect in Microsoft Excel or use
in other data analysis software. The protein output text file is intended for use as direct input for the ‘R’ package
‘TPP ‘, as described in the associated manuscript.
Both pre- and postMascot parts of the workflow are performed using Python scripts executed on the command line
and are designed to run with the minimum number of parameters to make this as simple as possible. Additional
parameters are found in configuration (.cfg) files corresponding to either part. The preMascot code identifies any
Xcalibur .raw files in the specified data directory and processes each sequentially; generating one .hdf5 and one .mgf
file per .raw file. The .mgf files may then submitted to Mascot for searching. Once complete, the Mascot results
Nature Protocols: doi:10.1038/nprot.2015.101
6
(.dat) files should be copied to the data directory for the subsequent postMascot workflow. Provided they are
stored in the same data directory each .dat file is automatically matched to the appropriate .hdf5 file in the initial
phase of the postMascot process, and the Mascot results are transferred to the .hdf5 file. Using a single parameter,
specified on the command line, the user can control whether protein inference is performed either on each
individual file or on the combined data set once all .hdf5 files have been appended with their Mascot data (See
figure 1 for more details of the workflow and file creation and usage).
Log files are created by each stage of the workflow; however, the relevant logs for each section are deleted at the
start of the preMascot or postMascot step in order to prevent excessive disk usage.
Running the preMascot Workflow Having opened a command prompt, the user should change to the directory where the isobarQuant package has
been extracted, likely called C:\isob-1.0.0 if the .zip was extracted to c:\.
The preMascot workflow (preMascot.py) is controlled using a combination of command line arguments and a
configuration file (preMascot.cfg). There are two essential parameters that must be specified each time
preMascot.py is run: datadir and quant. These are provided using the syntax:
Here we have used "vehicle_1" as a possible directory name. We suggest that the user think of a name which makes
sense within the context of the data being processed. This directory should be created and the relevant Xcalibur
.raw file(s) to be processed copied there. The quant parameter is shown here as "TMT10" but can be any of the
quantification methods defined in the QuantMethod.cfg file or 0 (zero) or "none". This zero setting tells the
software to look at each filename to find a valid quantification method name in the name. If a valid method is found
then the file is processed with this method, otherwise no quantification is performed. The "none" setting disables
any quantification.
Default values for the datadir and quant parameters are also defined in the [runtime] section of preMascot.cfg. If
these are appropriate preMascot.py can be started via the shortened syntax:
The configuration file preMascot.cfg defines a variety of further parameters that control the functionality of
preMascot.py. These are organized in sections, which are separated by section headers indicated by squared
brackets. The parameter values can be either modified in place or specified on the command line for a one-off
analysis according to the syntax:
Where <section> is the section name, <parameter> is the parameter name and <value> is the new value for the
parameter.
Note: parameter values that are specified via the command line only apply to that specific run of preMascot.py or
postMascot.py. They are not changed in the corresponding .cfg file.
--<section>.<parameter> <value>
C:\isobarQuant-1.0.0>python preMascot.py --d
C:\isobarQuant-1.0.0>python preMascot.py --datadir c:\vehicle_1 --quant TMT10
Nature Protocols: doi:10.1038/nprot.2015.101
7
Results For each .raw file in the selected directory two new files (.hdf5 and .mgf) will be created with the same name as the
.raw but different file extensions. The .mgf file is used for Mascot searching and the .hdf5 is the MS.hdf5 data
repository which contains the extracted MS data. The results of the Mascot search will be added in the postMascot
workflow.
Explanation of preMascot.cfg Parameters The master configuration files are used to expose the most commonly changed parameters for ease of use. Relevant
data from the configuration is passed into each sub-process. The preMascot.cfg file is split into the three sections
[runtime], [logging] and [general]. Within any section an optional parameter "parameterconversion" may be
present. When present, this forces the conversion of parameters into the specified type e.g. path, float, integer,
Boolean, list.
Runtime Section
This section contains the parameters that must be provided on the command line. They deal with where the data is
located and the type of quantification used.
Parameter Description Initial Value
datadir The directory where the Xcalibur .raw files can be found c:\myproject
quant The quantification method to be used. A value of 0 forces the code to extract a quantification method name from the .raw filenames.
0
Logging Section
This contains the parameters to control the various log outputs of the preMascot analysis. The data here is passed
into all sub-processes so that all log files and screen output shows the same level of detail. Please note: running in
DEBUG mode adds significantly to the time taken to complete the workflow; it is recommended to use in only in
order to track problems.
Parameter Description Initial Value
logdir subdirectory of the directory specified by datadir (above) where the log files of all the components are placed.
Logs
logfile filename for the log file preMascot.log
screenlevel level of log details that are displayed on screen while the application is running Levels = WARNING, INFO, DEBUG.
INFO
loglevel level of log details that are recorded in the log file. Levels = WARNING, INFO, DEBUG. INFO
General Section
Contains the parameters controlling the preMascot workflow. The tolppm and tolmda parameters are used in
comparing masses and XIC creation. The actual error used is the maximum of the tolmda and the absolute error
calculated from the ion m/z and tolppm.
Parameter Description Initial Value
tolppm relative mass accuracy of the MS data in ppm. 8
tolmda absolute mass accuracy of the MS data in mDa. 8
beforepeak retention time window before the peak apex in which a MS/MS event can be linked to the peak (in minutes)
0.5
afterpeak retention time window after the peak apex in which a MS/MS event can be linked to the peak (in minutes)
0.5
Nature Protocols: doi:10.1038/nprot.2015.101
8
Figure 2. Scheme for the data extraction part of the processing flow of the preMascot workflow. Gray boxes
highlight the section of the workflow that performs the extraction of MS data and parameters from Xcalibur .raw
files. Green boxes highlight section performing the reanalysis of the precursor ion data using chromatographic data
to improve the precursor ion accuracy.
Nature Protocols: doi:10.1038/nprot.2015.101
9
Detailed Description of the preMascot Workflow The preMascot workflow is concerned with the extraction of data from the Xcalibur .raw files and the storage of the
information in data repository (MS.hdf5 file) additionally improvements are made to the precursor ion mass
accuracy. The second part of the preMascot workflow prepares searchable data for Mascot. All the MS/MS are
filtered to improve the data quality and maximize Mascot’s ability to link spectra to peptide sequences. These data
files are simple text files that conform to Mascot's own generic file format (.mgf).
Data Extraction from .raw Files
The data extraction is performed by an application in the preMascot analysis called pyMSsafe (see figure 2). After
initialization using its own configuration file and parameters passed from the preMascot configuration pyMSsafe
opens the Xcalibur .raw file and reads the tune and instrument methods, including HPLC method data. This data is
then transferred into a newly created .hdf5 file (MS.hdf5) which acts as a data repository. pyMSsafe then processes
all spectra in the .raw file.
With MS data it performs peak detection and centroiding if acquired in profile mode. All found peaks are stored in
the data repository. The application also calculates the signal-to-interference values for any precursor ions relating
to the current MS spectrum. This is done by extracting all ions from the region of the MS spectrum corresponding to
the precursor isolation window and assigning the ions as either belonging to the precursor isotopic cluster or
interference ions.
MS/MS data are also centroided if they are acquired in profile mode and the ions stored.
The reporter ions from isobaric quantification methods are detected and their intensity data stored. All reporter
ions have natural carbon in them that is subject to the normal 13C impurities and because the ions have heavy
isotope labels they are subject to miss incorporation of the heavy labels. This means that each label can influence
adjacent labels. For each label we have measured the relative intensities of these satellite ions, compared to the
main reporter ion, and use this data to correct for the influence of other labels. pyMSsafe applies these corrections
at this point and stores the isotope corrected data alongside the raw, unmodified, intensity data.
Once all spectra have been processed, pyMSsafe attempts to reassign the precursor m/z using chromatographic
peak data. Essentially, ion chromatograms (XIC) are constructed for all ions in the precursor ion isotope cluster.
Chromatographic peaks are detected in the XICs and these are linked together to form isotope clusters. The cluster
intensity data is compared to the theoretical isotope intensities of an averagine model with similar m/z. The cluster
with the lowest least squares fit is selected as representing the precursor. The new m/z is calculated from the
intensity weighted average m/z of the top of the monoisotopic peak.
Creating Files for Mascot Searching
The second part of the preMascot workflow is the creation of data files for use in Mascot searches (see figure 3).
Mascot provides an accessible file format, called Mascot Generic Format (.mgf). All MS/MS spectra are read from
the MS.hdf5 file and following the application of selected filters are exported into the .mgf file. In addition to this,
the Mascot generic format allows for a TITLE field for each spectrum. The application uses this field to store some
information about the MS/MS event. The most critical of these is the msms_id which contains the Xcalibur spectrum
number, and is the way that the workflows link Mascot results with the Xcalibur spectra which they came from.
Filtering removes ions from the MS/MS spectrum that would cause interference in the Mascot matching process.
We have developed a number of filters for our data and find that different filters are required for the high
resolution, high energy collision data produced by HCD fragmentation compared to other fragmentation methods.
For lower energy, low resolution data we use a simple filter that selects the four most intense ions in each segment
of the MS/MS spectrum. The segment size varies with the charge state of the precursor: spectra from 1+ and 2+
precursors are segmented every 100 Th and spectra from more highly charged precursors are segmented every
50 Th.
Nature Protocols: doi:10.1038/nprot.2015.101
10
For HCD data we remove the reporter ions, in this mode they often have high intensities and disrupt the Mascot
scoring. Additionally, we deconvolute the spectrum so that all ions with multiple charges are recalculated as singly
charged ions. The final filter combines the deconvoluted ions into a single ion if their new m/z are within instrument
accuracy of each other. The new ion has an intensity equal to the sum of the intensities of all combined ions and a
m/z equal to the intensity weighted mean of the combined ions. This marks the end of the preMascot workflow.
Figure 3. Scheme for the part of the preMascot workflow dealing with the creation of search files for Mascot
analysis.
Nature Protocols: doi:10.1038/nprot.2015.101
11
Table 1. List of MS/MS ion filters and available manipulation methods.
Filter Description
ten Orders the ions by intensity and removes 10% of the ions with the lowest intensity
zone Remove 13C ions and then selects the most intense 4 ions per 100 Th for 1+ and 2+ precursors and per 50 Th for precursors with higher charge states
repion This filter removes the reporter ions from the spectrum.
deconv Identifies ions that are multiply charged in the spectrum and deconvolutes them to singly charged ions.
neutralloss Removes selected neutral losses from the precursor ion. Which losses are defined in the mgf.cfg file [msmsfilters]neutrals. These are defined by the mass delta from the precursor.
multi Groups together ions that have been deconvoluted and are within instrument tolerance of each other. The group is replaced by a single ion of m/z = intensity weighted average of all precursor m/z in the group and intensity = the sum of all intensities in the group.
immonium Removes selected immonium ions. Which immonium ions are defined in the mgf.cfg file [msmsfilters]immonium. These are defined by the expected m/z for each immonium ion.
Mascot Searches Searches are started manually using the Mascot search interface in the normal way with the .mgf files created in the
step above. As soon as the searches are finished, the resulting .dat files should be exported and copied to the
directory where the MS.hdf5 and .raw data are. For FDR estimation (on peptide and protein level) to be carried out,
it is necessary to perform the Mascot searches against a single .fasta file containing both forward and reverse
(decoy) protein sequences. How this file may be created is described in more detail here:
http://www.matrixscience.com/help/decoy_help.html. If this is not possible, the workflow will not fail, but no FDR
estimation will be performed and consequently no filtering based on FDR will be available.
It is recommended that the Mascot configuration (mascot.dat) file is changed to allow the information of all
identified proteins to be stored in the results file (this gives the postMascot workflow access to protein / gene
names, descriptions and molecular weight data of all proteins) including those proteins with no top ranking peptides
of a minimum length (see Mascot Installation and Setup document, chapter 6). The value of the parameter
‘ProteinsInResultsFile’ should be set to 3 (if it is not listed in the configuration file then it should simply be added). If
it is not possible to carry this out on the Mascot server, then we have provided a small Python script to parse out the
required information from the Uniprot.fasta file (for human proteins see
http://www.uniprot.org/proteomes/UP000005640/; ‘Download’ section) to a text file that is used during the
postMascot workflow. The script may be run with the syntax shown below, with the first command line argument
being the uniprot.fasta file from which to parse the data. Currently this will only work for Uniprot .fasta files.
Where " UniprotSequenceFile.fasta " is the name of the Uniprot file to be analyzed. The output, called
‘essentialuniprotdata.txt’, should be copied to the top-level directory where the isobarQuant package is installed: it
is used later on by the script creating the .txt file outputs. The workflow will not fail if this is not carried out, but
some helpful annotations might be missing in the output.
During development of this package, we have used Mascot versions 2.4 and 2.5.
Running the postMascot Workflow The postMascot (postMascot.py) workflow is started in an analogous manner to the preMascot workflow. As
before, it is controlled using a combination of command line arguments and configuration file (postMascot.cfg).
There are two essential parameters that must be specified each time it runs: datadir and mergeresults:
C:\isobarQuant-1.0.0>python postMascot.py --datadir c:\vehicle_1 --mergeresults yes
C:\isobarQuant-1.0.0>python parseUniprotFasta.py UniprotSequenceFile.fasta
Nature Protocols: doi:10.1038/nprot.2015.101
12
Here, "C:\vehicle_1" represents the absolute path of the directory that contains the data files (.dat and .hdf5 files
from the preMascot workflow) and the mergeresults parameter can be any of yes/no, 1/0 or True/False. If "yes" is
selected then the results of all files are merged together and protein inference and quantification is performed on
the combined results. Selecting "no" will separate the results and perform inference and quantification on each
individual file present.
If the default values in the [runtime] section of postMascot.cfg are appropriate postMascot can also be started via
the shortened syntax:
As with the preMascot workflow, the configuration file postMascot.cfg defines a variety of additional parameters
that control the functionality of postMascot.py. These are organized into four sections, which are separated by
section headers indicated by squared brackets. The parameter values can be either modified in place or specified on
the command line for a one-off analysis according to the syntax:
Where <section> is the section name, <parameter> is the parameter name and <value> is the new value for the
parameter. Once again to note: parameter values that are specified via the command line only apply to that specific
run of postMascot.py; they are not changed in the corresponding .cfg file.
Results The results of the postMascot workflow are stored in four files: A master results.hdf5 file and summary text files
containing peptide data, protein data and sample summary data respectively. A full description of the data
contained in each file is given in Appendix 1 & 2.
Output files are named according to the following convention:
<directory>_<analysis_type>_<rundate> _<runtime>_[ output _type]
directory is replaced by the first 25 characters of the data directory name;
analysis_type is ‘merged_results’ when a merged analysis has been performed and is the .raw file name plus
‘_results’ when each .raw file is analyzed separately
rundate is the date when the process was run in the format YYYYMMDD
runtime is the time when the process was run in format HHMM
output_type is either
1. _proteins.txt: file contains protein information and corresponding protein fold changes. This file can
be taken to the next stage.
2. _peptides.txt: file contains information about the individual peptides and their reporter ion values.
3. _summary.txt : file contains some statistical information about the runs performed
4. .hdf5 : the underlying .hdf5 file contains the results of the run started above
We designed the filename convention to store the parent folder in the results file name since the folder name is
often related to the experiment being analyzed. The inclusion of the date/time stamp in the file name prevents the
accidental overwriting of data, especially if multiple operating parameters are being investigated.
Explanation of postMascot.cfg Parameters The postMascot.cfg file is split into four sections: [runtime], [logging], [general] and [quantification]. Relevant data
from this .cfg file is passed to each sub-process.
--<section>.<parameter> <value>
C:\isobarQuant-1.0.0>python postMascot.py --d
Nature Protocols: doi:10.1038/nprot.2015.101
13
Runtime Section
This section contains the parameters that must be provided on the command line. The "mergeresults" parameter
needs further explanation. It is used to control the way protein inference and quantification is performed.
Selecting "yes" will combine the search results from all .raw files and produce a single set of results files. Thus the
protein inference and quantification will be calculated using all peptides in all files.
Selecting "no" will generate a set of results files: one specific for each .raw file.
Parameter Description Initial Value
datadir The directory where the .dat and .hdf5 data files can be found c:\myproject
mergeresults Flag to control the merging of results at the protein inference stage. yes
Logging Section
Contains the parameters to control the log output of the postMascot analysis.
Parameter Description Initial Value
logdir Subdirectory of the directory specified by datadir (above) where the log files of all the components are placed. This is transmitted to and used by all sub-processes.
logs
logfile filename for the log file postMascot.log
screenlevel Level of log details that are displayed on the console (screen) while the application is running. Levels = WARNING, INFO, DEBUG
INFO
loglevel Level of log details that are recorded in the log file. Levels = WARNING, INFO, DEBUG. This is transmitted to and used by all sub-processes.
INFO
General Section
Contains parameters that are used by multiple sub-processes
Parameter Description Initial Value
minhooklength This is the minimum number of amino acids in a peptide to be a candidate hook peptide.
6
delta This is the minimum Mascot score difference between the top two Mascot peptides for the top peptide to be a candidate hook peptide.
10
fdrthreshold The maximum false discovery rate for a peptide to be included. The false discovery cut off for peptide QC. i.e. only peptides with a false discovery rate better than this threshold will be selected, both for quantification and for protein inference
0.01
run_corrects2iquant Flag to switch on / off the script correcting reporter ion signals according to their calculated S2I value
yes
Quantification Section
Contains the parameters to control the calculations performed for protein quantification.
Parameter Description Default Value
reference The isotope that should be used as the reference in determining the relative quantities. Value can be any valid isotope id, or isotope name, ‘first’ or ‘last’. If ‘first’ or ‘last’ is provided then the lowest/highest mass isotope is automatically selected, respectively.
last
Detailed Description of the postMascot Workflow The postMascot workflow is divided into five steps. The major parameters for the postMascot workflow are found in
the postMascot.cfg file, however, each step is further parameterized by its own configuration file. The application
starts by finding all the Mascot results files (.dat files) and reading which file was used to create the search data. This
is used to link the .dat file with its corresponding MS.hdf5 file. This file is then examined to find if any previous
Mascot data has been loaded. If all files have Mascot data then an option is given to use the already loaded data, or
Nature Protocols: doi:10.1038/nprot.2015.101
14
to delete the existing data and load from the current .dat files. If only some of the MS.hdf5 files have Mascot data
then the only option is to delete the existing Mascot data and perform a fresh loading. In either case the option is
presented to quit the application and examine the data manually.
Transfer and QC of Mascot-Derived Peptides Matches
The first section of the postMascot workflow (mascotParser.py) reads the contents of the Mascot results files, details
of the search parameters and the masses and modifications used. These are then transferred to the MS HDF5 file.
The data is stored in a separate section of the .hdf5 file named after the .dat file.
The query data is linked to the Xcalibur spectrum data by the msms_id that is put into the spectrum TITLE data (see
preMascot workflow above). All peptide sequences are removed if they contain any invalid residues (as stated in the
allowedamino parameter of the mascotParser.cfg file [default-ACDEFGHIKLMNPQRSTUVWY]). The peptide data are
then filtered to remove any Mascot suggestion that does not have the same score as the highest scoring Mascot
suggestion for each spectrum. The top sequences are further QC’d by checking if they are hook peptides (defined in
the Hook Peptide section below). The pure peptide data is stored in the peptides table whilst the link to protein
information is stored in the seq2acc table (peptide sequences identified by Mascot (seq) to the protein identifiers
(acc)). This table also contains summary information about the quality of the identification of the peptide sequence
and is used in the protein inference step.
Figure 4. postMascot sections.
1. Transfer of QC’d Mascot peptides to .hdf5 file. Creation of links between spectra / peptides and quantification information to protein accessions.
2. Protein inference based on QC peptides passing FDR threshold
3. Correction of peptide quant values based on measured S2I values
4. Protein Quantification following filtering step of several different criteria
5. Creation of output files and incorporation of Uniprot identifiers
Hook Peptides
Hook peptides represent high quality Mascot identifications. The default settings of isobarQuant require that each
identified protein has at least one hook peptide. Two parameters decide whether a peptide is a hook peptide or not:
sequence length and the Mascot score difference between the highest and second highest scoring Mascot
suggestion for the MSMS spectrum. Only the highest scoring Mascot match for each spectrum is considered for hook
peptide status. Shorter peptides are less reliable for protein identification compared to longer ones, so a hook
peptide must be at least 6 amino acids in length. Since the Mascot scoring system is at heart a probability score the
difference in scores between multiple identifications for the same spectrum is indicative of the relative likelihoods of
Nature Protocols: doi:10.1038/nprot.2015.101
15
the matches being correct. A peptide match with a score difference of 10 points to the next best score should be 10
times more likely to be correct. The default definition dictates that a hook peptide has to have a score difference of
at least 10 points to the next best scoring match for that spectrum.
Protein Inference
The second part of the postMascot workflow (proteinInference.py) infers protein groups from a good quality subset
of the peptides imported from the Mascot .dat files into the individual MS.hdf5 files during the previous step. This is
achieved by removing those peptides which do not pass the FDR cut off, removing protein groups and associated
peptides containing only low-quality or repeat peptide identifications and then by applying the principle of Occam’s
razor to create protein groupings based on the remaining shared peptides. It is at this point that the merging of
results from several searches takes place if required. Firstly, all peptide sequences from all relevant MS.hdf5 files are
used to determine the peptide FDR; the Mascot score corresponding to the FDR-threshold given in the configuration
file is calculated. Next, peptides with their associated protein accessions are retrieved and assembled into groups
based on shared peptides. At this point multiple spectrum matches to the same peptide sequence are represented
by the top-scoring peptide sequence only. Any groups with fewer hook peptides than the given threshold
(minnumhook, default=1) are discarded at this point, as are those containing no peptides with a Mascot score
greater than the equivalent to an FDR of fdrthreshold (default=0.01 (1%)). All groups containing completely
duplicate peptides (samesets) are now merged together. The peptide groupings are scanned in descending order of
hook peptide numbers, total score, and then by all FDR-passed peptides. Peptides are marked as novel the first time
they are encountered and as non-novel in any groups thereafter. At the end of this scan all groups containing only
non-novel peptides are removed.
All peptides are then assessed and flagged for uniqueness. The peptides used at each of the steps described above
are always those passing the FDR-threshold. All other, FDR-failed, peptides are kept with their associated accession
groups and are written to the peptide output sheet but are not used in the determination or merging of protein
groups. This might result in the situation that some of the failed peptides are not found in the protein sequences of
all members of the group. Using the default settings, none of these peptides will be used for protein quantification
(unless the user alters the parameters in the protein quantification configuration file).
Following the protein inference section, peptide, spectrum and quantification data from all spectra are extracted
from the MS.hdf5 files, processed and then added with their protein links to the new .hdf5 results file.
For the protein inference there is only one parameter, minnumhook, which is not transmitted down from the values
of the postMascot.cfg file. Its default value is set to 1, meaning that for a protein set to be valid at least one hook
peptide (also passing FDR cut-off) should be present.
Performing Quantification Correction
The third step in the postMascot workflow (corrects2iquant.py) is concerned with reducing the effect of ratio
compression in later protein fold changes. The reporter ion area values are corrected by a simple algorithm using the
signal to interference measure, S2I, which has been shown to strongly reduce the effect of co-fragmentation and
produce more accurate peptide and protein fold changes. The algorithm is described in detail in Savitski et al J
Proteome Res. 2013 Aug 2;12(8):3586-98. The default mode is to have this correction step switched on. If this is
not required, it can be switched off by setting the parameter run_corrects2iquant in the postMascot.cfg file to 0.
This script has no specific parameters of its own; it takes the logging and result file information from the .cfg of the
parent (postMascot) process.
Performing Protein Quantification
The fourth processing step in the postMascot workflow is the protein quantification step (quantifyProteins.py). Here
the quantification values for all peptides linked to each protein group are extracted from the results.hdf5 file; the
filters (as described below) are applied to these quantification values and if there are more than a given number
spectra with associated reporter ions, a bootstrap sum ratio calculation is carried out as described in Savitski et al
Anal Chem. 2011 Dec 1;83(23):8959-67. This cut-off is set by the parameter minquantspectra in the
Nature Protocols: doi:10.1038/nprot.2015.101
16
proteinquantification.cfg. The number of bootstrap iterations is set to 5000 and the result has three components:
the median fold change over all iterations and the positions of the 0.025 (lower) and 0.975(upper) quantiles. If the
required minimum number of spectra with reporter ions (default is set to 4) is not attained, a simple sum ratio is
calculated and the upper and lower quantiles are set to -1. In cases where no quantification data is available in the
‘reference’ channel, no fold change calculation is possible and a value of -1 is recorded (these -1 values are
converted later to NA in the text outputs). In cases where no reporter signal is present for that isotope but a signal
for the ‘reference’ channel is present then the fold change is set at zero with -1 for the upper and lower quantile
values.
Table 2. Configurable parameters employed in the proteinQuantification part of the postMascot workflow.
Parameter Description Initial Value
minquantspectra This is the minimum number spectra required for bootsumratio to be calculated 4
mascotthreshold This is the minimum Mascot score required for peptides to be used for protein quantification.
15
fdrthreshold The maximum false discovery rate for a peptide to be used for quantification. Set in the postMascot.cfg file. Given a fraction of 1.
0.01
peplengthfilter This is the minimum accepted length of a peptide for it to be used for protein quantification: Set in the postMascot.cfg file.
6
p2tthrshold Threshold above which P2T value must be to allow quant values to be used in protein quantification.
4
s2ithreshold Threshold above which S2I value must be to allow quant values to be used in protein quantification.
0.5
deltaseqfilter Score difference between the quantified peptide and the next highest scoring sequence for that spectrum.
5
quantmethod Method used to calculate protein quantification values: either ‘bootstrap’ or ‘simpleratio’. If the total number of quantified spectra in the protein group is below ‘minquantspectra’, the simpleratio is used.
bootstrap
The advanced user might wish to run each of the four steps above as a separate process. Each process is started in a
very similar way to that described for the pre- and postMascot workflows. The parameters may be changed either
within each separate .cfg file or via the command line in the manner described above. The scripts are started from
one of three sub-directories. The full directory structure of isobarQuant is shown in Appendix 4. Please note that in
all steps of the postMacot workflow the results from the any previous steps are deleted from the .hdf5 files.
For the standard user it should normally suffice to run the whole postMascot workflow and if necessary change the
parameters in the .cfg files as described. This way a new results.hdf5 file is generated each time and the protein
quantification values will not be overwritten. If the postMascot workflow is started a second time and Mascot
results data are already present in the MS.hdf5 file, the user is prompted whether or not to overwrite this data or
abandon the processing.
Output Creation
The .txt outputs are created as the final part of the postMascot workflow. The basis for the naming of the files is
described above in the section ‘Output’. The structure of these files is defined in Appendix 1 and 2. It is possible to
re-create the .txt outputs at any time, without having to re-run the whole postMascot workflow, since all
information is drawn from the underlying results.hdf5 file. This is done by issuing the following command:
This tells the script to create the .txt outputs for all results.hdf5 files in directory c:\vehicle_1 which have the date
string matching ‘YYYYMMDD’, if one wished to create outputs for all result.hdf5 files created in directory
‘c:\vehicle_1’ on January 3rd 2016 YYYYMMDD would read 20160103. One could extend this to be more precise by
giving the time *20160103_1600* or the exact file name ‘vehicle_1_20160103_1600.hdf5’.
C:\isobarQuant-1.0.0>python outputResults.py --datadir c:\vehicle_1 --filefilter *YYYYMMDD*
Nature Protocols: doi:10.1038/nprot.2015.101
17
This might be useful in the case that one wishes to show the protein outputs with or without the ‘reverse’ decoy hits
displayed (the parameter ignore_reverse_hits controls this and is set to 0 (off) by default).
isobarQuant is now finished. At this point data may be inspected visually or used as input for further analysis in the
TPP R-package.
Adding New Quantification Methods The details of the quantification methods are stored in the QuantMethod.cfg configuration file. Three methods are
included in the installation (TMT10, iTRAQ and 8TRAQ). Each method has its own section in QuantMethod.cfg.
To add a new method in the QuantMethod.cfg file a new section must be created. The name of the section should
be the same as the meth_name parameter only in lowercase. The meth_name and meth_id should both be unique
as these parameters are used by the code to identify each method. Where method variants exist (e.g. TMT with 6, 8
and 10 labels) we have adopted the convention that the number of labels in the method is part of the method name
(e.g. TMT6, TMT8, TMT10).
Parameter Explanation
meth_name Full name of the method (used to lookup methods in filenames e.g. TMT10)
meth_type Generic name of the method (without any isotope numbers e.g. TMT)
mascot_name Used by the code to identify the Mascot modification corresponding to the quantification method.
meth_id id number of the quantification method
heavy_isotopes The heavy isotopes used in the quantification method. Formatted as a python dictionary with possible keys = C13, N15, O18, H2 and values = number of the isotope present.
num_isotopes number of different labels in the method
Source The location of the quantification data in the MS data: currently only MS2-based quantification strategies are supported
The mascot_name is used by the code to identify Mascot modifications that are used for quantification. The name
must be contained in the Mascot modification title, e.g. TMT which will match to any of the modifications "TMT",
"TMT2plex" and "TMT6plex". Please note that this parameter is case sensitive so "tmt" will not match to any of the
previous modifications. Any peptide that does not have any modification (variable or fixed) will not be used for
quantification.
The heavy_isotopes parameter is used in the generation of the isotope model for the precursor isotopes. In the
model an average amino acid (averagine) is created using the average elemental composition of all amino acids
weighted by their frequency in the IPI protein database (composition: C4.91, H7.75, N1.38, O1.48, S0.04). The elemental
composition of 1 – 200 averagine units and their isotopic abundances are calculated. Working with the assumption
that each peptide is labeled twice, on average, the isotopic abundance is convoluted by the twice the
heavy_isotopes elemental composition. The heavy_isotopes composition is calculated from the average compostion
of all labels. In the case of iTRAQ there are three different isotopic compositions that make up the four labels (13C2 18O, 13C 15N 18O and 13C3
15N), these are combined to give the composition in QuantMethods.cfg (13C2.25 15N0.75
18O0.5).
After the source parameter there is a series of numbered lines that contain information about each isotope in the
quantification method. The lines have the format below:
<id>: [{'mass':<mass>, 'id':<id>, 'name':<name>, 'corrections':<correction>}]
<id> = unique, to the QuantMethod.cfg file, identifier for the isotope, <mass> = The mass of the reporter ion for this
label, <name> = name assigned to the isotope label, <corrections> = The fractional interference of this isotope with
other isotopes. The data for this is formatted as a python dictionary: with key = the ID of the affected isotope label
and value = fractional interference.
e.g. 1: [{'mass':126.12772591, 'id':1, ‘name’:'126', ‘corrections’ :{ 3:0.04}}]
Nature Protocols: doi:10.1038/nprot.2015.101
18
In this case isotope id 1 interferes with isotope id 3. The degree of interference is calculated as the intensity of
isotope id 1 multiplied by the fractional interference. This will be subtracted from the intensity of isotope id 3. This
process is repeated for isotopes in the quantification method to calculate the quant_isocorrected values stored in
the results files. For further detail on how to measure the fractional interference values see Werner et al Anal Chem.
2012 Aug 21;84(16):7188-94.
Nature Protocols: doi:10.1038/nprot.2015.101
19
Appendix 1: Results Text Files Description Output files are named according to the following convention:
<directory>_<outputtype>_<rundate>_<runtime>.<suffix>; where <directory> is replaced by the first 25 characters
of the data directory name; <outputtype> is either merged_results or (if mergeresults was set to ‘no’ or 0, the full
Xcalibur filename minus the ".raw"; is replaced by one of "peptides", "proteins" or "summary"
Protein Output
Field Name Description
protein_group_no Generated protein number for this protein group.
protein_id Identifying accession(s) from protein database. In cases where all identified peptides match to more than one protein sequence, all possible accessions are given here. The order in which the accessions are presented determines the order of the corresponding information in the description and mw columns. This protein_id can be used to match proteins across files from multiple experiments (as opposed to protein_group_no).
description Description line from .fasta file used in Mascot search. More than one entry possible.
gene_name For data searched with Uniprot fasta files only. This corresponds to the GN value given in the description line of the fasta file. For multiple peptide-match entries this value is the minimal set of gene names associated with the entry. This may be a more appropriate identifier used for matching proteins between files from multiple experiments.
mw Molecular weight of full protein sequences. More than one entry possible: see definition of protein_id.
protein_fdr False discovery rate for protein based on max_score.
max_score Maximum Mascot score of all peptides in protein group.
total_score Sum of mascot scores for all peptides in protein group greater than peptide FDR threshold.
ssm Number of spectrum-to-sequence matches [peptides] in protein group (ssm=spectrum-sequence matches).
upm Number of peptide sequences in protein group that are unique to it (upm=unique peptide matches).
fqssm Number of spectrum to sequence matches [peptides] in protein group with reporter ions that were excluded from protein quantification by the filter criteria (such as uniqueness, S2I, P2T, FDR and mascot score) fqssm=filtered quantified spectrum-sequence matches.
qssm Number of spectrum to sequence matches [peptides] in protein group with reporter ions used in protein quantification qssm =quantified spectrum-sequence matches.
qupm Number of unique peptide sequences in protein group with reporter ions used in protein quantification qupm=quantified unique peptide matches.
reference_label Name of reporter ion label corresponding to the control on which relative protein quantification is based.
signal_sum_<label> Sum of all reporter ion signals of valid quantifiable peptides in protein group. Corrected for isotope impurity and signal to interference prior to summing. Given for each label where <label> is replaced by label name.
rel_fc_<label> Relative fold change calculated by performing bootstrap sum of ratios of quant values against the control. Given for each label where <label> is replaced by label name.
delta_conf_<label> Delta between upper and lower boundary of 95% confidence interval as obtained during the bootstrap sum of ratios calculation. Given for each label where <label> is replaced by label name.
Nature Protocols: doi:10.1038/nprot.2015.101
20
Peptide Output The peptide output file contains peptide centric data including quantification, FDR, and protein group data.
Field Name Description
protein_group_no Generated protein number to which the peptide sequence is matched.
protein_id Identifying accession(s) from protein database. In cases where all identified peptides match to more than one protein sequence, all possible accessions are given here. The order in which the accessions are presented determines the order of the corresponding information in the description and mw columns. This protein_id can be used to match proteins across files from multiple experiments (as opposed to protein_group_no).
sequence Peptide's amino acid sequence. The length of the peptide sequence is used in peptide QC and to exclude it from use in protein quantification. This is given as "peptide" in _results.hdf5 file
modifications Description of all modifications with their position in the sequence and the amino acid they modify. E.g. TMT6plex N-term; Oxidation M5; Oxidation M9; Carbamidomethyl C12; TMT6plex K13;
mw Calculated molecular mass of the peptide sequence corresponds to mrcalc in Mascot .dat file.
precursor_mz Observed m/z of precursor ion. Corresponds to "observed" in Mascot .dat file.
charge_state Charge state of the precursor ion.
ppm_error Mass difference between the sequence mass (mw) and the neutral_mass in Da, expressed in ppm.
score Score assigned to spectrum-to-protein match by Mascot. This may be used as a threshold to exclude peptide from being used for protein quantification.
fdr_at_score Calculated peptide false discovery rate for this score. This may be used as a threshold to exclude peptide from use in protein quantification and from protein inference.
rank Rank of spectrum to peptide match within Mascots list of ten suggestions. Only rank1 peptides are quantified.
msms_id Identifier of MS/MS spectrum from the raw data file. Non unique across merged datasets.
source_file The file name of the original Xcalibur .raw file name from which the data were acquired.
peak_intensity Maximum intensity of the XIC peak identified from this spectrum.
s2i Signal-to-interference value for the precursor ion of this spectrum. If the S2I value is above s2ithreshold the peptide's quantification values may be used for protein quantification.
p2t Intensity-to-noise value for the precursor ion of this spectrum. If the P2T value is above p2tthreshold the peptides quantification values may be used for protein quantification. (P2T = precursor2threshold).
is_unique Peptide sequence is found uniquely in given protein group. This is required for its quant values to be used in protein quantification.
in_quantification_of_protein Equal to 1 if peptide's reporter ions are used for protein quantification.
in_protein_inference Equals 1 if peptide sequence is used in assignment of protein groups.
seq_start Peptide start position on protein sequence given by first identifying accession in protein group.
seq_end Peptide end position on protein sequence given by first identifying accession in protein group.
sig_<label>
Nature Protocols: doi:10.1038/nprot.2015.101
21
Summary Output The summary output file is in three parts the first part is a table (described below) containing the summary data for
each individual sample together with the totals for the whole workflow.
The second part contains the number of forward and reverse protein hits and the minimum Mascot scores for 1%
and 5% peptide FDR.
The final part contains all the operational parameters from the pre- and postMascot workflows
Field Name Description
sample_id Generated identifier of the sample
source_file The file name of the original Xcalibur .raw file name from which the data were acquired.
acquired_spectra The number of MS/MS events acquired in the Xcalibur raw file.
mascot_matched_spectra The number of spectra for which Mascot found a peptide match.
spectra_in_qc_proteins The number of spectra with peptides linked to proteins fulfilling all QC criteria.
quantified_spectra The number of spectra fulfilling all quantification-specific QC criteria linked to validated proteins.
mean_precursor_ion_accuracy The precursor ion accuracy is calculated as: (measured precursor ion m/z - theoretical peptide m/z) / theoretical peptide m/z, expressed in ppm. The mean is calculated from all spectra linked to accepted proteins.
sd_precursor_ion_accuracy The standard deviation of the data calculated for the mean_precursor_ion_accuracy.
mean_reporter_ion_accuracy The reporter ion accuracy is calculated as: measured m/z - theoretical m/z, expressed in Th. The mean is calculated from all detected reporter ions from all spectra with a linked to a valid protein.
sd_reporter_ion_accuracy The standard deviation of the data calculated for the mean_reporter_ion_accuracy
Nature Protocols: doi:10.1038/nprot.2015.101
22
Appendix 2: Results .hdf5 File Description
fdrdata
Column Name Data Type Description
data_type string 50 characters Describes whether the data is a peptide or protein based FDR
score float 32bit Mascot score
reverse_hits integer 32bit Number of reverse (false) hits at Mascot score
forward_hits integer 32bit Number of forward (true) hits at Mascot score
global_fdr float 32bit FDR for Mascot score >= score
local_fdr float 32bit FDR for Mascot score = score
true_spectra integer 32bit Number of spectra with Mascot score >= score
configparameters
Column Name Data Type Description
analysis string 30 characters Name of the part of the analysis that is calling the application.
application string 30 characters Name of the application using the parameter
section string 30 characters Section of the config file containing the parameter
parameter string 60 characters Parameter name
value string 100 characters Parameter value
proteinhit
Column Name Data Type Description
protein_group_no integer 32bit Generated protein number for this protein group. Master entry.
protein_id string 100 characters
Identifying accession(s) from protein database. In cases where all identified peptides match to more than one protein sequence, all possible accessions are given here. The order in which the accessions are presented determines the order of the corresponding information in the description and mw columns. This protein_id can be used to match proteins across files from multiple experiments (as opposed to protein_group_no).
description string 500 characters
Description line from .fasta file used in Mascot search. More than one entry possible.
gene_name string 50 characters
For data searched with Uniprot fasta files only. This corresponds to the GN value given in the description line of the fasta file. For multiple peptide match entries this value is the minimal set of gene names associated with the entry. This may be a more appropriate identifier used for matching proteins between files from multiple experiments.
mw float 32bit Molecular weight of full protein sequences. More than one entry possible see definition of protein_id.
ssm integer 32bit Number of spectrum-to-sequence matches [peptides] in protein group (ssm=spectrum-sequence matches).
total_score float 32bit Sum of mascot scores for all peptides in protein group greater than peptide FDR threshold. Note that only highest Mascot score is counted for all versions of peptide sequence in that group.
hssm integer 32bit Number of spectra matched to hook peptides in protein group. See definition of hook peptide above. hssm = hook sequence-spectra matches.
hookscore float 32bit Sum of the Mascot scores for the matched Hook peptides
upm integer 32bit Number of peptide sequences in protein group that are unique to it (upm=unique peptide matches).
is_reverse_hit integer 32bit is the match a false positive match
protein_fdr float 32bit False discovery rate for protein based on max_score.
max_score float 32bit Maximum Mascot score of all peptides in protein group.
Nature Protocols: doi:10.1038/nprot.2015.101
23
peptide
Column Name Data Type Description
peptide_id integer 32bit Generated unique identifier for this peptide record.
spectrum_id integer 32bit Assigned identifier for this spectrumCorresponds to master entry in spectrum table.
protein_group_no integer 32bit Generated protein number to which the peptide sequence is matched. Corresponds to master entry in proteinhit table.
peptide string 100 characters
Peptide's amino acid sequence. The length of the peptide sequence is used in peptide QC and to exclude it from use in protein quantification. This is given as sequence in the output.
variable_modstring string 100 characters
Description of the Mascot identified variable modifications. e.g. 2 Oxidation (M); 1 TMT6plex (N-term).
fixed_modstring string 100 characters
Description of the Mascot identified fixed modifications. e.g. 1 Carbamidomethyl (C); 1 TMT6plex (K).
positional_modstring string 150 characters
Description of all modifications with their position in the sequence . e.g. TMT6plex:0; Carbamidomethyl:2; Oxidation:3; Oxidation:10; TMT6plex:12.
score float 32bit Score assigned to spectrum-to-protein match by Mascot. This may be used as a threshold to exclude peptide from being used for protein quantification.
rank integer 32bit Rank of spectrum to protein match within Mascots list of ten suggestions. Only rank1 peptides are quantified.
mw float 32bit Calculated molecular mass of the peptide sequence corresponds to mrcalc in Mascot .dat file.
da_delta float 32bit Mass difference between the sequence mass (mw) and the neutral_mass in Da. Corresponds to Mascot delta.
ppm_error float 32bit da_error expressed in ppm.
missed_cleavage_ sites
integer 32bit Number of sites on the peptide sequence where expected cleavage by the protease failed to occur.
delta_seq float 32bit Difference in Mascot scores between this Mascot sequence match and the next highest scoring sequence match.
delta_mod float 32bit Difference in Mascot scores between the best two modification variants of this sequence. See paper 21057138.
is_hook integer 32bit Equals 1 if this is a hook peptide.
is_duplicate integer 32bit When multiple identifications are made to the same sequence the highest scoring identification has is_duplicate =0. All other identifications have is_duplicate = 1.
is_first_use_of_ sequence
integer 32bit Equals 1 for the first occurrence of this peptide sequence across all protein groups when they are ordered by descending total_score.
is_unique integer 32bit Peptide sequence is found uniquely in given protein group. This is required for its quant values to be used in protein quantification.
is_reverse_hit integer 32bit Equals 1 if peptide sequence maps only to a protein from the reverse database.
is_quantified integer 32bit Equals 1 if peptide links to at least one quantification event (reporter signal). May be filtered out if peptide fails one or more of the criteria
failed_fdr_filter integer 32bit Equals 1 if peptide's FDR is above the threshold given at runtime. This means it is not used for quantification or for protein inference during the aggregation / re-assignment step.
fdr_at_score float 32bit Calculated peptide false discovery rate for this score. This may be used as a threshold to exclude peptide from use in protein quantification and from protein inference.
in_protein_inference integer 32bit Equals 1 if peptide sequence is used in assignment of protein groups.
seq_start integer 32bit Peptide start position on protein sequence given by first identifying accession in protein group.
seq_end integer 32bit Peptide end position on protein sequence given by first identifying accession in protein group.
Nature Protocols: doi:10.1038/nprot.2015.101
24
spectrum
Column Name Data Type Description
spectrum_id integer 32bit Assigned identifier for this spectrum. Is unique across all merged datasets in a single analysis. Master entry.
sample_id integer 32bit Generated identifier of the sample where the spectrum was measured.
msms_id integer 32bit Identifier of MS/MS spectrum from the raw data file. Non unique across merged datasets.
query integer 32bit The spectrum identifier as assigned by Mascot.
neutral_mass float 32bit The neutral mass calculated from the precursor_mz by Mascot. Corresponds to neutral_mass in Mascot.
parent_ion float 32bit The actual m/z setting used by the instrument for the isolation of the precursor ion.
precursor_mz float 32bit Observed m/z of precursor ion. Corresponds to "observed" in Mascot .dat file.
s2i float 32bit Signal-to-interference value for the precursor ion of this spectrum
p2t float 32bit Intensity- to -noise value for the precursor ion of this spectrum. (P2T = precursor-to-threshold)
survey_id integer 32bit Identifier of the survey spectrum from the .raw file. Non unique across merged datasets.
start_time float 32bit Retention time for the MS/MS event
peak_rt float 32bit Retention time of the apex of the chromatographic peak identified linked to this spectrum.
peak_intensity float 32bit Maximum intensity of the XIC peak identified from this spectrum.
peak_fwhm float 32bit The full width at half maximum intensity of the chromatographic peak identified from this spectrum.
sum_reporter_ions float 32bit Sum of all reporter ion intensities in this spectrum.
quant_cancelled integer 32bit Is the quantification of this spectrum cancelled
charge_state integer 32bit Charge state of the precursor ion.
sample
Column Name Data Type Description
sample_id integer 32bit Generated identifier of the sample where the spectrum was measured. Master entry.
source_file string 150 characters
The file name of the original Xcalibur .raw file name from which the data were acquired.
acquisition_time float 32bit The total acquisition time of the Xcalibur analysis
acquired_spectra integer 32bit The number of MS/MS events acquired in the Xcalibur raw file.
mascot_matched_spectra integer 32bit The number of spectra for which Mascot found a peptide match.
spectra_in_qc_proteins integer 32bit The number of spectra with peptides linked to proteins fulfilling all QC criteria.
quantified_spectra integer 32bit The number of spectra fulfilling all quantification-specific QC criteria linked to validated proteins.
mean_precursor_ion_ accuracy
float 32bit The precursor ion accuracy is calculated as: (measured precursor ion m/z - theoretical peptide m/z) / theoretical peptide m/z, expressed in ppm. The mean is calculated from all spectra linked to accepted proteins.
sd_precursor_ion_ accuracy
float 32bit The standard deviation of the data calculated for the mean_precursor_ion_accuracy.
mean_reporter_ion_ accuracy
float 32bit The reporter ion accuracy is calculated as: measured m/z - theoretical m/z, expressed in Th. The mean is calculated from all detected reporter ions from all spectra with a linked to a valid protein.
sd_reporter_ion_ accuracy float 32bit The standard deviation of the data calculated for the mean_reporter_ion_accuracy
Nature Protocols: doi:10.1038/nprot.2015.101
25
specquant
Column Name Data Type Description
spectrum_id integer 32bit Assigned identifier for this spectrum. Is unique across all merged datasets in a single analysis. Corresponds to master entry in spectrum table.
isotopelabel_id integer 32bit Isotope label identifier. These are defined in the QuantMethod.cfg file and should be unique across all quantification methods.
protein_group_no integer 32bit Generated protein number to which the peptide sequence is matched. Corresponds to master entry in proteinhit table.
quant_raw float 32bit Unprocessed quantification value. In MS2 quantification this is the intensity of the reporter ion.
quant_isocorrected float 32bit Quantification value following subtraction of isotope interference from adjacent label(s).
quant_allcorrected float 32bit Quantification value after all corrections have been applied.
score float 32bit Mascot score
delta_seq float 32bit Difference in Mascot scores between this Mascot sequence match and the next highest scoring sequence match. If delta_seq equals zero or if it is above the delta_seq_threshold its quant values may be used for protein quantification.
s2i float 32bit Signal-to-interference value for the precursor ion of this spectrum. If the S2I value is above s2ithreshold the peptide's quantification values may be used for protein quantification.
p2t float 32bit Intensity-to-noise value for the precursor ion of this spectrum. If the P2T value is above p2tthreshold the peptides quantification values may be used for protein quantification. (P2T = precursor2threshold).
in_quantification_of_protein integer 32bit Equal to 1 if peptide's reporter ions are used for protein quantification.
peptide_length integer 32bit Length of the peptide sequence.
is_unique integer 32bit Peptide sequence is found uniquely in given protein group. This is required for its quant values to be used in protein quantification.
fdr_at_score float 32bit Calculated peptide false discovery rate for this Mascot score. This may be used as a threshold to exclude peptide from use in protein quantification and from protein inference.
proteinquant
Column Name Data Type Description
protein_group_no integer 32bit Generated protein number to which the peptide sequence is matched. Corresponds to protein entry in proteins output.
isotopelabel_id integer 32bit Isotope label identifier. These are defined in the QuantMethod.cfg file and should be unique across all quantification methods.
reference_label string 50 characters
Name of reporter ion label corresponding to the control on which relative protein quantification is based.
protein_fold_change float 32bit Protein fold change compared to reference.
lower_confidence_level float 32bit Lower bound of confidence level interval for the protein fold change.
upper_confidence_level float 32bit Upper bound of confidence level interval for the protein fold change.
sum_quant_signal float 32bit Sum of all reporter ion signals of valid quantifiable peptides in protein group following all relevant corrections.
qssm integer 32bit Number of spectrum to sequence matches [peptides] in protein group with reporter ions used in protein quantification (qssm =quantfied spectrum-sequence matches).
qupm integer 32bit Number of unique peptide sequences in protein group with reporter ions used in protein quantification (qupm=quantified unique peptide matches).
statistics
Column Name Data Type Description
Nature Protocols: doi:10.1038/nprot.2015.101
26
statistic string 100 characters Name of summary statistic
value string 200 characters Value of summary statistic
isotopes
Column Name Data Type Description
iso_id integer 32bit The isotope labels id, should be unique amongst all methods.
name string 10 characters The name for the individual isotope.
mz float 32bit The reporter ion m/z for the label for MS2 quantification.
intmz integer 32bit The m/z rounded to the nearest integer.
error float 32bit The absolute m/z error to be applied to this label.
method_id integer 32bit The unique identifier for the quantification method that this label is part of.
amino string 5 characters The amino acid the label modifies.
Nature Protocols: doi:10.1038/nprot.2015.101
27
Appendix 3: MS.hdf5 File Description The MS .hdf5 file is split into two sections: rawdata and Mascot. The rawdata section holds the data from the
Xcalibur .raw file and the Mascot section, named after the Mascot .dat file name, contains the Mascot peptide
matching data.
Data structure:
MS.hdf5
│ imports
├───rawdata
│ config
│ deconvions
│ ions
│ isotopes
│ lockmass
│ msmsheader
│ noise
│ parameters
│ profile
│ quan
│ specparams
│ spectra
│ units
│ xicbins
└───F000000_dat
bestprot
config
index
masses
mods
parameters
peptides
proteins
queries
seq2acc
statistics
unimodaminoacids
unimodelements
unimodmodifications
unimodspecificity
Top Level Table
imports
Column Name Data Type Description
name string 30 characters The name of the section in which the Mascot data is to be stored.
date string 30 characters Date and time of the Mascot import start.
datfile string 50 characters The name of the Mascot .dat file to be imported.
status string 10 characters The status of the import.
Rawdata Section
rawdata/config
Column Name Data Type Description
set string 30 characters The config file group.
parameter string 60 characters The parameter name.
value string 150 characters The parameter value
Nature Protocols: doi:10.1038/nprot.2015.101
28
rawdata/deconvions
Column Name Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
mz float 32bit The m/z of the MS/MS ion.
inten float 32bit The intensity of the MS/MS ion.
charge string 10 characters The charge state of the MS/MS ion.
rawdata/ions
Column Name Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
mz float 32bit The m/z for the ion, after recalibration if performed.
inten float 32bit The intensity of the MS/MS ion.
mzraw float 32bit The m/z for the ion.
rt float 32bit The retention time of the spectrum.
rawdata/isotopes
Column Name Data Type Description
iso_id integer 32bit The isotope labels id, should be unique amongst all methods.
name string 10 characters The name for the individual isotope.
mz float 32bit The reporter ion m/z for the label for MS2 quantification.
intmz integer 32bit The m/z rounded to the nearest integer.
error float 32bit The absolute m/z error to be applied to this label.
method_id integer 32bit The unique identifier for the quantification method that this label is part of.
amino string 5 characters The amino acid the label modifies.
rawdata/lockmass
Column Name Data Type Description
mass float 32bit The m/z value for the lock mass ion.
start float 32bit The start of the retention time window for inclusion.
end float 32bit The end of the retention time window for inclusion.
polarity string 3 characters The polarity the instrument should be using to analyze the ion.
comment string 100 characters Comments for this ion.
rawdata/msmsheader
Column Name
Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
unit_id integer 32bit Unit identifier: The unit groups spectra from the same precursor so that they can be processed together.
order integer 32bit Order in which the spectra in a unit were acquired.
scanevent integer 32bit The event number assigned by Xcalibur.
precmz float 32bit The m/z for the precursor ion using XIC derived data if possible. If recalibration has been performed this is the recalibrated m/z
precmzraw float 32bit The m/z for the precursor ion using XIC derived data if possible.
precinten float 32bit The intensity value for the precursor ion. This will be the apex intensity of the XIC data where possible or the interpolated intensity of the precursor ion at the time of the MS/MS event.
peaktype string 10 characters
The source of the precursor ion data. Is One of XICclust - XIC data with full cluster matching, XICsingle - XIC data with only 12C ion matching, survey - interpolated data from MS1 spectra.
monomz float 32bit The Xcalibur monoisotopicMZ value - an m/z value for the precursor ion.
precmz_xic float 32bit The m/z for the precursor ion derived from XIC data. Intensity weighted mean of the XIC data points with intensity > half the maximum intensity.
Nature Protocols: doi:10.1038/nprot.2015.101
29
rawdata/msmsheader continued
Column Name
Data Type Description
precmz_surv float 32bit The m/z for the precursor ion derived from MS1 spectra data.
rt float 32bit The retention time of the spectrum.
rtapex float 32bit The retention time of the matching XIC peak apex.
scanapex integer 32bit The spectrum number for the XIC peak apex.
charge integer 32bit The charge state of the precursor ion.
precsd float 32bit The standard deviation of the data point used to calculate the precmz_xic.
id_spec integer 32bit The spec_id of the spectrum where the sequence data is located. When units have multiple spectra one can contain the sequence information and the other the quantification data.
quan_spec integer 32bit The spec_id of the spectrum where the quantification data is located. When units have multiple spectra one can contain the sequence information and the other the quantification data.
survey_spec integer 32bit Spectrum identifier for the MS1 spectrum prior to the MS/MS event.
inten float 32bit The apex ion intensity of any linked XIC peak.
area float 32bit The maximum ion area of any linked XIC peak.
setmass float 32bit The m/z set by Xcalibur as the center of the isolation window.
fwhm float 32bit The width of the XIC peak at half maximum intensity.
c13iso integer 32bit The number of 13C isotopes identified in the isolation window.
s2i float 32bit The S2I value for the MS/MS event, this is interpolated from the two flanking MS1 spectra.
s2iearly float 32bit The S2I value for the precursor ion measured in the MS1 event prior to the MS/MS event.
s2ilate float 32bit The S2I value for the precursor ion measured in the MS1 event after to the MS/MS event.
c12 float 32bit The intensity of the precursor 12C ion, this is interpolated from the two flanking MS1 spectra.
c12early float 32bit The 12C intensity value for the precursor ion measured in the MS1 event prior to the MS/MS event.
c12late float 32bit The 12C intensity value for the precursor ion measured in the MS1 event after to the MS/MS event.
thresh float 32bit The Xcalibur intensity cut off (best estimation of noise) level for the MS/MS event, this is interpolated from the two flanking MS1 spectra.
threshearly float 32bit The Xcalibur intensity cut off for the precursor ion measured in the MS1 event prior to the MS/MS event.
threshlate float 32bit The Xcalibur intensity cut off for the precursor ion measured in the MS1 event after to the MS/MS event.
fragmeth string 5 characters
The fragmentation method used in this MS/MS event.
fragenergy float 32bit The energy used for fragmentation for this MS/MS event.
sumrepint float 32bit Sum of the intensities of all identified reporter ions.
sumreparea float 32bit Sum of the areas of all identified reporter ions.
smooth float 32bit The maximum smoothed intensity for the XIC peak.
integ float 32bit The integrated ion intensity for ions with intensities > half the maximum intensity the XIC peak.
Nature Protocols: doi:10.1038/nprot.2015.101
30
rawdata/noise
Column Name Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
mz float 32bit The m/z value for this measurement point.
noise float 32bit The intensity cut off level at the m/z. This is the best estimate of noise.
baseline float 32bit ?
rawdata/parameters
Column Name Data Type Description
set string 30 characters What part of the system the parameters are from, e.g. LC, Instrument, Tune.
subset string 30 characters The grouping of the parameters e.g. analytical, EXPERIMENT, Scan Event.
parameter string 60 characters The name of the parameter.
value string 100 characters The value of the parameter.
rawdata/profile
Column Name Data Type Description
id string 12 characters The name of the LC.
time float 32bit The time point.
time_unit string 3 characters The unit of time used in time.
percent_b float 32bit The fraction of solvent B in the LC output.
flow float 32bit The solvent flow rate.
flow_unit string 6 characters The unit of flow used in flow
rawdata/quan
Column Name
Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
isolabel_id integer 32bit The identifier of the isotope label.
area float 32bit The area of the quantification ion.
corrected float 32bit The intensity value after isotope interference correction.
inten float 32bit The intensity value of the isotope label.
survey_id integer 32bit The MS1 spectrum prior to the MS/MS event.
mzdiff float 32bit The m/z difference between the theoretical ion and the identified ion.
ppm float 32bit mzdiff expressed in parts per million.
coalescence float 32bit The measured degree of coalescence between adjacent label ions. Scale between 0 no coalescence and 1 fully coalesced.
rawdata/specparams
Column Name Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
set string 30 characters The Xcalibur group of parameters: either "filter" or "Extra".
parameter string 60 characters The parameter name.
value string 200 characters The parameter value.
rawdata/spectra
Column Name Data Type Description
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
rt float 32bit The retention time of the spectrum.
type string 5 characters Type of spectrum MS1, MS2 or MS3.
Nature Protocols: doi:10.1038/nprot.2015.101
31
rawdata/xicbins
Column Name Data Type Description
bin integer 32bit The m/z bin, this is calculated from the m/z using the binsPerDa parameter.
specid integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
rt float 32bit The retention time of the spectrum.
mz float 32bit The m/z of the ion, after recalibration if performed.
inten float 32bit The intensity of the ion.
area float 32bit The ion area, calculated by integrating the intensities of the ion.
mzraw float 32bit The m/z of the ion.
rawdata/units
Column Name Data Type Description
unit integer 32bit The ID of the unit.
order integer 32bit The order number of the unit MS/MS event.
activation string 8 characters
The fragmentation method used.
acttime float 32bit The time allowed for collisions.
energy float 32bit The energy used in the fragmentation.
isolation float 32bit The isolation window setting.
lowmz float 32bit Lowest m/z for the MS/MS event.
resolution integer 32bit Operating resolution of the instrument.
use string 5 characters
How the spectrum data should be used I = identification only, Q = quantification only & IQ = quantification and identification.
scanevents string 100 characters
The scan events from Xcalibur method linked to this.
colenergysteps integer 32bit The steps in collision energy if this option is being used.
colenergywidth float 32bit Type range of collision energies used if multiple collision energies are being used.
Mascot Data Section
F000000_dat/bestprot
Column Name Data Type Description
parameter string 20 characters The name of the parameter.
type string 5 characters The data type of the parameter (string, float etc.).
value string 150 characters The value of the parameter.
F000000_dat/config
Column Name Data Type Description
set string 30 characters The config file group.
parameter string 60 characters The parameter name.
value string 150 characters The parameter value
F000000_dat/index
Column Name Data Type Description
section string 20 characters Mascot dat file section name.
linenumber integer 32bit The number of the line in the Mascot dat file where the section starts.
F000000_dat/masses
Column Name Data Type Description
name string 10 characters
Name of the element or peptide terminal endings or single letter code of the amino acid.
mass float 32bit Mass of the named element / peptide terminal endings / amino acid
Nature Protocols: doi:10.1038/nprot.2015.101
32
F000000_dat/parameters
Column Name Data Type Description
section string 10 characters Section name from the MascotParser.cfg file.
parameter string 30 characters The parameter name.
value string 100 characters The parameter value
F000000_dat/mods
Column Name Data Type Description
id string 1 characters
Mascot’s identifier for the modification. This is a number for variable modifications and an amino acid code for fixed modifications.
name string 50 characters
The name of the modification.
modtype string 10 characters
Whether the modification is fixed or variable.
da_delta float 32bit The mass difference that the modification makes.
amino string 20 characters
The amino acid specificity or terminal specificity.
neutralloss float 32bit The masses of any neutral losses.
nlmaster float 32bit The mass of the main neutral loss.
relevant integer 32bit Is the modification biologically relevant.
composition string 100 characters
The elemental composition of the modification
F000000_dat/peptides
Column Name Data Type Description
query integer 32bit The mascot assigned spectrum identifier.
pepno integer 32bit The Mascot assigned order of this peptide match to this query.
sequence string 255 characters
Peptide's amino acid sequence.
is_hook integer 32bit Equals 1 if this is a hook peptide.
useinprot integer 32bit Equals 1 if the peptide is suitable to use in protein inference.
modsVariable string 255 characters
Description of the Mascot identified variable modifications. e.g. 2 Oxidation (M); 1 TMT6plex (N-term).
modsFixed string 255 characters
Description of the Mascot identified fixed modifications. e.g. 1 Carbamidomethyl (C); 1 TMT6plex (K).
mass float 32bit Calculated molecular mass of the peptide sequence.
da_delta float 32bit Mass difference between the sequence mass (mw) and the neutral_mass in Da. Corresponds to Mascot delta.
score float 32bit The Mascot score.
misscleave integer 32bit Number of sites on the peptide sequence where expected cleavage by the protease failed to occur.
numionsmatched integer 32bit The number of ions from the sequence that match the spectrum data.
seriesfound string 255 characters
Mascot seriesfound parameter.
peaks1 integer 32bit Mascot peaks1 parameter.
peaks2 integer 32bit Mascot peaks2 parameter.
peaks3 integer 32bit Mascot peaks3 parameter.
Nature Protocols: doi:10.1038/nprot.2015.101
33
F000000_dat/proteins
Column Name Data Type Description
accession string 15 characters The protein identifier used by Mascot.
mass float 32bit The mass of the protein sequence.
name string 255 characters The name of the protein.
F000000_dat/statistics
Column Name Data Type Description
statistic string 20 characters The name of the statistic.
value float 32bit The value of the statistic
F000000_dat/queries
Column Name Data Type Description
query integer 32bit The mascot assigned spectrum identifier.
spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.
msms_id string 10 characters
spec_id reformatted in text format (F000000).
rt float 32bit The retention time of the spectrum.
prec_neutmass float 32bit The precursor m/z turned into zero charge state.
prec_mz float 32bit The precursor m/z.
prec_charge integer 32bit The precursor charge state.
matches integer 32bit Number of peptides found by Mascot within the mass accuracy of the precursor.
homology float 32bit The Mascot score required for a homology match.
numpeps integer 32bit A count of the number of peptides with valid sequences returned by Mascot for this spectrum.
delta_seq float 32bit Difference in Mascot scores between the top Mascot sequence match and the next highest scoring sequence match. If delta_seq equals zero or if it is above the delta_seq_threshold its quant values may be used for protein quantification.
delta_mod float 32bit Difference in Mascot scores between the best two modification variants of this sequence. See ref – pubmed #21057138.
F000000_dat/seq2acc
Column Name Data Type Description
sequence string 255 characters The peptide amino acid sequence.
accession string 15 characters The protein accession.
hook integer 32bit Equals 1 if this peptide has been identified as a hook peptide.
hookscore float 32bit The best score of any hook identifications of this peptide.
pepscore float 32bit The best score of any identifications of this peptide.
bestczrank integer 32bit The best position (Mascot pepno) of any identifications of this peptide.
numpeps integer 32bit The number of times this peptide has been identified.
start integer 32bit The start location of the first occurrence of this peptide in the protein.
end integer 32bit The end location of the first occurrence of this peptide in the protein
F000000_dat/unimodaminoacids
Column Name Data Type Description
code string 6 characters Three letter amino acid code or terminus.
name string 20 characters Full amino acid name or terminus.
title string 6 characters One letter amino acid code or terminus (how the entity is used in Mascot.
monomass float 32bit The monoisotopic mass of the amino acid residue / terminus.
avgmass float 32bit The average mass of the amino acid residue / terminus.
composition string 100 characters The elemental composition of the amino acid residue / terminus.
F000000_dat/unimodelements
Column Name Data Type Description
Nature Protocols: doi:10.1038/nprot.2015.101
34
name string 20 characters The name of the element.
title string 5 characters The symbol used for the element.
monomass float 32bit The monoisotopic mass of the element.
avgmass float 32bit The average mass of the element.
F000000_dat/unimodmodifications
Column Name Data Type Description
modid integer 32bit The Mascot modification identifier.
name string 50 characters The name of the modification.
title string 50 characters The short name as used by Mascot.
delta_mono_mass float 32bit The difference in monoisotopic mass that the modification makes.
delta_avge_mass float 32bit The difference in average mass that the modification makes.
delta_comp string 100 characters The difference in elemental composition that the modification makes.
F000000_dat/unimodspecificity
Column Name Data Type Description
modid integer 32bit The Mascot modification identifier.
group integer 32bit The Mascot group for the modification. This allows Mascot to consider multiple sites as one modification.
site string 20 characters
The amino acid or terminus where the modification can be found.
position string 20 characters
Location that the modification can occur within the peptide / protein.
classification string 30 characters
Biological classification of the potential source of this modification.
hidden string 5 characters
true / false if the modification is hidden and not to be used.
nl_mono_mass float 32bit The difference in monoisotopic mass that the neutral loss makes.
nl_avge_mass float 32bit The difference in average mass that the neutral loss makes.
nl_comp string 100 characters
The difference in elemental composition that the neutral loss makes.
Nature Protocols: doi:10.1038/nprot.2015.101
35
Appendix 4: Directory structure of isobarQuant The following directory structure is created when the contents of the zip file are extracted. Each
isobarQuant
│ LICENSE AGREEMENT.docx
│ outputResults.cfg
│ outputResults.py
│ parseUniprotFasta.py
│ postMascot.cfg
│ postMascot.py
│ preMascot.cfg
│ preMascot.py
│ QuantMethod.cfg
│ README.txt
│
├───CommonUtils
│ averagine.py
│ ConfigManager.py
│ elements.cfg
│ ExceptionHandler.py
│ hdf5Base.py
│ hdf5Config.py
│ hdf5Mascot.py
│ hdf5Results.py
│ hdf5Tables.py
│ ionmanipulation.py
│ LoggingManager.py
│ MascotModificationsHandler.py
│ MathTools.py
│ progressbar.py
│ QuantMethodHandler.py
│ tools.py
│ XICprocessor.py
│ __init__.py
│
├───pyMSsafe
│ czXcalibur.pyd
│ datastore.py
│ mgf.cfg
│ mgf.py
│ pyMSsafe.cfg
│ pyMSsafe.py
│ SpectrumProcessor.py
│ xcalparser.py
│ xRawFile.py
├───mascotParser
│ datfile.py
│ datparser.py
│ hdf2file.py
│ mascotparser.cfg
│ mascotParser.py
│ peptide.py
│ protein.py
│ unimodxmlparser.py
│
├───proteinInference
│ fileData.py
│ hdfFile.py
│ peptideset.py
│ peptidesetmanager.py
│ proteininference.cfg
│ proteinInference.py
│
└───proteinQuantification
corrects2iquant.cfg
correctS2Iquant.py
proteinquantification.cfg
quantifyProteins.py
Nature Protocols: doi:10.1038/nprot.2015.101
36
Appendix 5: Trouble shooting symptom Cause Remedy
all FDR values for protein and peptide show 0
No decoy data (reverse hits) present in the .fasta file used for searching
Add decoy hits to .fasta file on Mascot server (see http://www.matrixscience.com/help/decoy_help.html)
postMascot exits saying ‘No *.dat files found for processing.’
User forgot to export Mascot .dat files to the ‘myproject’ data directory User entered the wrong datadir name
Check that value given as argument for --datadir exists and contains .dat files
No quantification information for proteins
Mascot modification name for quant and mascot_name in .cfg file different Quant method not set correctly in QuantMethod.cfg file
Check QuantMethod.cfg file. Make ‘mascot_name’ is present in Mascot’s modification name. Make sure the masses of the reporter ions are correct for your quantification method (see section “Adding New Quantification Methods” this manual
peptide flag ‘in_protein_inference’ set to 1 but actual FDR is below value set
Protein inference is performed on the ‘best’ peptide sequences in protein group. However, any reporter signals will not be used for protein quant.
Check for the other in the protein group whose FDR value is above threshold set. Check that ‘in_quantification_of_protein’ is set to 0.
'python' is not recognized as an internal or external command, operable program or batch file.
Python has not been installed The Python interpreter is not in the system path.
Follow part one of this guide to re install python. Make sure “C:\python27\” is in Path. Edit Control Panel\All Control Panel Items\SystemAdvance settingsEnvironment Variables Path
Error during preMascot workflow ‘This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.’
Xcalibur foundation software is not installed on the workstation where isobarQuant is running.
Please install Xcalibur foundation software and try again
Nature Protocols: doi:10.1038/nprot.2015.101