supplementary methods of “thermal proteome profiling for ... · different cell line (k562), and...

42
1 Supplementary Methods of “Thermal proteome profiling for unbiased identification of direct and indirect drug targets“ Holger Franken 1,, Toby Mathieson 1,, Dorothee Childs 1,2,, Gavain Sweetman 1, Thilo Werner 1 , Ina Tögel 1 , Carola Doce 1 , Stephan Gade 1 , Marcus Bantscheff 1 , Gerard Drewes 1 , Friedrich B.M. Reinhard 1,* , Wolfgang Huber 2,* , Mikhail M. Savitski 1,* 1 Cellzome GmbH, Molecular Discovery Research, GlaxoSmithKline, Meyerhofstrasse 1, Heidelberg, Germany 2 Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. These authors contributed equally to this work * Corresponding authors [email protected] , [email protected] , [email protected] The supplementary procedures were adapted with permission from the authors of the Jafari et al protocol 3 . The steps were adjusted for different compound treatment (panobinostat), different cell line (K562), and the in-house equipment used. Materials Reagents Liquid nitrogen (use any local provider) K562 cell line (ATCC, cat. no. CCL-243) HDAC inhibitor Panobinostat (Santa Cruz Biotechnology , cat. no. SC-208148) RPMI1640 medium (Life Technologies, cat. no. 21875-034) FCS (Life Technologies, cat. no. 10270-106) DPBS without calcium chloride/ magnesium chloride (Life Technologies, cat. no. 14190-094) DMSO waterfree (Sigma-Aldrich, cat. no. 41647) cOmplete, EDTA-free protease inhibitors (Roche, cat. no. 11873580001) Dimethyl sulfoxide DMSO (Fluka , cat. no. 01934-1L) Equipment Waterbath TW20 (Julabo, cat. no. 9550120) Forma™ Steri-Cult™ CO2 Incubator (Thermo Scientific, cat. no. 3311) Nature Protocols: doi:10.1038/nprot.2015.101

Upload: hatu

Post on 02-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

1

Supplementary Methods of “Thermal proteome profiling for unbiased identification of

direct and indirect drug targets“

Holger Franken1,†

, Toby Mathieson1,†

, Dorothee Childs1,2,†

, Gavain Sweetman1†

, Thilo

Werner1, Ina Tögel

1, Carola Doce

1, Stephan Gade

1, Marcus Bantscheff

1, Gerard Drewes

1,

Friedrich B.M. Reinhard1,*

, Wolfgang Huber2,*

, Mikhail M. Savitski1,*

1 Cellzome GmbH, Molecular Discovery Research, GlaxoSmithKline, Meyerhofstrasse 1,

Heidelberg, Germany

2 Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.

† These authors contributed equally to this work

* Corresponding authors

[email protected], [email protected], [email protected]

The supplementary procedures were adapted with permission from the authors of the Jafari et

al protocol3. The steps were adjusted for different compound treatment (panobinostat),

different cell line (K562), and the in-house equipment used.

Materials

Reagents

Liquid nitrogen (use any local provider)

K562 cell line (ATCC, cat. no. CCL-243)

HDAC inhibitor Panobinostat (Santa Cruz Biotechnology , cat. no. SC-208148)

RPMI1640 medium (Life Technologies, cat. no. 21875-034)

FCS (Life Technologies, cat. no. 10270-106)

DPBS without calcium chloride/ magnesium chloride (Life Technologies, cat. no.

14190-094)

DMSO waterfree (Sigma-Aldrich, cat. no. 41647)

cOmplete, EDTA-free protease inhibitors (Roche, cat. no. 11873580001)

Dimethyl sulfoxide DMSO (Fluka , cat. no. 01934-1L)

Equipment

Waterbath TW20 (Julabo, cat. no. 9550120)

Forma™ Steri-Cult™ CO2 Incubator (Thermo Scientific, cat. no. 3311)

Nature Protocols: doi:10.1038/nprot.2015.101

2

CASY Cell Counter TT 150µM (Roche Innovatis, cat. no. 05651697001)

PCR Tubes (Brand, cat. no. 781305)

Heraeus Multifuge 3SR (Thermo Scientific, cat. no. 75004371)

Peltier Thermal Cycler (MJ Research / BioRad, cat. no. PTC-200)

Tabletop Centrifuge 5415D (Eppendorf, cat. no. 022621425)

0.2 ml Ultracentrifuge Tubes (Beckman Coulter, cat. no. 343775)

Nunc U-bottom 96-well polypropylene plates, clear (Nunc, cat. no. 267245)

Optima Max XP benchtop ultracentrifuge (Beckman Coulter, cat. no. 393315)

Reagent Setup

Cell culture medium Supplement the RPMI cell culture medium by adding FCS to a final

concentration of 10% (vol/vol). Preheat the culture medium to 37 °C using a water bath

before use in cell culture experiments.

Cells This protocol has been optimized for use with cells in suspension, and K562 cells are

used in the example used in the procedure. For adherent cells normally growing as monolayer

cultures, their use as described in this protocol must be carefully validated for each individual

system. This validation would, for example, involve the investigation of whether there are any

differences in drug-target engagement in cells growing as a monolayer compared with the

format presented here (i.e., detached and in suspension). Such comparative studies can be

achieved by performing the compound incubation step in the cell culture flasks before

detaching the cells, rather than in suspension, as outlined herein, with the downstream steps

remaining the same.

Compound preparation Compounds delivered as powders are dissolved in DMSO to yield

10 mM stock solutions. All library compounds are stored frozen as 10 mM stock solutions

until use. The 10 mM stock solutions are diluted to 2 mM in DMSO before use in the

experiments.

! CAUTION Use appropriate safety equipment and a controlled environment when working

with potentially toxic and mutagenic compounds. This is particularly important when

handling DMSO solutions, as the solvent is highly skin permeable.

▲CRITICAL STEP Water-soluble compounds can be dissolved in water instead. For

nonpolar organic compounds, it might be necessary to use lower concentrations of the

compounds in DMSO owing to their lower solubility. If this is the case, it is important to

determine the DMSO tolerance of the cells that you are working with before deciding how

much of the compound to add.

EQUIPMENT SETUP

Forma™ Steri-Cult™ CO2 Incubator Place a water tray containing sterilized ultrapure

water in the incubator, and then set the incubator to 37 °C, 95% humidity and 5% CO2.

Peltier Thermal Cycler Set and equilibrate to the required temperature before the tubes are

placed inside.

PROCEDURE

Nature Protocols: doi:10.1038/nprot.2015.101

3

Supplementary procedure 1, Performing the cell handling and compound treatment

part of the TPP-TR experiments ● TIMING 7 h

▲CRITICAL STEP Part 1 of the PROCEDURE (Steps S1–S16) describes how to perform

the cell handling and compound treatment part of a TPP-TR experiment on intact K562 cells.

S1| Expand K562 cells in cell culture medium to a cell density of ~1.4 million cells per ml

using standard sterile cell culture procedures and supplies. Approximately 120 million K562

cells are required to perform four TPP-TR experiments. This experiment should be performed

at least two times, each on different days, in order to get statistically meaningful results.

S2| Add 24 ml of the ~1.4 million K562 cells per milliliter of suspension into four separate

T75 flasks. If you are working with adherent cells and it is desirable to avoid detaching the

cells, the compound incubation step could instead be performed in suitable cell culture flasks

or microplates, with the heating step (Step S11) performed either immediately after cell

detachment (using the method of your choice) or directly in the culture containers.

S3| Add 120 μl each of a 200 µM solution of Panobinostat in DMSO to individual flasks to

get a final concentration of 1 μM of each compound. Add the same volume of DMSO to the

remaining flask serving as the vehicle control (0.5% finak concentration). Gently mix the cell

suspension by pipetting up and down several times using a serological pipette.

▲CRITICAL STEP If the compound concentrations are adjusted, ensure that the

predetermined DMSO tolerability of the cells in question is not exceeded. As a rule of thumb,

avoid DMSO concentrations above 1% (vol/vol), but note that many cells are much more

sensitive, so it is advisable to do a pilot experiment to determine DMSO tolerance.

! CAUTION Use appropriate safety equipment and a controlled environment when working

with potentially toxic and mutagenic compounds. This is particularly important when

handling DMSO solutions, as the solvent is highly skin-permeable.

S4| Incubate the cell culture flasks for 5 h in the CO2 incubator at 37 °C.

S5| Collect the cell suspension with a serological pipette and transfer the cells to marked 50-

ml conical tubes.

S6| Count the cell numbers and assess cell viability using a preferred method.

▲CRITICAL STEP It is important to examine whether the compound has acutely affected

the viability or membrane integrity of the compound-treated cells.

S7| Centrifuge the conical tubes at 300g for 3 min at room temperature to pellet the cells, and

then carefully remove and discard all of the culture medium.

▲CRITICAL STEP The centrifugal force may need to be adjusted if the cellular test system

is altered such that the cells are more sensitive to disruption during centrifugation or are more

difficult to pellet.

S8| Gently resuspend the cell pellets with 20 ml of ice cold PBS and centrifuge them at 300g

for 3 min at room temperature to pellet the cells again. Carefully remove and discard all of the

supernatant. Repeat this step.

S9| Add 1.2 ml of ice-cold PBS supplemented with protease inhibitors to each respective tube

and carefully resuspend the cell pellet.

Nature Protocols: doi:10.1038/nprot.2015.101

4

! CAUTION Protease inhibitors must be avoided if they interfere with the target or system

being evaluated. The extent of such interference can be tested in prior experiments for each

individual target protein.

S10| Distribute each cell suspension, i.e., with vehicle control or with the test compound, into

ten different 0.2-ml PCR tubes with 100 μl of cell suspension in each tube (~3 million cells

per tube). Mark each tube or strip with a designated temperature (37–67 °C). This yields a

total of 40 PCR tubes.

S11| Centrifuge the PCR tubes at 300g for 3 min at 4°C to pellet the cells, and then carefully

remove 80 µl of PBS. Gently tapp the tubes to resuspend the cells. The tubes are kept at 4°C

before the heat treatment step.

Heat treatment of cell suspensions ●TIMING 10 min

S12| Heat the PCR-tubes with the first four temperature endpoints (37°C, 41°C, 44°C, 47°C)

at their designated temperature for 3 min in the PTC-200 Peltier Thermal Cycler. Immediately

after heating, remove and incubate the tubes at room temperature for 3 min. After this 3-min

incubation, immediately snap-freeze the samples according to the instructions in Step S13.

▲CRITICAL STEP Do not let the temperature in the blocks rise to the designated

temperature while the tubes are in the cycler. Only place the tubes in the blocks when the

temperature has reached the designated temperature. It is crucial to ensure consistent timing

between tubes in both the heating and cooling steps, as well as between heat stages.

S13| In the meantime, heat the samples subjected to the next four temperature points

remaining at their designated temperature (50°C, 53°C, 56°C, 59°C) for 3 min in the PTC-200

Peltier Thermal Cycler. Immediately after heating, remove and incubate the tubes at room

temperature for 3 min. After this 3-min incubation, immediately snap-freeze the samples

according to the instructions in Step S13. Finally, perform this also for the remaining two

temperatures at their designated temperature (63°C and 67°C) for 3 min in the PTC-200

Peltier Thermal Cycler. Immediately after heating, remove and incubate the tubes at room

temperature for 3 min.

▲CRITICAL STEP Do not let the temperature in the blocks rise to the designated

temperature while the tubes are in the cycler. Only place the tubes in the blocks when the

temperature has reached the designated temperature. It is crucial to ensure consistent timing

between tubes in both the heating and cooling steps.

S14| Snap-freeze the heat-treated cell suspensions in liquid nitrogen.

■ PAUSE POINT The experiment can be paused here, with the samples kept at −80 °C

overnight.

Cell lysis ●TIMING 1 h

S15| Freeze-thaw the cells twice using liquid nitrogen and a thermal cycler or heating block

set at 25 °C in order to ensure a uniform temperature between tubes. The tubes are vortexed

briefly after each thawing. The resulting cell lysates are kept on (4 °C) ice after the last

thawing step.

S16| Briefly vortex the tubes, transfer the cell lysates to 0.2 ml polycarbonate thickwall tubes

and centrifuge the cell lysate–containing tubes at 100,000g for 20 min at 4 °C to pellet cell

Nature Protocols: doi:10.1038/nprot.2015.101

5

debris together with precipitated and aggregated proteins. Carefully remove the tubes from the

centrifuge and avoid disturbing the pellets. Keep the samples on ice in a cooling block.

Continue with step 3 in the main protocol.

PROCEDURE

Supplementary procedure 2, Performing the cell handling and compound treatment

part of the TPP-CCR experiments ● TIMING 7 h

▲CRITICAL STEP The cell handling and compound treatment part of a TPP-CCR is

similar to that of a TPP-TR experiment, except that the compound concentration is varied

instead of the temperature when heating the cells. Steps S18–S30 show how to perform the

cell handling and compound treatment part of a concentration-response experiment at 55 °C

on the basis of 9 different compound concentrations.

S17| Expand K562 cells in cell culture medium to a cell density of ~1.4 million cells per ml

using standard sterile cell culture procedures and supplies. Approximately 60 million K562

cells are needed for two TPP-CCR experiments.

S18| Transfer the 10 * 2.5 ml cells with a density of ~ 1.4 million cells/ml to 6-well plates and

label the wells with the compound concentration that will be applied (10, 2.5, 0.625, 0.15625,

0.03906, 0.00977, 0.00244, 0.00061, 0.00015 µM and vehicle).

S19| Place 60 μl of a 2 mM DMSO stock solution of panobinostat in one well of column 1 of

a 96U NUNC plate. Place 45 μl of DMSO in columns 2–10 of the same plate row.

! CAUTION Use appropriate safety equipment and a controlled environment when working

with potentially toxic and mutagenic compounds. This is particularly important when you are

handling DMSO solutions, as the solvent is highly skin-permeable.

S20| Serially dilute the stock solutions by transferring 15 μl from columns 1 to 2 and by

mixing thoroughly with the predispensed DMSO by a minimum of seven aspiration and

dispensing steps. Continue this process by moving one column at the time, i.e., 2–3, 3–4 and

so on, until the sample in column 9 has been thoroughly mixed. Remove 15 μl from column 9

to ensure the same volume of 45 μl in all wells. This procedure generates a 9-point dose-

response curve with fourfold difference in concentration between wells. Column 10 with

added DMSO serves as a vehicle control.

S21| Split the serial dilutions into two copies by transferring 22 μl of all solutions to a second

96U NUNC plate. Place one of the copies in a −20 °C freezer as backup.

S22| Transfer 12.5 μl of all diluted compound solutions to the corresponding labeled well of

th 6-well plate containing the 2.5 ml of 1.4 million cells per ml.

S23| Incubate the plate for 5 hrs in the CO2 incubator at 37 °C.

S24| Transfer 10x 2.5ml cell suspension into to 10 correspondingly labeled 15ml conical

tubes.

S25| Centrifuge the conical tubes at 300g for 3 min at room temperature to pellet the cells,

and then carefully remove and discard all of the culture medium.

Nature Protocols: doi:10.1038/nprot.2015.101

6

▲CRITICAL STEP The centrifugal force may need to be adjusted if the cellular test system

is altered such that the cells are more sensitive to disruption during centrifugation or are more

difficult to pellet.

S26| Gently resuspend the cell pellets with 10 ml of PBS and centrifuge them at 300g for 3

min at room temperature to pellet the cells again. Carefully remove and discard all of the

supernatant. Repeat this step.

S27| Add 125 µl of ice-cold PBS supplemented with protease inhibitors to each respective

tube and carefully resuspend the cell pellet.

! CAUTION Protease inhibitors must be avoided if they interfere with the target or system

being evaluated. The extent of such interference can be tested in prior experiments for each

individual target protein.

S28| Distribute 100 µl of each cell suspension into ten different 0.2-ml PCR tubes (~3 million

cells per tube). Mark each tube with a designated concentration.

S29| Centrifuge the PCR tubes at 300g for 3 min at 4°C to pellet the cells, and then carefully

remove 80 µl of PBS. Gently tapp the tubes to resuspend the cells. The tubes are kept at 4°C

before the heat treatment step.

Heat treatment of cell suspensions ● TIMING 10 min

S30| Place the PCR tubes into the cycler and heat the cells for 3 min at 55 °C.

S31| Remove the tubes from the instrument and place it in an aluminum block at room

temperature for 3 min to ensure consistent cooling between wells.

Cell lysis ●TIMING 1 h

S32| Freeze-thaw the cells twice using liquid nitrogen and a thermal cycler or heating block

set at 25 °C in order to ensure a uniform temperature between tubes. The tubes are vortexed

briefly after each thawing. The resulting cell lysates are kept on (4 °C) ice after the last

thawing step.

S33| Briefly vortex the tubes, transfer the cell lysates to 0.2 ml polycarbonate thickwall tubes

and centrifuge the cell lysate–containing tubes at 100,000g for 20 min at 4 °C to pellet cell

debris together with precipitated and aggregated proteins. Carefully remove the tubes from the

centrifuge and avoid disturbing the pellets. Keep the samples on ice in a cooling block.

Continue with step 3 in the main protocol.

For detailed discussion of the experimental steps and troubleshooting please consult the Jafari

et al3 protocol.

References

3. R. Jafari, H. Almqvist, H. Axelsson et al., Nat Protoc 9 (9), 2100 (2014).

Nature Protocols: doi:10.1038/nprot.2015.101

1

Supplementary Manual: isobarQuant

Table of Contents

INSTALLATION GUIDE ................................................................................................................................................................. 3

INSTALLATION REQUIREMENTS ............................................................................................................................................................... 3

OVERVIEW OF WORKFLOW ........................................................................................................................................................ 4

RUNNING THE PREMASCOT WORKFLOW ................................................................................................................................... 6

RESULTS ............................................................................................................................................................................................ 7

EXPLANATION OF PREMASCOT.CFG PARAMETERS ...................................................................................................................................... 7

Runtime Section ......................................................................................................................................................................... 7

Logging Section .......................................................................................................................................................................... 7

General Section .......................................................................................................................................................................... 7

DETAILED DESCRIPTION OF THE PREMASCOT WORKFLOW ........................................................................................................................... 9

Data Extraction from .raw Files .................................................................................................................................................. 9

Creating Files for Mascot Searching ........................................................................................................................................... 9

MASCOT SEARCHES .................................................................................................................................................................. 11

RUNNING THE POSTMASCOT WORKFLOW ............................................................................................................................... 11

RESULTS .......................................................................................................................................................................................... 12

EXPLANATION OF POSTMASCOT.CFG PARAMETERS .................................................................................................................................. 12

Runtime Section ....................................................................................................................................................................... 13

Logging Section ........................................................................................................................................................................ 13

General Section ........................................................................................................................................................................ 13

Quantification Section .............................................................................................................................................................. 13

DETAILED DESCRIPTION OF THE POSTMASCOT WORKFLOW ....................................................................................................................... 13

Transfer and QC of Mascot-Derived Peptides Matches............................................................................................................ 14

Hook Peptides ........................................................................................................................................................................... 14

Protein Inference ...................................................................................................................................................................... 15

Performing Quantification Correction ...................................................................................................................................... 15

Performing Protein Quantification ........................................................................................................................................... 15

Output Creation ........................................................................................................................................................................ 16

ADDING NEW QUANTIFICATION METHODS .............................................................................................................................. 17

APPENDIX 1: RESULTS TEXT FILES DESCRIPTION ....................................................................................................................... 19

PROTEIN OUTPUT .............................................................................................................................................................................. 19

PEPTIDE OUTPUT ............................................................................................................................................................................... 20

SUMMARY OUTPUT ........................................................................................................................................................................... 21

APPENDIX 2: RESULTS .HDF5 FILE DESCRIPTION ....................................................................................................................... 22

FDRDATA.......................................................................................................................................................................................... 22

CONFIGPARAMETERS .......................................................................................................................................................................... 22

PROTEINHIT ...................................................................................................................................................................................... 22

PEPTIDE ........................................................................................................................................................................................... 23

SPECTRUM ....................................................................................................................................................................................... 24

SAMPLE ........................................................................................................................................................................................... 24

SPECQUANT ...................................................................................................................................................................................... 25

PROTEINQUANT ................................................................................................................................................................................. 25

STATISTICS ........................................................................................................................................................................................ 25

ISOTOPES ......................................................................................................................................................................................... 26

Nature Protocols: doi:10.1038/nprot.2015.101

2

APPENDIX 3: MS.HDF5 FILE DESCRIPTION ................................................................................................................................ 27

TOP LEVEL TABLE .............................................................................................................................................................................. 27

RAWDATA SECTION ............................................................................................................................................................................ 27

MASCOT DATA SECTION ..................................................................................................................................................................... 31

APPENDIX 4: DIRECTORY STRUCTURE OF ISOBARQUANT ......................................................................................................... 35

APPENDIX 5: TROUBLE SHOOTING............................................................................................................................................ 36

Nature Protocols: doi:10.1038/nprot.2015.101

3

Installation Guide

Installation Requirements

isobarQuant Package

The Python code developed by us is hosted on github and may be downloaded by following the link:

https://github.com/protcode/isob/archive/1.0.0.zip

When extracted, this package also contains a License: please read it! By using the isobarQuant software you agree to

be bound by the terms outlined in it. If this is extracted directly to ‘C:\’ it will likely be ‘C:\isob-1.0.0’.

Python

This software was developed using Python2.7 (32 bit version). It requires the following python modules to be

installed:

numpy

numexpr

tables

pathlib

To install Python and the required packages simply follow the instructions below. This information is also given in

the README.txt file which is also found in the .zip file. Any updates to the required libraries will be reflected in the

README file; the user is therefore recommended to read that first and adapt the names provided below where

necessary.

The example below is for 2.7.9 and assumes that c:\temp is the location of all files downloaded. Adapt this where

necessary to a directory name of your choice. If using an earlier version of Python than 2.7.9 it might be necessary

to install the Python 'pip' package as well. Ways to do this can be found here:

https://pip.pypa.io/en/latest/installing.html

On a windows machine with internet access, proceed as follows:

1) Download the following Microsoft installer (msi) file: https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi

Execute the .msi file and follow instructions to install Python to a location of your choice. Ensure that Python is

added to the system path (this is the bottom option on the first page of the installer after the point where the user is

requested to set the location for python to be installed)

2) Within the contents of the extracted .zip file, there are 3 binary (.whl) files taken from the external repository

maintained by Christoph Gohlke (http://www.lfd.uci.edu/~gohlke/pythonlibs ), which correspond to Python version

2.7 on a 32-bit Windows machine and were used in development and testing of this tool

numpy-1.8.2+mkl-cp27-none-win32.whl

numexpr-2.4-cp27-none-win32.whl

tables-3.2.0-cp27-none-win32.whl

These should be installed via the command prompt in the order given below:

3) From a command prompt run:

C:\>pip install pathlib

C:\>pip install c:\isob-1.0.0\numpy-1.8.2+mkl-cp27-none-win32.whl

C:\>pip install c:\isob-1.0.0\numexpr-2.4-cp27-none-win32.whl

C:\>pip install c:\isob-1.0.0\tables-3.2.0-cp27-none-win32.whl

Nature Protocols: doi:10.1038/nprot.2015.101

4

Alternatively, if the computer onto which you are installing the packages is behind a firewall / unable to connect to

the internet the following procedure should be followed:

1) On a computer with internet connection, download the following Microsoft installer (msi) file:

https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi and copy it to a directory of your choosing on the

computer where you are installing isobarQuant. e.g. c:\temp

2) On a computer with internet connection, download pathlib from:

https://pypi.python.org/packages/source/p/pathlib/pathlib-1.0.1.tar.gz and copy to a directory of your choosing

on the computer where the installation is being done e.g. c:\temp

3) Execute the .msi file and follow instructions to install Python to a location of your choice. Ensure that Python is

added to the system path (this is the bottom option on the first page of the installer after the point where the

user is requested to set the location for python to be installed)

4) On the computer where the installation is being carried out, open a command prompt and run the following

commands (the first 3 are included in this download and are taken from the external repository maintained by

Christoph Gohlke (http://www.lfd.uci.edu/~gohlke/pythonlibs ))

The required python packages are now installed. The downloaded files in c:\temp may now be deleted.

Xcalibur Software (Thermo Scientific™)

The Xcalibur Foundation installation is required for the preMascot functionality of this software (tested with Xcalibur

versions 2.2 and 3.0).

NOTE: The need to link to Xcalibur modules forces the windows 32 bit requirement on the preMascot section of the

code. We believe that any Windows environment that allows the successful installation and running of Xcalibur will

be compatible with the analysis code. However, we have only used it on WindowsNT, Win7 and Windows Server

2012.

The postMascot section does not require Xcalibur and can be run in any operating system. We have tested it in

Win7, Windows Server 2012 and Red Hat Enterprise Linux Server release 6.6.

In order to run isobarQuant the user should change to the directory into which the code was unpacked from the .zip

file.

The intermediate results and data are all stored in HDF5 format since this allows the efficient storage and retrieval of

the large quantities of data common in .raw files. See http://www.hdfgroup.org/HDF5/whatishdf5.html for more

information about this format. As it is a binary format, it is not possible to view it in standard text editors. There are

several possible viewers for these .hdf5 files; a list of which is maintained by the HDF Group:

http://www.hdfgroup.org/products/hdf5_tools/index.html.

Overview of Workflow The isobarQuant workflow is split into two parts, the first part (preMascot) deals with the processing of raw Xcalibur

data files. The data from the .raw file is converted to HDF5 and then the raw spectra from the Mass Spectrometer

are processed, performing ion detection on profile data. The expected reporter ions for isobaric tagged

quantification are also detected at this point. The accuracy of precursor ion masses is enhanced by detecting the

elution profile of the precursor ion and its isotopes in the MS1 data and reporting the intensity weighted mean of

the precursor mass for the top (greater than 50% maximum intensity region) of the precursor peak. This data is then

C:\>pip install c:\isob-1.0.0\numpy-1.8.2+mkl-cp27-none-win32.whl

C:\>pip install c:\isob-1.0.0\numexpr-2.4-cp27-none-win32.whl

C:\>pip install c:\isob-1.0.0\tables-3.2.0-cp27-none-win32.whl

C:\>pip install c:\temp\pathlib-1.0.1.tar.gz

Nature Protocols: doi:10.1038/nprot.2015.101

5

used in the generation of peak lists, stored in Mascot generic format, ready for manual submission to Mascot for

searching.

NOTE: The code does not perform automated Mascot searches. These are left for the user to run manually with the

generated .mgf files.

Figure 1. Flow diagram for the two-stage proteomics workflow code. The scheme shows not only the application

functionality but also the interaction with data files utilized and created during an analysis of multiple files. Solid

arrows indicate the application flow; dashed arrows indicate the flow of data to or from a data file; dotted arrows

indicate that the data files are the same entity but have been reproduced in a new location in the scheme for clarity;

red lines indicate that the user must perform the step manually.

The second part of the analysis (postMascot) collates the Mascot results with the MS data by extracting data from

the Mascot results file, linking the Mascot query information to the Xcalibur spectrum information and performing

peptide level QC. The relevant data is then stored in a separate section of the MS.hdf5 file. From this point it is

possible to perform protein inference and quantification on individual files or collectively by merging the data from

multiple files. The results of the protein inference are stored in a newly created .hdf5 file (results.hdf5) and reports

are generated from this as simple tab-delimited text files, which the user may then inspect in Microsoft Excel or use

in other data analysis software. The protein output text file is intended for use as direct input for the ‘R’ package

‘TPP ‘, as described in the associated manuscript.

Both pre- and postMascot parts of the workflow are performed using Python scripts executed on the command line

and are designed to run with the minimum number of parameters to make this as simple as possible. Additional

parameters are found in configuration (.cfg) files corresponding to either part. The preMascot code identifies any

Xcalibur .raw files in the specified data directory and processes each sequentially; generating one .hdf5 and one .mgf

file per .raw file. The .mgf files may then submitted to Mascot for searching. Once complete, the Mascot results

Nature Protocols: doi:10.1038/nprot.2015.101

6

(.dat) files should be copied to the data directory for the subsequent postMascot workflow. Provided they are

stored in the same data directory each .dat file is automatically matched to the appropriate .hdf5 file in the initial

phase of the postMascot process, and the Mascot results are transferred to the .hdf5 file. Using a single parameter,

specified on the command line, the user can control whether protein inference is performed either on each

individual file or on the combined data set once all .hdf5 files have been appended with their Mascot data (See

figure 1 for more details of the workflow and file creation and usage).

Log files are created by each stage of the workflow; however, the relevant logs for each section are deleted at the

start of the preMascot or postMascot step in order to prevent excessive disk usage.

Running the preMascot Workflow Having opened a command prompt, the user should change to the directory where the isobarQuant package has

been extracted, likely called C:\isob-1.0.0 if the .zip was extracted to c:\.

The preMascot workflow (preMascot.py) is controlled using a combination of command line arguments and a

configuration file (preMascot.cfg). There are two essential parameters that must be specified each time

preMascot.py is run: datadir and quant. These are provided using the syntax:

Here we have used "vehicle_1" as a possible directory name. We suggest that the user think of a name which makes

sense within the context of the data being processed. This directory should be created and the relevant Xcalibur

.raw file(s) to be processed copied there. The quant parameter is shown here as "TMT10" but can be any of the

quantification methods defined in the QuantMethod.cfg file or 0 (zero) or "none". This zero setting tells the

software to look at each filename to find a valid quantification method name in the name. If a valid method is found

then the file is processed with this method, otherwise no quantification is performed. The "none" setting disables

any quantification.

Default values for the datadir and quant parameters are also defined in the [runtime] section of preMascot.cfg. If

these are appropriate preMascot.py can be started via the shortened syntax:

The configuration file preMascot.cfg defines a variety of further parameters that control the functionality of

preMascot.py. These are organized in sections, which are separated by section headers indicated by squared

brackets. The parameter values can be either modified in place or specified on the command line for a one-off

analysis according to the syntax:

Where <section> is the section name, <parameter> is the parameter name and <value> is the new value for the

parameter.

Note: parameter values that are specified via the command line only apply to that specific run of preMascot.py or

postMascot.py. They are not changed in the corresponding .cfg file.

--<section>.<parameter> <value>

C:\isobarQuant-1.0.0>python preMascot.py --d

C:\isobarQuant-1.0.0>python preMascot.py --datadir c:\vehicle_1 --quant TMT10

Nature Protocols: doi:10.1038/nprot.2015.101

7

Results For each .raw file in the selected directory two new files (.hdf5 and .mgf) will be created with the same name as the

.raw but different file extensions. The .mgf file is used for Mascot searching and the .hdf5 is the MS.hdf5 data

repository which contains the extracted MS data. The results of the Mascot search will be added in the postMascot

workflow.

Explanation of preMascot.cfg Parameters The master configuration files are used to expose the most commonly changed parameters for ease of use. Relevant

data from the configuration is passed into each sub-process. The preMascot.cfg file is split into the three sections

[runtime], [logging] and [general]. Within any section an optional parameter "parameterconversion" may be

present. When present, this forces the conversion of parameters into the specified type e.g. path, float, integer,

Boolean, list.

Runtime Section

This section contains the parameters that must be provided on the command line. They deal with where the data is

located and the type of quantification used.

Parameter Description Initial Value

datadir The directory where the Xcalibur .raw files can be found c:\myproject

quant The quantification method to be used. A value of 0 forces the code to extract a quantification method name from the .raw filenames.

0

Logging Section

This contains the parameters to control the various log outputs of the preMascot analysis. The data here is passed

into all sub-processes so that all log files and screen output shows the same level of detail. Please note: running in

DEBUG mode adds significantly to the time taken to complete the workflow; it is recommended to use in only in

order to track problems.

Parameter Description Initial Value

logdir subdirectory of the directory specified by datadir (above) where the log files of all the components are placed.

Logs

logfile filename for the log file preMascot.log

screenlevel level of log details that are displayed on screen while the application is running Levels = WARNING, INFO, DEBUG.

INFO

loglevel level of log details that are recorded in the log file. Levels = WARNING, INFO, DEBUG. INFO

General Section

Contains the parameters controlling the preMascot workflow. The tolppm and tolmda parameters are used in

comparing masses and XIC creation. The actual error used is the maximum of the tolmda and the absolute error

calculated from the ion m/z and tolppm.

Parameter Description Initial Value

tolppm relative mass accuracy of the MS data in ppm. 8

tolmda absolute mass accuracy of the MS data in mDa. 8

beforepeak retention time window before the peak apex in which a MS/MS event can be linked to the peak (in minutes)

0.5

afterpeak retention time window after the peak apex in which a MS/MS event can be linked to the peak (in minutes)

0.5

Nature Protocols: doi:10.1038/nprot.2015.101

8

Figure 2. Scheme for the data extraction part of the processing flow of the preMascot workflow. Gray boxes

highlight the section of the workflow that performs the extraction of MS data and parameters from Xcalibur .raw

files. Green boxes highlight section performing the reanalysis of the precursor ion data using chromatographic data

to improve the precursor ion accuracy.

Nature Protocols: doi:10.1038/nprot.2015.101

9

Detailed Description of the preMascot Workflow The preMascot workflow is concerned with the extraction of data from the Xcalibur .raw files and the storage of the

information in data repository (MS.hdf5 file) additionally improvements are made to the precursor ion mass

accuracy. The second part of the preMascot workflow prepares searchable data for Mascot. All the MS/MS are

filtered to improve the data quality and maximize Mascot’s ability to link spectra to peptide sequences. These data

files are simple text files that conform to Mascot's own generic file format (.mgf).

Data Extraction from .raw Files

The data extraction is performed by an application in the preMascot analysis called pyMSsafe (see figure 2). After

initialization using its own configuration file and parameters passed from the preMascot configuration pyMSsafe

opens the Xcalibur .raw file and reads the tune and instrument methods, including HPLC method data. This data is

then transferred into a newly created .hdf5 file (MS.hdf5) which acts as a data repository. pyMSsafe then processes

all spectra in the .raw file.

With MS data it performs peak detection and centroiding if acquired in profile mode. All found peaks are stored in

the data repository. The application also calculates the signal-to-interference values for any precursor ions relating

to the current MS spectrum. This is done by extracting all ions from the region of the MS spectrum corresponding to

the precursor isolation window and assigning the ions as either belonging to the precursor isotopic cluster or

interference ions.

MS/MS data are also centroided if they are acquired in profile mode and the ions stored.

The reporter ions from isobaric quantification methods are detected and their intensity data stored. All reporter

ions have natural carbon in them that is subject to the normal 13C impurities and because the ions have heavy

isotope labels they are subject to miss incorporation of the heavy labels. This means that each label can influence

adjacent labels. For each label we have measured the relative intensities of these satellite ions, compared to the

main reporter ion, and use this data to correct for the influence of other labels. pyMSsafe applies these corrections

at this point and stores the isotope corrected data alongside the raw, unmodified, intensity data.

Once all spectra have been processed, pyMSsafe attempts to reassign the precursor m/z using chromatographic

peak data. Essentially, ion chromatograms (XIC) are constructed for all ions in the precursor ion isotope cluster.

Chromatographic peaks are detected in the XICs and these are linked together to form isotope clusters. The cluster

intensity data is compared to the theoretical isotope intensities of an averagine model with similar m/z. The cluster

with the lowest least squares fit is selected as representing the precursor. The new m/z is calculated from the

intensity weighted average m/z of the top of the monoisotopic peak.

Creating Files for Mascot Searching

The second part of the preMascot workflow is the creation of data files for use in Mascot searches (see figure 3).

Mascot provides an accessible file format, called Mascot Generic Format (.mgf). All MS/MS spectra are read from

the MS.hdf5 file and following the application of selected filters are exported into the .mgf file. In addition to this,

the Mascot generic format allows for a TITLE field for each spectrum. The application uses this field to store some

information about the MS/MS event. The most critical of these is the msms_id which contains the Xcalibur spectrum

number, and is the way that the workflows link Mascot results with the Xcalibur spectra which they came from.

Filtering removes ions from the MS/MS spectrum that would cause interference in the Mascot matching process.

We have developed a number of filters for our data and find that different filters are required for the high

resolution, high energy collision data produced by HCD fragmentation compared to other fragmentation methods.

For lower energy, low resolution data we use a simple filter that selects the four most intense ions in each segment

of the MS/MS spectrum. The segment size varies with the charge state of the precursor: spectra from 1+ and 2+

precursors are segmented every 100 Th and spectra from more highly charged precursors are segmented every

50 Th.

Nature Protocols: doi:10.1038/nprot.2015.101

10

For HCD data we remove the reporter ions, in this mode they often have high intensities and disrupt the Mascot

scoring. Additionally, we deconvolute the spectrum so that all ions with multiple charges are recalculated as singly

charged ions. The final filter combines the deconvoluted ions into a single ion if their new m/z are within instrument

accuracy of each other. The new ion has an intensity equal to the sum of the intensities of all combined ions and a

m/z equal to the intensity weighted mean of the combined ions. This marks the end of the preMascot workflow.

Figure 3. Scheme for the part of the preMascot workflow dealing with the creation of search files for Mascot

analysis.

Nature Protocols: doi:10.1038/nprot.2015.101

11

Table 1. List of MS/MS ion filters and available manipulation methods.

Filter Description

ten Orders the ions by intensity and removes 10% of the ions with the lowest intensity

zone Remove 13C ions and then selects the most intense 4 ions per 100 Th for 1+ and 2+ precursors and per 50 Th for precursors with higher charge states

repion This filter removes the reporter ions from the spectrum.

deconv Identifies ions that are multiply charged in the spectrum and deconvolutes them to singly charged ions.

neutralloss Removes selected neutral losses from the precursor ion. Which losses are defined in the mgf.cfg file [msmsfilters]neutrals. These are defined by the mass delta from the precursor.

multi Groups together ions that have been deconvoluted and are within instrument tolerance of each other. The group is replaced by a single ion of m/z = intensity weighted average of all precursor m/z in the group and intensity = the sum of all intensities in the group.

immonium Removes selected immonium ions. Which immonium ions are defined in the mgf.cfg file [msmsfilters]immonium. These are defined by the expected m/z for each immonium ion.

Mascot Searches Searches are started manually using the Mascot search interface in the normal way with the .mgf files created in the

step above. As soon as the searches are finished, the resulting .dat files should be exported and copied to the

directory where the MS.hdf5 and .raw data are. For FDR estimation (on peptide and protein level) to be carried out,

it is necessary to perform the Mascot searches against a single .fasta file containing both forward and reverse

(decoy) protein sequences. How this file may be created is described in more detail here:

http://www.matrixscience.com/help/decoy_help.html. If this is not possible, the workflow will not fail, but no FDR

estimation will be performed and consequently no filtering based on FDR will be available.

It is recommended that the Mascot configuration (mascot.dat) file is changed to allow the information of all

identified proteins to be stored in the results file (this gives the postMascot workflow access to protein / gene

names, descriptions and molecular weight data of all proteins) including those proteins with no top ranking peptides

of a minimum length (see Mascot Installation and Setup document, chapter 6). The value of the parameter

‘ProteinsInResultsFile’ should be set to 3 (if it is not listed in the configuration file then it should simply be added). If

it is not possible to carry this out on the Mascot server, then we have provided a small Python script to parse out the

required information from the Uniprot.fasta file (for human proteins see

http://www.uniprot.org/proteomes/UP000005640/; ‘Download’ section) to a text file that is used during the

postMascot workflow. The script may be run with the syntax shown below, with the first command line argument

being the uniprot.fasta file from which to parse the data. Currently this will only work for Uniprot .fasta files.

Where " UniprotSequenceFile.fasta " is the name of the Uniprot file to be analyzed. The output, called

‘essentialuniprotdata.txt’, should be copied to the top-level directory where the isobarQuant package is installed: it

is used later on by the script creating the .txt file outputs. The workflow will not fail if this is not carried out, but

some helpful annotations might be missing in the output.

During development of this package, we have used Mascot versions 2.4 and 2.5.

Running the postMascot Workflow The postMascot (postMascot.py) workflow is started in an analogous manner to the preMascot workflow. As

before, it is controlled using a combination of command line arguments and configuration file (postMascot.cfg).

There are two essential parameters that must be specified each time it runs: datadir and mergeresults:

C:\isobarQuant-1.0.0>python postMascot.py --datadir c:\vehicle_1 --mergeresults yes

C:\isobarQuant-1.0.0>python parseUniprotFasta.py UniprotSequenceFile.fasta

Nature Protocols: doi:10.1038/nprot.2015.101

12

Here, "C:\vehicle_1" represents the absolute path of the directory that contains the data files (.dat and .hdf5 files

from the preMascot workflow) and the mergeresults parameter can be any of yes/no, 1/0 or True/False. If "yes" is

selected then the results of all files are merged together and protein inference and quantification is performed on

the combined results. Selecting "no" will separate the results and perform inference and quantification on each

individual file present.

If the default values in the [runtime] section of postMascot.cfg are appropriate postMascot can also be started via

the shortened syntax:

As with the preMascot workflow, the configuration file postMascot.cfg defines a variety of additional parameters

that control the functionality of postMascot.py. These are organized into four sections, which are separated by

section headers indicated by squared brackets. The parameter values can be either modified in place or specified on

the command line for a one-off analysis according to the syntax:

Where <section> is the section name, <parameter> is the parameter name and <value> is the new value for the

parameter. Once again to note: parameter values that are specified via the command line only apply to that specific

run of postMascot.py; they are not changed in the corresponding .cfg file.

Results The results of the postMascot workflow are stored in four files: A master results.hdf5 file and summary text files

containing peptide data, protein data and sample summary data respectively. A full description of the data

contained in each file is given in Appendix 1 & 2.

Output files are named according to the following convention:

<directory>_<analysis_type>_<rundate> _<runtime>_[ output _type]

directory is replaced by the first 25 characters of the data directory name;

analysis_type is ‘merged_results’ when a merged analysis has been performed and is the .raw file name plus

‘_results’ when each .raw file is analyzed separately

rundate is the date when the process was run in the format YYYYMMDD

runtime is the time when the process was run in format HHMM

output_type is either

1. _proteins.txt: file contains protein information and corresponding protein fold changes. This file can

be taken to the next stage.

2. _peptides.txt: file contains information about the individual peptides and their reporter ion values.

3. _summary.txt : file contains some statistical information about the runs performed

4. .hdf5 : the underlying .hdf5 file contains the results of the run started above

We designed the filename convention to store the parent folder in the results file name since the folder name is

often related to the experiment being analyzed. The inclusion of the date/time stamp in the file name prevents the

accidental overwriting of data, especially if multiple operating parameters are being investigated.

Explanation of postMascot.cfg Parameters The postMascot.cfg file is split into four sections: [runtime], [logging], [general] and [quantification]. Relevant data

from this .cfg file is passed to each sub-process.

--<section>.<parameter> <value>

C:\isobarQuant-1.0.0>python postMascot.py --d

Nature Protocols: doi:10.1038/nprot.2015.101

13

Runtime Section

This section contains the parameters that must be provided on the command line. The "mergeresults" parameter

needs further explanation. It is used to control the way protein inference and quantification is performed.

Selecting "yes" will combine the search results from all .raw files and produce a single set of results files. Thus the

protein inference and quantification will be calculated using all peptides in all files.

Selecting "no" will generate a set of results files: one specific for each .raw file.

Parameter Description Initial Value

datadir The directory where the .dat and .hdf5 data files can be found c:\myproject

mergeresults Flag to control the merging of results at the protein inference stage. yes

Logging Section

Contains the parameters to control the log output of the postMascot analysis.

Parameter Description Initial Value

logdir Subdirectory of the directory specified by datadir (above) where the log files of all the components are placed. This is transmitted to and used by all sub-processes.

logs

logfile filename for the log file postMascot.log

screenlevel Level of log details that are displayed on the console (screen) while the application is running. Levels = WARNING, INFO, DEBUG

INFO

loglevel Level of log details that are recorded in the log file. Levels = WARNING, INFO, DEBUG. This is transmitted to and used by all sub-processes.

INFO

General Section

Contains parameters that are used by multiple sub-processes

Parameter Description Initial Value

minhooklength This is the minimum number of amino acids in a peptide to be a candidate hook peptide.

6

delta This is the minimum Mascot score difference between the top two Mascot peptides for the top peptide to be a candidate hook peptide.

10

fdrthreshold The maximum false discovery rate for a peptide to be included. The false discovery cut off for peptide QC. i.e. only peptides with a false discovery rate better than this threshold will be selected, both for quantification and for protein inference

0.01

run_corrects2iquant Flag to switch on / off the script correcting reporter ion signals according to their calculated S2I value

yes

Quantification Section

Contains the parameters to control the calculations performed for protein quantification.

Parameter Description Default Value

reference The isotope that should be used as the reference in determining the relative quantities. Value can be any valid isotope id, or isotope name, ‘first’ or ‘last’. If ‘first’ or ‘last’ is provided then the lowest/highest mass isotope is automatically selected, respectively.

last

Detailed Description of the postMascot Workflow The postMascot workflow is divided into five steps. The major parameters for the postMascot workflow are found in

the postMascot.cfg file, however, each step is further parameterized by its own configuration file. The application

starts by finding all the Mascot results files (.dat files) and reading which file was used to create the search data. This

is used to link the .dat file with its corresponding MS.hdf5 file. This file is then examined to find if any previous

Mascot data has been loaded. If all files have Mascot data then an option is given to use the already loaded data, or

Nature Protocols: doi:10.1038/nprot.2015.101

14

to delete the existing data and load from the current .dat files. If only some of the MS.hdf5 files have Mascot data

then the only option is to delete the existing Mascot data and perform a fresh loading. In either case the option is

presented to quit the application and examine the data manually.

Transfer and QC of Mascot-Derived Peptides Matches

The first section of the postMascot workflow (mascotParser.py) reads the contents of the Mascot results files, details

of the search parameters and the masses and modifications used. These are then transferred to the MS HDF5 file.

The data is stored in a separate section of the .hdf5 file named after the .dat file.

The query data is linked to the Xcalibur spectrum data by the msms_id that is put into the spectrum TITLE data (see

preMascot workflow above). All peptide sequences are removed if they contain any invalid residues (as stated in the

allowedamino parameter of the mascotParser.cfg file [default-ACDEFGHIKLMNPQRSTUVWY]). The peptide data are

then filtered to remove any Mascot suggestion that does not have the same score as the highest scoring Mascot

suggestion for each spectrum. The top sequences are further QC’d by checking if they are hook peptides (defined in

the Hook Peptide section below). The pure peptide data is stored in the peptides table whilst the link to protein

information is stored in the seq2acc table (peptide sequences identified by Mascot (seq) to the protein identifiers

(acc)). This table also contains summary information about the quality of the identification of the peptide sequence

and is used in the protein inference step.

Figure 4. postMascot sections.

1. Transfer of QC’d Mascot peptides to .hdf5 file. Creation of links between spectra / peptides and quantification information to protein accessions.

2. Protein inference based on QC peptides passing FDR threshold

3. Correction of peptide quant values based on measured S2I values

4. Protein Quantification following filtering step of several different criteria

5. Creation of output files and incorporation of Uniprot identifiers

Hook Peptides

Hook peptides represent high quality Mascot identifications. The default settings of isobarQuant require that each

identified protein has at least one hook peptide. Two parameters decide whether a peptide is a hook peptide or not:

sequence length and the Mascot score difference between the highest and second highest scoring Mascot

suggestion for the MSMS spectrum. Only the highest scoring Mascot match for each spectrum is considered for hook

peptide status. Shorter peptides are less reliable for protein identification compared to longer ones, so a hook

peptide must be at least 6 amino acids in length. Since the Mascot scoring system is at heart a probability score the

difference in scores between multiple identifications for the same spectrum is indicative of the relative likelihoods of

Nature Protocols: doi:10.1038/nprot.2015.101

15

the matches being correct. A peptide match with a score difference of 10 points to the next best score should be 10

times more likely to be correct. The default definition dictates that a hook peptide has to have a score difference of

at least 10 points to the next best scoring match for that spectrum.

Protein Inference

The second part of the postMascot workflow (proteinInference.py) infers protein groups from a good quality subset

of the peptides imported from the Mascot .dat files into the individual MS.hdf5 files during the previous step. This is

achieved by removing those peptides which do not pass the FDR cut off, removing protein groups and associated

peptides containing only low-quality or repeat peptide identifications and then by applying the principle of Occam’s

razor to create protein groupings based on the remaining shared peptides. It is at this point that the merging of

results from several searches takes place if required. Firstly, all peptide sequences from all relevant MS.hdf5 files are

used to determine the peptide FDR; the Mascot score corresponding to the FDR-threshold given in the configuration

file is calculated. Next, peptides with their associated protein accessions are retrieved and assembled into groups

based on shared peptides. At this point multiple spectrum matches to the same peptide sequence are represented

by the top-scoring peptide sequence only. Any groups with fewer hook peptides than the given threshold

(minnumhook, default=1) are discarded at this point, as are those containing no peptides with a Mascot score

greater than the equivalent to an FDR of fdrthreshold (default=0.01 (1%)). All groups containing completely

duplicate peptides (samesets) are now merged together. The peptide groupings are scanned in descending order of

hook peptide numbers, total score, and then by all FDR-passed peptides. Peptides are marked as novel the first time

they are encountered and as non-novel in any groups thereafter. At the end of this scan all groups containing only

non-novel peptides are removed.

All peptides are then assessed and flagged for uniqueness. The peptides used at each of the steps described above

are always those passing the FDR-threshold. All other, FDR-failed, peptides are kept with their associated accession

groups and are written to the peptide output sheet but are not used in the determination or merging of protein

groups. This might result in the situation that some of the failed peptides are not found in the protein sequences of

all members of the group. Using the default settings, none of these peptides will be used for protein quantification

(unless the user alters the parameters in the protein quantification configuration file).

Following the protein inference section, peptide, spectrum and quantification data from all spectra are extracted

from the MS.hdf5 files, processed and then added with their protein links to the new .hdf5 results file.

For the protein inference there is only one parameter, minnumhook, which is not transmitted down from the values

of the postMascot.cfg file. Its default value is set to 1, meaning that for a protein set to be valid at least one hook

peptide (also passing FDR cut-off) should be present.

Performing Quantification Correction

The third step in the postMascot workflow (corrects2iquant.py) is concerned with reducing the effect of ratio

compression in later protein fold changes. The reporter ion area values are corrected by a simple algorithm using the

signal to interference measure, S2I, which has been shown to strongly reduce the effect of co-fragmentation and

produce more accurate peptide and protein fold changes. The algorithm is described in detail in Savitski et al J

Proteome Res. 2013 Aug 2;12(8):3586-98. The default mode is to have this correction step switched on. If this is

not required, it can be switched off by setting the parameter run_corrects2iquant in the postMascot.cfg file to 0.

This script has no specific parameters of its own; it takes the logging and result file information from the .cfg of the

parent (postMascot) process.

Performing Protein Quantification

The fourth processing step in the postMascot workflow is the protein quantification step (quantifyProteins.py). Here

the quantification values for all peptides linked to each protein group are extracted from the results.hdf5 file; the

filters (as described below) are applied to these quantification values and if there are more than a given number

spectra with associated reporter ions, a bootstrap sum ratio calculation is carried out as described in Savitski et al

Anal Chem. 2011 Dec 1;83(23):8959-67. This cut-off is set by the parameter minquantspectra in the

Nature Protocols: doi:10.1038/nprot.2015.101

16

proteinquantification.cfg. The number of bootstrap iterations is set to 5000 and the result has three components:

the median fold change over all iterations and the positions of the 0.025 (lower) and 0.975(upper) quantiles. If the

required minimum number of spectra with reporter ions (default is set to 4) is not attained, a simple sum ratio is

calculated and the upper and lower quantiles are set to -1. In cases where no quantification data is available in the

‘reference’ channel, no fold change calculation is possible and a value of -1 is recorded (these -1 values are

converted later to NA in the text outputs). In cases where no reporter signal is present for that isotope but a signal

for the ‘reference’ channel is present then the fold change is set at zero with -1 for the upper and lower quantile

values.

Table 2. Configurable parameters employed in the proteinQuantification part of the postMascot workflow.

Parameter Description Initial Value

minquantspectra This is the minimum number spectra required for bootsumratio to be calculated 4

mascotthreshold This is the minimum Mascot score required for peptides to be used for protein quantification.

15

fdrthreshold The maximum false discovery rate for a peptide to be used for quantification. Set in the postMascot.cfg file. Given a fraction of 1.

0.01

peplengthfilter This is the minimum accepted length of a peptide for it to be used for protein quantification: Set in the postMascot.cfg file.

6

p2tthrshold Threshold above which P2T value must be to allow quant values to be used in protein quantification.

4

s2ithreshold Threshold above which S2I value must be to allow quant values to be used in protein quantification.

0.5

deltaseqfilter Score difference between the quantified peptide and the next highest scoring sequence for that spectrum.

5

quantmethod Method used to calculate protein quantification values: either ‘bootstrap’ or ‘simpleratio’. If the total number of quantified spectra in the protein group is below ‘minquantspectra’, the simpleratio is used.

bootstrap

The advanced user might wish to run each of the four steps above as a separate process. Each process is started in a

very similar way to that described for the pre- and postMascot workflows. The parameters may be changed either

within each separate .cfg file or via the command line in the manner described above. The scripts are started from

one of three sub-directories. The full directory structure of isobarQuant is shown in Appendix 4. Please note that in

all steps of the postMacot workflow the results from the any previous steps are deleted from the .hdf5 files.

For the standard user it should normally suffice to run the whole postMascot workflow and if necessary change the

parameters in the .cfg files as described. This way a new results.hdf5 file is generated each time and the protein

quantification values will not be overwritten. If the postMascot workflow is started a second time and Mascot

results data are already present in the MS.hdf5 file, the user is prompted whether or not to overwrite this data or

abandon the processing.

Output Creation

The .txt outputs are created as the final part of the postMascot workflow. The basis for the naming of the files is

described above in the section ‘Output’. The structure of these files is defined in Appendix 1 and 2. It is possible to

re-create the .txt outputs at any time, without having to re-run the whole postMascot workflow, since all

information is drawn from the underlying results.hdf5 file. This is done by issuing the following command:

This tells the script to create the .txt outputs for all results.hdf5 files in directory c:\vehicle_1 which have the date

string matching ‘YYYYMMDD’, if one wished to create outputs for all result.hdf5 files created in directory

‘c:\vehicle_1’ on January 3rd 2016 YYYYMMDD would read 20160103. One could extend this to be more precise by

giving the time *20160103_1600* or the exact file name ‘vehicle_1_20160103_1600.hdf5’.

C:\isobarQuant-1.0.0>python outputResults.py --datadir c:\vehicle_1 --filefilter *YYYYMMDD*

Nature Protocols: doi:10.1038/nprot.2015.101

17

This might be useful in the case that one wishes to show the protein outputs with or without the ‘reverse’ decoy hits

displayed (the parameter ignore_reverse_hits controls this and is set to 0 (off) by default).

isobarQuant is now finished. At this point data may be inspected visually or used as input for further analysis in the

TPP R-package.

Adding New Quantification Methods The details of the quantification methods are stored in the QuantMethod.cfg configuration file. Three methods are

included in the installation (TMT10, iTRAQ and 8TRAQ). Each method has its own section in QuantMethod.cfg.

To add a new method in the QuantMethod.cfg file a new section must be created. The name of the section should

be the same as the meth_name parameter only in lowercase. The meth_name and meth_id should both be unique

as these parameters are used by the code to identify each method. Where method variants exist (e.g. TMT with 6, 8

and 10 labels) we have adopted the convention that the number of labels in the method is part of the method name

(e.g. TMT6, TMT8, TMT10).

Parameter Explanation

meth_name Full name of the method (used to lookup methods in filenames e.g. TMT10)

meth_type Generic name of the method (without any isotope numbers e.g. TMT)

mascot_name Used by the code to identify the Mascot modification corresponding to the quantification method.

meth_id id number of the quantification method

heavy_isotopes The heavy isotopes used in the quantification method. Formatted as a python dictionary with possible keys = C13, N15, O18, H2 and values = number of the isotope present.

num_isotopes number of different labels in the method

Source The location of the quantification data in the MS data: currently only MS2-based quantification strategies are supported

The mascot_name is used by the code to identify Mascot modifications that are used for quantification. The name

must be contained in the Mascot modification title, e.g. TMT which will match to any of the modifications "TMT",

"TMT2plex" and "TMT6plex". Please note that this parameter is case sensitive so "tmt" will not match to any of the

previous modifications. Any peptide that does not have any modification (variable or fixed) will not be used for

quantification.

The heavy_isotopes parameter is used in the generation of the isotope model for the precursor isotopes. In the

model an average amino acid (averagine) is created using the average elemental composition of all amino acids

weighted by their frequency in the IPI protein database (composition: C4.91, H7.75, N1.38, O1.48, S0.04). The elemental

composition of 1 – 200 averagine units and their isotopic abundances are calculated. Working with the assumption

that each peptide is labeled twice, on average, the isotopic abundance is convoluted by the twice the

heavy_isotopes elemental composition. The heavy_isotopes composition is calculated from the average compostion

of all labels. In the case of iTRAQ there are three different isotopic compositions that make up the four labels (13C2 18O, 13C 15N 18O and 13C3

15N), these are combined to give the composition in QuantMethods.cfg (13C2.25 15N0.75

18O0.5).

After the source parameter there is a series of numbered lines that contain information about each isotope in the

quantification method. The lines have the format below:

<id>: [{'mass':<mass>, 'id':<id>, 'name':<name>, 'corrections':<correction>}]

<id> = unique, to the QuantMethod.cfg file, identifier for the isotope, <mass> = The mass of the reporter ion for this

label, <name> = name assigned to the isotope label, <corrections> = The fractional interference of this isotope with

other isotopes. The data for this is formatted as a python dictionary: with key = the ID of the affected isotope label

and value = fractional interference.

e.g. 1: [{'mass':126.12772591, 'id':1, ‘name’:'126', ‘corrections’ :{ 3:0.04}}]

Nature Protocols: doi:10.1038/nprot.2015.101

18

In this case isotope id 1 interferes with isotope id 3. The degree of interference is calculated as the intensity of

isotope id 1 multiplied by the fractional interference. This will be subtracted from the intensity of isotope id 3. This

process is repeated for isotopes in the quantification method to calculate the quant_isocorrected values stored in

the results files. For further detail on how to measure the fractional interference values see Werner et al Anal Chem.

2012 Aug 21;84(16):7188-94.

Nature Protocols: doi:10.1038/nprot.2015.101

19

Appendix 1: Results Text Files Description Output files are named according to the following convention:

<directory>_<outputtype>_<rundate>_<runtime>.<suffix>; where <directory> is replaced by the first 25 characters

of the data directory name; <outputtype> is either merged_results or (if mergeresults was set to ‘no’ or 0, the full

Xcalibur filename minus the ".raw"; is replaced by one of "peptides", "proteins" or "summary"

Protein Output

Field Name Description

protein_group_no Generated protein number for this protein group.

protein_id Identifying accession(s) from protein database. In cases where all identified peptides match to more than one protein sequence, all possible accessions are given here. The order in which the accessions are presented determines the order of the corresponding information in the description and mw columns. This protein_id can be used to match proteins across files from multiple experiments (as opposed to protein_group_no).

description Description line from .fasta file used in Mascot search. More than one entry possible.

gene_name For data searched with Uniprot fasta files only. This corresponds to the GN value given in the description line of the fasta file. For multiple peptide-match entries this value is the minimal set of gene names associated with the entry. This may be a more appropriate identifier used for matching proteins between files from multiple experiments.

mw Molecular weight of full protein sequences. More than one entry possible: see definition of protein_id.

protein_fdr False discovery rate for protein based on max_score.

max_score Maximum Mascot score of all peptides in protein group.

total_score Sum of mascot scores for all peptides in protein group greater than peptide FDR threshold.

ssm Number of spectrum-to-sequence matches [peptides] in protein group (ssm=spectrum-sequence matches).

upm Number of peptide sequences in protein group that are unique to it (upm=unique peptide matches).

fqssm Number of spectrum to sequence matches [peptides] in protein group with reporter ions that were excluded from protein quantification by the filter criteria (such as uniqueness, S2I, P2T, FDR and mascot score) fqssm=filtered quantified spectrum-sequence matches.

qssm Number of spectrum to sequence matches [peptides] in protein group with reporter ions used in protein quantification qssm =quantified spectrum-sequence matches.

qupm Number of unique peptide sequences in protein group with reporter ions used in protein quantification qupm=quantified unique peptide matches.

reference_label Name of reporter ion label corresponding to the control on which relative protein quantification is based.

signal_sum_<label> Sum of all reporter ion signals of valid quantifiable peptides in protein group. Corrected for isotope impurity and signal to interference prior to summing. Given for each label where <label> is replaced by label name.

rel_fc_<label> Relative fold change calculated by performing bootstrap sum of ratios of quant values against the control. Given for each label where <label> is replaced by label name.

delta_conf_<label> Delta between upper and lower boundary of 95% confidence interval as obtained during the bootstrap sum of ratios calculation. Given for each label where <label> is replaced by label name.

Nature Protocols: doi:10.1038/nprot.2015.101

20

Peptide Output The peptide output file contains peptide centric data including quantification, FDR, and protein group data.

Field Name Description

protein_group_no Generated protein number to which the peptide sequence is matched.

protein_id Identifying accession(s) from protein database. In cases where all identified peptides match to more than one protein sequence, all possible accessions are given here. The order in which the accessions are presented determines the order of the corresponding information in the description and mw columns. This protein_id can be used to match proteins across files from multiple experiments (as opposed to protein_group_no).

sequence Peptide's amino acid sequence. The length of the peptide sequence is used in peptide QC and to exclude it from use in protein quantification. This is given as "peptide" in _results.hdf5 file

modifications Description of all modifications with their position in the sequence and the amino acid they modify. E.g. TMT6plex N-term; Oxidation M5; Oxidation M9; Carbamidomethyl C12; TMT6plex K13;

mw Calculated molecular mass of the peptide sequence corresponds to mrcalc in Mascot .dat file.

precursor_mz Observed m/z of precursor ion. Corresponds to "observed" in Mascot .dat file.

charge_state Charge state of the precursor ion.

ppm_error Mass difference between the sequence mass (mw) and the neutral_mass in Da, expressed in ppm.

score Score assigned to spectrum-to-protein match by Mascot. This may be used as a threshold to exclude peptide from being used for protein quantification.

fdr_at_score Calculated peptide false discovery rate for this score. This may be used as a threshold to exclude peptide from use in protein quantification and from protein inference.

rank Rank of spectrum to peptide match within Mascots list of ten suggestions. Only rank1 peptides are quantified.

msms_id Identifier of MS/MS spectrum from the raw data file. Non unique across merged datasets.

source_file The file name of the original Xcalibur .raw file name from which the data were acquired.

peak_intensity Maximum intensity of the XIC peak identified from this spectrum.

s2i Signal-to-interference value for the precursor ion of this spectrum. If the S2I value is above s2ithreshold the peptide's quantification values may be used for protein quantification.

p2t Intensity-to-noise value for the precursor ion of this spectrum. If the P2T value is above p2tthreshold the peptides quantification values may be used for protein quantification. (P2T = precursor2threshold).

is_unique Peptide sequence is found uniquely in given protein group. This is required for its quant values to be used in protein quantification.

in_quantification_of_protein Equal to 1 if peptide's reporter ions are used for protein quantification.

in_protein_inference Equals 1 if peptide sequence is used in assignment of protein groups.

seq_start Peptide start position on protein sequence given by first identifying accession in protein group.

seq_end Peptide end position on protein sequence given by first identifying accession in protein group.

sig_<label>

Nature Protocols: doi:10.1038/nprot.2015.101

21

Summary Output The summary output file is in three parts the first part is a table (described below) containing the summary data for

each individual sample together with the totals for the whole workflow.

The second part contains the number of forward and reverse protein hits and the minimum Mascot scores for 1%

and 5% peptide FDR.

The final part contains all the operational parameters from the pre- and postMascot workflows

Field Name Description

sample_id Generated identifier of the sample

source_file The file name of the original Xcalibur .raw file name from which the data were acquired.

acquired_spectra The number of MS/MS events acquired in the Xcalibur raw file.

mascot_matched_spectra The number of spectra for which Mascot found a peptide match.

spectra_in_qc_proteins The number of spectra with peptides linked to proteins fulfilling all QC criteria.

quantified_spectra The number of spectra fulfilling all quantification-specific QC criteria linked to validated proteins.

mean_precursor_ion_accuracy The precursor ion accuracy is calculated as: (measured precursor ion m/z - theoretical peptide m/z) / theoretical peptide m/z, expressed in ppm. The mean is calculated from all spectra linked to accepted proteins.

sd_precursor_ion_accuracy The standard deviation of the data calculated for the mean_precursor_ion_accuracy.

mean_reporter_ion_accuracy The reporter ion accuracy is calculated as: measured m/z - theoretical m/z, expressed in Th. The mean is calculated from all detected reporter ions from all spectra with a linked to a valid protein.

sd_reporter_ion_accuracy The standard deviation of the data calculated for the mean_reporter_ion_accuracy

Nature Protocols: doi:10.1038/nprot.2015.101

22

Appendix 2: Results .hdf5 File Description

fdrdata

Column Name Data Type Description

data_type string 50 characters Describes whether the data is a peptide or protein based FDR

score float 32bit Mascot score

reverse_hits integer 32bit Number of reverse (false) hits at Mascot score

forward_hits integer 32bit Number of forward (true) hits at Mascot score

global_fdr float 32bit FDR for Mascot score >= score

local_fdr float 32bit FDR for Mascot score = score

true_spectra integer 32bit Number of spectra with Mascot score >= score

configparameters

Column Name Data Type Description

analysis string 30 characters Name of the part of the analysis that is calling the application.

application string 30 characters Name of the application using the parameter

section string 30 characters Section of the config file containing the parameter

parameter string 60 characters Parameter name

value string 100 characters Parameter value

proteinhit

Column Name Data Type Description

protein_group_no integer 32bit Generated protein number for this protein group. Master entry.

protein_id string 100 characters

Identifying accession(s) from protein database. In cases where all identified peptides match to more than one protein sequence, all possible accessions are given here. The order in which the accessions are presented determines the order of the corresponding information in the description and mw columns. This protein_id can be used to match proteins across files from multiple experiments (as opposed to protein_group_no).

description string 500 characters

Description line from .fasta file used in Mascot search. More than one entry possible.

gene_name string 50 characters

For data searched with Uniprot fasta files only. This corresponds to the GN value given in the description line of the fasta file. For multiple peptide match entries this value is the minimal set of gene names associated with the entry. This may be a more appropriate identifier used for matching proteins between files from multiple experiments.

mw float 32bit Molecular weight of full protein sequences. More than one entry possible see definition of protein_id.

ssm integer 32bit Number of spectrum-to-sequence matches [peptides] in protein group (ssm=spectrum-sequence matches).

total_score float 32bit Sum of mascot scores for all peptides in protein group greater than peptide FDR threshold. Note that only highest Mascot score is counted for all versions of peptide sequence in that group.

hssm integer 32bit Number of spectra matched to hook peptides in protein group. See definition of hook peptide above. hssm = hook sequence-spectra matches.

hookscore float 32bit Sum of the Mascot scores for the matched Hook peptides

upm integer 32bit Number of peptide sequences in protein group that are unique to it (upm=unique peptide matches).

is_reverse_hit integer 32bit is the match a false positive match

protein_fdr float 32bit False discovery rate for protein based on max_score.

max_score float 32bit Maximum Mascot score of all peptides in protein group.

Nature Protocols: doi:10.1038/nprot.2015.101

23

peptide

Column Name Data Type Description

peptide_id integer 32bit Generated unique identifier for this peptide record.

spectrum_id integer 32bit Assigned identifier for this spectrumCorresponds to master entry in spectrum table.

protein_group_no integer 32bit Generated protein number to which the peptide sequence is matched. Corresponds to master entry in proteinhit table.

peptide string 100 characters

Peptide's amino acid sequence. The length of the peptide sequence is used in peptide QC and to exclude it from use in protein quantification. This is given as sequence in the output.

variable_modstring string 100 characters

Description of the Mascot identified variable modifications. e.g. 2 Oxidation (M); 1 TMT6plex (N-term).

fixed_modstring string 100 characters

Description of the Mascot identified fixed modifications. e.g. 1 Carbamidomethyl (C); 1 TMT6plex (K).

positional_modstring string 150 characters

Description of all modifications with their position in the sequence . e.g. TMT6plex:0; Carbamidomethyl:2; Oxidation:3; Oxidation:10; TMT6plex:12.

score float 32bit Score assigned to spectrum-to-protein match by Mascot. This may be used as a threshold to exclude peptide from being used for protein quantification.

rank integer 32bit Rank of spectrum to protein match within Mascots list of ten suggestions. Only rank1 peptides are quantified.

mw float 32bit Calculated molecular mass of the peptide sequence corresponds to mrcalc in Mascot .dat file.

da_delta float 32bit Mass difference between the sequence mass (mw) and the neutral_mass in Da. Corresponds to Mascot delta.

ppm_error float 32bit da_error expressed in ppm.

missed_cleavage_ sites

integer 32bit Number of sites on the peptide sequence where expected cleavage by the protease failed to occur.

delta_seq float 32bit Difference in Mascot scores between this Mascot sequence match and the next highest scoring sequence match.

delta_mod float 32bit Difference in Mascot scores between the best two modification variants of this sequence. See paper 21057138.

is_hook integer 32bit Equals 1 if this is a hook peptide.

is_duplicate integer 32bit When multiple identifications are made to the same sequence the highest scoring identification has is_duplicate =0. All other identifications have is_duplicate = 1.

is_first_use_of_ sequence

integer 32bit Equals 1 for the first occurrence of this peptide sequence across all protein groups when they are ordered by descending total_score.

is_unique integer 32bit Peptide sequence is found uniquely in given protein group. This is required for its quant values to be used in protein quantification.

is_reverse_hit integer 32bit Equals 1 if peptide sequence maps only to a protein from the reverse database.

is_quantified integer 32bit Equals 1 if peptide links to at least one quantification event (reporter signal). May be filtered out if peptide fails one or more of the criteria

failed_fdr_filter integer 32bit Equals 1 if peptide's FDR is above the threshold given at runtime. This means it is not used for quantification or for protein inference during the aggregation / re-assignment step.

fdr_at_score float 32bit Calculated peptide false discovery rate for this score. This may be used as a threshold to exclude peptide from use in protein quantification and from protein inference.

in_protein_inference integer 32bit Equals 1 if peptide sequence is used in assignment of protein groups.

seq_start integer 32bit Peptide start position on protein sequence given by first identifying accession in protein group.

seq_end integer 32bit Peptide end position on protein sequence given by first identifying accession in protein group.

Nature Protocols: doi:10.1038/nprot.2015.101

24

spectrum

Column Name Data Type Description

spectrum_id integer 32bit Assigned identifier for this spectrum. Is unique across all merged datasets in a single analysis. Master entry.

sample_id integer 32bit Generated identifier of the sample where the spectrum was measured.

msms_id integer 32bit Identifier of MS/MS spectrum from the raw data file. Non unique across merged datasets.

query integer 32bit The spectrum identifier as assigned by Mascot.

neutral_mass float 32bit The neutral mass calculated from the precursor_mz by Mascot. Corresponds to neutral_mass in Mascot.

parent_ion float 32bit The actual m/z setting used by the instrument for the isolation of the precursor ion.

precursor_mz float 32bit Observed m/z of precursor ion. Corresponds to "observed" in Mascot .dat file.

s2i float 32bit Signal-to-interference value for the precursor ion of this spectrum

p2t float 32bit Intensity- to -noise value for the precursor ion of this spectrum. (P2T = precursor-to-threshold)

survey_id integer 32bit Identifier of the survey spectrum from the .raw file. Non unique across merged datasets.

start_time float 32bit Retention time for the MS/MS event

peak_rt float 32bit Retention time of the apex of the chromatographic peak identified linked to this spectrum.

peak_intensity float 32bit Maximum intensity of the XIC peak identified from this spectrum.

peak_fwhm float 32bit The full width at half maximum intensity of the chromatographic peak identified from this spectrum.

sum_reporter_ions float 32bit Sum of all reporter ion intensities in this spectrum.

quant_cancelled integer 32bit Is the quantification of this spectrum cancelled

charge_state integer 32bit Charge state of the precursor ion.

sample

Column Name Data Type Description

sample_id integer 32bit Generated identifier of the sample where the spectrum was measured. Master entry.

source_file string 150 characters

The file name of the original Xcalibur .raw file name from which the data were acquired.

acquisition_time float 32bit The total acquisition time of the Xcalibur analysis

acquired_spectra integer 32bit The number of MS/MS events acquired in the Xcalibur raw file.

mascot_matched_spectra integer 32bit The number of spectra for which Mascot found a peptide match.

spectra_in_qc_proteins integer 32bit The number of spectra with peptides linked to proteins fulfilling all QC criteria.

quantified_spectra integer 32bit The number of spectra fulfilling all quantification-specific QC criteria linked to validated proteins.

mean_precursor_ion_ accuracy

float 32bit The precursor ion accuracy is calculated as: (measured precursor ion m/z - theoretical peptide m/z) / theoretical peptide m/z, expressed in ppm. The mean is calculated from all spectra linked to accepted proteins.

sd_precursor_ion_ accuracy

float 32bit The standard deviation of the data calculated for the mean_precursor_ion_accuracy.

mean_reporter_ion_ accuracy

float 32bit The reporter ion accuracy is calculated as: measured m/z - theoretical m/z, expressed in Th. The mean is calculated from all detected reporter ions from all spectra with a linked to a valid protein.

sd_reporter_ion_ accuracy float 32bit The standard deviation of the data calculated for the mean_reporter_ion_accuracy

Nature Protocols: doi:10.1038/nprot.2015.101

25

specquant

Column Name Data Type Description

spectrum_id integer 32bit Assigned identifier for this spectrum. Is unique across all merged datasets in a single analysis. Corresponds to master entry in spectrum table.

isotopelabel_id integer 32bit Isotope label identifier. These are defined in the QuantMethod.cfg file and should be unique across all quantification methods.

protein_group_no integer 32bit Generated protein number to which the peptide sequence is matched. Corresponds to master entry in proteinhit table.

quant_raw float 32bit Unprocessed quantification value. In MS2 quantification this is the intensity of the reporter ion.

quant_isocorrected float 32bit Quantification value following subtraction of isotope interference from adjacent label(s).

quant_allcorrected float 32bit Quantification value after all corrections have been applied.

score float 32bit Mascot score

delta_seq float 32bit Difference in Mascot scores between this Mascot sequence match and the next highest scoring sequence match. If delta_seq equals zero or if it is above the delta_seq_threshold its quant values may be used for protein quantification.

s2i float 32bit Signal-to-interference value for the precursor ion of this spectrum. If the S2I value is above s2ithreshold the peptide's quantification values may be used for protein quantification.

p2t float 32bit Intensity-to-noise value for the precursor ion of this spectrum. If the P2T value is above p2tthreshold the peptides quantification values may be used for protein quantification. (P2T = precursor2threshold).

in_quantification_of_protein integer 32bit Equal to 1 if peptide's reporter ions are used for protein quantification.

peptide_length integer 32bit Length of the peptide sequence.

is_unique integer 32bit Peptide sequence is found uniquely in given protein group. This is required for its quant values to be used in protein quantification.

fdr_at_score float 32bit Calculated peptide false discovery rate for this Mascot score. This may be used as a threshold to exclude peptide from use in protein quantification and from protein inference.

proteinquant

Column Name Data Type Description

protein_group_no integer 32bit Generated protein number to which the peptide sequence is matched. Corresponds to protein entry in proteins output.

isotopelabel_id integer 32bit Isotope label identifier. These are defined in the QuantMethod.cfg file and should be unique across all quantification methods.

reference_label string 50 characters

Name of reporter ion label corresponding to the control on which relative protein quantification is based.

protein_fold_change float 32bit Protein fold change compared to reference.

lower_confidence_level float 32bit Lower bound of confidence level interval for the protein fold change.

upper_confidence_level float 32bit Upper bound of confidence level interval for the protein fold change.

sum_quant_signal float 32bit Sum of all reporter ion signals of valid quantifiable peptides in protein group following all relevant corrections.

qssm integer 32bit Number of spectrum to sequence matches [peptides] in protein group with reporter ions used in protein quantification (qssm =quantfied spectrum-sequence matches).

qupm integer 32bit Number of unique peptide sequences in protein group with reporter ions used in protein quantification (qupm=quantified unique peptide matches).

statistics

Column Name Data Type Description

Nature Protocols: doi:10.1038/nprot.2015.101

26

statistic string 100 characters Name of summary statistic

value string 200 characters Value of summary statistic

isotopes

Column Name Data Type Description

iso_id integer 32bit The isotope labels id, should be unique amongst all methods.

name string 10 characters The name for the individual isotope.

mz float 32bit The reporter ion m/z for the label for MS2 quantification.

intmz integer 32bit The m/z rounded to the nearest integer.

error float 32bit The absolute m/z error to be applied to this label.

method_id integer 32bit The unique identifier for the quantification method that this label is part of.

amino string 5 characters The amino acid the label modifies.

Nature Protocols: doi:10.1038/nprot.2015.101

27

Appendix 3: MS.hdf5 File Description The MS .hdf5 file is split into two sections: rawdata and Mascot. The rawdata section holds the data from the

Xcalibur .raw file and the Mascot section, named after the Mascot .dat file name, contains the Mascot peptide

matching data.

Data structure:

MS.hdf5

│ imports

├───rawdata

│ config

│ deconvions

│ ions

│ isotopes

│ lockmass

│ msmsheader

│ noise

│ parameters

│ profile

│ quan

│ specparams

│ spectra

│ units

│ xicbins

└───F000000_dat

bestprot

config

index

masses

mods

parameters

peptides

proteins

queries

seq2acc

statistics

unimodaminoacids

unimodelements

unimodmodifications

unimodspecificity

Top Level Table

imports

Column Name Data Type Description

name string 30 characters The name of the section in which the Mascot data is to be stored.

date string 30 characters Date and time of the Mascot import start.

datfile string 50 characters The name of the Mascot .dat file to be imported.

status string 10 characters The status of the import.

Rawdata Section

rawdata/config

Column Name Data Type Description

set string 30 characters The config file group.

parameter string 60 characters The parameter name.

value string 150 characters The parameter value

Nature Protocols: doi:10.1038/nprot.2015.101

28

rawdata/deconvions

Column Name Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

mz float 32bit The m/z of the MS/MS ion.

inten float 32bit The intensity of the MS/MS ion.

charge string 10 characters The charge state of the MS/MS ion.

rawdata/ions

Column Name Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

mz float 32bit The m/z for the ion, after recalibration if performed.

inten float 32bit The intensity of the MS/MS ion.

mzraw float 32bit The m/z for the ion.

rt float 32bit The retention time of the spectrum.

rawdata/isotopes

Column Name Data Type Description

iso_id integer 32bit The isotope labels id, should be unique amongst all methods.

name string 10 characters The name for the individual isotope.

mz float 32bit The reporter ion m/z for the label for MS2 quantification.

intmz integer 32bit The m/z rounded to the nearest integer.

error float 32bit The absolute m/z error to be applied to this label.

method_id integer 32bit The unique identifier for the quantification method that this label is part of.

amino string 5 characters The amino acid the label modifies.

rawdata/lockmass

Column Name Data Type Description

mass float 32bit The m/z value for the lock mass ion.

start float 32bit The start of the retention time window for inclusion.

end float 32bit The end of the retention time window for inclusion.

polarity string 3 characters The polarity the instrument should be using to analyze the ion.

comment string 100 characters Comments for this ion.

rawdata/msmsheader

Column Name

Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

unit_id integer 32bit Unit identifier: The unit groups spectra from the same precursor so that they can be processed together.

order integer 32bit Order in which the spectra in a unit were acquired.

scanevent integer 32bit The event number assigned by Xcalibur.

precmz float 32bit The m/z for the precursor ion using XIC derived data if possible. If recalibration has been performed this is the recalibrated m/z

precmzraw float 32bit The m/z for the precursor ion using XIC derived data if possible.

precinten float 32bit The intensity value for the precursor ion. This will be the apex intensity of the XIC data where possible or the interpolated intensity of the precursor ion at the time of the MS/MS event.

peaktype string 10 characters

The source of the precursor ion data. Is One of XICclust - XIC data with full cluster matching, XICsingle - XIC data with only 12C ion matching, survey - interpolated data from MS1 spectra.

monomz float 32bit The Xcalibur monoisotopicMZ value - an m/z value for the precursor ion.

precmz_xic float 32bit The m/z for the precursor ion derived from XIC data. Intensity weighted mean of the XIC data points with intensity > half the maximum intensity.

Nature Protocols: doi:10.1038/nprot.2015.101

29

rawdata/msmsheader continued

Column Name

Data Type Description

precmz_surv float 32bit The m/z for the precursor ion derived from MS1 spectra data.

rt float 32bit The retention time of the spectrum.

rtapex float 32bit The retention time of the matching XIC peak apex.

scanapex integer 32bit The spectrum number for the XIC peak apex.

charge integer 32bit The charge state of the precursor ion.

precsd float 32bit The standard deviation of the data point used to calculate the precmz_xic.

id_spec integer 32bit The spec_id of the spectrum where the sequence data is located. When units have multiple spectra one can contain the sequence information and the other the quantification data.

quan_spec integer 32bit The spec_id of the spectrum where the quantification data is located. When units have multiple spectra one can contain the sequence information and the other the quantification data.

survey_spec integer 32bit Spectrum identifier for the MS1 spectrum prior to the MS/MS event.

inten float 32bit The apex ion intensity of any linked XIC peak.

area float 32bit The maximum ion area of any linked XIC peak.

setmass float 32bit The m/z set by Xcalibur as the center of the isolation window.

fwhm float 32bit The width of the XIC peak at half maximum intensity.

c13iso integer 32bit The number of 13C isotopes identified in the isolation window.

s2i float 32bit The S2I value for the MS/MS event, this is interpolated from the two flanking MS1 spectra.

s2iearly float 32bit The S2I value for the precursor ion measured in the MS1 event prior to the MS/MS event.

s2ilate float 32bit The S2I value for the precursor ion measured in the MS1 event after to the MS/MS event.

c12 float 32bit The intensity of the precursor 12C ion, this is interpolated from the two flanking MS1 spectra.

c12early float 32bit The 12C intensity value for the precursor ion measured in the MS1 event prior to the MS/MS event.

c12late float 32bit The 12C intensity value for the precursor ion measured in the MS1 event after to the MS/MS event.

thresh float 32bit The Xcalibur intensity cut off (best estimation of noise) level for the MS/MS event, this is interpolated from the two flanking MS1 spectra.

threshearly float 32bit The Xcalibur intensity cut off for the precursor ion measured in the MS1 event prior to the MS/MS event.

threshlate float 32bit The Xcalibur intensity cut off for the precursor ion measured in the MS1 event after to the MS/MS event.

fragmeth string 5 characters

The fragmentation method used in this MS/MS event.

fragenergy float 32bit The energy used for fragmentation for this MS/MS event.

sumrepint float 32bit Sum of the intensities of all identified reporter ions.

sumreparea float 32bit Sum of the areas of all identified reporter ions.

smooth float 32bit The maximum smoothed intensity for the XIC peak.

integ float 32bit The integrated ion intensity for ions with intensities > half the maximum intensity the XIC peak.

Nature Protocols: doi:10.1038/nprot.2015.101

30

rawdata/noise

Column Name Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

mz float 32bit The m/z value for this measurement point.

noise float 32bit The intensity cut off level at the m/z. This is the best estimate of noise.

baseline float 32bit ?

rawdata/parameters

Column Name Data Type Description

set string 30 characters What part of the system the parameters are from, e.g. LC, Instrument, Tune.

subset string 30 characters The grouping of the parameters e.g. analytical, EXPERIMENT, Scan Event.

parameter string 60 characters The name of the parameter.

value string 100 characters The value of the parameter.

rawdata/profile

Column Name Data Type Description

id string 12 characters The name of the LC.

time float 32bit The time point.

time_unit string 3 characters The unit of time used in time.

percent_b float 32bit The fraction of solvent B in the LC output.

flow float 32bit The solvent flow rate.

flow_unit string 6 characters The unit of flow used in flow

rawdata/quan

Column Name

Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

isolabel_id integer 32bit The identifier of the isotope label.

area float 32bit The area of the quantification ion.

corrected float 32bit The intensity value after isotope interference correction.

inten float 32bit The intensity value of the isotope label.

survey_id integer 32bit The MS1 spectrum prior to the MS/MS event.

mzdiff float 32bit The m/z difference between the theoretical ion and the identified ion.

ppm float 32bit mzdiff expressed in parts per million.

coalescence float 32bit The measured degree of coalescence between adjacent label ions. Scale between 0 no coalescence and 1 fully coalesced.

rawdata/specparams

Column Name Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

set string 30 characters The Xcalibur group of parameters: either "filter" or "Extra".

parameter string 60 characters The parameter name.

value string 200 characters The parameter value.

rawdata/spectra

Column Name Data Type Description

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

rt float 32bit The retention time of the spectrum.

type string 5 characters Type of spectrum MS1, MS2 or MS3.

Nature Protocols: doi:10.1038/nprot.2015.101

31

rawdata/xicbins

Column Name Data Type Description

bin integer 32bit The m/z bin, this is calculated from the m/z using the binsPerDa parameter.

specid integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

rt float 32bit The retention time of the spectrum.

mz float 32bit The m/z of the ion, after recalibration if performed.

inten float 32bit The intensity of the ion.

area float 32bit The ion area, calculated by integrating the intensities of the ion.

mzraw float 32bit The m/z of the ion.

rawdata/units

Column Name Data Type Description

unit integer 32bit The ID of the unit.

order integer 32bit The order number of the unit MS/MS event.

activation string 8 characters

The fragmentation method used.

acttime float 32bit The time allowed for collisions.

energy float 32bit The energy used in the fragmentation.

isolation float 32bit The isolation window setting.

lowmz float 32bit Lowest m/z for the MS/MS event.

resolution integer 32bit Operating resolution of the instrument.

use string 5 characters

How the spectrum data should be used I = identification only, Q = quantification only & IQ = quantification and identification.

scanevents string 100 characters

The scan events from Xcalibur method linked to this.

colenergysteps integer 32bit The steps in collision energy if this option is being used.

colenergywidth float 32bit Type range of collision energies used if multiple collision energies are being used.

Mascot Data Section

F000000_dat/bestprot

Column Name Data Type Description

parameter string 20 characters The name of the parameter.

type string 5 characters The data type of the parameter (string, float etc.).

value string 150 characters The value of the parameter.

F000000_dat/config

Column Name Data Type Description

set string 30 characters The config file group.

parameter string 60 characters The parameter name.

value string 150 characters The parameter value

F000000_dat/index

Column Name Data Type Description

section string 20 characters Mascot dat file section name.

linenumber integer 32bit The number of the line in the Mascot dat file where the section starts.

F000000_dat/masses

Column Name Data Type Description

name string 10 characters

Name of the element or peptide terminal endings or single letter code of the amino acid.

mass float 32bit Mass of the named element / peptide terminal endings / amino acid

Nature Protocols: doi:10.1038/nprot.2015.101

32

F000000_dat/parameters

Column Name Data Type Description

section string 10 characters Section name from the MascotParser.cfg file.

parameter string 30 characters The parameter name.

value string 100 characters The parameter value

F000000_dat/mods

Column Name Data Type Description

id string 1 characters

Mascot’s identifier for the modification. This is a number for variable modifications and an amino acid code for fixed modifications.

name string 50 characters

The name of the modification.

modtype string 10 characters

Whether the modification is fixed or variable.

da_delta float 32bit The mass difference that the modification makes.

amino string 20 characters

The amino acid specificity or terminal specificity.

neutralloss float 32bit The masses of any neutral losses.

nlmaster float 32bit The mass of the main neutral loss.

relevant integer 32bit Is the modification biologically relevant.

composition string 100 characters

The elemental composition of the modification

F000000_dat/peptides

Column Name Data Type Description

query integer 32bit The mascot assigned spectrum identifier.

pepno integer 32bit The Mascot assigned order of this peptide match to this query.

sequence string 255 characters

Peptide's amino acid sequence.

is_hook integer 32bit Equals 1 if this is a hook peptide.

useinprot integer 32bit Equals 1 if the peptide is suitable to use in protein inference.

modsVariable string 255 characters

Description of the Mascot identified variable modifications. e.g. 2 Oxidation (M); 1 TMT6plex (N-term).

modsFixed string 255 characters

Description of the Mascot identified fixed modifications. e.g. 1 Carbamidomethyl (C); 1 TMT6plex (K).

mass float 32bit Calculated molecular mass of the peptide sequence.

da_delta float 32bit Mass difference between the sequence mass (mw) and the neutral_mass in Da. Corresponds to Mascot delta.

score float 32bit The Mascot score.

misscleave integer 32bit Number of sites on the peptide sequence where expected cleavage by the protease failed to occur.

numionsmatched integer 32bit The number of ions from the sequence that match the spectrum data.

seriesfound string 255 characters

Mascot seriesfound parameter.

peaks1 integer 32bit Mascot peaks1 parameter.

peaks2 integer 32bit Mascot peaks2 parameter.

peaks3 integer 32bit Mascot peaks3 parameter.

Nature Protocols: doi:10.1038/nprot.2015.101

33

F000000_dat/proteins

Column Name Data Type Description

accession string 15 characters The protein identifier used by Mascot.

mass float 32bit The mass of the protein sequence.

name string 255 characters The name of the protein.

F000000_dat/statistics

Column Name Data Type Description

statistic string 20 characters The name of the statistic.

value float 32bit The value of the statistic

F000000_dat/queries

Column Name Data Type Description

query integer 32bit The mascot assigned spectrum identifier.

spec_id integer 32bit Spectrum identifier; corresponds to the Xcalibur spectrum number.

msms_id string 10 characters

spec_id reformatted in text format (F000000).

rt float 32bit The retention time of the spectrum.

prec_neutmass float 32bit The precursor m/z turned into zero charge state.

prec_mz float 32bit The precursor m/z.

prec_charge integer 32bit The precursor charge state.

matches integer 32bit Number of peptides found by Mascot within the mass accuracy of the precursor.

homology float 32bit The Mascot score required for a homology match.

numpeps integer 32bit A count of the number of peptides with valid sequences returned by Mascot for this spectrum.

delta_seq float 32bit Difference in Mascot scores between the top Mascot sequence match and the next highest scoring sequence match. If delta_seq equals zero or if it is above the delta_seq_threshold its quant values may be used for protein quantification.

delta_mod float 32bit Difference in Mascot scores between the best two modification variants of this sequence. See ref – pubmed #21057138.

F000000_dat/seq2acc

Column Name Data Type Description

sequence string 255 characters The peptide amino acid sequence.

accession string 15 characters The protein accession.

hook integer 32bit Equals 1 if this peptide has been identified as a hook peptide.

hookscore float 32bit The best score of any hook identifications of this peptide.

pepscore float 32bit The best score of any identifications of this peptide.

bestczrank integer 32bit The best position (Mascot pepno) of any identifications of this peptide.

numpeps integer 32bit The number of times this peptide has been identified.

start integer 32bit The start location of the first occurrence of this peptide in the protein.

end integer 32bit The end location of the first occurrence of this peptide in the protein

F000000_dat/unimodaminoacids

Column Name Data Type Description

code string 6 characters Three letter amino acid code or terminus.

name string 20 characters Full amino acid name or terminus.

title string 6 characters One letter amino acid code or terminus (how the entity is used in Mascot.

monomass float 32bit The monoisotopic mass of the amino acid residue / terminus.

avgmass float 32bit The average mass of the amino acid residue / terminus.

composition string 100 characters The elemental composition of the amino acid residue / terminus.

F000000_dat/unimodelements

Column Name Data Type Description

Nature Protocols: doi:10.1038/nprot.2015.101

34

name string 20 characters The name of the element.

title string 5 characters The symbol used for the element.

monomass float 32bit The monoisotopic mass of the element.

avgmass float 32bit The average mass of the element.

F000000_dat/unimodmodifications

Column Name Data Type Description

modid integer 32bit The Mascot modification identifier.

name string 50 characters The name of the modification.

title string 50 characters The short name as used by Mascot.

delta_mono_mass float 32bit The difference in monoisotopic mass that the modification makes.

delta_avge_mass float 32bit The difference in average mass that the modification makes.

delta_comp string 100 characters The difference in elemental composition that the modification makes.

F000000_dat/unimodspecificity

Column Name Data Type Description

modid integer 32bit The Mascot modification identifier.

group integer 32bit The Mascot group for the modification. This allows Mascot to consider multiple sites as one modification.

site string 20 characters

The amino acid or terminus where the modification can be found.

position string 20 characters

Location that the modification can occur within the peptide / protein.

classification string 30 characters

Biological classification of the potential source of this modification.

hidden string 5 characters

true / false if the modification is hidden and not to be used.

nl_mono_mass float 32bit The difference in monoisotopic mass that the neutral loss makes.

nl_avge_mass float 32bit The difference in average mass that the neutral loss makes.

nl_comp string 100 characters

The difference in elemental composition that the neutral loss makes.

Nature Protocols: doi:10.1038/nprot.2015.101

35

Appendix 4: Directory structure of isobarQuant The following directory structure is created when the contents of the zip file are extracted. Each

isobarQuant

│ LICENSE AGREEMENT.docx

│ outputResults.cfg

│ outputResults.py

│ parseUniprotFasta.py

│ postMascot.cfg

│ postMascot.py

│ preMascot.cfg

│ preMascot.py

│ QuantMethod.cfg

│ README.txt

├───CommonUtils

│ averagine.py

│ ConfigManager.py

│ elements.cfg

│ ExceptionHandler.py

│ hdf5Base.py

│ hdf5Config.py

│ hdf5Mascot.py

│ hdf5Results.py

│ hdf5Tables.py

│ ionmanipulation.py

│ LoggingManager.py

│ MascotModificationsHandler.py

│ MathTools.py

│ progressbar.py

│ QuantMethodHandler.py

│ tools.py

│ XICprocessor.py

│ __init__.py

├───pyMSsafe

│ czXcalibur.pyd

│ datastore.py

│ mgf.cfg

│ mgf.py

│ pyMSsafe.cfg

│ pyMSsafe.py

│ SpectrumProcessor.py

│ xcalparser.py

│ xRawFile.py

├───mascotParser

│ datfile.py

│ datparser.py

│ hdf2file.py

│ mascotparser.cfg

│ mascotParser.py

│ peptide.py

│ protein.py

│ unimodxmlparser.py

├───proteinInference

│ fileData.py

│ hdfFile.py

│ peptideset.py

│ peptidesetmanager.py

│ proteininference.cfg

│ proteinInference.py

└───proteinQuantification

corrects2iquant.cfg

correctS2Iquant.py

proteinquantification.cfg

quantifyProteins.py

Nature Protocols: doi:10.1038/nprot.2015.101

36

Appendix 5: Trouble shooting symptom Cause Remedy

all FDR values for protein and peptide show 0

No decoy data (reverse hits) present in the .fasta file used for searching

Add decoy hits to .fasta file on Mascot server (see http://www.matrixscience.com/help/decoy_help.html)

postMascot exits saying ‘No *.dat files found for processing.’

User forgot to export Mascot .dat files to the ‘myproject’ data directory User entered the wrong datadir name

Check that value given as argument for --datadir exists and contains .dat files

No quantification information for proteins

Mascot modification name for quant and mascot_name in .cfg file different Quant method not set correctly in QuantMethod.cfg file

Check QuantMethod.cfg file. Make ‘mascot_name’ is present in Mascot’s modification name. Make sure the masses of the reporter ions are correct for your quantification method (see section “Adding New Quantification Methods” this manual

peptide flag ‘in_protein_inference’ set to 1 but actual FDR is below value set

Protein inference is performed on the ‘best’ peptide sequences in protein group. However, any reporter signals will not be used for protein quant.

Check for the other in the protein group whose FDR value is above threshold set. Check that ‘in_quantification_of_protein’ is set to 0.

'python' is not recognized as an internal or external command, operable program or batch file.

Python has not been installed The Python interpreter is not in the system path.

Follow part one of this guide to re install python. Make sure “C:\python27\” is in Path. Edit Control Panel\All Control Panel Items\SystemAdvance settingsEnvironment Variables Path

Error during preMascot workflow ‘This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.’

Xcalibur foundation software is not installed on the workstation where isobarQuant is running.

Please install Xcalibur foundation software and try again

Nature Protocols: doi:10.1038/nprot.2015.101