peptide mapping by capillary/standard lc/ms and ......3 peptide mapping by capillary/standard lc/ms...

UPTEC X 04 036 ISSN 1401-2138 AUG 2004

RAGNAR STOLT

Peptide mapping by capillary/standard LC/MS and multivariate analysis

Master’s degree project

2

UPTEC X 04 036 Date of issue 2004-08

Author

Ragnar Stolt Title (English)

Peptide mapping by capillary/standard LC/MS and multivariate

analysis

Title (Swedish) Abstract The potential of LC/MS peptide mapping combined with multivariate analysis was investigated using IgG1 as a model protein. Five batches of IgG1 were exposed to different levels of an oxidizing agent. A method to detect differences between the batches using solely MS data was developed and successfully applied. Four peptide fragments containing methionine residues were found to represent the most significant differences and characterized using MS/MS. In order to evaluate different computational strategies Principal Component Analysis (PCA) was used. Attempts were also made in order to use the information from the whole LC/MS space.

Keywords Peptide Mapping, LC/MS, PCA, PTM, IgG1, Genetic Algorithms, Matlab Programming

Supervisors

Rudolf Kaiser AstraZeneca, Analytical Development Södertälje

Scientific reviewer

Per Andrén Uppsala University, Laboratory for Biological and Medical Mass Spectrometry

Project name

Sponsors

Language

English

Security

ISSN 1401-2138

Classification

Supplementary bibliographical information Pages

47

Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

Molecular Biotechnology Programme

Uppsala University School of Engineering

3

Peptide mapping by capillary/standard LC/MS and

multivariate analysis

Ragnar Stolt

Sammanfattning

Inom läkemedelsindustrin är det viktigt att utveckla analytiska metoder för att kunna hitta små skillnader mellan olika prover av ett läkemedelsprotein. Man måste kunna kartlägga vilka förändringar som introduceras i proteinet då det t.ex. lagras i rumstemperatur under lång tid. Dessa förändringar kan nämligen ändra proteinets egenskaper och eventuellt även leda till ett immunsvar med allvarliga konsekvenser. Traditionellt har man inom proteinkemin karaktäriserat proteiner bland annat genom s.k. peptidmappning. Peptidmappning går ut på att enzymatiskt klyva ett protein och analysera de uppkommna peptidfragmenten med vätskekromatografi. Varje resulterande kromatogram motsvarar då ett fingeravtryck av proteinet och små skillnader mellan olika prover kan spåras genom små förändringar i fingeravtrycken. På så vis kan man avgöra om det föreligger några skillnader men inte vad de består av. Den här studien bygger på att ytterligare förbättra möjligheterna med peptid mappning genom att analysera peptidfragmenten med en masspektrometer. Små förändringar i form av oxidation infördes på ett modellprotein. Med hjälp av traditionella statistiska metoder har stokastiska icke signifikanta skillnader filtrerats bort och förändringarna kunde via tandem masspektrometri karaktäriseras som oxidation av metionin. Stor tyngd har lagts på att utveckla algoritmer som kan hantera den komplicerade och stora datamängd som masspektrometrisk data utgör.

Examensarbete 20 p i Molekylär bioteknikprogrammet

Uppsala universitet Augusti 2004

4

1 INTRODUCTION .......................................................................................................................................................5

1.1 MODEL PROTEIN, IMMUNOGLOBULIN G1...........................................................................................................6 1.2 PEPTIDE MAPPING ...........................................................................................................................................6 1.2.1 Digestion ...............................................................................................................................................6

1.3 REVERSED PHASE HIGH PERFORMANCE LIQUID CHROMATOGRAPHY (RP-HPLC) ..............................................7 1.4 MASS SPECTROMETRY (MS) ..............................................................................................................................8 1.4.1 Ion Source .............................................................................................................................................9 1.4.2 Time of Flight Analyzer.........................................................................................................................9 1.4.3 Tandem Mass Spectrometry ................................................................................................................ 10 1.4.4 Hybrid Quadrupoles Time of Flight .................................................................................................... 10 1.4.5 The Detector........................................................................................................................................ 11

1.5 DATA ANALYSIS .............................................................................................................................................. 11 1.5.1 Normalization...................................................................................................................................... 11 1.5.2 Confidence Interval ............................................................................................................................. 12 1.5.3 Principal Component Analysis (PCA)................................................................................................. 13 1.5.4 Genetic Algorithms.............................................................................................................................. 14 1.5.5 Wavelet Transformation...................................................................................................................... 15

2 MATERIAL AND METHODS .......................................................................................................................... 17

2.1 EQUIPMENT AND CHEMICALS .......................................................................................................................... 17 2.1.1 Chemicals............................................................................................................................................ 17 2.1.2 Equipment ........................................................................................................................................... 17

2.2 METHODS ...................................................................................................................................................... 18 2.2.1 Oxidation of Model Protein................................................................................................................. 18 2.2.2 Digestion of Model Protein ................................................................................................................. 18 2.2.3 RP-HPLC ............................................................................................................................................ 18 2.2.4 LC/MS ................................................................................................................................................. 19 2.2.5 Design of Experiment .......................................................................................................................... 19

2.3 DATA ANALYSIS .............................................................................................................................................. 20 2.3.1 Importing Data to Matlab ................................................................................................................... 20 2.3.2 Approach 1: Collapsed Time Scale ..................................................................................................... 21

2.3.2.1 Normalization ................................................................................................................................. 22 2.3.2.2 Principal Component Analysis (PCA) ............................................................................................ 22 2.3.2.3 Confidence Interval ........................................................................................................................ 23 2.3.2.4 Finding Oxidized Fragments .......................................................................................................... 23

2.3.3 Approach 2: Timescale........................................................................................................................ 24 2.3.3.1 Wavelet Denoising ......................................................................................................................... 24 2.3.3.2 Preprocessing Using Genetic Algorithms and Normalization ........................................................ 24 2.3.3.3 Bucketing........................................................................................................................................ 25 2.3.3.4 Confidence Interval ........................................................................................................................ 25

3 RESULTS............................................................................................................................................................. 26

3.1 DATA ANALYSIS .............................................................................................................................................. 26 3.1.1 Approach 1: Collapsed Time Scale ..................................................................................................... 26

3.1.1.1 Normalization ................................................................................................................................. 27 3.1.1.2 Principal Component Analysis (PCA) ............................................................................................ 30

3.1.1.2.1 Normalization with Normalization Parameter......................................................................... 30 3.1.1.2.2 Evaluating Auto Scaling.......................................................................................................... 31 3.1.1.2.3 Comparing Normalization Techniques.................................................................................... 32

3.1.1.3 Confidence Interval ........................................................................................................................ 32 3.1.2 Approach 2: Time Scale ...................................................................................................................... 37

3.1.2.1 Wavelet Denoising ......................................................................................................................... 37 3.1.2.2 Preprocessing................................................................................................................................. 39 3.1.2.3 Confidence Interval ........................................................................................................................ 40 3.1.2.4 Bucketing........................................................................................................................................ 41

3.2 TANDEM MASS SPECTROMETRY ....................................................................................................................... 41

4 DISCUSSION....................................................................................................................................................... 44

5 ACKNOWLEDGEMENTS ................................................................................................................................ 46

6 REFERENCES .................................................................................................................................................... 47

Table of contents:

5

1 Introduction Today a number of different recombinant proteins are available on the pharmaceutical market. The breakthrough for recombinant techniques is often associated with the release of insulin produced in E.Coli 1982 [1]. Right from the beginning it has been important to develop methods to characterize and analyze recombinant proteins. There are problems using recombinant techniques due to posttranslational modifications (PTM). Eukaryotic organisms, especially the human species, have developed a complex system for PTM:s. Vital proteins will not function properly if these PTM:s are missing. On the contrary prokaryotic organisms, e.g. E.Coli, do not perform any PTM:s at all. The pharmaceutical companies have therefore to be able to detect differences between product and native form of the drug candidate protein. Differences from the native copy can lead to dysfunction of the protein drug and also an unwanted immunorespons with hazardous consequences. There is also a great need of investigating the quality of a protein drug. What kind of modifications will be introduced in the protein when it e.g. is stored at room temperature for days? Maybe a couple of amino acids in the protein will be oxidized and some other will be exposed to deamidation or deglycosylation. These questions need to be answered before commercializing a new protein drug. A common method to detect differences between protein batches is peptide mapping, using RP-HPLC [2]. To facilitate data analysis a multivariate approach can be successful. Principal component analysis (PCA) is often used [3] to model variations in the data set, making it easier to detect e.g. outliers and to produce information concerning system reproducibility. It is also important to minimize stochastic and system drift variations especially when looking for small differences in the data set. Otherwise it can be difficult to separate non-chemical variations from true physical differences in the protein. The UV-data collected from the HPLC is however often not sufficient to disclose small variations in the data set. Furthermore the UV-chromatogram does not give any qualitative information. It is not possible using this kind of data to answer the question “Where on the protein are the modifications located and what do they consist of?”. To further enlarge the possibilities of peptide mapping the univariate approach has to be abandoned and more physical information describing the properties of the protein need to be gathered. One possibility to enlarge the amount of available information is to use a LC/MS system, gathering information not only in the time domain but also in the m/z-domain resulting in a bivariate peptide map instead of the traditional univariate UV-map. Mass data (m/z) can also give qualitative information about the parts of the protein where the modifications are situated. Using MS/MS these parts can be analyzed further and comparing with a reference batch individual differing amino acids can be detected. This project focuses on studying LC/MS peptide maps and developing computational methods to separate true chemical differences from noise without any a priori information. Found differences will be characterized using MS/MS.

6

1.1 Model protein, Immunoglobulin G1

As model protein Immunoglobulin G (IgG1, κ) has been chosen. The IgG molecule is very important to the immune defense system and the most abundant antibody with approximately 13.5mg/ml in serum [4]. IgG binds to foreign molecules and is thereby activating other members of the immune defense system. IgG is a molecule consisting of two major chains one smaller forming the light chain and one larger forming the heavy chain each represented twice (fig. 1). The different chains are held together with a total of four disulfide bonds. The molecular mass of the IgG molecule used in this study is 145 kDa (without any PTM:s) and there are 450 amino acids. There is a N-linked glycosylation site on each heavy chain. To be able to evaluate the possibilities with a LC/MS peptide map, small chemical changes were introduced by oxidizing IgG. Comparing batches with different amount of added oxidizing agent hopefully reveals some information about the potential of the analytical LC/MS system. The amino acid most sensitive to oxidizing agents is methionine. Oxidization of methionine produces methionine sulfoxid [5] in a reversible reaction. This oxidization corresponds to an addition of an oxygen atom resulting in a 16 Da increment of mass. Increasing the concentration of oxidizing agent further can irreversible oxidize methionine sulfoxid to methionine sulfone. There are six methionine residues represented in the amino acid sequence. 1.2 Peptide Mapping

Peptide mapping is a method used to create a “fingerprint”, specific for a certain protein. The protein is digested with a suitable enzyme and the peptide fragments are separated using e.g. Reversed Phase High Performance Liquid Chromatography (RP-HPLC). Traditionally an UV-detector is often chosen for data collection. In this study a mass detector was used.

1.2.1 Digestion

The digestion method has to be compatible with the chemical conditions necessary for the HPLC-system and the mass spectrometer. It is important to develop a digestion routine with high reproducibility in order to be able to compare the results from different runs. The enzyme used has to digest the protein into a sufficient number of peptide fragments. Too many and too small fragments risk to obstruct the data analysis and signal to noise ratio will decrease. Too few fragments decrease the amount of information that can be gathered from a peptide map.

Figure 1: Immunoglobulin G1

7

1.3 Reversed Phase High Performance Liquid Chromatography (RP-HPLC)

RP-HPLC is a widely used and well-established tool for the analysis and purification of biomolecules e.g. a protein digest. The system uses high pressure to force a mobile phase through a column packed with porous micro particles. Particle sizes range typically between 3 and 50 µm. The smaller particle diameter the more pressure will be generated in the system. The particle pore size generally ranges between 100-1000Å. Smaller pore silicas may sometimes separate small or hydrophilic peptides better than larger pore silica [6]. The most common columns are packed with silica particles to which different alkylsilane chains are chemically attached. Butyl (C4), octyl (C8) and octadecyl (C18) silane chains are the most commonly used. C4 is generally used for proteins and C18 for small molecules. The idea is that large proteins with a lot of hydrophobic moieties need shorter chains on the stationary phase for sufficient hydrophobic interaction. The choice of column diameter depends on the required sample load and the flow rate. Small-bore columns (1.0 and 2.1 mm i.d.) can improve sensitivity and reduce solvent usage. Column length does not significantly affect most polypeptide separations [6]. To speed up the analytical cycle time short columns with high flow rate and fast gradients can be used at expense of resolution. An HPLC-system optimized for columns with small inner diameters and low flows are called micro-HPLC. A micro–HPLC system has narrow capillaries, typically 50µm i.d. The pumps are commonly working with a split-flow enabling low flow with high accuracy. The advantage of micro-HPLC is mainly reduction in mobile phase solvent consumption and high sensitivity, which makes it possible to load low amounts of sample, facilitating the connection to a mass spectrometer system. In this form of liquid chromatography the stationary phase is non-polar and the mobile phase relatively polar. Analytes will thus be separated mainly due to their hydrophobic properties. During a gradient separation two different kinds of solvents are used as mobile phase. One of the solvents is relatively hydrophilic and the other is relatively organic (hydrophobic). The two solvents are mixed together and the relative content of the organic solvent increases with time. Analytes are at the beginning of the gradient attached through hydrophobic interaction to the solid phase. When the organic content of the mobile phase reaches a critical value desorption will take place and the analytes will pass through the column. The majority of peptides (10 to 30 amino acid residues in length) have reached their critical value when the gradient reaches 30% organic content. The separation is however also influenced by molecular size. Smaller molecules will move slower through the column than larger based on the fact that smaller molecules will have access to a larger volume of the column. The analytes partitioning

Polypeptide enters the column at injection

Polypeptide adsorbs to hydrophobic surface

Polypeptide desorbs from stationary phase when organic solvent reaches critical concentration.

Figure 2: The idea behind gradient separation with RP-HPLC

8

process between mobile and solid phase will also impact the separation process. However it is quite safe to say that polar analytes elute first and non-polar analytes last. To get separation mainly based on hydrophobic differences an ion-pairing agent is often added to the mobile phase in order to serve one or more of the following functions: pH control, suppression of non wanted interactions between basic analytes and the silanol surface, suppression of non wanted interactions between analytes, or complexation with oppositely charged ionic groups. It has been shown [6] that addition of an ion-pairing agent has a dramatic beneficial effect on RP-HPLC not only to enhance separation but also to improve peak symmetry. Trifluoroacetic acid (TFA) is an ion-pairing agent widely used. It is volatile and has a long history of proven reliability. To be able to connect the HPLC system to a mass-spectrometer it is important to choose an ion-pairing agent with care. Ion suppression reduces the sensitivity of the mass-spectrometer system. Another effect that can influence on peptide separations is temperature. Higher temperature is associated with increased diffusion according to Einstein’s diffusion constant D: Where η is a viscosity constant, kB corresponds to Bolzmann’s constant, T is temperature and r the radius of the diffusing particle. It is however difficult to draw any general conclusions because it has been shown that an increase in temperature increases resolution between certain analytes and decreases resolution between other analytes [6]. For good reproducibility a firm temperature control has to be applied. RP-HPLC is one of the most widely used forms of chromatography mainly because of its high resolution. Chromatographic resolution is defined as the ratio of the difference in retention time between two neighboring peaks A and B and the mean of their base widths: Where tR corresponds to retention time and w base width. It is possible when using RP-HPLC to separate peptides whose sequence only differs by one single amino acid residue. 1.4 Mass Spectrometry (MS)

A mass spectrometer is an analytical instrument that determines the molecular weight of ions according to their mass to charge ratio m/z. The device consists mainly of three basic components: the ionization source, the mass analyzer/filter and the detector.

ηπrTk

D B

6= (1)

av

R

BA

BRAR

Sw

t

ww

ttR

∆=

+

−=2

(2)

9

1.4.1 Ion Source

Ionization is an essential part of the mass spectrometric process. The molecules have to be charged and in gaseous phase in order to accelerate in the electrical field inside the mass spectrometer. Today several different ionization techniques have been developed. The most often used techniques concerning analysis of peptides and proteins are matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI). These two techniques are so called soft ionization techniques, which means that the molecules are ionized without fragmentation. In this study ESI has been used. ESI creates a fine spray of highly charged droplets in the presence of a strong electric field. The sample solution is injected at a constant flow, which makes ESI particularly useful when sample solution is introduced by a LC-system. If the LC-flow is compatible with the mass spectrometer an online LC/MS system is easily established. The charged droplets are introduced to the mass analyzer compartment together with dry gas, heat or both. This will lead to solvent vaporization. When the droplets decrease in volume the electric field density increases and eventually repulsion will exceed surface tension and charged molecules will start to leave the droplet via a so called Taylor cone [7]. This process is conducted at atmospheric pressure and is sometime also called atmospheric pressure ionization (API). Using ESI it is possible to study molecules with masses up to 150 000 Dalton, mainly because of the fact that ESI generates multiple charged molecules, which means that a low upper m/z limit is sufficient for analysis of large biomolecules. A typical detection limit using ESI is femtomole [7].

1.4.2 Time of Flight Analyzer

The most commonly used analyzers are quadropoles, Fourier transform ion cyclotron resonance and time of flight analyzers (TOF). In this study a TOF analyzer with a reflectron was used. The TOF analyzer is the simplest construction based on the idea that ions are accelerated through an electrical field with the same amount of kinetic energy. These ions will differ in velocity due to their charge to mass ratio: Where U corresponds to the accelerating voltage. The differences in velocity will in turn lead to different flight time from the ion source to the detector. One advantage of TOF instruments is that no scanning of the m/z spectrum is necessary. Another advantage is the fact that there is virtually no upper mass limit using TOF. However the resolving power of TOF instruments is low. Resolving power, also called resolution, is defined as the ability of a mass spectrometer to distinguish between different m/z ratios at a certain peak height. Looking at just one peak in the mass spectrum resolution is commonly defined as the ratio between the m/z value and the full width of the peak at half maximum. Analyzers with reflectron can improve resolution. A reflectron is a device with gradient electrostatic field strength. This so-called “ion mirror” will redirect the ion beam towards the detector. Ions with greater kinetic energy will penetrate

Uz

mv 2

1

⋅

=−

(3)

10

deeper into the reflectron compared with low energetic ions. This mechanism will compensate for a wide distribution of initial kinetic energy and thus increase mass resolution.

1.4.3 Tandem Mass Spectrometry

The peptide has to be fragmented in order to determine its sequence. Fragmentation can be achieved by inducing ion-molecule collisions by a process called collision induced dissociation (CID). The idea behind CID is to select the peptide ion of interest and introduce it into a collision cell, with a collision gas often Argon, resulting in break of the peptide backbone. From the resulting daughter ion spectrum the m/z values of the involving amino acids can be found. The peptide fragments can be divided into different series. When charge is retained on the N-terminal (fig. 3) the resulting series of fragments are called an, bn and cn. When charge is retained on the C-terminal fragmentation can also occur at three different positions called xn, yn, zn.

1.4.4 Hybrid Quadrupoles Time of Flight

To select peptide ions, that are to be investigated by CID, a quadrupole device can be used. Quadrupoles are four parallel rods with an applied direct current and a radio frequency electromagnetical field. When ions reach the quadrupole they will start to oscillate depending on the radiofrequency field and their m/z value. Only ions with a particular m/z value will be able to escape the quadrupole, the rest will collide with the quadrupole walls. Thus the quadrupole works as a mass filter. By scanning the radio frequency field an entire mass spectrum can be obtained.

NH2 CH C NH

R1 O

CH C NH

R2 O

CH C NH

R3 O

CH C OH

R4 O

x3

a1

y3

b1

z3

c1

x2

a2

y2

b2

z2

c2

x1

a3

y1

b3

z1

c3

NH

H2N+ CHR1

C H2N CHR1 O+

O

H2N CHR1 C NH3+

R4HC C+O2H

H3N+ CHR4 C OH

O

O

CHR4 C OH O+

C

a1

b1

c1 x1

y1

z1

Figure 3: Collision induced dissociation (CID). Peptide fragments are produced according to the

scheme above. Ions of the b and y series are often dominating the daughter ion spectrum.

11

The instrument used in this study is a quadrupole TOF hybrid (fig. 4). The quadrupole is used to select an ion of interest, which is fragmented in the collision cell. The resulting daughter ions are analyzed using a TOF device and a detector.

1.4.5 The Detector

The detector (fig. 5) converts the kinetic energy from the arriving ions to an electrical current. The amplitude of the current is correlated with the number of ions reaching the detector. Most detectors available today build on the principle of electron multiplication. The detector in this particular instrument is called microchannel plate (MCP). A MCP detector is a huge number of electron multiplicator tubes.

When a charged particle collides with the tube wall secondary electrons will be emitted and reflected further down the tube, leading to a cascade of secondary electrons well gathered in space. The signal is amplified in the MCP detector with typically a factor of 103 - 104. 1.5 Data Analysis

1.5.1 Normalization

When using a LC/MS device small differences in sample concentration, injection volume or loss of sensitivity will introduce variations in the data set that complicates the comparison between different batches. These variations can however be compensated for by normalizing the data set. Most normalization techniques e.g. when treating HPLC data are based on an

Hollow glass capillary with secondary electron emission coating.

Secondary electrons.

Ions

Photoelectron

Figure 5: Microchannel Plate detector.

Collision Cell

Ar

Detector Ions from ESI

Selected ion Ion fragments

Figure 4: The concept behind a Quadrupole - TOF system.

Quadrupole

Reflectron

TOF

12

internal standard or an external standard. Normalization in this context means that the data set is divided by the area or height of the standard peak. As standard peak the peak with largest area in the chromatogram can often be used with good results. However when analyzing MS data with a large number (up to thousands) of m/z values normalization is not a trivial task. Which m/z value should be chosen as standard peak in order to produce the most accurate normalization? What if a m/z value with large variation or equally bad, with too little variation is chosen? The normalized data set will under these conditions poorly represent the true values. A better approach would be to calculate intensity quotients between the m/z values of a reference sample and a target sample. The mean of these quotients could be used as a normalization parameter. Averaging the quotients will work as a low pass filter (fig. 6) and only significant trends in the data set will be represented in the normalization parameter thus minimizing the impact of m/z values with large variation. This normalization technique will work well under the conditions that the number of m/z values is fairly large and the chemical differences between the batches are fairly small. Large chemical differences will slip through the low pass filter and give rise to a skew normalization.

1.5.2 Confidence Interval

A classic way of treating the problem with stochastic variation between different samples of the same batch is to estimate a confidence interval. Assuming that the observed variable belongs to the normal distribution it is fairly easy to calculate the probability of finding the true mean value within the variation of the measured variable. Or the other way around: it is possible to calculate the limits within which the true mean value with a certain amount of probability can be found. Comparing two batches the confidence interval for the differences in mean intensity of the measured m/z values will give some useful information. If the calculated confidence interval ranges from a positive value to a negative value, i.e. includes zero, it is not possible to

Figure 6: The low pass nature of a mean operation. ω corresponds to frequency. H corresponds

to the transfer function. The z-transform clearly shows that only low frequency components will

slip through the filter.

=

++++=

−

+−++−+−+=

+−−−

2sin

2sin

1)(

...

...)(

)1(...)2()1()()(

121

ω

ω

ω

n

neH

n

zzzzzX

tiontransformaz

n

nnxnxnxnxnx

j

n

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

frequency (rad/s)

A

n = 5

13

statistically declare the calculated difference as a true different. By using this approach non-significant changes can be removed.

1.5.3 Principal Component Analysis (PCA)

Principal component analysis is a multivariate projection method designed to extract and display the systematic variation of a data set [8]. The data set is composed by a number of N observations and K variables. Examples of observations can be samples of different batches or time points in a continuous process. The variables are often represented by different kinds of analytical results e.g. UV-data, NIR-data, m/z-data. Geometrically the data set can be interpreted by representing each observation as a point in the N-dimensional orthogonal variable space (fig. 7), where each axis constitutes a variable. A new set of orthogonal variables is introduced where each new variable minimizes the residual variance of the observations by least square analysis. Minimizing the residual variance is equivalent to maximizing the variance of the observations along the new variable axis. This new set of variables is called principal components (PC). It is possible to calculate as many PC:s as there are variables. The Euclidian distance between each projection point of the observations on the PC and the PC center point is called the score value. Each observation is represented of a single score value in the principal component space. The PC:s space has an equal number of dimensions as the original variable space. However reduction of dimensionality can be done by choosing the PC:s for the PCA model which together describes mostly of the variance in the original data set. The degree of variance explained is called the cumulative variance. Two or three PC:s are often sufficient, meaning that the original variable space with N-dimensions has been reduced to a new variable space with two or three orthogonal axis without any significant loss of information. The eigenvalue of each PC is proportional to the variance explained by that particular PC and can thus be an useful tool when ranking the PC:s.

Variable 2

Variable 1

Score value

Residual variance

PC 1

PC 2 δ

Figure 7: Geometrical interpretation of PCA with only two variables. Cosine of δ corresponds to the loading value of variable 1.

14

Another important value besides the score value is the cosine of the angle between the original variable axis and the new PC axis. This value is proportional to the importance of the original variable for the direction of the PC and it is called the loading. Each original variable will give rise to a loading value. In the resulting PCA-plot it is easy to find relations between observations. It is also possible to collect information about e.g. outliers and classification. The loading plot, where the loadings are plotted in the PC:space reveals information about relationships between variables. To facilitate the interpretation of the PCA-plot data is often mean-centered and auto scaled. Mean centering means that the average value of each variable is subtracted from the data set. After mean centering the mean value of each variable will be zero. Auto scaling means that the standard deviation is calculated for each variable and the obtained scaling factor (1/σi where σ is the standard deviation and i = 1,2,3,….K-1,K) is then multiplied with each variable. By putting all variables on a comparable footing, no variable is allowed to dominate over another because of its variance. PCA is an efficient and nowadays common chemometrical method for decomposition of two-dimensional data sets, however it is important to emphasize that PCA poorly represents nonlinear correlations.

1.5.4 Genetic Algorithms

Using RP-HPLC subtle variations will be introduced in the chromatographic profiles despite identical experimental conditions. These variations can be due to e.g. small changes in TFA concentration (remember that TFA is volatile), column temperature, degeneration of column silica etc. Since these variations do not represent a true change in the sample but still affect the chromatogram, it will be difficult to draw any analytical conclusions. Peak shapes, retention time and baselines are all variables that will be exposed to small non-sample related variations. To compensate for these subtle variations different alignment algorithms have been developed [10,21], trying to optimize the alignment between chromatograms by slightly altering peak shape and baseline structure. Today a lot of different mathematical techniques are described dealing with the optimization problem. If an explicit function exists describing the experimental system optimization techniques such as Newton-Raphson or Steepest Descent can be used with success. These traditionally iterative methods are however computationally demanding and if the system is too complex to be described by an explicit function these methods will not be successful. The risk of finding a local optimum instead of the global must also be considered when using these techniques. Another approach to the optimization problem is to ignore explicit relations and with biased stochastic methods search the solution space. A Genetic algorithm (GA) is a typical example of such a stochastic optimization method that can handle fairly large and complex systems without enormous computational power [11].

15

Genetic algorithms simulate the biological evolution and consider populations of solutions rather than one solution at a time. A reproduction process that is biased towards better solutions forms the next population and after a certain number of generations or a specific criterion the optimum is hopefully found. The first step in the genetic algorithm is to create an initial population. This can be done by using a priori information or just random initialization. The created population of chromosomes can be, e.g. when studying energy minimization, coordinate vectors of the involving atoms. Next step is to evaluate the chromosomes and to give each a specific value of fitness. The chromosomes that produce the best solution will be given the highest value of fitness. The next population of chromosomes will be a combination of the chromosomes in the preceding generation. The number of offspring each chromosomes produces is proportional to its value of fitness, i.e. chromosomes with higher fitness will have greater impact on the qualities of next generation than chromosomes with lower values of fitness. Mutations and cross-over effects are also introduced during the breeding process. These stochastic elements make it possible to escape a local optimum. Aligning target chromatograms against a reference is a typical problem that could be solved using genetic algorithms [12,3]. To evaluate the fitness of a chromosome the Euclidian distance between the two chromatograms that are to be aligned can be used.

1.5.5 Wavelet Transformation

Normalization and Genetic Algorithms is not always sufficient when preprocessing LC/MS data. Stochastic noise often disturbs the interpretation of the chromatograms and introduces larger variations than acceptable. Furthermore the alignment genetic algorithm produces a better result if raw data is denoised. Traditionally in the field of signal analysis denoising and compression of time dependent data is done using different methods of Fourier transformation e.g. Fast Fourier Transformation (FFT), Discrete Fourier Transformation (DFT) [13]. These methods transform the signal from the time dependent space to a frequency dependent space (Fourier space). By using information from Fourier space a low pass filter can be applied and high frequent noise can easily be removed. The filtered signal can via inverse Fourier transformation be analyzed in the time dependent space. Fourier space also reveals information about how the energy of the signal is distributed on different frequencies. Frequency components representing only a small part of the energy can be removed without loosing any significant information, thus compressing the original signal. However Fourier transformation is not capable of coping with non-stationary signals where the nature of the signal’s frequency components changes over time. Solving this problem using Fourier transformation on small time portions of the signal will be hazardous to resolution because of Heisenberg’s uncertainty principle. A good resolution in the time domain will lead to a miserable resolution in the frequency domain. A better approach when studying this type of signals e.g. chromatograms is to use wavelet analysis.

16

Wavelet analysis is a technique that opposite to Fourier transformation preserves time information and is capable of revealing aspects of data like trends, discontinuities in higher derivatives and self-similarity. Using wavelet transformation Heisenberg’s uncertainty principle will not cause any problems since wavelet analysis is a multiresolution technique where resolution is proportional to frequency [14]. A wavelet is a waveform of effectively limited duration that has an average value of zero (fig. 8). Wavelets tend to be irregular and asymmetric. Wavelet analysis can be summarized as the process of describing a signal via a number of shifted and scaled versions of the so called mother wavelet. Mathematically wavelet transformation can be described as the inner product of the test signal with the basis functions: Where ψ corresponds to the mother wavelet, s is scale, τ is translation and x corresponds to the test signal. The basis functions are the scaled and translated versions of the mother wavelet. This definition shows that wavelet analysis is a measure of similarity in the sense of frequency components between the basis functions and the signal itself. The calculated wavelet coefficient refer to the closeness of the signal to the wavelet at the current scale. The resulting coefficient will have a scale and translational component. The scale component describes the inverse of the frequency and the translational component describes the time domain of the signal, i.e. the coefficient describes the frequency components of the signal at all time points. Using discrete implementations of CWT makes it easy to compress or denoise a signal, via low pass filtering [15].

Figure 8: Example of mother wavelet: Daubechies 2 (db2).

A

t

∫

−=Ψ= dts

ttx

sssCWT xx

τψττ ψψ *)(

1),(),( (4)

17

2 Material and Methods 2.1 Equipment and chemicals

2.1.1 Chemicals

IgG1 solved in 10mM acetic acid1. Guanidine-HCl, analytical grade, was purchased from ICN Biomedicals. NH4HCO3, analytical grade, was purchased from BDH Laboratory supplies. Trypsin was purchased from Promega and H2O2 from Acros Organics.

2.1.2 Equipment

Agilent 1100, micro-HPLC system Micromass LCT Micromass Quattro Ultima

1 Due to confidential reasons the name of the producing company cannot be mentioned. This clone of IgG is slightly modified compared to native clones.

18

2.2 Methods

2.2.1 Oxidation of Model Protein

Hydrogen peroxide (H2O2) was chosen as oxidization agent. It has been shown [5] that hydrogen peroxide gently oxidizes proteins. Prior to oxidization pH was set using Ammonium Bicarbonate NH4HCO3 (AmBic), 30 µl 1M AmBic was added to 250µl IgG (1.1µg/µl) solution [16]. As oxidizing agent 45.5µl (35%w/v) H2O2 in 955µl H2O was used. The oxidization agent was diluted by adding 100µl to 300µl H2O. This reagent is called 1:1 ox-agent and diluted even further according to table 1. Five batches with 50µl pH corrected IgG solution each were prepared using the following scheme:

Batch nr: Added reagent Added volume (µl) 1 H2O 1.0 2 1:30 ox-agent 1.0 3 1:20 ox-agent 1.0 4 1:10 ox-agent 1.0 5 1:1 ox-agent 1.0

The oxidized batches were incubated 10 min in 4°C and evaporated with speedvac. 40 min 32°C. The remaining pellets were stored at -75°C until further analysis.

2.2.2 Digestion of Model Protein

The pellets with more or less oxidized IgG were dissolved via thorough vortexing with 5µl 6M Guanidine-HCl in order to denaturate the protein. To enhance denaturation the batches were preincubated 75 min, 65°C. Prior to digestion 5µl 1M AmBic and 40µl H2O were added. The final pH was approximately 8, which corresponds to pH optimum of the enzyme. The final Guanidine-HCl concentration was 0.6M and AmBic 0.1M. AmBic is volatile and the remaining salt concentrations should be low enough to avoid sensitivity loss during mass spectrometry. For the cleavage reaction 1.25 µl of Trypsin 1µg/µl was added. This corresponds to an enzyme-substrate ratio of 1:40 (w/w). The batches were incubated 15 h at 37°C.

2.2.3 RP-HPLC

A systematic approach to the problem of optimizing HPLC separation is using factorial design and a suitable optimization algorithm e.g. multisimplex [17]. However complex systems such as peptide digestions are not, using this approach, easily handled within a

Table 1: Oxidization scheme.

19

limited period of time. In this study a more empirical trial and error method was used. The slope of the gradient, flow rate and column temperature are all variables that have to be considered when optimizing HPLC separation. As organic mobile phase acetonitrile with 0.05% TFA was used. As hydrophilic mobile phase water with 0.05% TFA was used. The following conditions (table 2 and table 3) were found to give acceptable separation within a reasonable period of time and were used throughout the study:

Time (min) % Organic mobile phase 0 2 30 62 35 90 38 90 40 2 50 2

Flow rate: 56µl/min Column: Zorbax Extend-C18 1mm*150mm, 3.5µm Sample temperature: 8°C Column temperature: 32°C

2.2.4 LC/MS

The outlet capillary from the HPLC system was connected to the inlet of the mass spectrometer. The two systems were independently controlled from two different computers. The operator program controlling the mass spectrometer was Masslynx 4.0, the corresponding program for the HPLC-system was Chemstation (2002). The mass spectrometer was tuned and calibrated with NaI following the standard procedure described in [18]. The resolution was found to be 3700 which is regarded as low for this specific instrumental setup. Mass data were collected from m/z = 300 Th to m/z = 1500 Th with a rate of 30 centroid spectra per minute.

2.2.5 Design of Experiment

Three samples of each batch were analyzed. Due to technical breakdown only two samples of batch 1 and batch 5 were run. Theoretically the concentration of digested protein in the samples should be about 7 picomol/µl, i.e. 1µl should be enough regarding the sensitivity of the mass spectrometer. However an injection volume of 1 µl turned out to give almost no result at all. Instead 10 µl was used. In order to reduce the influence of systematic errors the samples where analyzed in a randomized order according to the following scheme:

Table 2: Gradient conditions.

Table 3: System conditions.

20

Batch, sample Randomized order Injection volume (µl) 1 1 10.0 1b 12 10.0 1c 15 Not run 2 3 10.0 2b 9 10.0 2c 2 10.0 3 11 10.0 3b 10 10.0 3c 5 10.0 4 8 10.0 4b 6 10.0 4c 4 10.0 5 13 10.0 5b 7 10.0 5c 14 Not run

2.3 Data Analysis

2.3.1 Importing Data to Matlab

All data processing was carried out using Matlab 6.1 (Mathworks Inc. USA). Two-dimensional LC/MS data (fig. 9) with one m/z spectrum for each time point were exported from Masslynx to ASCII-format via software called Databridge. The ASCII file contains an array of approximately 160 000 elements where the m/z spectra are saved in ascending time order (fig. 10). However since only m/z values with intensity over the detection limit will be collected, the lengths between spectra from different time points (fig. 10) will not be identical. This fact has to be compensated for in order to receive a matrix that represents the whole LC/MS space, with columns of equal length. An algorithm with the following pseudo code was constructed to solve this problem.

Step 1 Find the m/z value with the highest intensity at time n. Step 2 Collect intensities from all time points (including the present) where this m/z

value can be found within a small m/z window. If m/z values cannot be found at a certain time point insert zero intensity.

Table 4: Injection scheme.

m/z Time

I

Figure 9: The nature of LC/MS data.

21

Step 3 Repeat step1 to 3 with the m/z value that corresponds to the second highest intensity and so on until all intensities within time point n are analyzed.

Step 4 Remove all found intensities from the original data set. Repeat step 1 to 4 for

n= 1,2,3…..N-1, N (N is maximum number of time points). Step 5 Sort the resulting matrix in ascending m/z order. The size of the m/z window was set to ±0.125 Thompson. Prior to further analysis the actual peptide map was separated from additional data, i.e. the solvent peak in the beginning and baseline after the last peptide peak. Only data from t = 5 min to t = 21 min was saved.

2.3.2 Approach 1: Collapsed Time Scale

The first approach to detect differences between the different batches was to ignore the chromatographic scale, comparing only m/z spectra. Each sample give rise to a LC/MS matrix described in 2.3.1. By projecting the data on the m/z axis using Simpson’s numerical integration method {5} each value on the m/z axis will be proportional to the amount of the corresponding peptide fragment. Instead of a matrix describing the whole LC/MS space a m/z vector will describe each sample. Simpson’s rule is shown in {5}. RT corresponds to truncation error. h is the distance along the x-axis between two neighboring data points.

Time point n

Time point n+1

Time point n+2

m/z I

ASCII file

m/z Time point 1 2 3 4 ………. N-1 N

Matlab matrix

Figure 10: Importing data to Matlab.

TnT

n

k

nk

n

k

k

b

a

RSRxfxfxfxfh

dxxf +=+

+++= ∑∑∫

−

==−

1

122

1120 )()(2)(4)(

3)( (5)

22

2.3.2.1 Normalization

Prior to normalization one has to make sure that the m/z vectors contain exactly the same m/z values. Otherwise different m/z values will be incorrectly normalized against each other. This is done by using the sorting algorithm described in 2.3.1 with m/z vectors instead of time points. The normalization algorithm can be described by the following pseudo code: Step 1: Calculate the quotient between the intensity of m/z value n from the reference

m/z vector and the target m/z vector. This is called the normalization quotient. The quotient is only saved if it is smaller than cutoff1 and if both the reference value and the target value are larger than cutoff2.

Step 2: Repeat step 1 for n = 1,2,3,4,…….N-1,N (N is maximum number of m/z

values). Step 3: Calculate the mean of the resulting quotients. This is called the normalization

parameter. Step 4: Multiply all intensities corresponding to the target m/z vector with the

normalization parameter. The resulting m/z vector will be the new normalized m/z vector.

Cutoff1 is set to 10 to avoid influence of quotients that are the result of division when the denominator is fairly small. This situation will occur if an m/z value is close to noise. Cutoff2 is set to zero to avoid division by zero. Prior to normalization the normalization algorithm was used with no cutoff1 restraint in order to find any correlation between the size of the quotient and the m/z values. It is also important to find out whether or not a correlation within the normalized matrix exists between variance of the intensities and the m/z values and if variance is correlated to the height of the intensities. However since variance (the square of σ in {6}) will depend on the height of the measured intensity two m/z values of different intensities cannot be compared directly. A better approach is to calculate the relative m/z intensities within each batch, i.e. step 1 and step 2 to in the algorithm above, and compare the variance between them. As reference the first normalized sample of each batch was chosen (i.e. sample 1,2,3,4,5). The resulting matrix with relative intensity values are further on called relative intensity matrix. This approach makes it possible to compare variance between variables of different size. 2.3.2.2 Principal Component Analysis (PCA)

Principal component analysis was preformed by using built-in algorithms in Matlab, based on a single value decomposition (SVD) method. As variables the intensities at different m/z values were chosen and each sample represented an individual object. Pre-treatment of data meaning mean centering and auto scaling was preformed and evaluated.

23

2.3.2.3 Confidence Interval

At a significant level of 95%, i.e. the probability that the measured variations do belong to the normal distribution is more than 95%, the confidence interval is calculated as follows:

Where x is the mean value of the measured m/z intensities within each batch, 1.96 represents 95% of the area under the standard normal curve (Gaussian curve), σ represents the standard deviation and n is the sample size. Notice that an increment of the sample size will make the confidence interval narrower. In this study only three samples were collected. The confidence interval of a difference between two stochastic variables can be calculated as follows: When comparing two batches one is called reference and the other target, where reference is defined as the batch believed to have undergone smallest chemical change compared to native state. The calculated interval of confidence is used to filter out the non-significant differences. 2.3.2.4 Finding Oxidized Fragments

In order to confirm if observed modifications are due to oxidization an algorithm was constructed to swiftly filter out all differences that do not correspond to an increment in mass between reference and target. This increment should according to the oxidization reaction described in 1.1 correspond to the mass of oxygen or any of its possible m/z values. The pseudo code describing this algorithm is as follows: Step 1: Calculate the difference of m/z value n between reference and target intensities

using the procedure described in 2.3.2.3. Step 2: If a confirmed difference also can be found but with opposite sign at m/z value

n + windowi the found differences are defined as oxidizations. Repeat this step for i = 1,2,3.

Step 3: Repeat step 1-2 for n = 1,2,3…….N-1,N (where N is the length of the m/z-

vector). Windowi is an m/z window defined to allow variations in m/z to compensate for limiting m/z reproducibility as follows:

nxx

σ⋅±⊂ 96.1

( ))1(

22

−Σ−Σ

=nn

xxnσ (6)

y

y

x

x

nnyxyx

22

96.1σσ

+⋅±−⊂− (7)

24

i Lower m/z Upper m/z z 1 15.9 16.1 1 2 7.95 8.03 2 3 5.3 5.6 3

2.3.3 Approach 2: Timescale

Collapsing the time scale means that all chromatographic information will be lost. This fact might not be a problem when looking for chemical modifications that corresponds to a fairly large change in m/z. However modifications that corresponds to a very small change in m/z e.g. deamidation that only give rise to ∆(m/z) = 1 Th for z = 1 and ∆(m/z) = 0.5 Th for z = 2 will be much harder to characterize using this approach. It might be beneficial to include chromatographic changes in the analysis. The combination of chromatographic- and m/z information might enhance the possibilities of finding subtle modifications. To investigate this hypothesis the whole LC/MS space was analyzed. Using this approach each sample was represented by the entire LC/MS matrix described in 2.3.1. Prior to further analysis the LC/MS matrixes were sorted using the procedure described in 2.3.1 but with entire LC/MS matrixes instead of just single vectors. The idea was to make sure that every m/z value was represented in all matrixes. Summarizing all m/z intensities for each time point yields the total ion current (TIC) chromatogram. 2.3.3.1 Wavelet Denoising

Chromatographic data was denoised using a level 1 decomposition with a mother wavelet called Daubechies 2 (fig. 8). Built in Matlab denoising and reconstructing algorithms were used. Due to the low sampling frequency (only 30 m/z spectra per minute) no data compression was found to be necessary. The difference d(t) between original TIC data and denoised TIC data is a time dependent function that corresponds to the wavelet operation. For all time points this difference is allocated on to the m/z intensities by adding d(t)/N (t = 5, 5.033, 5.066,…..21; N = number of m/z values) to each m/z intensity. This denoising procedure is only valid under the approximation that noise is mainly due to chromatographic time dependent fluctuations, i.e. all m/z values for a specific time point will behave in an identical way. The result is called the denoised LC/MS matrix and the difference between the denoised LC/MS matrix and raw data is called the shift surface. 2.3.3.2 Preprocessing Using Genetic Algorithms and Normalization

There are many genetic algorithms constructed to handle the alignment problem available. The algorithm chosen in this study is described in detail in [3]. It is a combination of a peak shift alignment and a base line correction algorithm. Inputs to the algorithm are a reference

Table 5: Definition of oxidization windows.

25

chromatogram and a target chromatogram. The target chromatogram is aligned with the reference chromatogram. As reference sample 1 was chosen throughout the study. The first step in the preprocessing procedure is to roughly align the target chromatograms against the reference chromatogram in order to facilitate normalization. This first alignment only uses the peak shift alignment part of the genetic algorithm with 200 generations and a population of 60 individuals. The resulting difference between original target and the aligned chromatogram was allocated onto the target LC/MS matrix according to the procedure described in 2.3.3.1. The second step is to normalize the target LC/MS matrix against the reference matrix. This is done by collapsing the time scale (as in 2.3.2) for each sample and multiplying the normalization parameter (calculated as described in 2.3.2.1) with the target LC/MS matrix. The chromatogram corresponding to the normalized LC/MS matrix is once again aligned against the reference chromatogram using genetic algorithms, this time not only the peak shift alignment part but also base line correction. Peak heights are allowed adjustments of maximum 10% of the total height. Both the peak shift alignment part of the algorithm and the base line correction part use 200 generations and a population of 60 individuals. Again the resulting difference between original target and the aligned chromatogram was allocated onto the target LC/MS matrix according to the procedure described in 2.3.3.1. The resulting target matrix is called preprocessed matrix. 2.3.3.3 Bucketing

To further reduce the influence of chromatographic differences bucketing of the LC/MS matrix might be a successful approach. Bucketing means that the time scale is decomposed into several buckets, i.e. the time scale within the bucket window is collapsed using the procedure described in 2.3.2. In this study a bucket window with 10 time points was evaluated. Using bucketing methods will reduce the chromatographic resolution, but when not depending on excellent chromatographic resolution, as often is the case with LC/MS data, bucketing is a much more simple and straight on method than genetic algorithms. 2.3.3.4 Confidence Interval

Comparing two mean LC/MS matrixes from different batches is done by extrapolating the procedure described in 2.3.2.3 onto the entire LC/MS matrix. The result will also be a matrix where differences can be found not only on the m/z scale but also on the chromatographic scale.

26

3 Results 3.1 Data Analysis

The LC/MS space was under the conditions used in this study found to be represented by a matrix with 1798 rows (m/z values) and 1050 columns (time points). The chromatographic TIC profile of the reference sample is shown in fig. 11. The time points between the arrows are representing the peptide map and LC/MS data from this region was saved for further analysis. The resulting LC/MS matrixes were found to have 1798 m/z values and 458 time points and are from here on referred to as raw data.

3.1.1 Approach 1: Collapsed Time Scale

Prior to analysis all intensities lower than 200 counts were ignored in order to reduce the influence of noisy m/z values. The resulting LC/MS matrixes were found to have 976 m/z values and 458 time points. Collapsing time scale yields an unique vector for each sample with 976 m/z values.

5 1 0 1 5 2 0 2 5 3 0 3 50

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

6 0 0 0

7 0 0 0

t (m in )

I (counts/s)

R e fe re n c e s a m p le 1

Figure 11: Chromatographic TIC profile of the reference sample.

27

3.1.1.1 Normalization

Following figure shows the normalization quotients with no cutoff1 restraint.

As can be seen in fig. 12 it appears to be a trend, however small, that a larger m/z value corresponds to a larger normalization quotient. The normalization quotients and the m/z values were fitted to linear curves using least square algorithms [19]. The slopes of these curves are given in table 6.

Figure 12: Normalization quotients and fitted linear curves.

20

0

10

13

0

6.5

15

0

7.5

13

0

6.5

12

0

6

25

12.5

m/z: 200 400 600 800 1000 1200 1400 1600 m/z: 200 400 600 800 1000 1200 1400 1600

0

12

12

0

6

15

0

7.5

1

0

0.5

7

0

3.5

7

0

3.5

9

0

4.5

2

2c

3b

2b

3

3c

4c

4b

5

4

1 1b

24

5b

0

28

Fitted curve, Sample:

Slope Interception with y-axis

1 0 1.00 1b 0.007 2.44 2 0.0004 1.10 2b 0.0003 1.40 2c 0.0004 1.82 3 0.0004 1.67 3b -0.0004 1.52 3c 0.0005 1.25 4 0.0003 1.52 4b -0.0003 1.51 4c 0.0006 1.32 5 0.0014 1.30 5b 0.0010 1.39 Standard deviation 0.00048 Mean 0.00041 Interval of confidence -0.003, 0.0038

Since the interval of confidence describing the mean of the slopes contains zero no statistical trend can be found regarding the correlation of m/z values and normalization quotients. Therefore it should be safe to calculate an unbiased mean of the normalization quotients and use this as a normalization parameter. Figure 13 shows the effects of normalization and it is obvious that normalization is necessary. Non-normalized intensity bars are not comparable to reference data. Figure 13 shows only a portion of the m/z spectrum but the trend is identical for most m/z values. Table 7 shows the normalization parameters when using cutoff1 = 10. The normalization parameters are all larger than one, which means that the reference sample has generally the largest

Table 6: Slope of linear curves describing the correlation between the m/z

values and the normalization quotients.

Table 7: Normalization parameters.

Sample Normalization parameter

1 1.0000 1b 2.8075 2 1.3614 2b 1.5992 2c 2.1263 3 1.8882 3b 1.2394 3c 1.5658 4 1.6897 4b 1.2209 4c 1.7092 5 1.9833 5b 1.9503

Figure 13: Example of the effects of normalization.

Normalized data sample 1b: blue bars. Raw data sample1b

shifted –0.1 m/z: red bars. Reference sample shifted –0.2

m/z: green bars.

448 449 450 451 452 453 454 455 456 457 4580

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

m/z

I

(counts/s)

Effects of normalization

29

intensities. The large variation and range of the normalization parameters can be interpreted as low reproducibility. Note that the normalization parameter for sample 1 is 1.0000 since sample 1 is also the reference sample. In order to evaluate this normalization technique another identical data set was normalized by dividing each m/z vector with the highest intensity within each sample. Linear approximations of the batch specific variance for the relative intensity matrix are shown in fig. 14. The slopes of the fitted curves are a measure of correlation size. Large positive slopes mean that large m/z values have larger variance compared to smaller m/z. Two slopes are negative and three are positive, it is obvious without further statistical analysis that no correlation between variance and m/z values can be found in the data set.

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0-0 . 2

- 0 . 1

0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

m / zFigure 14: Fitted curves of first order showing the correlation between m/z value and variance. The

numbers correspond to different batches.

4

2

3

1

5

Figure 15: Intensity distribution of sample 1.

dps corresponds to the number of m/z values. Figure 16: Variance calculated for each intensity

in the relative intensity matrix. Color code as in

fig. 14. dps corresponds to the number of m/z

values.

100 200 300 400 500 600 700 800 9000

0.5

1

1.5

2

x 104

dps

I

(counts/s)

200 300 400 500 600 700 800 900

0

2

4

6

8

10

12

14

dps

variance

Variance

30

The distribution of intensities from sample 1 (fig. 15) is analogous for all samples and shows that the majority of intensities are approximately of equal size and a small portion is of much greater height that the rest. Higher intensity corresponds to larger variation according to fig. 16. Remember that this result is calculated from the relative intensity matrix. This means that m/z values with higher intensities have a non-proportional larger variance compared to m/z values with lower intensities. 3.1.1.2 Principal Component Analysis (PCA)

3.1.1.2.1 Normalization with Normalization Parameter

As can be seen in fig. 17 the different batches are divided into different groups when using normalization. Without normalization the variations within the batches cannot visually be

-60

-40

-20

0

20

-40

-20

0

20

40-20

-10

0

10

20

PC 1

PCA Mean-centered and Auto Scaled Raw Data

PC 2

PC 3

-60

-40

-20

0

20 -10

0

10

20

30

40

-40

-20

0

20

40

PC 2

PCA Mean-Centered and Autoscaled

PC 1

PC 3

Cumulative variance: 87.6%

-60 -50 -40 -30 -20 -10 0 10 20-25

-20

-15

-10

-5

0

5

10

15

20

25PCA Mean-Centered and Autoscaled Raw Data

PC 2

PC 1

-60 -50 -40 -30 -20 -10 0 10-10

-5

0

5

10

15

20

25

30

35PCA mean-centered and autoscaled Normalized data

PC 1

PC 2

Figure 17: PCA score plots. Mean centered and auto scaled raw data a) and c) Mean centered and auto scaled

normalized data b) and d).

a) b)

c) d)


31

separated from the variations between the batches. Cumulative variance of 87.6% is regarded as good while 78.5% is only acceptable.

3.1.1.2.2 Evaluating Auto Scaling

The initial variance of a variable is interpretable as the squared “length” of that variable. Auto scaling often makes the comparison between different variables more accurate since all variables are put on a comparable footing and no variable is allowed to dominate because of its “length”. This is a standard technique used very frequently when the variables have different origins. However if the measured variables all have the same unit auto scaling can produce a skew PCA model. Another reason why auto scaling not automatically should be used is when variables with large variance actually are supposed to dominate the PCA model. MS data with many noisy variables with little importance and short lengths will be given smaller impact on a PCA model without auto scaling. This can be seen in the score plot of non-auto scaled data (fig. 18). Comparing fig. 18 with fig. 17 shows that it is easier to separate the objects into different batches in the non-auto scaled score plots than in the auto scaled score plot. Note also that the cumulative variances increase. It is however hard to draw any reliable conclusions because of the small number of samples in each batch.

-4

-2

0

2

4

x 104

-2

0

2

4

x 104

-3

-2

-1

0

1

2

3

x 104

PC 1

PCA Mean-Centered and Normalized Data

PC 2

PC 3

-4

-2

0

2

4

x 104

-4

-2

0

2

x 104

-1.5

-1

-0.5

0

0.5

1

x 104

PC 1

PCA Mean-Centered Raw Data

PC 2

PC 3

-3 -2 -1 0 1 2 3 4

x 104

-2

-1

0

1

2

3

4x 10

4 PCA Mean-Centered Normalized Data

PC 1

PC 2

-4 -3 -2 -1 0 1 2 3

x 104

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5x 10

4 PCA Mean-Centered Raw Data

PC 1

PC 2

Figure 18: Score plot without auto scaling. Mean centered raw data a) and c) Mean centered normalized data b) and d).

a)

c)

b)

d)

Cumulative variance: 84.3% Cumulative variance: 88.6%

32

3.1.1.2.3 Comparing Normalization Techniques

Normalizing against the highest intensity within each sample shows to be less successful than using the normalization parameter technique (2.3.2.1). Score plots of the data set normalized against the highest intensity is shown in fig. 19. It is more difficult to separate the batches into distinct groups in fig. 19 compared to the situation in fig. 18. The cumulative variance has also decreased.

There are other preprocessing techniques than mean centering and auto scaling. In this project logarithms (ln, log10) and exponential degradation of data (square root and fourth order root) were evaluated (results not shown here) but found to give no improvement neither in normalization results nor PCA plots compared to shown results. 3.1.1.3 Confidence Interval

Calculating the differences of the mean m/z intensities between the reference batch and batch 4 yields 976 differences (fig. 20).

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0- 2

- 1 . 5

- 1

- 0 . 5

0

0 . 5

1x 1 0

4

m / z

I

(counts/s)

D i f fe r e n c e b e t w e e n m e a n ( 1 ) a n d m e a n ( 4 )

Figure 20: Differences between the average intensities of reference batch and batch 4.

-1.5

-1

-0.5

0

0.5

1

-1

-0.5

0

0.5

-0.6

-0.4

-0.2

0

0.2

PC 1

PCA Mean-Centered Normalized Trad. Data

PC 2

PC 3

-1.5 -1 -0.5 0 0.5 1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

PC 1

PC 2

PCA Mean-Centered Normalized Trad. Data

Figure 19: Score plots of normalized and mean centered data when normalizing against the highest intensity

in each sample.


33

The intervals of confidence for the calculated differences (fig. 21), based on the standard deviations within each batch, reveal that many differences were not significant and could thus be ignored. The remaining significant differences are shown in fig. 22. Only 185 differences were found to pass the criterion of significance. Differences that were significant but still small in comparison with the maximum intensity of that specific variable were also removed. Differences larger than 2400 counts were spared if one or both of the reference and target intensities were smaller than 4000 counts. m/z values with both reference and target intensity higher than 4000 counts were only spared if the corresponding differences were higher than 5000 counts in order to compensate for the fact

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0-1 . 5

- 1

- 0 . 5

0

0 . 5

1x 1 0

4

m / z

I

(Counts/s)

S i g n i f i c a n t D i ffe r e n c e b e t w e e n m e a n (1 ) a n d m e a n (4 )

Figure 22: Significant differences α = 0.05

1 1 5 0 1 2 0 0 1 2 5 0 1 3 0 0 1 3 5 0 1 4 0 0 1 4 5 0 1 5 0 0

-4 0 0 0

-2 0 0 0

0

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

m / z

I

(Counts/s)

In t e r va l o f c o n fi d e n c e

Figure 21: Confidence interval for the calculated differences. Only a portion of the

m/z vector is shown. * corresponds to the mean value of each differences.

34

that high intensities correspond to a non-proportional large variance (fig. 16). The reaming 23 differences are shown in fig. 23.

Among the remaining m/z values a repeating theme could be found. Close to a difference with a positive sign a difference with the opposite sign could be found in the near neighborhood of ascending m/z values. This means that some fragments were more abundant in the reference batch and other fragments with a little higher m/z value were more abundant in batch 4. At a closer look (table 8) these fragments were separated with an m/z value that corresponds to possible addition of oxygen. With this a priori information the search for interesting differences could be narrowed even further. Only looking for differences due to oxidation resulted in six interesting pairs of peptide fragments (fig. 24).

m/z 318.2 417.3 426.4 492.4 506.4 508.4 522.4 591.5 607.4

Significant difference ∆I(counts)

-4436 -2513 -3864 2820 6764 -2462 -3924 8771 -11013

m/z 619.5 786.6 607.4 619.5 786.6 835.6 851.6 890.3 898.2 Significant difference ∆I(counts)

3623 -3791 -11013 3623 -3791 6158 -6071 -11346 -2530

m/z 940.2 940.7 948.2 948.7 949.1 963 1171.3 1179.4

Significant difference ∆I(counts)

3943 4215 -7230 -7175 -3611 2780 2852 -2463

3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0 1 1 0 0 1 2 0 0-1 . 5

- 1

- 0 . 5

0

0 . 5

1x 1 0

4 S ig n i f i c a n t D i ffe r e n c e s w i t h r e s t r a i n t s

m / z

I

(Counts/s)

Figure 23: Significant differences larger than 2400 counts.

Table 8: Significant differences larger than 2400 counts, repeating patterns are bold.

35

Notice that one of the fragments marked with (*) is repeated. This pattern can appear if e.g. a salt ion has attached to the fragment. In this case ∆(m/z) correspond to 18 Th (*) and there are several chemical reactions that can appear in the protein that corresponds to this result [20]. Assuming that oxidization has taken place z can also be calculated knowing the mass of oxygen (16 Da) and the difference in m/z value between the believed oxidized fragments and the corresponding non-oxidized copies (table 9). Of the six fragments four could be found in the known sequence and all four contain methionine (table 10). Theoretical m/z Measured m/z Sequence 1170.6 z = 2 1171.2 GLEWLGMIWGDGGNTDYNSALK 940.0 z = 1 940.2 DIQMTQSPSSLSASVGD 835.4 z = 1 835.6 DTLMISR 591.4 z = 1 591.5 VTMLK Not found z = 1 492.4 Not found 310.2 1283.5 z = 3 Not found WQQGNVFSCSVMHEALHNHYTQK 879.4 z = 3 Not found VTALYAMDYWGQGSLVTVSSASTK

3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0 1 1 0 0 1 2 0 0-1 . 5

- 1

- 0 . 5

0

0 . 5

1x 1 0

4

m / z

a p r i o r i i n fo

Figure 24: Differences that correspond to addition of an oxygen atom.

* +

Positive sign (m/z) Negative sign (m/z) ∆(m/z) z 492.4 508.4 16 1 591.5 607.5 16 1 835.6 851.6 16 1 940.2 948.2 8 2 1171.2 1179.2 8 2 Not found 310.2 - -

Table 9: Differences that correspond to addition of an oxygen atom.

Table 10: Matching theoretical m/z values with corresponding sequence. The two largest

fragments containing methionine were not found in the experimental data irrespective of the

charge, here m/z are shown for z = 3.

36

A closer look at the quotients between the intensities of the oxidized and non-oxidized fragments (fig. 25) clearly shows that there is a trend. Quotients larger than one means that the intensity of the oxidized fragment is higher compared to the non-oxidized copy. It appears as if the batches in ascending order have been exposed to oxidization. Using this information it is possible to find out which of the fragments that are most sensitive to oxidization by comparing the initial levels of the oxidization quotients and how fast they grow.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0

0.5

1

1.5

2

2.5

3

3.5

4

m/z

Fragment 940

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0

1

2

3

4

5

m/z

Fragment 591.5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0

5

10

15

20

25

30

35

40

45

50

m/z

Fragment 835

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0

0.5

1

1.5

2

m/z

Fragment 1171

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0

0.5

1

1.5

2

m/z

FRAGMENT 492

Figure 25: Oxidization quotients. m/z values corresponding to the non-oxidized copy: a)

m/z:1171, b) m/z: 835.6, c) m/z: 940.2, d) m/z: 591.5, e) m/z 492.4.

b)

d)

e)

c)

a)

Concentration of oxidization agent Concentration of oxidization agent

37

The most sensitive fragment is m/z: 835.6. Fragment m/z: 1171.2 appear to be least sensitive to oxidization. The remaining fragments all have approximately the same oxidization characteristics. This result can be explained considering the chemical environment around each fragment. Fragments very sensitive to oxidization are probably more exposed to the surrounding environment. Non-sensitive fragments might be protected deep within the three dimensional structure of the protein. Note that batch 5 (corresponding to a dilution coefficient of the oxidization agent of 1) was excluded from this calculation to avoid division with zero. In this batch the model protein appeared to be so oxidized that no amount of non-oxidized fragments could be found. The large variation of the oxidization quotient in batch 4 (corresponding to a dilution coefficient of oxidization agent of 0.1) can be explained with the observation that the intensity of the non-oxidized fragments in this batch is close to noise.

3.1.2 Approach 2: Time Scale

Examining the TIC chromatograms of raw data revealed that sample 5b most probably was an outlier (marked with an arrow in fig. 26). The outlier was not found when analyzing collapsed m/z data (3.1) suggesting that it is mainly deviation in time and not in m/z intensities that is responsible for the outlying behavior. Sample 5b was removed prior to further analysis.

3.1.2.1 Wavelet Denoising

Denoised TIC chromatograms (fig. 27) were found to have smoother characteristics than the original chromatograms, suggesting that at least a part of the unwanted noise have been removed.

4 6 8 1 0 1 2 1 4 1 6 1 8 2 0 2 20

0 . 2

0 . 4

0 . 6

0 . 8

1

1 . 2

1 . 4

1 . 6

1 . 8

2x 1 0

4

t (m in )

TIC

(Counts/s)

R a w d a t a

Figure 26: TIC showing raw data. Blue: batch 1, red: batch 2, green: batch 3, yellow:

batch 4, manganese: batch 5.

5b

38

The denoising algorithm can be visualized by looking at the shift surface (fig. 28), i.e. the difference between the denoised LC/MS matrix and the raw data matrix. The shift surface is constant along the direction of the m/z axis, since only variations in the time domain are taken into consider when wavelet denoising. This approximation is valid when reproducibility in the m/z domain is much better than reproducibility in the time domain, which is the case when using LC/MS.

13 13.5 14 14.5 15 15.5 16 16.50

2000

4000

6000

8000

10000

12000

14000

t (min)

TIC

(Counts/s)

denoised

5 10 15 20

0

2000

4000

6000

8000

10000

12000

14000

t (min)

TIC

(Counts/s)

denoised

12.5 13 13.5 14 14.5 15 15.5 16 16.5

0

2000

4000

6000

8000

10000

12000

14000

t (min)

TIC

(Counts/s)

raw data

4 6 8 10 12 14 16 18 20 220

2000

4000

6000

8000

10000

12000

14000

16000

18000

t (min)

TIC

(Counts/s)

raw data

Figure 27: Batch 2 raw data a) compared to denoised data b).

a)

b)

39

3.1.2.2 Preprocessing

The genetic algorithms and normalization technique resulted in a dramatic improvement of chromatographic alignment compared to raw data. Comparing fig. 27 and fig. 29 shows that without preprocessing the LC/MS matrixes almost no true chemical differences between the samples can be detected because of the huge variations due to large stochastic differences along the chromatographic axis. The calculated normalization parameters are except for the three last decimals identical with those in table 7.

4 6 8 10 12 14 16 18 20 220

1000

2000

3000

4000

5000

6000

7000

t (min)

TIC

(Counts/s)

preprocessed

12.5 13 13.5 14 14.5 15 15.5 16 16.50

1000

2000

3000

4000

5000

t (min)

TIC

(Counts/s)

preprocessed

Figure 29: Preprocessed data from batch 2.

1 51 6

1 71 8

1 9

0

5 0 0

1 0 0 0

1 5 0 0

-0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

t (m i n )

s h i ft s u r fa c e

m / z

TIC

(Counts/s)

Figure 28: Shift surface from reference sample.

40

3.1.2.3 Confidence Interval

Finding differences between preprocessed LC/MS matrixes was found to be easy (fig. 31) and still after removal of the non-significant differences a huge number of peptide fragments with different m/z and at different retention times turned out to have a non-negligible intensity difference. The result suggests that more differences can be found between the LC/MS matrixes than between the collapsed data. This is not surprising because of the fact that a unique m/z value can have several eluent peaks along the chromatographic scale. Those peaks will be regarded as just one when collapsing the time scale. Small remaining differences along the time scale will also appear as individual spots on the LC/MS peptide map.

6 8 1 0 1 2 1 4 1 6 1 8 2 0 2 2-1 0 0 0

0

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

6 0 0 0

7 0 0 0

8 0 0 0

t (m i n )

TIC

(Counts/s)

P r e p r o c e s s e d

Figure 30: Preprocessed data color code as in fig 26.

Figure 31: Differences between batch 1 and batch 4. Differences without restraint a) and

significant differences b).

a) b)

t (min) t (min)

m/z m/z

41

3.1.2.4 Bucketing

Bucketing the LC/MS matrixes (fig. 32) will of course result in a reduction of time resolution but also a reduction of differences that are due to chromatographic variations that despite of the preprocessing techniques still exist. Bucketed data with larger and larger bucketing windows support this idea. Bucketing the whole LC/MS matrix into a single m/z vector will yield the same result as in 3.1.1.3. Fragment 835.4 is encircled in fig 32. Note that the oxidized copy (m/z = 851.4) has a shorter retention time than the non-oxidized.

3.2 Tandem Mass Spectrometry To find out where on the fragments changes have occurred and to verify that the found fragments correspond to the theoretical MS/MS was used on sample 1 and sample 9. Four of the fragments in table 9 where characterized using MS/MS. The other two fragments m/z = 310.2 Th and m/z = 492.4 Th could not be found. In all four fragments a change in m/z corresponding to addition of oxygen were identified at the position of methionine. The conclusion must be that methionine has been oxidized to methionine sulfoxide. Figures 33,34 show the daughter ion spectrum from fragment m/z = 940.2 Th and its oxidized copy. The m/z values that were found in the y and in the b series are shown in table 11,12 with the corresponding amino acid sequence.

Figure 32: Significant differences between bucketed data

from batch 1 and batch 4. Ten time points are bucketed into

one. Positive differences are turquoise and negative blue.

t (min)

m/z

42

User Name:

AstraZeneca R&DAnalytical Development Södertälje

Instrument: QTOF Global Ultima GAA041Acq. Date.: 08-Jul-2004 13:13:27

Vial No.:

Sample Id.: NanoES+, IgG Tryp dig, ej oxi, 7pmol/ul desalt 5/10ul

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700m/z0

100

%

040708YY203 24 (0.868) AM (Cen,6, 85.00, Ht,5000.0,0.00,1.00); Sb (99,40.00 ); Sm (Mn, 2x6.00); Cm (5:40) TOF MSMS 940.00ES+ 203

y''151522.7

y''10978.4

y''9891.4

b7y''8804.4

y''5533.3

b5589.2

y''6604.3

b4488.2

y''4446.2

y''3347.2

b3357.2

b2229.1

y''1175.1

y''2290.1

b10y''111075.5

939.4y''7691.3

y''121162.5

y''141391.6y''13

1290.6

Figure 33: Non oxidized fragment from sample 1m/z = 940.2 Th

NanoES+, IgG Tryp dig, 7pmol/ul desalt 5/10ul,msms6948,2+

200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600m/z0

100

%

040708YY104 32 (1.154) AM (Cen,4, 80.00, Ht,10000.0,0.00,0.30); Sb (15,10.00 ); Sm (SG, 2x4.00); Cm (9:33) TOF MSMS 948.00ES+ 527

y''9891.5

b6733.3

y''111075.6

y''3347.2

b2229.1

201.1276.1

948.5

y''7691.4

y''5533.3b3

357.2

b4504.2

y''4y''9 2+446.3

377.2

417.2

b5605.3

y''6604.3

534.3606.3

622.3

948.0

y''16 2+833.9

b7820.4

939.5

939.0

834.9

949.0

949.5

y''10978.5

b91004.5

y''121162.6

1076.6

1077.6

1078.5

1163.6

y''131290.7

1164.6

y''141391.7

1292.6y''151538.81393.7

Figure 34: Oxidized fragment sample 9 m/z = 948.2 Th

b (Th) 116.0 229.1 357.2 488.2 589.3 717.3 804.4 901.4 988.4 aa D I Q M T Q S P S

y (Th) - 1763.9 1650.8 1522.7 1391.7 1290.6 1162.6 1075.5 978.5

b (Th) 1075.5 1188.6 1275.6 1346.6 1433.7 1532.7 1589.7 1704.8 - aa S L S A S V G D R

y (Th) 891.5 804.4 691.3 604.3 533.3 446.2 347.2 290.1 175.1

Table 11: Non oxidized sample1. m/z = 940.2 Th.

43

Daughter ion identification was carried out using software called Biolynx. Ions written in italic could not be found. Note that the m/z values in the b series larger than m/z for methionine are all shifted +16 Th in the oxidized copy.

b (Th) 116.0 229.1 357.2 504.2 605.3 733.3 820.4 917.4 1004.4 aa D I Q oxM T Q S P S y (Th) - 1779.9 1666.8 1538.7 1391.7 1290.6 1162.6 1075.5 978.5

b (Th) 1091.5 1204.6 1291.6 1362.6 1449.7 1548.7 1605.7 1720.8 - aa S L S A S V G D R y (Th) 891.5 804.4 691.3 604.3 533.3 446.2 347.2 290.1 175.1

Table 12: oxidized sample. 9 m/z = 948.2 Th

44

4 Discussion

Traditionally, peptide mapping has been performed using RP-HPLC and an UV-detector. However few qualitative conclusions can be drawn from pure UV-data. It is possible to detect changes between different chromatograms but the underlying reasons for these changes remain unknown. In this study the advantages of mass spectrometry were combined with the traditional procedure of peptide mapping. Mass spectrometry has higher sensitivity compared to an UV detector. But perhaps more interesting is the qualitative information that can be gathered from the mass spectrometer. It is possible to find the m/z values of fragments that differ from sample to sample. Comparing the mass information for each fragment to theoretical values gives a good idea of the amino acid sequence. However due to the complex pattern of m/z data, with an almost infinite number of peaks, it is difficult to be sure that the found fragments really agree with the theoretical. If the sequence of the protein is unknown this also has to be considered. Tandem MS makes it possible to determine the sequence of unknown peptides. Using this information qualitative information down to individual amino acids can be gathered. This is what has been the focus of this study. Small changes were introduced in a model protein and later successfully characterized as methionine oxidizations using a MS peptide map and tandem MS. Attempts have been made to characterize the nature of MS and LC/MS data. It was shown that at least under the conditions used in this study preprocessing of complex MS data was necessary in order to be able to compare different batches. Without e.g. normalization no interesting results could be found. The low reproducibility cannot only be explained by the stochastic nature of MS data. Most probably the model protein and digestion method contributed to this result. Normally when digesting a protein the disulfide bonds are reduced in order to avoid very large connected fragments and to enhance the enzymatic reaction. Very large fragments will be difficult to detect with the mass detector and can contribute to peptide adsorbtion. The problem with peptide and protein adsorption within the RP-HPLC system is of special importance when using a micro HPLC system with tiny dimensions. In this study no reduction of the disulfide bonds took place in order to avoid reduction of the oxidized model protein. This fact can be a part of the explanation of the low reproducibility. On the other hand after normalization of the data set four of the six peptide fragments containing methionie were found which must be considered as a good result. The remaining two fragments were not found at all neither the oxidized or non-oxidized copy. These lost fragments were the largest fragments and thus much more difficult to find since only multiple ions with a mass larger than 1500 Da will be detected. The two fragments that could not be found among the theoretical can have been the result of many uncontrolled reactions. Enzyme self-degeneration, partial cleavage and sample contamination are a couple of them. Since these two fragments could not be found when using tandem MS they were ignored.

45

The low sensitivity of the mass spectrometer in this study (70 picomoles were injected from each sample) can be due to the use of TFA as an ion-pairing agent. Low concentrations of TFA in the mobile phase should not give rise to any problematic loss of sensitivity. In this study 0.05% TFA was used which may have affected the sensitivity. A better approach might have been to use formic acid instead. The batch comparing method developed in this study can be used as a filter to remove all non-interesting differences. In this study batch four was chosen as target because differences between this batch and reference could not be detected visually by looking at normalized m/z-vectors, as was the case with batch five. But how do we know that no significant and perhaps vital differences will be missed? In this study not only non-significant differences were removed but also differences that were small in comparison to the maximum intensity of each m/z value. M/z values with intensities lower than 4000 counts were e.g. only saved if the significant differences were larger than 2400 counts. This criterion of selection was based on empirical knowledge, which means that for each specific batch comparison new criteria of selection have to be determined from the nature of that specific data. More general methods need to be developed in order to be able to compare whole classes of batches instead of only individual batches. In order to be able to use LC/MS peptide maps on more regular basis reproducibility and robustness has to be thoroughly investigated. There has to be an answer to the question: Is it possible to trust the results or not? A suggestion for further research is to use a model protein with copies of known mutations. The problems of using a non-controlled oxidization reaction will then be avoided and it will also be possible to reduce the troublesome disulfide bonds. A validated LC/MS system could dramatically reduce analytical cycle times and provide qualitative data that not only answers the question if? More studies have to be done in order to characterize the capacity of finding small PTM:s. What if a protein is only partially modified (as was the case in this study) how small differences can then be detected? How small differences really need to be detected? These questions are dependent from case to case and do not have a general answer which makes analytical chemistry for a long time forward dependent on non automatic expensive human resources. Attempts have also been made to characterize the nature of the entire LC/MS matrix. It was found that this more complicated approach was combined with a lot of obstacles. It is difficult to subtract the variations with a stochastic origin from the chemical variations. No general methods have yet been developed concerning the treatment of LC/MS matrixes. However if differences in retention time and m/z can be added together in a synergetic manner then huge possibilities of finding very small PTM:s will emerge.

46

5 Acknowledgements

I would like to thank Sven Jacobsson for supporting this thesis. I would also like to thank my supervisors Rudolf Kaiser and Yang Yang who guided me through the nasty jungle of awkward instruments. Thanks Fredrik Andersson for your genetic algorithms. Thanks Ingrid Marle for your never ending smile. And thanks to the whole bunch of brownies working on AstraZeneca. You have all been very nice!

47

6 References [1] Lilly, About us; History, http://www.lilly.co.uk/aboutus/history.cfm, (25.05.2004) [2] Bongers J., Cummings Mary B., Federici M., Gledhill L., Gulati D., Hilliard G., Jones B., Lee K.,

Mozdzanowski J., Validation of a Peptide Mapping Method for a Therapeutic Monoclonal Antibody: what could we possibly learn about a method we have run 100 times?, Journal of Pharmaceutical and Biomedical Analysis 21 (2000), 1099-1128

[3] Andersson F., Kaiser R., Jacobsson S., Data Preprocessing by Wavelets and Genetic Algorithms for

Enhanced Multivariate Analysis of LC Peptide Mapping, Journal of Pharmaceutical and Biomedical Analysis 34 (2004), 531-541

[4] Abbas a., Lichtman A., Cellular and Molecular Immunolog, fifth edition, W.B. Saunder’s Company

Philadelphia

[5] Vogt W., Oxidation of Methionyl Residues in Proteins, Tools, Targets and Reversal, Free Radical Biology

and Medicine 18 (1995), 93-105 [6] Vydac separation group, The Handbook of Analysis and Purification of Peptides and Proteins by Reversed-

Phase HPLC

[7] Siuzdak G., Mass Spectrometry for Biotechnology, Academic press (1996) [8] Eriksson L., Johansson E., Kettaneh-Wold N., Multi- and Megavariate Data Analysis, Umetrics (2001) [9] Yerevan Physics Institute, Patterns of Gene Expression in Normal and Neoplastic Tissues and Associated

Statistical Problems, crdlx5.yerphi.am/proj/gene/crdf.php3 (09.07.2004) [10] Rohrback B., Ramos S., Aligning Chromatograms,

http://www.infometrix.com/apps/GCC2003_Align_L.pdf, (09.07.2004) [11] Wehrens R. Buydens M.C., Evolutionary Optimisation: a Tutorial, Trends in Analytican Chemistry 17

(1998), 193-203 [12] Leardi R., Genetic Algorithms in Chemometrics and Chemistry: a Review, J. Chemometrics 15 (2001),

559-569 [13] Svärdström A., Signaler och System, StudentlitteraturLund (1999) [14] Polikar R., Fundamental Concepts & an Overview of the Wavelet Theory,

http://users.rowan.edu/~polikar/WAVELETS/WTtutorial.html (2004.07.12) [15] Misiti M., Misiti Y., Oppenheim G., Poggi J., Wavelet Toolbox user’s Guide, The MathWorks Inc. (1996) [16] Dionex, Seperation of Tryptophan and Methionine Oxidized Peptides from their Unoxidized Forms,

http://www1.dionex.com/en-us/webdocs/application/industry/lifesci/ic/AN129_V16.pdf, (2004.07.12) [17] Klein E.J., Rivera S.L., A review of Criteria Functions and Response Surface Methodology for the

Opitmization of Analytic Scale HPLC Seperations, J. Lic. Chrom & Rel. Technol., 23 (2000), 2097-2121 [18] Micromass, Operator Training Course, User Manual (2003) [19] Heath M., Scientific Computing second edition, McGraw-Hill Science/Engineering/Mat (2001) [20] Solomon G.,Fryhle C., Organic Chemistry 7:th ed., Wiley Text Books (2000) [21] Torgrip R., Åberg M., Karlberg B., Jacobsson S., Peak alignment using reduced set mapping, J.

Chemometrics 17 (2003), 573-582

peptide mapping by capillary/standard lc/ms and ......3 peptide mapping by capillary/standard lc/ms...

Documents