new supervised alignment method as a preprocessing tool for chromatographic data in metabolomic...

10
Journal of Chromatography A, 1256 (2012) 150–159 Contents lists available at SciVerse ScienceDirect Journal of Chromatography A j our na l ho me p ag e: www.elsevier.com/locate/chroma New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies Wiktoria Struck, Paweł Wiczling, Małgorzata Waszczuk-Jankowska, Roman Kaliszan, Michał Jan Markuszewski Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gda´ nsk, Al. Gen. Hallera 107, 80-416 Gda´ nsk, Poland a r t i c l e i n f o Article history: Received 19 March 2012 Received in revised form 19 June 2012 Accepted 26 July 2012 Available online 2 August 2012 Keywords: Metabolomics HPLC Retention time shift Data alignment Warping Preprocessing methods a b s t r a c t The purpose of this work was to develop a new aligning algorithm called supervised alignment and to compare its performance with the correlation optimized warping. The supervised alignment is based on a “supervised” selection of a few common peaks presented on each chromatogram. The selected peaks are aligned based on a difference in the retention time of the selected analytes in the sample and the reference chromatogram. The retention times of the fragments between known peaks are subsequently linearly interpolated. The performance of the proposed algorithm has been tested on a series of simulated and experimental chromatograms. The simulated chromatograms comprised analytes with a systematic or random retention time shifts. The experimental chromatographic (RP-HPLC) data have been obtained during the analysis of nucleosides from 208 urine samples and consists of both the systematic and ran- dom displacements. All the data sets have been aligned using the correlation optimized warping and the supervised alignment. The time required to complete the alignment, the overall complexity of both algorithms, and its performance measured by the average correlation coefficients are compared to assess performance of tested methods. In the case of systematic shifts, both methods lead to the successful alignment. However, for random shifts, the correlation optimized warping in comparison to the super- vised alignment requires more time (few hours versus few minutes) and the quality of the alignment described as correlation coefficient of the newly aligned matrix is worse 0.8593 versus 0.9629. For the experimental dataset supervised alignment successfully aligns 208 samples using 10 prior identified peaks. The knowledge about retention times of few analytes’ in the data sets is necessary to perform the supervised alignment for both systematic and random shifts. The supervised alignment method is faster, more effective and simpler preprocessing method than the correlation optimized warping method and can be applied to the chromatographic and electrophoretic data sets. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Now, in the era of evolving bioinformatics methods, there is a clear trend toward the analysis of the entire chromatographic data matrix, rather than the selected peaks detected in the chro- matograms. This broad approach does not require choosing the individual analytes for their integration and subsequent analysis, therefore, does not cause data loss. Such a holistic approach over- comes the problem with enormity of the data in the areas like metabolomics, which refers, inter alia, to the analysis of metabolic profiles, metabolic fingerprinting, as well as examines the inter- actions between levels of not necessarily identified metabolites. By analyzing the entire chromatographic data matrix instead of concentrations or peak areas of selected analytes, more relevant Corresponding author. Tel.: +48 58 349 3260; fax: +48 58 349 3262. E-mail address: [email protected] (M.J. Markuszewski). information can be extracted about the analyzed sample using appropriate classification and prediction methods. However prior to such chemometric analyses, it is necessary to align retention time shifts that occur either globally or in small sections of the chromatograms. The peaks are shifted because of the unavoid- able changes of the experimental conditions caused by the minor changes in the mobile phase composition, stationary phase prop- erties or by the impact of sample matrix (particularly in case of biological sample matrix such as urine or serum). Two types of peak shifts can be distinguished in a real set of chromatograms. In the first one, called systematic, the difference between reten- tion times of the corresponding analytes on the two consecutive chromatograms versus the retention time is a continuous func- tion. It is a very common situation and might be a consequence of column ageing, changes in chromatographic conditions, minor changes in the mobile phase composition, etc. Contrary, for the random displacement the difference between retention times of corresponding analytes is a random variable, so it affects each peak 0021-9673/$ see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chroma.2012.07.084

Upload: michal-jan

Post on 14-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

Nd

WMD

a

ARRAA

KMHRDWP

1

admitcmpaBc

0h

Journal of Chromatography A, 1256 (2012) 150– 159

Contents lists available at SciVerse ScienceDirect

Journal of Chromatography A

j our na l ho me p ag e: www.elsev ier .com/ locate /chroma

ew supervised alignment method as a preprocessing tool for chromatographicata in metabolomic studies

iktoria Struck, Paweł Wiczling, Małgorzata Waszczuk-Jankowska, Roman Kaliszan,ichał Jan Markuszewski ∗

epartment of Biopharmaceutics and Pharmacodynamics, Medical University of Gdansk, Al. Gen. Hallera 107, 80-416 Gdansk, Poland

r t i c l e i n f o

rticle history:eceived 19 March 2012eceived in revised form 19 June 2012ccepted 26 July 2012vailable online 2 August 2012

eywords:etabolomicsPLCetention time shiftata alignmentarping

reprocessing methods

a b s t r a c t

The purpose of this work was to develop a new aligning algorithm called supervised alignment and tocompare its performance with the correlation optimized warping. The supervised alignment is based ona “supervised” selection of a few common peaks presented on each chromatogram. The selected peaksare aligned based on a difference in the retention time of the selected analytes in the sample and thereference chromatogram. The retention times of the fragments between known peaks are subsequentlylinearly interpolated. The performance of the proposed algorithm has been tested on a series of simulatedand experimental chromatograms. The simulated chromatograms comprised analytes with a systematicor random retention time shifts. The experimental chromatographic (RP-HPLC) data have been obtainedduring the analysis of nucleosides from 208 urine samples and consists of both the systematic and ran-dom displacements. All the data sets have been aligned using the correlation optimized warping andthe supervised alignment. The time required to complete the alignment, the overall complexity of bothalgorithms, and its performance measured by the average correlation coefficients are compared to assessperformance of tested methods. In the case of systematic shifts, both methods lead to the successfulalignment. However, for random shifts, the correlation optimized warping in comparison to the super-vised alignment requires more time (few hours versus few minutes) and the quality of the alignment

described as correlation coefficient of the newly aligned matrix is worse 0.8593 versus 0.9629. For theexperimental dataset supervised alignment successfully aligns 208 samples using 10 prior identifiedpeaks. The knowledge about retention times of few analytes’ in the data sets is necessary to perform thesupervised alignment for both systematic and random shifts. The supervised alignment method is faster,more effective and simpler preprocessing method than the correlation optimized warping method andcan be applied to the chromatographic and electrophoretic data sets.

. Introduction

Now, in the era of evolving bioinformatics methods, there is clear trend toward the analysis of the entire chromatographicata matrix, rather than the selected peaks detected in the chro-atograms. This broad approach does not require choosing the

ndividual analytes for their integration and subsequent analysis,herefore, does not cause data loss. Such a holistic approach over-omes the problem with enormity of the data in the areas likeetabolomics, which refers, inter alia, to the analysis of metabolic

rofiles, metabolic fingerprinting, as well as examines the inter-

ctions between levels of not necessarily identified metabolites.y analyzing the entire chromatographic data matrix instead ofoncentrations or peak areas of selected analytes, more relevant

∗ Corresponding author. Tel.: +48 58 349 3260; fax: +48 58 349 3262.E-mail address: [email protected] (M.J. Markuszewski).

021-9673/$ – see front matter © 2012 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.chroma.2012.07.084

© 2012 Elsevier B.V. All rights reserved.

information can be extracted about the analyzed sample usingappropriate classification and prediction methods. However priorto such chemometric analyses, it is necessary to align retentiontime shifts that occur either globally or in small sections of thechromatograms. The peaks are shifted because of the unavoid-able changes of the experimental conditions caused by the minorchanges in the mobile phase composition, stationary phase prop-erties or by the impact of sample matrix (particularly in case ofbiological sample matrix such as urine or serum). Two types ofpeak shifts can be distinguished in a real set of chromatograms.In the first one, called systematic, the difference between reten-tion times of the corresponding analytes on the two consecutivechromatograms versus the retention time is a continuous func-tion. It is a very common situation and might be a consequence

of column ageing, changes in chromatographic conditions, minorchanges in the mobile phase composition, etc. Contrary, for therandom displacement the difference between retention times ofcorresponding analytes is a random variable, so it affects each peak
Page 2: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

atogr.

ivb

sniwwaatTthAircphecmpecpb(gs

pracsFw(pco

2

2

psaastt[wracftsm

W. Struck et al. / J. Chrom

n a different way. It might be a consequence of some unexplainedariability, i.e. caused by interaction between analytes. Very oftenoth types of displacements are present on a chromatogram.

Numerous methods have been applied to align retention timehifts. They are based on different mathematical transformation,amely correlation optimized warping [1–8], dynamic time warp-

ng [4–8], parametric time warping [4,7,9], semi-parametric timearping [4], peak alignment with genetic algorithm [10], localarping [11], automated alignment [12,13], fuzzy warping [14]

s well as alignment using differential evolution (DE) [15]. Theim of above mentioned methods is to align all shifted signals sohat different chromatograms appear at the same retention time.here are also aligning methods that are suitable for mass spec-rometry data [16,17]. Recently a tool for chromatogram alignmentas been developed that is freely accessible for the users [18].mong different warping methods the correlation optimized warp-

ng seems to be the most often used. This method was first timeeported by Nielsen et al. [19] and concerns the alignment of twohromatographic profiles by piecewise linear stretching and com-ression among the time axis of one of the profiles. Since 1998 itas been implemented in case of both chromatographic [1–6] andlectrophoretic peak shifts [7,8]. The quality of the alignment isalculated as correlation coefficient among the newly aligned chro-atograms. Moreover, the alignment is performed without any

reliminary information about the peaks’ correspondence. How-ver, the main disadvantage is that total procedure is based onhoosing two parameters the so called segment length and slackarameter. That is very time consuming and does not guarantee theest alignment for the signals that contain a large set of data pointssuch as thousands of data points). Moreover, this method does notive any satisfactory results when it is adapted to the random peak’shifts [4,12].

In this paper we would like to propose a new, fast and sim-le alignment method that can be adapted for both systematic andandom data shifts and is not limited to the data matrix size. Tochieve the chosen goal, firstly two artificial data sets have beenreated. First one contains systematic retention time shifts and theecond one is composed of random retention time displacements.or both simulated data sets, the correlation optimized warping asell as the newly developed method, called supervised alignment

SA), have been performed and obtained results have been com-ared. The proposed algorithm has been also applied to the realhromatographic data set from the analysis of nucleosides and thether metabolites derived from urine samples.

. Theory and implementation

.1. Correlation optimized warping

The correlation optimized warping (COW) is a well known andopular alignment method that is usually applied to align peakhifts from the chromatographic data. This method is based on

piecewise linear stretching and compression among the timexis relative to the reference chromatogram. COW was previouslyuccessfully applied and described by us for alignment of elec-rophoretic data wherein systematic shifts were observed [7,8]. Forhe random shifts this method seems not to be the best choice4,12]. However to prove this we applied correlation optimizedarping method for simulated data set of both systematic and

andom peak shifts. The highest average correlation coefficientmong all samples within the matrix served to find the referencehromatogram. Subsequently the peak displacement has been per-

ormed by choosing, usually by trials and errors, two parameters,he so called segment length and slack parameter. In short, theegment is a range of points among time axis wherein the align-ent is performed. Furthermore, a slack parameter is the number of

A 1256 (2012) 150– 159 151

points about the value that can move the peak (flexibility). In otherwords, slack parameter is a maximum range of warping in a seg-ment length. An important advantage of the method is that in orderto compensate peak shifts there is no information on peak identityrequired. However, this method is very time-consuming especiallywhen a large data set has to be aligned (over thousand points).

2.2. Supervised alignment

Our goal was to create a method that in a relatively short timewill enable the alignment of the shifted peaks based on the knownretention times of the few common analytes in all analyzed chro-matograms. We called our method as the supervised alignment(SA) because the shift of peaks is based on the shifts of peakscorresponding to the same analyte and which are pointed by theuser (supervised by the user). Similarly to the correlation opti-mized warping, the SA procedure consists of selecting the referencechromatogram. The selection is based on the calculation of thecorrelation coefficient calculated for each pair of the samples inthe matrix. The sample, which has the highest average correlationcoefficient among samples matrix, is chosen as the reference chro-matogram, T. Here end the similarities with COW. After selectingthe target chromatogram, peaks that are common on each chro-matogram from the analyzed matrix and reference chromatogramare selected. Therefore, identification of peaks present in all chro-matograms is required. This step seems to be the main drawbackof our method, however it could be easily automated, which wouldmake this method very fast. It is also not required to confirm theidentity of all common peaks existing in the chromatograms. Onlysome of them occurring at the beginning, middle and end of thetime vector are crucial for the whole procedure. Depending on thenature of the data it is necessary to select minimum 2–3 out ofthe total number of peaks presented in the chromatogram. In thenext step the retention times of each peak are determined (i.e. thetimes of maximal absorbance of each selected peak) for the sampleand target chromatograms. In the final step, the linear interpola-tion is used to map the whole range of times from the referencechromatogram to the target chromatogram based on the retentiontimes of the selected peaks as it is illustrated in Fig. 1. We decidedto preserve the heights of peaks, as height is a more useful mea-sure of concentration for the subsequent chemometric analysis. TheMatlab code is given in Appendix 1.

Undoubtedly, the most time consuming phase of analysis isto properly choose the peaks that exist in each chromatogram,because it requires inspection of each single chromatogram from,in fact, a very large data set. However, after this step the rest ofthe analysis is performed automatically and lasts from few sec-onds to few minutes depending on the computer computationalabilities. It has also to be underlined that the time required for theidentification of those several peaks, is incomparably shorter thanthe time needed to optimize the key parameters of the correlationoptimized warping (segment length and slack parameter). Finally,to check potential of supervised alignment method to align ran-domly and systematically shifted peaks the method was applied tothe simulated data sets. We also decided to examine the number ofthe identified peaks necessary to correctly align both data sets.

3. Materials and methods

All simulations and calculations have been conducted usingMatlab 9.1 (The MathWorks, Inc., USA). We used the COW algorithm

that was developed at the Department of Analytical Chemistryand Pharmaceutical Technology, Vrije University, Brussels, Belgium[7,8]. To calculate quality of the alignment two parameters havebeen compared, namely root mean square error (RMSE) and the
Page 3: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

152 W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159

F gramc ected

mtdpc

R

wm

3

aGtj

X

wctriThplctf

ig. 1. The procedure of supervised alignment. (A) Selection of target chromatoorresponding retention times on target chromatogram and fragments between sel

aximum average correlation coefficient (R) among all samples inhe analyzed matrix. RMSE is the square root of the mean squaredifference between the values of the retention times of individualeaks from the matrix and their retention times from the referencehromatogram:

MSE =n∑

i=1

√(TRi − TR)2

n,

here TR and TRi denote the retention time from reference chro-atogram and sample chromatogram, respectively.

.1. Chromatographic data simulations

Thirty chromatograms for each type of displacement (system-tic and random) have been simulated. Each peak is assumedaussian shaped and includes many sorts of variability encoun-

ered in a real chromatogram. The ith chromatogram at time point has been obtained according to the formula:

(i, j) =∑

k=1:10

AUCik1√

2�(Wik/4)2exp

(− (j − TRik)2

2(Wik/4)2

)

here k denotes the number of a peak in a chromatogram. Eachhromatogram has 10 peaks described by area (AUCik), reten-ion time (TRik) and peak width (Wik). Since we aimed to mimiceal gradient elution conditions the peak width has been approx-mately constant for each analyte present in the chromatogram.o challenge the analysis the peak width and area under the peakave been simulated as random variables. Thus the shape andeak height of each peak differ between chromatograms. Simi-

arly the retention times are simulated with the additive randomomponent (ε). For the systematic displacement ε is equal for allhe peaks present on the particular chromatogram, however dif-ers between chromatograms. Contrary, for random displacement

, (B) selection of few common peaks, which retention times are aligned to the peaks are linearly interpolated, and (C) the results of SA.

ε differs between peaks and chromatograms. The mean values andthe random components of all parameters used for simulation aregiven in Table 1.

Figs. 2 and 3 show the simulated data sets for the systematicand random displacement of the peaks. Together with the chro-matogram the average correlation coefficient between samples ispresented. As illustrated in Figs. 2 and 3 the peaks are shifted in bothdata sets and, as a consequence, samples vary a lot among them-selves. Hence, before applying some statistical analysis in order toachieve the valuable information about data set, first the peak align-ment should be performed. The goal of the alignment procedure isto shift the peaks to create the chromatograms more similar to eachother.

3.2. Experimental chromatograms

The chromatograms were obtained by reversed-phase high per-formance liquid chromatography (RP-HPLC) analysis of metabolicprofiles of nucleosides from urine. Urine samples were collectedfrom patients suffering on urological tract cancer diseases andbefore HPLC analysis were submitted to solid-phase (SPE) onphenyloboronic (PBA) columns. The experiments were carriedout on Agilent Technologies 1200 Series system (Waldbronn,Germany), consisting of a G1311A pump, a G1329A autosampler,a G1316A column oven and a G1315 DAD diode array detector.The detector wavelength was set at 254 nm. The separation ofthe isolated urinary nucleosides was performed on a two con-nected columns, Gemini C18, 250 mm × 4.6 mm (Phenomenex,USA) giving the total column length of 500 mm with particle size3 �m. The columns were thermostated at 55 ◦C; the flow rate was0.5 mL min−1 and the samples were introduced to the columns in

the volume 5 �L. LC separations were performed using the fol-lowing mobile phase gradient profile:from 2:98 to 20:80 (v/v) ofMeOH:0.1% formic acid, pH 2.8 in 50 min. Data were collected andanalyzed using the Chemstation software (Agilent Technologies).
Page 4: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159 153

Table 1The values of retention times, peak widths and areas of the peaks used to simulate the chromatograms with systematic and random peak displacements. All symbols withan upper dash are the typical values whereas ε is a random normally distributed component, N(0, �), with mean zero and standard deviation �. The AUC was modeled usinglognormal distribution to avoid negative values.

Type of displacement Retention time, RTik Peak width, Wik Area of the peak, AUCik

Systematic TRik = TRk + εTRi

TRk = {2, 3, 5, 6, 8, 10, 11, 12, 14, 15}εTR

i= N(0, 0.25)

Wik = Wk + εWik

Wk = {2}εW

i= N(0, 0.1)

AUCik = AUCk exp(εAUCik

)

AUCk = {50, 25, 10, 20, 30, 10, 15, 40, 10, 50}εAUC

i= N(0, 0.5)

Random TRik = TRk + εik

TR = {2, 3, 5, 6, 8, 10, 11, 12, 14, 15}

Mpm

4

4

ud(itatf

Ftt

k

εTRik

= N(0, 0.25)

ore detail information on the analytical procedure for samplereparation and an optimized and validated liquid chromatographyethod are available in the publication [20].

. Results and discussion

.1. Application of correlation optimized warping

In the first step the correlation optimized warping has beensed to align both simulated data sets. Figs. 4 and 5 present theata using 4 samples before (Figs. 4A and 5A) and after the COWFigs. 4B and 5B). The results for the whole data set are illustratedn Figs. 4C and 5C. As presented in Table 2 the averaged correla-

ion coefficient ranges from 0.725 to 0.878 for systematic shiftsnd from 0.483 to 0.806 for random shifts. These values are higherhan for raw data (0.148–0.629 for systematic and 0.337–0.599or random peak displacement) which suggests the considerable

ig. 2. The simulated data with systematic peaks shifts (30 samples × 1751 data points). Bhe value of the correlation between successive samples. Red color reflects low correlationo color in this figure legend, the reader is referred to the web version of the article.)

improvement in peak alignment. The time for alignment dependson the time for optimization of key parameters of the method aswell as the length of time vector and amounts to 2.26 min for sys-tematic displacements and 4.37 min for random shifts. The time isrelatively short as the time vector comprises 2001 points with 10peaks only. As illustrated in Fig. 4 and Table 2, the COW methodworks well for systematic displacements. The root mean squareerror calculated between retention times of target and sample chro-matograms is in the range from 0 (perfect alignment) to 0.365(slight movement). In result, the newly aligned analytical data maybe subjected to further chemometric analysis without losing impor-tant information that could have happened during the alignment.In contrary, the data with random peak shifts is not aligned prop-

erly by means of COW algorithm, so a contradictory informationcan be obtained during the classification or predictive analysis. Theroot mean square error is in the range from 0.17 to 0.51. Besides,many analytes totally coelute together, which made the alignment

ox on the right describes correlation between samples:the color of each grid shows while white high correlation between samples. (For interpretation of the references

Page 5: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

154 W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159

Fig. 3. The simulated data with random peaks shifts (30 samples × 1751 data points). Box on the right describes correlation between samples:the color of each grid showsthe value of the correlation between successive samples. Red color reflects low correlation while white – the high correlation between samples. (For interpretation of thereferences to color in this figure legend, the reader is referred to the web version of the article.)

Fig. 4. Correlation optimized warping on simulated data set with systematic peak shifts. (A) Original samples (n = 4) before alignment, (B) samples (n = 4) after correlationoptimized warping (segment length = 10, slack parameter = 3), (C) results of alignment of whole data set with COW, and (D) correlation between samples. The targetchromatogram was presented as black dashed line.

Page 6: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159 155

Fig. 5. Correlation optimized warping on simulated data set with random peak shifts. (A) Original samples (n = 4) before alignment, (B) samples (n = 4) after correlationo ent ofl resenl

dbo

4

4

ancvf

TTmw

ptimized warping (segment length = 30, slack parameter = 6), (C) results of alignmow correlation, while white – the high correlation). The target chromatogram was pegend, the reader is referred to the web version of the article.)

ifficult to apply. Therefore, if some randomly shifted peaks need toe aligned, correlation optimized warping should not be a methodf choice.

.2. Application of supervised alignment

.2.1. Systematic shiftsThe supervised alignment of the chromatograms with system-

tic shifts starts by choosing the reference chromatogram, T. In the

ext step, common peaks on all chromatograms are identified. Theommon peaks presented in each chromatogram were chosen byisual inspection of the data. In the first step we have selectedew time frames (regions) were repeatable and high peaks were

able 2he accuracy of peak alignment for two simulated data sets by COW and SA algorithms.atogram (target) and chromatograms before and after alignment. R is the mean correlaas summarized be mean, range and standard deviation (STD).

Type of pre-processing method Random shifts

RMSE

Raw data Mean 0.29Max 0.45Min 0.23

Correlation optimized warping Mean 0.29Max 0.51Min 0.17

Supervised alignment performed according tothe 3 peaks (first peak, peak in the region710–910 points and last peak)

Mean 0.33Max 0.58Min 0.11

whole data set with COW, and (D) correlation between samples (red color reflectsted as a black dashed line. (For interpretation of the references to color in this figure

observed. The highest peak in each region was considered to belongto the same analyte and its retention time was noted for furthercalculation. The selection of regions was required as the analyteretention time change from chromatogram to chromatogram. Tocheck how many peaks have to be identified to properly align thisdata set we considered the following selections: (i) only the firstpeak, (ii) the first and the last peak, and (iii) the first peak, peak inthe region of 710–910 points and the last peak. Fig. 6 shows thealignment of systematically shifted peaks using the SA method for

the three selected scenarios. The best alignment has been obtainedin the third case, when 3 peaks have been chosen for calculations.However, aligning the sample chromatogram based on only thefirst peak leads to the considerable improvement in the initial part

RMSE presents the differences in peaks’ retention times between reference chro-tion coefficient between all vectors (samples) in the data set. The range of R values

Systematic shifts

R RMSE R

Mean 0.478Max 0.559Min 0.337STD 0.066

Mean 0.27Max 0.27Min 0.27

Mean 0.477Max 0.629Min 0.148STD 0.13

Mean 0.705Max 0.806Min 0.483STD 0.063

Mean 0.14Max 0.36Min 0

Mean 0.828Max 0.878Min 0.725STD 0.038

Mean 0.625Max 0.721Min 0.454STD 0.061

Mean 0Max 0Min 0

Mean 0.831Max 0.892Min 0.728STD 0.040

Page 7: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

156 W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159

Fig. 6. Supervised alignment on simulated data set with systematic peak shifts. (A) Original samples (n = 4) before alignment, (B) samples (n = 4) after supervised alignmentur

Fuo

sing first peak, (C) samples (n = 4) after supervised alignment using first and the last peegion of 710–910 points and the last peak. The target chromatogram was presented as b

ig. 7. Supervised alignment on simulated data set with random peak shifts. (A) Originasing first peak, (C) samples (n = 4) after supervised alignment using first and the last peakf 710–910 points and the last peak, and (E) samples (n = 4) after supervised alignment u

ak, and (D) samples (n = 4) after supervised alignment using first peak, peak in thelack dashed line.

l samples (n = 4) before alignment, (B) samples (n = 4) after supervised alignment, (D) samples (n = 4) after supervised alignment using first peak, peak in the region

sing all peaks. The target chromatogram was presented as black dashed line.

Page 8: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159 157

Fig. 8. Result of alignment of whole dataset (208 samples × 7500 points) with supervised alignment method. (A) Original data set before alignment, (B) enlarged originald ised ad ised ai to the

ortRalrlsta

4

bpfipmrtptrceSea

ata set, (C) correlation between samples before alignment, (D) data set after supervata after supervised alignment, and (F) correlation between samples after superv

nterpretation of the references to color in this figure legend, the reader is referred

f chromatogram. The comparison of the SA and the COW algo-ithm is given in Table 2. As far as systematic displacements usinghe three peaks are concerned, SA perfectly aligns all analytes withMSE equal to zero. On the contrary, RMSE equals 0.27 for raw datand is in the range from 0 to 0.36 after COW. The averaged corre-ation coefficient is much better (0.728–0.892) in comparison withaw data (0.148–0.629) and similar to the same parameter calcu-ated after COW (0.725–0.878). Therefore, the alignment of the dataet is successful using both methods (SA and COW). However, theime required for alignment is less than 1 min when the supervisedlignment is performed.

.2.2. Random shiftsIn the case of the randomly shifted peaks, similar procedure has

een used as for systematic data. Therefore the alignment has beenerformed using the information about retention time of the (i)rst peak, (ii) the first peak and the last peak, and (iii) the first peak,eak in the region of 750–850 points and the last peak. Next, the SAethod has been applied. That approach does not give satisfactory

esults because only these peaks are aligned according to whichhe procedure is applied. Therefore, we expanded the number ofeaks to (iv) to all present in the chromatogram. Fig. 7 illustrateshe obtained results. Only by using the prior information aboutetention time of all peaks, the chromatograms could be properlyorrected. It further indicates that for random peak displacements

ither COW and SA algorithms does not work, and only by using theA it is possible to align some already identified peaks with all oth-rs being poorly aligned. In a real chromatogram, the peak shifts are

combination of systematic and random displacements. If random

lignment using 8 peaks according to which alignment was performed, (E) enlargedlignment (red and white color reflect low and high correlation, respectively). (For

web version of the article.)

changes are small the data set can be treated as systematicallyshifted and the SA algorithm is very powerful in aligning the data.However, if random component is large the SA algorithms works,but the retention time of all peaks need to be known in advance.

Results after supervised alignment using 3 peaks (first, last andpeak in the range of 710–910 points) are presented in Table 2. BothRMSE (0.11–0.58) and mean correlation coefficient (0.454–0.721)are lower than for COW (RMSE 0.17–0.51, mean R 0.483–0.806)which, in fact, is a result of impossibility in aligning peaks withrandom displacement. This observation can led to the conclusionthat randomly shifted peaks can be aligned perfectly but only usingthe information about retention time of all peaks presented in theanalyzed samples. However, this situation is of no practical interest.

4.2.3. The experimental chromatographic dataAfter the simulated data sets were tested, we have applied the

SA to the real chromatographic data which was derived from theanalysis of 208 urine samples in the direction of determination ofnucleosides. It is important to notice that some of the nucleosideshave low retention capacity as illustrated in Fig. 8A in an initialpart of chromatogram (up to 2000 points). Nucleosides such as 3-metylcytidine, cytidine and pseudouridine are poorly retained onchromatographic reversed-phase column so their retention timescan be unrepeatable. The analytes of moderate and strong reten-tion properties elute from the column in a predictable way and

their possible shifts are present mainly due to analytical variability.Hence for these analytes systematic shifts dominate over randomones. In order to illustrate the nature of the displacements (sys-tematic or random), the difference between the retention times
Page 9: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

158 W. Struck et al. / J. Chromatogr. A 1256 (2012) 150– 159

F ta set

obacIrcn

rsdhtampue2hastttaeamtav

ig. 9. The character of possible shifts on simulated (A and B) and experimental da

f common peaks and its position in target chromatogram haseen calculated. Such curves were plotted for both (a) system-tic and (b) random simulated data set as well as for (c) the realhromatographic data for comparison and are presented in Fig. 9.n the case of systematic displacements, the difference betweenetention times of peaks from Sample chromatogram and Targethromatogram is constant. In the case of random displacements,o pattern can be distinguished.

As it can be noticed the first two peaks and the last one from theeal chromatographic data behave randomly, whereas the rest haveystematic character. Thus, the method of COW, especially for ran-om fragments seems not to be the adequate one. Nonetheless, weave tried to do that but the results were so unsatisfactory in the ini-ial section of the chromatogram that we rejected this method andpplied the supervised alignment instead. The time needed for opti-ization of two adjustable parameters:segment length and slack

arameter, equaled about 11 h (635 min). After that the analysissing optimized parameters (segment length = 100, slack param-ter = 9) has lasted additional 12.53 min. As a result, we aligned08 samples but the first two peaks of the each chromatogramave been still significantly shifted to each other. The highest aver-ge correlation coefficient after alignment was 0.436 and was onlylightly improved in comparison with the chromatograms beforehe alignment (R = 0.316). Fig. S1 in Supplementary material illus-rates the alignment on the chromatographic 208 samples afterhe implementation of the COW method. As far as the supervisedlignment is concerned, we have chosen 10 peaks that occurred invery single chromatogram from the 208 samples in the matrix andligned using the SA algorithm. Fig. 8A and B describes the chro-

atograms and the maximum averaged correlation coefficient of

he matrix before and after alignment. The total analysis time isbout 1 h and most of the time is held for choosing the most rele-ant peaks presented in all samples as well as their number in order

(C). (A) Systematic shifts, (B) random shifts, and (C) systematic and random shifts.

to obtain the well aligned data set that would be further statisticallycalculated.

5. Conclusions

In this article the supervised alignment (SA) method of peaksalignment has been introduced and compared with correlationoptimized warping (COW). Using simulated data sets it has beenshown that supervised alignment can very efficiently handle thesystematic peak shifts. Alignment by the COW method comes downto the use of two input data, the so-called segment length and slackparameter with the indication that choice of these parameters isoften done by trials and errors. Additionally, the implementationof COW algorithm to data containing more than 1000 points makesthis method very time consuming. In contrast to the COW, theSA is neither involved in splitting time vector into segments normoves peaks within the segments according to the value of theslack parameter. Instead, using the Supervised Alignment we candetermine the maximum absorbance of the several peaks presentin each sample and change their position on the time axis relativeto their corresponding location on the target chromatogram. It isimportant to emphasize that the main limitation of the SA is theneed to confirm the identity of analytes in each sample. However, itshould also be remembered that only few peaks in a chromatogramneed to be identified. For example, the alignment of a matrix thatconsists of 208 urine samples, is performed based on retention timeof only 10 peaks. The quality of the alignment which is reflectedby the correlation coefficient calculated among newly alignedsamples confirms the usefulness of the Supervised Alignment,

especially to the random peaks displacements. Moreover, the totalanalysis time depends mainly on the peaks needed for alignmentand lasts from a few minutes to 2 h in total. Concerning analysison such large matrix like 208 urine samples per 7500 points, the
Page 10: New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies

atogr.

tuffmot

C

A

Vo

e1

A

f2

[

[

[

[[[[

W. Struck et al. / J. Chrom

otal time correction is relatively short. Hence, we would like tonderline that the SA method can be fast and a very effective toolor the alignment of the chromatographic data and competitiveor the sake of the alignment quality with another widely used

ethod like the COW. Our further research will concentraten implementation of the Supervised Alignment to the new,wo-dimensional data sets obtained from CE and NMR techniques.

onflict of interest

The authors have declared no conflict of interest.

cknowledgments

The authors would like to thank Prof. Yvan Vander Heyden fromrije Universiteit Brussel for providing algorithm of correlationptimized warping.

This work was financially supported by the Ministry of Sci-nce and Higher Education, Warsaw, Poland (grant numbers N40501334 and N405 630338).

ppendix A. Supplementary data

Supplementary data associated with this article can beound, in the online version, at http://dx.doi.org/10.1016/j.chroma.012.07.084.

[[[[

A 1256 (2012) 150– 159 159

References

[1] L. Zheng, D.G. Watson, B.F. Johnston, R.L. Clark, R. Edrada-Ebel, W. Elseheri, Anal.Chim. Acta 642 (2009) 257.

[2] L.R. Martins, E.R. Pereira-Filho, Q.B. Cass, Anal. Bioanal. Chem. 400 (2011) 469.[3] A.M. van Nederkassel, M. Daszykowski, D.L. Massart, Y. Vander Heyden, J. Chro-

matogr. A 1096 (2005) 177.[4] A.M. van Nederkassel, M. Daszykowski, P.H.C. Eilers, Y. Vander Heyden, J. Chro-

matogr. A 1118 (2006) 199.[5] G. Tomasi, F. Van den Berg, C. Andersson, J. Chemom. 18 (2004) 231.[6] V. Pravdova, B. Walczak, D.L. Massart, Anal. Chim. Acta 456 (2002) 77.[7] E. Szymanska, M.J. Markuszewski, X. Capron, A.M. van Nederkassel, Y. Vander

Heyden, M. Markuszewski, K. Krajka, R. Kaliszan, Electrophoresis 28 (2007)2861.

[8] E. Szymanska, M.J. Markuszewski, X. Capron, A.M. van Nederkassel, Y. VanderHeyden, M. Markuszewski, K. Krajka, R. Kaliszan, J. Pharm. Biomed. Anal. 43(2007) 413.

[9] P.H.C. Eilers, Anal. Chem. 76 (2004) 404.10] J. Forshed, I. Schuppe-Koistinen, S.P. Jacobsson, Anal. Chim. Acta 487 (2003)

189.11] K.J. Johnson, B.W. Wright, K.H. Jarman, R.E. Synovec, J. Chromatogr. A 996 (2003)

141.12] M. Daszykowski, Y. Vander Heyden, C. Boucon, B. Walczak, J. Chromatogr. A

1217 (2010) 6127.13] T. Skov, F. van der Berg, G. Tomasi, R. Bro, J. Chemom. 20 (2006) 484.14] B. Walczak, W. Wu, Chemometr. Intell. Lab. 77 (2005) 173.15] Z.-M. Zhang, S. Chen, Y.-Z. Liang, Talanta 83 (2011) 1108.16] Z. Zhang, J. Am. Soc. Mass Spectrom. 23 (2012) 764.

17] F. Suits, J. Lepre, P. Du, R. Bischoff, P. Horvatovich, Anal. Chem. 80 (2008) 3095.18] S.-Y. Wang, T.-J. Ho, C.-H. Kuo, Y.J. Tseng, Bioinformatics 26 (2010) 2338.19] N.P.V. Nielsen, J.M. Carstensen, J. Smedsgaard, J. Chromatogr. A 805 (1998) 17.20] M. Waszczuk-Jankowska, M.J. Markuszewski, M. Markuszewski, R. Kaliszan,

Bioanalysis 4 (2012) 1185.