discrimination of acceptable and contaminated heparin by
TRANSCRIPT
DISCRIMINATION OF ACCEPTABLE AND CONTAMINATED HEPARIN BY
CHEMOMETRIC ANALYSIS OF PROTON NUCLEAR MAGNETIC
RESONANCE SPECTRAL DATA
By
Qingda Zang
A Dissertation Submitted to
the University of Medicine and Dentistry of New Jersey – School of
Health Related Professions in partial fulfillment of the Requirements for
the Degree of Doctor of Philosophy
Department of Health Informatics
April, 2011
iii
ABSTRACT
DISCRIMINATION OF ACCEPTABLE AND CONTAMINATED HEPARIN BY
CHEMOMETRIC ANALYSIS OF PROTON NUCLEAR MAGNETIC
RESONANCE SPECTRAL DATA
Qingda Zang
Heparin is a highly effective anticoagulant that can contain varying
amounts of undesirable galactosamine impurities (mostly dermatan sulfate or
DS), the level of which indicates the purity of the drug substance. Currently,
the United States Pharmacopeia (USP) monograph for heparin purity dictates
that the weight percent of galactosamine in total hexosamine (%Gal) may not
exceed 1%. In 2007 and 2008, heparin contaminated with oversulfated
chondroitin sulfate (OSCS) was associated with adverse clinical effects, i.e., a
rapid and acute onset of a potentially fatal anaphylactoid-type reaction. In
order to develop efficient and reliable screening methods for detecting and
identifying contaminants in existing and future lots of heparin to ensure the
integrity of the global supply, chemometric techniques for heparin proton
nuclear magnetic resonance (1H NMR) spectral data were applied to establish
adequate multivariate statistical models for discrimination between pure
heparin samples and those deemed unacceptable based on their levels of DS
and/or OSCS.
iv
The whole research work consisted of two parts: (1) the development of
quantitative regression models to predict the %Gal in various heparin
samples from NMR spectral data. Multivariate analyses including multiple
linear regression (MLR), Ridge regression (RR), partial least squares
regression (PLSR), and support vector regression (SVR) were employed in
this investigation. To obtain stable and robust models with high predictive
ability, variables were selected by genetic algorithms (GA) and stepwise
methods; (2) differentiation of heparin samples from impurities and
contaminants by the different pattern recognition and classification
approaches, such as principal components analysis (PCA), partial least
squares discriminant analysis (PLS-DA), linear discriminant analysis (LDA), k-
nearest-neighbor (kNN), classification and regression tree (CART), artificial
neural networks (ANN) and support vector machine (SVM), as well as the
class modeling techniques soft-independent modeling of class analogy
(SIMCA) and unequal dispersed classes (UNEQ).
Overall, the results from this study demonstrate that NMR spectroscopy
coupled with multivariate chemometric techniques shows promise as a
valuable tool for evaluating the quality of heparin sodium active
pharmaceutical ingredients (APIs). These developed models may be useful in
monitoring purity of other complex pharmaceutical products from high
information content data.
v
ACKNOWLEDGEMENTS
I would like to acknowledge my advisor Dr. Dinesh P. Mital for his inspiring
supervision and supportive attitudes. The completion of this dissertation could
not have been possible without his invaluable guidance and unending
patience.
I wish to express my gratitude to my co-advisor, Dr. William J. Welsh who
has given me the opportunity to be where I am today. I would like to thank
him for trusting me and letting me go my own way.
I want to express my sincere thanks to the faculty members at the
Department of Health Informatics, especially to Dr. Syed S. Haque, Dr.
Shankar Srinivasan, and Dr. Masayuki Shibata, for their expertise, training,
advice and assistance throughout my graduate study.
I am very grateful to Dr. Richard D. Wood at Snowdon, Inc. for his
stimulating discussion, timely encouragement and constructive suggestions.
I would also like to thank the staff at the US Food and Drug Administration
(FDA). They provided the analysis data and more importantly, the financial
support, which made the research work possible. The collaboration with them
has greatly broadened my perspectives and I have learned a great deal from
them. Special thanks to Dr. Lucinda F. Buhse, Dr. David A. Keire, Dr.
Christine M. V. Moore, Dr. Moheb Nasr, Dr. Ali Al-Hakim, and Dr. Michael L.
Trehy.
vi
I would like to extend my gratitude to Dr. Dmitriy Chekmarev at the
Department of Pharmacology for spending his time in reviewing this
dissertation and valuable comments and feed-back.
Finally, I wish to thank my colleagues at Dr. Welsh‟s group, Dr. Ni Ai, Dr.
Vladyslav Kholodovych, Dr. Eric Kaipeen Yang and Dr. Oyenike Olabisi for
their consistent enthusiasm and reliable willingness to help, and friendly and
pleasant environment.
vii
TABLE OF CONTENTS
ABSTRACT ..................................................................................................... iii
ACKNOWLEDGEMENTS ............................................................................... v
LIST OF TABLES ............................................................................................ix
LIST OF FIGURES ..........................................................................................xi
Chapter I. INTRODUCTION ............................................................................ 1
1.1 Statement of the Problem ...................................................................... 1
1.2 Background of the Problem ................................................................... 4
1.3 Objectives of the Research .................................................................... 7
1.4 Research Hypotheses ........................................................................... 9
1.5 Results and Significance of the Research ........................................... 11
Chapter II. LITERATURE REVIEW ............................................................... 16
2.1 The Structure, Preparation and Medical Use of Heparin ..................... 17
2.1.1 Structures of Glycosaminoglycans (GAGs) ................................... 17
2.1.2 Preparation of Heparin .................................................................. 21
2.1.3 Medical Use of Heparin ................................................................. 22
2.2 Heparin Crisis ...................................................................................... 24
2.2.1 Adverse Events ............................................................................. 25
2.2.2 Contaminant Identification ............................................................. 26
2.2.3 USP Monograph for Heparin Quality ............................................. 32
2.3 Chemometrics and its Application in Heparin Field ............................. 33
2.3.1 Variable Selection ......................................................................... 34
2.3.2 Multivariate Regression Analysis .................................................. 39
2.3.3 Chemometric Pattern Recognition ................................................ 46
2.3.4 Application of Chemometrics in Heparin Field .............................. 67
Chapter III. DATA AND METHODS .............................................................. 72
3.1 Heparin Samples ................................................................................. 72
3.1.1 Pure, Impure and Contaminated Heparin APIs for Classification .. 72
3.1.2 Heparin API Samples for %Gal Determination .............................. 73
3.1.3 Blends of Heparin Spiked with other GAGs .................................. 74
3.2 Proton NMR Spectra............................................................................ 75
3.3 Data Processing .................................................................................. 77
viii
3.4 Computational Programs ..................................................................... 79
3.5 Performance Validation ....................................................................... 80
Chapter IV. RESULTS AND DISCUSSION ................................................... 82
4.1 Multivariate Regression Analysis for Predicting %Gal ......................... 82
4.1.1 Variable Selection ......................................................................... 82
4.1.2 Multiple Linear Regression Analysis ............................................. 90
4.1.3 Ridge Regression Analysis ........................................................... 97
4.1.4 Partial Least Squares Regression Analysis ................................ 101
4.1.5 Support Vector Regression Analysis ........................................... 105
4.2 Classification of Pure and Contaminated Heparin Samples .............. 108
4.2.1 Principal Components Analysis ................................................... 110
4.2.2 Partial Least Squares Discriminant Analysis ............................... 115
4.2.3 Linear Discriminant Analysis ....................................................... 119
4.2.4 k-Nearest-Neighbor ..................................................................... 123
4.2.5 Classification and Regression Tree ............................................. 128
4.2.6 Artificial Neural Networks ............................................................ 133
4.2.7 Support Vector Machine .............................................................. 137
4.2.8 Analysis of Misclassifications ...................................................... 141
4.2.9 Classification Analysis of Heparin Spiked with other GAGs ........ 145
4.3 Class Modeling for Discriminating Heparin Samples ......................... 149
4.3.1 SIMCA Analysis .......................................................................... 149
4.3.2 UNEQ Analysis ........................................................................... 165
Chapter V. SUMMARY AND CONCLUSIONS ............................................ 173
5.1 Multivariate Regression for Predicting %Gal ..................................... 173
5.2 Classification for Pure and Contaminated Heparin Samples ............. 175
5.3 Class Modeling Using SIMCA and UNEQ ......................................... 180
Chapter VI. FUTURE DIRECTION FOR RESEARCH ................................ 184
References .................................................................................................. 188
Appendix A: Abbreviations .......................................................................... 204
Appendix B: Index ....................................................................................... 207
ix
LIST OF TABLES
Table 1. Summary Statistics of %Gal Measured from HPLC ........................... 74
Table 2. Variable IDs and their Corresponding Chemical Shifts ...................... 79
Table 3. The Stepwise Variable Selection Procedure for Dataset A ............... 85
Table 4. The Stepwise Variable Selection Procedure for Dataset B ............... 86
Table 5. Parameters for the Genetic Algorithms ................................................ 87
Table 6. The Variables (ppm) Selected by Genetic Algorithms ....................... 89
Table 7. Model Parameters of Multiple Linear Regression (MLR) ................... 92
Table 8. Model Parameters of Ridge Regression (RR) ................................... 100
Table 9. Model Parameters of Partial Least Squares Regression (PLSR) .. 104
Table 10. Model Parameters for Support Vector Regression with RBF Kernel................................................................................................................................... 107
Table 11. Number and Type of Misclassifications (Errors) by PLS-DA Classification ........................................................................................................... 118
Table 12. Wilks‟ Lambda ( v ) & F-to-enter (F) of Variables (V) for Various
Models ...................................................................................................................... 120
Table 13. Performance of LDA Classification Models under Different Variables .................................................................................................................. 121
Table 14. Performance of kNN Classification Models for Original Data ....... 124
Table 15. Performance of PCA-kNN Classification Models under Different PCs ........................................................................................................................... 125
Table 16. Model Parameters and Classification Rates for CART .................. 130
Table 17. Model Parameters and Classification Rates for ANN .................... 137
x
Table 18. Model Parameters and Classification Rates for SVM .................... 141
Table 19. Classification Matrices for the Heparin vs DS Model in the 1.95-5.70 ppm Region .................................................................................................... 143
Table 20. Classification Matrices for the Heparin vs [DS + OSCS] Model in the 1.95-5.70 ppm Region .................................................................................... 144
Table 21. Classification Matrices for the Heparin vs DS vs OSCS Model in the 1.95-5.70 ppm Region ........................................................................................... 144
Table 22. Compositions of the Series of Blends of Heparin Spiked with other GAGs and Test Results for Classification from SVM, CART and ANN in the 1.95-5.70 ppm Region ........................................................................................... 148
Table 23. Sensitivity and Specificity from SIMCA Modeling for Heparin, DS, and OSCS ............................................................................................................... 151
Table 24. Classification Matrices and Success Rates from SIMCA Class Modeling for Heparin, DS and OSCS ................................................................. 157
Table 25. Discriminant Powers (DP) of Variables (V) for Various Models ... 161
Table 26. The Compositions of the Series of Blends of Heparin Spiked with other GAGs and Test Results from Class Modeling ......................................... 164
Table 27. Wilks Lambda (λ) and F-to-enter (F) Values of Variables (V) ....... 167
Table 28. Sensitivity and Specificity from UNEQ Class Modeling for Heparin, DS and OSCS ......................................................................................................... 169
Table 29. Classification Matrices from UNEQ Class Modeling for Heparin, DS and OSCS ............................................................................................................... 172
xi
LIST OF FIGURES
Figure 1. Three-dimensional structures of heparin. ........................................... 18
Figure 2. Structural formulae of heparin, dermatan sulfate, chondroitin sulfate A and C, and oversulfated chondroitin sulfate ..................................................... 19
Figure 3. Monthly event date distributions of heparin allergic-type reports received from January 1, 2007 to September 30, 2008 ..................................... 26
Figure 4. NMR analysis of standard heparin, heparin containing natural dermatan sulfate and contaminated heparin ....................................................... 29
Figure 5. The molecular structures of heparin and OSCS ................................ 30
Figure 6. Schematic diagram representing the process of assessing sample class from raw NMR spectra .................................................................................. 49
Figure 7. Structure of a classification or regression tree .................................. 56
Figure 8. A fully connected multilayer feedforward network ............................. 58
Figure 9. Non-linear separation case in the low dimension input space and linear separation case in the high dimension feature space ............................. 61
Figure 10. Scores plot of the PCA analysis of the spectral data set ............... 69
Figure 11. Separation of the samples containing OSCS from those not containing OSCS in a score-plot of a PCA model .............................................. 70
Figure 12. Comparison of Raman spectra of heparin and the principal contaminants and Raman PLS model test for OSCS ........................................ 71
Figure 13. An overlay of the 500MHz 1H NMR spectra of a heparin sodium API spiked with 10.0% of CSA, OS-CSA, CSB and OS-CSB. .......................... 76
Figure 14. The relationship between the Bayes information criterion (BIC) and the number of variables selected by the stepwise procedure ................... 84
Figure 15. Histograms of frequency for the selected variables by GAs ......... 88
Figure 16. Predicted (from NMR data) versus measured (from HPLC) %Gal for Dataset A (%Gal: 0-10) ..................................................................................... 93
xii
Figure 17. Predicted (from NMR data) versus measured (from HPLC) %Gal for Dataset B (%Gal: 0-2) ........................................................................................ 96
Figure 18. Ridge regression for the heparin 1H NMR data at 40 variables selected from GA ...................................................................................................... 99
Figure 19. The relationship between the component number of PLSR and the standard error of prediction (SEP) for Dataset A .............................................. 102
Figure 20. Scores plots for the model Heparin vs DS ..................................... 112
Figure 21. Scores plots for the model Heparin vs OSCS ............................... 113
Figure 22. Scores plots for the model Heparin vs DS vs OSCS .................... 114
Figure 23. Misclassification rate as a function of the number of PLS components for the PLS-DA model ..................................................................... 116
Figure 24. kNN classification for heparin-contaminant data over the range k =1 to k = 25 ............................................................................................................. 127
Figure 25. Classification trees and their corresponding complexity parameter CP for model Heparin vs DS vs OSCS ............................................................... 129
Figure 26. The variations of misclassification errors from ANN with the hidden units and weight decay for the model Heparin vs DS vs OSCS for the data set in the 1.95-5.70 ppm range ................................................................... 136
Figure 27. Contour plots obtained from 9×9 grid search of the optimal values of γ and C for the SVM model .............................................................................. 140
Figure 28. Dendrogram on the blends of heparin spiked with other GAGs 147
Figure 29. Coomans plots for SIMCA class modeling ..................................... 153
Figure 30. Coomans plots for UNEQ class modeling ...................................... 171
Figure 31. Comparison of the classification results of the six approaches .. 179
Figure 32. Overlaid plots of the SAX-HPLC chromatograms ......................... 185
Figure 33. Near infrared spectra of 108 heparin samples that contain DS impurities and OSCS contaminants .................................................................... 186
1
Chapter I
INTRODUCTION
1.1 Statement of the Problem
Heparin, a highly sulfated glycosaminoglycan, is widely used as an
anticoagulant. This drug substance is obtained from biological sources and
always contains varying amounts of undesirable impurities. Among these,
chondroitin sulfate A (CSA) and chondroitin sulfate B (i.e., dermatan sulfate or
DS) have been identified. These chondroitin derivatives differ from heparin in
that they contain galactosamine, the level of which is used as an indicator for
the quality of the drug. Currently, the United States Pharmacopeia (USP)
monograph for heparin purity dictates that the weight percent of
galactosamine (%Gal) may not exceed 1%. Hence the accurate
measurement of the %Gal in heparin is an important parameter to assure the
safety and efficacy of the drug. The experimental determination of %Gal by
acid digestion and high-performance liquid chromatography (HPLC) with a
pulsed amperometric detector requires expert operators, expensive
equipment and careful sample preparation. By contrast, although the nuclear
magnetic resonance (NMR) approach requires more expensive equipment
than the HPLC method, the sample preparation is minimal and the data are
already required for other aspects of USP testing. Therefore, the development
of theoretical methods for the prediction of %Gal values from NMR spectral
data is of particular interest.
2
In late 2007 and early 2008, heparin sodium contaminated with
oversulfated chondroitin sulfate A (OSCS) was associated with a rapid and
acute onset of an anaphylactic reaction. In addition, naturally occurring
dermatan sulfate (DS) with concentrations up to a few percent was found to
be present in heparin samples as an impurity due to incomplete purification. It
is desirable to develop simple and effective screening analytical methods for
detecting and identifying contaminants and impurities in existing and future
lots of heparin. Because unique signals associated with OSCS or DS in
contaminated or impure heparin were observed in the NMR spectra, the
present study was undertaken to determine whether chemometric statistical
analysis of these NMR spectral data would be useful for discrimination
between USP-grade samples of heparin sodium active pharmaceutical
ingredients (APIs) and those deemed unacceptable based on their levels of
OSCS and/or DS. For this purpose, pattern recognition techniques for 1H
NMR spectral data were applied to establish adequate mathematical models
for revealing similarities and differences between heparin and contaminants.
In order to differentiate heparin samples with varying amount of DS
impurities and OSCS contaminants, proton NMR spectral data for heparin
sodium API samples from different manufacturers were analyzed by
multivariate statistical methods for quantitative determination and qualitative
classification. The whole research work was divided into two parts, i.e.,
multivariate regression analysis for the prediction of %Gal and pattern
3
recognition analysis for the differentiation of pure, impure and contaminated
heparin samples.
1. The quantitative determination of %Gal. This combination of
spectroscopy and chemometric methods was proposed for the prediction of
%Gal. Multivariate analyses including multiple linear regression (MLR), Ridge
regression (RR), partial least squares regression (PLSR), and support vector
regression (SVR) were employed in the present investigation. To obtain
stable and robust models with high predictive ability, variables were selected
by genetic algorithms (GAs) and stepwise methods.
2. Discrimination of pure, impure and contaminated heparin samples.
Heparin sample classifications were performed by applying multivariate
statistical approaches such as principal component analysis (PCA), partial
least squares discriminant analysis (PLS-DA), linear discriminant analysis
(LDA), k-nearest neighbors (kNN), classification and regression tree (CART),
artificial neural network (ANN), support vector machine (SVM), as well as
class-modeling techniques, such as soft-independent modeling of class
analogy (SIMCA) and unequal dispersed classes (UNEQ) for analysis of
proton NMR spectral data in order to distinguish between pure, impure and
contaminated heparin. The NMR signals were employed as fingerprints, and
classification models were built and validated for the determination of the
contaminant and/or impurity in the lots of heparin.
4
1.2 Background of the Problem
Heparin is a naturally occurring polydisperse mixture of linear, highly
sulfated carbohydrates composed of repeating disaccharide units, which
generally comprise a 6-O-sulfated, N-sulfated glucosamine alternating with a
2-O-sulfated iduronic acid [1-3]. As a member of the glycosaminoglycan
(GAG) family, heparin has the highest negative charge density among known
biological molecules. During heparin biosynthesis, the polysaccharide chains
are incompletely modified and variably elongated, leading to heterogeneity in
chemical structure, diversity in sulfation patterns, and polydispersity in
molecular mass [4]. As one of the oldest drugs still in widespread clinical use,
heparin is highly effective in kidney dialysis and cardiac surgery. Heparin is
the most widely used anticoagulant for preventing or treating thromboembolic
disorders, and for inhibiting coagulation during hemodialysis and
extracorporeal blood circulation [5-8].
Pharmaceutical heparin is usually derived by extracting animal tissues,
such as bovine, ovine, and porcine intestinal mucosa or bovine lung after
proteolytic digestion, and then precipitating the preparations as quaternary
ammonium complexes or barium salts, and eventually as sodium or calcium
salts [9-12]. Crude heparin contains proteins, nucleic acid, and other related
GAGs, such as heparan sulfate (HS), dermatan sulfate (DS), chondroitin
sulfate (CS), and hyaluronic acid (HA) [13]. Subsequent purification by
proprietary processes converts raw heparin into active pharmaceutical
5
ingredients (APIs) and the differences in these processes leads to variation in
the amount of native impurities in the final product [14, 15]. Dermatan sulfate
(DS) is the most common chondroitin sulfate impurity in heparin. DS is
composed of alternating iduronic acid-galactosamine disaccharide units and,
due to their similarity with the iduronic-glucosamine disaccharide units of
heparin, heparin APIs always contain varying levels of DS owing to the strong
affinity as well as incomplete purification [16]. The stage 2 USP monograph
for heparin sodium limits %Gal to not more than 1%. To ensure the
appropriate biological activity, chemical parameters, including purity,
molecular mass distribution, degree of sulfation, as well as the presence of
specific oligosaccharide sequences, must be strictly controlled. It is difficult to
accurately determine the precise chemical structure and to measure the
performance of purification protocol due to the heterogeneity of heparin
preparations [17-20].
Starting in November 2007, hundreds of cases of adverse reactions to
heparin, such as hypotension, severe allergic symptoms, and even death in
patients undergoing hemodialysis and receiving bolus injections of heparin
sodium, were reported to the US Food and Drug Administration (FDA) [21-
23]. Prompted by these adverse events, biological and analytical methods
were developed to identify contaminants and impurities in heparin [14, 15, 24-
28]. Oversulfated chondroitin sulfate (OSCS) was identified as a contaminant
associated with these adverse clinical effects. In standard drug potency
6
assays, the OSCS molecule can partially mimic the anti-coagulation activity of
heparin. OSCS is not known to be a natural product, but is semi-synthesized
by chemically modifying another GAG, chondroitin sulfate A (CSA). While
CSA normally contains one sulfate group per disaccharide unit, the
predominant structure of OSCS was found to have four sulfate groups per
disaccharide [13], suggesting that CSA was undergone complete or nearly
complete sulfonation of all hydroxyl groups. Since OSCS is a synthetic
substance, it must have been accidentally or deliberately mixed with the
heparin lots from outside a normal process step.
To ensure the safety and quality of heparin, spectroscopic and
chromatographic methods have been added to the USP monograph for
heparin APIs to detect and screen for impurities and contaminants [14, 15,
26, 27]. During the recent contamination crisis, nuclear magnetic resonance
(NMR) spectroscopy played a critical role in identifying the structure of OSCS
contaminating heparin [21, 29-33] while capillary electrophoresis (CE) [17, 27,
34, 35] and strong anion exchange high-performance liquid chromatography
(SAX-HPLC) [14, 15, 26] were used to measure the relative amounts of
heparin, DS and OSCS. Of these three analytical techniques, the complex
pattern of overlapping 1H NMR signals found in the heparin spectra was
judged most effective to assess structural information. As part of this study,
blinded 1H NMR data from heparin samples analyzed by FDA personnel was
provided for chemometric analysis.
7
1.3 Objectives of the Research
OSCS and DS have been determined as potential contaminants by NMR
spectroscopy, CE and SAX-HPLC. In general, these techniques require
expert operators and sophisticated instrumentation (e.g., high field NMR) with
a concomitant added cost to the analysis, which underscore the need to
develop rapid and sensitive analytical methods to screen for the presence of
these substances in existing and future lots of heparin and to ensure the
integrity of the global supply of heparin. In addition, the new USP specification
states that the limit for galactosamine concentrations (%Gal) is 1.0%, so it is
crucial to accurately determine the %Gal in heparin. The experimental
determination of %Gal is time consuming and tedious, and hence
development of theoretical methods for the prediction of this value is of
particular interest.
At present, powerful analytical approaches such as spectroscopic
techniques allow us to acquire high dimensional datasets from which valuable
information can be extracted by multivariate statistical methods. Pattern
recognition techniques are becoming increasingly popular in food chemistry,
pharmaceutical chemistry and medical sciences. Chemometric methods can
be applied to discern inherent patterns, classify objects and predict their
origin, reveal groupings, similarities or differences among samples in complex
datasets, and are especially suitable for cases in which there are more
variables than objects in the data matrices [36-39]. Discrimination of different
8
groups can be carried out either in an unsupervised way if no information
about the classes is available [36, 40], or in a supervised way where the class
membership of a sample from a test dataset can be predicted based on the
mathematical models derived from the training dataset, and class information
can be used to maximize the separation between groups [41-43].
The research objectives of the present study are:
1. The development of quantitative statistical models to predict the %Gal in
various heparin samples from NMR data. The combination of spectroscopy
and chemometric methods is consequently proposed for the quantitative
determination of %Gal. Multivariate analyses including multiple linear
regression (MLR), Ridge regression (RR), partial least squares regression
(PLSR), and support vector regression (SVR) are used in the present
investigation. In order to obtain stable and robust models with high predictive
ability, variables are selected by genetic algorithms (GAs) and stepwise
methods.
2. The application of chemometric tools for analysis of proton NMR data in
order to distinguish between acceptable and contaminated heparin from
various origins in complex systems. The NMR signals are employed as
fingerprints, and classification models are built and validated for the
identification of the contaminant and/or impurity in the lots of heparin.
The overall purpose of this study was to develop multivariate statistical
models that, once validated, will enable rapid and effective screening of new
9
lots of bulk heparin APIs to detect and quantify DS impurities and OSCS
contaminants. In practice, these models are intended for use by a non-expert
operator to afford decision support on sample quality from high information
content data and to aid in the analysis of complex drugs like heparin.
1.4 Research Hypotheses
1H NMR spectroscopy is very sensitive to minor structural variations, and
hence the repeating disaccharide units of heparin can be easily identified in
1H NMR spectra by specific signals [29, 44, 45]. 1H NMR technique is
commonly used for determination of the chemical composition of heparin and
its derivatives, as well as for the identification of contaminants from various
sources [21, 30, 33, 46].
When analyzing complex samples, the assignation of all peaks of the NMR
spectrum is seldom accomplished. However, this does not invalidate the
analysis since even unidentified signals can be used as fingerprints of
analytes for quality assessment and purity control in drug research. All these
characteristics can be reinforced by a combination of chemometric tools
which can extract more information from the study of the data generated [47,
48].
In heparin study, NMR technique can produce data sets with high
information content and the fingerprints from the spectrum provide an
overview of similarities/differences in heparin samples with different DS and
OSCS levels. While some differences can be determined simply by inspection
10
of these spectra, a quantitative analysis is required to acquire the maximum
information from the datasets.
Multivariate analysis approaches for classification and differentiation are
well-established [49]. Chemometric pattern recognition has been widely
applied in the fields of foods [50-52] and drugs [53-55] for authenticating and
identifying the origin of products. With the help of chemometric techniques,
valuable chemical information from complex NMR spectra can be extracted
by transforming the spectral data into discrete variables, and the
characterization and quantification of analytes can be accomplished by using
the NMR signals as fingerprints. Chemometric models have been
successfully applied to the study of 1H NMR spectra of several heparin
samples [8, 44, 46].
In the present study, 1H NMR spectra of heparin samples are used as
multivariate data for chemometric analysis, and the following hypotheses are
proposed:
1. Chemometric approaches can reduce complex data from information-
rich data sets. 1H NMR spectral data can be converted into useful information
using multivariate tools. The procedure for processing all spectra under the
same conditions is aimed to be as simple as possible but without affecting the
accuracy of the quantification.
2. The galactosamine content (%Gal) measured by SAX-HPLC can be
correlated with the structural information extracted from 1H NMR spectra of
11
heparin. That is, it is possible to reliably quantify galactosamine in heparin
samples and predict %Gal from characteristic 1H NMR signals by multivariate
calibration techniques.
3. Subtle changes in the structure of heparin from different sources can be
used for the quality control of pharmaceutical preparations. The chemometric
pattern recognition can be applied as a highly sensitive assay to test for the
presence of oversulfated contaminants in heparin and reveal inherent
patterns. These multivariate models can then be used to rapidly screen new
lots of bulk heparin API for the presence of OSCS and GAG contaminants.
They are able to statistically distinguish good samples from bad ones.
1.5 Results and Significance of the Research
For heparin samples studied here, the individual NMR fingerprints were
analyzed using chemometric tools to characterize and quantify galactosamine
for quality control or purity assessment and to differentiate the samples into
separate groups corresponding to pure, impure or contaminated heparin. The
following results were achieved.
1. Regression analysis. Multivariate statistical analysis of 1H NMR spectral
data obtained on heparin samples was employed to build computational
models for the prediction of %Gal. Genetic algorithms (GAs) and stepwise
selection methods were applied for variable selection prior to multivariate
regression (MVR) analysis by multiple linear regression (MLR), Ridge
regression (RR), partial least squares regression (PLSR), and support vector
12
regression (SVR). Two data sets were extracted from the NMR data: Dataset
A contained between 0-10% galactosamine, and Dataset B contained
between 0-2% galactosamine. In all cases, the MVR models obtained using
variable selection outperformed those obtained when all the variables were
considered. Using GAs for variable selection produced the most optimal MVR
models in terms of model simplicity (fewest independent variables) and
predictive ability when compared with the stepwise selection method. The
four regression techniques were comparable in performance for Dataset A
with low prediction errors under optimal conditions, whereas SVR was clearly
superior to the other three regression approaches for Dataset B. The
coefficient of determination (R2) of the linear regression analysis between the
galactosamine content obtained by rigorous HPLC analysis and that predicted
by the models based on NMR data for the test samples using the optimal
number of variables was 0.992 for Dataset A and 0.972 for Dataset B.
2. Classification analysis. The samples were treated as two-class models
(Heparin vs DS, Heparin vs OSCS, and Heparin vs [DS + OSCS]) and three-
class models (Heparin vs DS vs OSCS). Several multivariate chemometric
methods for clustering and classification were evaluated, specifically principal
components analysis (PCA), hierarchical cluster analysis (HCA), partial least
squares discriminant analysis (PLS-DA), linear discriminant analysis (LDA), k-
nearest-neighbors (kNN), classification and regression tree (CART), artificial
neural network (ANN), and support vector machine (SVM). Discrimination of
13
heparin samples from impurities and contaminants was achieved by the
different models. Data dimension reduction and variable selection techniques
by retaining only significant PCA components, implemented to avoid over-
fitting the training set data, markedly improved the performance of the
classification models PLS-DA, LDA and kNN. Three data sets corresponding
to different chemical shift regions (1.95-2.20, 3.10-5.70, and 1.95-5.70 ppm)
were analyzed for CART, ANN and SVM. While all three multivariate
statistical approaches were able to effectively model the data from the 1.95-
2.20 ppm region, SVM was found to substantially outperform CART and ANN
from the 3.10-5.70 ppm region in terms of classification success rate. Under
optimum conditions, a 100% prediction rate was frequently achieved for
discrimination between Heparin and OSCS samples on external test sets.
The classification rates for the Heparin vs DS, Heparin vs [DS + OSCS], and
Heparin vs DS vs OSCS models were 93%, 95%, and 95%, respectively. The
majority of classification errors between Heparin and DS involved cases
where the DS content was close to the 1.0% DS boundary between the two
classes, and can be ascribed to the similarity in NMR chemical shifts of
heparin and DS. When removing the borderline samples, almost perfect
classification results can be attained. Among the chemometric methods
evaluated in this study, it was found that the SVM models were superior to the
other models for classification. This study demonstrated that the combination
of proton NMR spectroscopy with multivariate chemometric methods
14
represents a powerful tool for heparin quality control and purity assessment.
3. Class modeling analysis. The chemometric models were constructed
using soft-independent modeling of class analogy (SIMCA) and unequal class
models (UNEQ) class-modeling techniques, and validated using the leave-
one-out cross-validation (LOO-CV). While SIMCA modeling was conducted
using the entire set of original variables, UNEQ modeling was combined with
variable reduction performed by stepwise linear discriminant analysis (SLDA)
to ensure that the number of samples per class exceeded the number of
variables in the model by at least three-fold. When comparing the modeling
results from these two approaches, it was found that UNEQ exhibited greater
sensitivity (fewer false positives) while SIMCA exhibited greater specificity
(fewer false negatives). For Heparin, DS and OSCS, the sensitivity was 78%
(56/72), 74% (37/50) and 85% (39/46) from SIMCA modeling and 88%
(63/72), 90% (45/50) and 94% (43/46) from UNEQ modeling. For both
approaches, no OSCS sample was accepted by the Heparin class; hence, the
specificity of Heparin with respect to OSCS was 100% (46/46). SIMCA
showed better specificity for Heparin with respect to DS with 90% (45/50)
compared to 54% (27/50) from UNEQ. The overall prediction ability of
classification for Heparin vs DS vs OSCS was superior for UNEQ (85%)
compared with SIMCA (76%). These two chemometric techniques were also
applied to the class modeling for blends of heparin spiked with non-, partially-,
15
or fully oversulfated chondroitin sulfate A (CSA), chondroitin sulfate B (CSB)
and heparan sulfate (HS) at the 1.0%, 5.0% and 10.0% weight percent levels.
The results from this study show that 1H NMR spectroscopy, already a
USP requirement for screening of contaminants in heparin, could offer utility
as a rapid method for quantitative determination of %Gal in heparin samples
when used in conjunction with MVR approaches, thereby potentially obviating
labor intensive and costly chemical analysis. In addition, NMR spectroscopy
coupled with chemometric multivariate techniques can be used to differentiate
heparin and its contaminants and identify potential contamination.
16
Chapter II
LITERATURE REVIEW
Heparin is a member of the glycosaminoglycan (GAG) family of
carbohydrates and is widely used as an injectable anticoagulant and anti-
thrombotic agent. In late 2007 and early 2008, contaminated lots of heparin
were associated with an acute, rapid onset of a potentially fatal
anaphylactoid-type reaction. Nuclear magnetic resonance (NMR)
spectroscopy and other analytical techniques have identified oversulfated
chondroitin sulfate (OSCS) as the contaminant. Fast sample preparation and
straightforward spectral evaluation take advantage of proton NMR spectra as
unique fingerprints and make the method most popular for quality and purity
control. Data analysis has become a fundamental task in the pharmaceutical
field due to the great quantity of analytical information provided by modern
analytical instruments such as NMR. The use of chemometrics is a solution
for performing either qualitative or quantitative analyses.
In this literature review, the first part describes the structure, preparation
and medical use of heparin, and the commonly used chemometric methods
for pharmaceutical applications are reviewed in the second part.
17
2.1 The Structure, Preparation and Medical Use of Heparin
Heparin, a highly-sulfated glycosaminoglycan polysaccharide and complex
pharmaceutical agent, is widely used as an anticoagulant in multiple settings,
including kidney dialysis, invasive surgical procedure, acute coronary
syndromes, and deep venous thrombosis treatment [5-8]. As one of the oldest
drugs currently still in widespread clinical use, heparin is one of the few
carbohydrate drugs and one of the first biopolymeric drugs [56]. Since its
introduction in the early 20th century, heparin has been an essential drug for
many patients and has become one of the top-selling anticoagulants world-
wide with yearly sales of nearly four billion dollars. Millions of doses of
heparin are dispensed every month and tons of heparin are used every year.
2.1.1 Structures of Glycosaminoglycans (GAGs)
Heparin is a biopolymeric glycosaminoglycan (GAG) consisting of linear
polymer chains. GAGs are composed of repeating disaccharide units
comprised of a hexosamine and a hexuronic acid which may be N- or O-
sulfated in different positions [1-4]. Structurally, they are long, unbranched,
negatively charged, and polydisperse polysaccharides (Figure 1).
18
A B C D Figure 1. Three-dimensional structures of heparin.
A & B: 2S0 conformation; C & D:
1C4 conformation.
Taken from the protein data bank (www.rcsb.org/pdb).
Depending upon the type of hexosamine unit, GAGs can be classified into
galactosaminoglycans (GalGs) and glucosaminoglycans (GlcGs) [44-46].
Chondroitin sulfates (CSA and CSC) and dermatan sulfate (chondrotin sulfate
B or CSB), are both GalGs that differ in the major uronic acid, which is D-
glucuronic acid for CSA and CSC, and L-iduronic acid for CSB. The uronic
acid is β (1→3) attached to the N-acetyl-D-galactosamine unit which is
commonly sulfated at C-4 in the case of CSB and chondrotin 4-sulfate (C4S
19
or CSA), or at C-6 in chondroitin 6-sulfate (C6S or CSC) (Figure 2). GalGs
are present in connective tissues where CSA predominates in cartilage and
CSB in skin.
Figure 2. Structural formulae of heparin (a), dermatan sulfate (b), chondroitin sulfate A and C (c), and oversulfated chondroitin sulfate (d). For chondroitin sulfate A, R marks the sulfated moiety. For chondroitin sulfate C, the residual group R’ is sulfated. For OSCS, R1–R4 label possibly sulfated moieties. Taken from Ref [30].
The glucosaminoglycans heparin and heparan sulfate are composed of
alternating α (1→4) bonds linked N-sulfo-glucosamine and iduronic or
glucuronic acid residues. The most common disaccharide unit of heparin is
composed of a 2-O-sulfo-α-L-iduronic acid 1, 4 linked to 6-O-sulfo-N-sulfo-α-
D-glucosamine (Figure 2). On the other hand, the constituent units are
primarily N-acetyl-D-glucosamine and D-glucuronic acid in heparan sulfate.
20
The disaccharide units can be O-sulfated at C-6 and/or C-3 of glucosamine
unit and also at C-2 of the acid residues. Heparin has the highest negative
charge density of any known biological macromolecule due to the O- and N-
sulfate groups as well as the iduronic acid carboxylate moiety [27].
For GAGs, there are various differences in saccharide units, chain length
and degree of sulfation between the different classes [57]. There is a
significant level of sequence heterogeneity with variation in N-acetyl, N-
sulfation, O-sulfation, and iduronic acid versus glucuronic acid content.
Superimposed on the polysaccharide backbone are complex patterns of
amido (N) or ester (O)-linked sulfo group substitutions. These subtle
differences create great structural diversity within the GAGs, which underpins
their functional diversity, and presents an enormous challenge for structure
elucidation of these complex molecules [18].
When the various stereoisomers, sugars and sulfation patterns are
combined, there are potentially 32 disaccharide units to be included in
heparin. Heparin is a polydisperse mixture of linear acidic polysaccharides
that vary in molecular weight from 5,000 to 40,000 Da. Heparin consists of
heterogeneous mixtures of highly sulfated glycosaminoglycans (GAGs),
which considerably differ in their individual structure. The mass range and
structural heterogeneity of heparin is due to the variable elongation of the
polysaccharide chains and incomplete modification during its biosynthesis
[19, 20].
21
2.1.2 Preparation of Heparin
Heparin is usually extracted from the tissues of animals used for
consumption, such as porcine intestinal mucosa and bovine lung, and then
purified and administered as an anticoagulant [10]. For medical applications,
pharmaceutical-grade heparin in the USA is required to be obtained from a
porcine intestinal source. The production process involves a proteolytic
digestion, followed by treatment with ion pairing reagents, precipitation with
quaternary ammonium complexes or barium salts, and fractionation and
purification based on anion exchange and gel filtration chromatography [11,
12].
In the preparation of heparin, the first step is the fractionation of crude
heparin from tissue. The constituents of crude heparin include heparin itself,
and small amounts of other GAGs, including chondroitin sulfate (CS),
dermatan sulfate (DS), hyaluronic acid (HA), heparan sulfate (HS), and some
percentage of non-polysaccharidic components, such as nucleic acid and
proteins [13]. Subsequent purification leads to the conversion of crude
heparin into active pharmaceutical ingredient (API) heparin through a series
of isolation steps as well as specific steps to inactivate adventitious agents,
including viruses.
When heparin APIs are purified from crude heparin by proprietary
processes, the differences in these processes can lead to variation in the
level of native impurities in the heparin APIs produced. The level of
22
chondroitin sulfates, heparan sulfate, insoluble material, and proteins varies
widely from batch to batch of the crude unrefined heparin.
Heparin APIs and formulations always contain varying amounts of
(normally less than 1%) of several natural GAG impurities. Among these
GAGs, dermatan sulfate (DS), a GAG containing L-iduronic acid units as does
heparin, is the most common impurity in heparin due to the structural
similarity and the high chemical affinity between them, a characteristic which
makes it difficult to obtain an effective purification [58]. The content of DS is
an indicator of the purity of the heparin drug substance.
The biological activity of the resulting heparin and related GAGs
preparations depends on various chemical parameters, such as purity,
molecular mass distribution and the extent of sulfation, and the presence of
specific oligosaccharide sequences responsible for certain functions. All these
factors must be controlled in order to obtain the appropriate anticoagulant and
anti-proliferative activities [59, 60].
2.1.3 Medical Use of Heparin
Heparin is a blood thinner that comes in either vials or in syringes. It is
degraded when taken orally and therefore has to be administered
parenterally. In some situations, heparin treatment is initiated using a high
bolus dose given directly into the bloodstream (intravenously) over a short
period of time, usually less than one hour [5]. The blood-thinning drug is
highly effective for preventing and treating blood clots in arteries, lungs and
23
veins. Heparin is often used during surgery, kidney dialysis or while a patient
is bedridden to thin a patient‟s blood. It is also used as a flush product to
inject into IV line to clear the line through removing blood clots from the line.
In addition to its classic anticoagulant activity, heparin is extensively
applied in the treatment of a wide range of diseases and can be found to form
an inner anticoagulant surface on various experimental and coating medical
devices such as catheters, stents, filters, test tubes and renal dialysis
machines [24].
Among its clinical applications, natural heparin acts as an anticoagulant,
preventing the formation of clots or extension of existing clots within the blood
and for avoiding coagulation during hemodialysis and extracorporeal blood
circulation. While heparin does not break down clots that have already
formed, it allows the body's natural clot lysis mechanisms to work normally to
break down clots that have formed. Heparin is generally used for
anticoagulation for the following conditions [6, 7]:
Acute coronary syndrome, e.g., NSTEMI
ECMO circuit for extracorporeal life support
Atrial fibrillation
Cardiopulmonary bypass for heart surgery
Deep-vein thrombosis and pulmonary embolism
In special medical circumstances, high doses of heparin have to be
injected. Thus, it is vital for pharmaceutical companies as well as for
24
independent quality control laboratories to be able to control its purity by
reliable analytical methods.
Under physiological conditions, the ester and amide sulfate groups are
deprotonated and attract positively-charged counter-ions to form a heparin
salt. It is in this form that heparin is usually administered as an anticoagulant
by binding to the enzyme inhibitor antithrombin III (AT-III). Upon binding to
heparin, AT-III undergoes a conformational change that results in its
activation through an increase in the flexibility of its reactive site loop, which
plays a critical role in blood clot formation, or factor Xa that produces
thrombin. For thrombin inhibition, however, thrombin must also bind to the
heparin polymer at a site proximal to the pentasaccharide. The highly-
negative charge density of heparin contributes to its very strong electrostatic
interaction with thrombin. The formation of a ternary complex between AT,
thrombin, and heparin results in the inactivation of thrombin. The rate of
inactivation of these proteases by AT can increase by up to 1000-fold due to
the binding of heparin [8].
2.2 Heparin Crisis
In 2007 and 2008, heparin raw materials and finished drug products
imported into the United States from foreign countries were found to contain
non-native contaminants that put U.S. consumers at risk and were linked with
increased incidences and numerous deaths. This contamination crisis led to a
collaborative study involving researchers from the FDA, industry, and
25
academia that identified oversulfated chondroitin sulfate A (OSCS) as the
heparin contaminant whose presence in heparin was associated with
anaphylactic reactions in certain patients.
2.2.1 Adverse Events
From January 1, 2007 through May 31, 2008 during a national
investigation of allergic-type events, the US FDA received over 800 reports of
serious adverse reactions not only in patients undergoing kidney dialysis
treatment but also in patients in other clinical settings, such as those
undergoing cardiac surgical procedures, and at least 238 patients died after
injection of bolus heparin sodium [21, 23]. The presence of the contaminant
within heparin likely led to clinical manifestations and symptoms occurred
within several minutes after intravenous infusion of heparin. Adverse
reactions may include: refractory hypotension leading to organ damage,
organ failure, shock, severe nausea, diaphoresis, tachycardia, urticaria,
angiodema, vasodilation, diarrhea, swelling of the larynx, a sudden drop in
blood pressure and other symptoms of anaphylaxis - flushing and fainting,
and in some cases ending in death [61, 62].
Because heparin is a drug commonly used in the clinic, occurrence of
these adverse events resulted in a crisis in the United States. Researchers at
the Centers for Disease Control and Prevention realized that the adverse
events were associated with the receipt of heparin sodium for injection,
manufactured by Baxter Healthcare. Thus, Baxter Healthcare issued recalls
26
of its batches of heparin sodium injection and heparin lock flush solution in
January and February 2008. This was followed by recalls for a number of
medical devices that contain or are coated with heparin. On February 18,
2008, it recalled all its heparin lots and stopped heparin production. Since that
recall, monitoring by the FDA indicated that, in May 2008, the number of
deaths reported in association with heparin usage had returned to baseline
levels (Figure 3) [23].
Figure 3. Monthly event date distributions of heparin allergic-type reports received from January 1, 2007 to September 30, 2008. Taken from Ref [23].
2.2.2 Contaminant Identification
In response to this outbreak of the adverse events, and in order to remove
tainted or suspect products from the market and to prevent further exposure
27
to patients by contaminated heparin, FDA developed both qualitative and
quantitative analytical methods in an attempt to detect the contaminant and
identify potential causes for this sudden rise in side effects [63, 64]. Heparin
lots correlated with adverse events were examined using orthogonal high-
resolution analytical techniques, including high-field nuclear magnetic
resonance (NMR) spectroscopy [13, 29-32], capillary electrophoresis (CE)
[27, 34] and high performance liquid chromatography (HPLC) [65]. After
intense studies, CE of the samples suggested that the suspect lots were
contaminated. Subsequent analysis by means of sophisticated two-
dimensional NMR techniques identified oversulfated chondroitin sulfate
(OSCS) as a contaminant and as the likely source of the adverse responses.
OSCS is a heparin-like compound, but it is not heparin. Like heparin,
OSCS has an anticoagulant effect and can mimic heparin‟s blood-thinning
properties [22]. Given the nature of OSCS, traditional screening tests cannot
differentiate between affected and unaffected lots. OSCS was not detected by
common analytical methods, for instance assays of anticoagulative activities
or size exclusion chromatography methods. Even though some batches of
heparin were found to contain up to a third of this non-natural form of
chondroitin sulfate, its presence was masked in standard quality-control
assays owing to the inherent anticoagulant activity of OSCS.
Due to its high sensitivity to even minor structural variations, NMR
spectroscopy has proven to be most promising and suitable for assessing
28
routine methods for analyzing complex mixtures. NMR has become a
successful technique for characterizing the chemical composition. 1H NMR
spectroscopy has been also used as a tool to provide characteristic
fingerprints of complex carbohydrates for quality assessment and purity
control. During the contamination crisis, NMR was critical in identifying the
structure of OSCS-contaminating heparin. It is also useful for the quantitative
determination of OSCS and DS content in heparin [66, 67].
Although extremely close in chemical structure to heparin, the researchers‟
extremely detailed structural analysis of the drug was able to detect the
minute differences between the contaminated drug and a normal dosage of
heparin. The structure of OSCS was elucidated by 1H and 13C NMR
spectroscopic methods (Figure 4) [21]. With NMR, other signals apart from
the heparin signals were observed. For example, particularly evident in the
proton NMR spectrum (Figure 4a) is the signal at 2.15 ppm corresponding to
an N-acetyl group different from that of heparin (2.05 ppm). This N-acetyl
signal is also distinct from that of DS (2.08 ppm). To complement and extend
the proton analysis, carbon NMR spectroscopy was performed. Comparison
of the carbon spectra indicates the presence of several additional signals not
normally associated with heparin structural signatures (Figure 4b). The acetyl
signal at 25.6 ppm together with the signal at 53.5 ppm are indicative of the
presence of an O-substituted N-acetylgalactosamine residue of unknown
29
structure, but again distinct from the N-acetylgalactosamine contained within
DS, with corresponding signals at 24.8 ppm and 54.1 ppm, respectively.
Figure 4. NMR analysis of standard heparin, heparin containing natural dermatan sulfate (DS) and contaminated heparin. (a) Proton NMR spectra; (b) Carbon NMR spectra. Taken from Ref [21].
30
Through detailed structural analysis, the contaminant was found to contain
a disaccharide repeating unit of glucuronic acid linked to an N-
acetylgalactosamine. The disaccharide unit has an unusual sulfation pattern
and is sulfated at the 2-O and 3-O positions of the glucuronic acid as well as
at the 4-O and 6-O positions of the galactosamine (Figure 5). The
predominant structure of OSCS has four sulfates per disaccharide, and both
sugars in the disaccharide unit contained two sulfate groups, a condition
never before seen in normal heparin and not found in any natural sources of
chondroitin sulfate. The OSCS molecule is not a natural product and cannot
be formed in any of the steps in the production of heparin. Since OSCS is a
synthetic glucosaminoglycan product it must have been added to the heparin
deliberately. The structure of OSCS suggests that all hydroxyl groups are
completely or nearly completely sulfated before its introduction into heparin.
Figure 5. The molecular structures of heparin and OSCS. Taken from Ref [17].
31
In addition, greater than 1% w/w levels of dermatan sulfate (DS, a known
impurity in pharmaceutical heparin) were also detected in many of the same
samples contaminated with OSCS, indicating that many manufacturers had
poor process controls in producing the drug [59, 65].
An impurity is a substance that can be introduced or retained in the natural
processing of heparin from animal tissue while a contaminant is a substance
that is accidentally or intentionally added outside of a normal process step.
While DS has no known toxicity, OSCS was toxic leading to patient deaths.
Screening of more than 100 heparin samples collected from international
markets revealed a high number of samples containing substantial amounts
of DS and a number of samples containing OSCS in an amount higher than
0.1%. Preliminary screening of contaminated heparin batches collected from
different sources by means of 1H NMR spectroscopy and capillary
electrophoresis (CE) revealed four different groups, i.e., pure heparin with
almost no DS, heparin-containing DS in varying amounts, heparin with OSCS,
and heparin with OSCS and varying amounts of DS [30].
It has been shown that OSCS has a hypotension effect. Kishimoto et al.
[22] were able to partially reproduce the clinical syndrome in a porcine model
by inoculating a large dose of the pure contaminant, suggesting that the
presence of OSCS was linked to or possibly responsible for the adverse
events. The contaminant activates chemicals in the body called enzymes,
which cause the body to make inflammatory mediators that can lead to some
32
of the symptoms such as low blood pressure, abdominal symptoms and
shortness of breath. This mechanism can explain many of the serious
adverse events that occurred immediately after patients were given the
contaminated heparin.
2.2.3 USP Monograph for Heparin Quality
The health crisis resulting from contamination of lots of pharmaceutical
heparin with chemically modified chondroitin sulfate addresses the need for
sensitive, selective, and robust methods for profiling the composition of
glycosaminoglycans, especially those used for therapeutic purposes.
To better secure the immediate supply of the drug for doctors and patients,
new proposed U.S. Pharmacopeia (www.usp.org/hottopics/heparin.html)
assays for OSCS were developed. USP released a first revision to its heparin
monograph standards in June 2008 to detect OSCS, including an NMR
identification assay which focused on the N-methyl acetyl proton region of the
spectrum and a capillary electrophoresis (CE) assay.
In the stage 2 revision of the monograph in 2009, the USP further
improved the monograph for heparin sodium by expanding the NMR
identification assay, replacing the CE assay with a strong-anion-exchange
high-performance liquid chromatography (SAX-HPLC) test for determining the
percent galactosamine in total hexosamine measurement (%Gal), and an
assay that measures the delay in the coagulation time associated with
purified IIa and Xa coagulation factors caused by heparin [14, 15, 26].
33
It is shown that quality and purity of API heparin sodium in the marketplace
has improved dramatically following issuance of the improved USP
monograph that included addition of tests for the composition and structure of
heparin [60].
2.3 Chemometrics and its Application in Heparin Field
Modern analytical instruments allow producing great amounts of
information for a large number of samples, leading to the availability of
multivariate data matrices. Chemometrics is a discipline using mathematical
and statistical methods to efficiently select the optimal experimental
procedure and extract the maximum useful information from data. The two
main techniques in chemometrics are: (a) regression methods which link the
chemical information to quantifiable properties of the samples and (b)
classification methods which group samples together according to the
available information.
All chemometric techniques share a common strategy no matter what
algorithm is applied, that consists of the following steps [39, 68, 69]:
1. Selection of a training or calibration set and a test set. The training set is
used for the optimization of parameters characteristic of each multivariate
technique.
2. Variable selection. Those variables that contain information for the
aimed analysis are kept, whereas those variables encoding the noise and/or
with no discriminating power are eliminated.
34
3. Building of a model using the training set. A mathematical model is
derived between a certain number of variables measured on the samples that
constitute the training set and their known categories.
4. Validation of the model using an independent test set of samples in
order to evaluate the reliability of the model achieved.
In practice, multivariate chemometric analysis begins by dividing the total
data set into two subsets: a training set that is used to construct the models,
and a test set that is used to validate and test the model‟s predictive ability.
The division should be random such that the training and test sets are
overlapping and representative of the total data set. This division process
may be performed multiple times to control for the composition of the training
and test sets. Stringent measures, such as cross-validation and external
validation procedures using test sets, are recommended to ensure that the
final model possesses the statistical rigor and applicability domain needed for
use under operational conditions [42, 43].
2.3.1 Variable Selection
Variable selection is a crucial step in statistical analysis, as it controls both
the number of variables and the mathematical complexity of the model [39].
The presence of variables not related to the response can produce
background noise, and redundant variables may confound models, resulting
in the reduction of predictive ability. It is important to determine those
variables that are relevant for building multivariate models and to eliminate
35
useless data. The selection of variables for chemometric analysis is an
optimization procedure, with the goal of identifying a subset of variables that
can produce simpler and more stable models with high prediction
performance and low errors.
2.3.1.1 Stepwise Method
The stepwise method covers three variable selection procedures: forward
addition, backward elimination, and “both direction”. Forward selection starts
with a single variable and then builds a model by subsequently adding other
variables; backward selection starts with all available variables and then
deletes the unnecessary variables step-by-step. The “both direction”
approach adds or drops variables at the same time [36]. In stepwise multiple
regression, the inclusion of variables in the model follows the forward
selection procedure, but at each stage backward elimination is also applied.
The variable most correlated with the response enters the model first, and
then forward selection continues. Each time a new variable is added, the
significance of the regression terms is tested. If the contribution of a variable
existing in the model is decreased and made no longer significant by a new
variable, then the insignificant variable is removed from the model. Any
variables that entered the model in the earlier stages can be discarded at the
later stages. The process of forward addition and backward elimination is
repeated until the inclusion of any other variables cannot further improve the
model, and finally each variable included in the model is significant [70].
36
2.3.1.2 Genetic Algorithms
Genetic algorithms (GAs) are numerical optimization tools and randomized
search techniques, which simulate biological evolution based on the Darwin
theory of natural selection. GAs are widely used in chemometrics for variable
selection [71-77]. The basic operation of GAs consists of five steps: encoding
variables into chromosomes, initial population of chromosomes, evaluation of
the fitness function, creation of next generation, and checking for the stopping
conditions [74, 78].
a. Coding of variables. In variable selection by GAs, each variable is called
a gene, and a group of variables is called a chromosome which can be
represented by a binary string. Each string contains as many elements as the
number of variables. A gene can be coded as the value “1” or “0”. If this gene
is “1”, the variable is selected, whereas the variable is not selected if its value
is “0”.
b. Random generation of an initial population. An initial population of
individuals is randomly generated as the first step in the GA procedure.
Thereafter, the size of the population is kept constant.
c. Evaluation of the fitness of each chromosome in the population. A
chromosome is evaluated by a fitness function for its survival ability.
According to the rules of biological evolution, the higher the fitness value, the
greater the chance for the chromosome to survive to the next generation.
37
Thus, the best string from the initial population is selected to reproduce. One
approach to calculating the fitness value is based on cross-validation.
d. Creation of the next generation from the previous one by genetic
operators. Depending on the fitness values, some pairs of chromosomes are
selected to undergo crossover where two existing chromosomes exchange
parts of their genomes and two new chromosomes are formed. After the
crossings, one or more mutations may occur, where the bits of an individual‟s
strings are randomly flipped with small probability and the state of the gene is
changed from “0” to “1” or vice versa. The mutation process avoids the
possibility that all chromosomes share the same code values, and leads to a
more heterogeneous system. According to the fitness, the current population
of chromosomes is selected, recombined and mutated to generate the next
population with strong survival ability.
e. Test of the stop condition. The operations of evaluation, selection,
crossing and mutation form one cycle by which a new generation of
chromosomes is produced. If the stopping criteria are not met by the new
population, steps b to d of the above are iterated by using the generated
chromosomes as the new initial population. The process is repeated until a
satisfactory result is achieved. After many generations, the final selected
chromosomes or subsets of variables are retained and employed for model
building and prediction.
38
2.3.1.3 Stepwise LDA Variable Reduction
Stepwise linear discriminant analysis (SLDA) is carried out using an
aggregative procedure, which starts with no variables in the model and adds
the variables with the greatest discriminating ability in the successive steps
[79-81]. In SLDA, Wilks‟ lambda is employed as a selection criterion to
determine the variables included in the procedure. Wilks‟ lambda is defined
as the ratio of the intra-class covariance to the total covariance; hence its
value varies between 0 and 1. A value close to 0 denotes that the classes are
well separated, while a value close to 1 denotes that the classes are poorly
separated.
As the first step, the variable that best discriminates the groups is selected
for the model. Each successive step involves evaluation of all remaining
variables in order to select the one that can yield the minimum intra-category
covariance, i.e., the smallest Wilks‟ lambda, which implies that the within-
category sum of squares is minimized while the inter-category sum of squares
is maximized. The selection procedure stops when all variables have been
evaluated. At the step when v variables have been selected, the value of the
Wilks‟ lambda v is calculated according to [80]:
T
W
n
gnv
1 (1)
where n is the total number of samples, and g is the number of classes, while
∑W and ∑T are the intra-category and the total variance–covariance
39
matrices, respectively. Suppose that the Wilks‟ lambda can be approximated
as the F-ratio that follows a Fisher distribution, then the statistical significance
of the changes in lambda is evaluated using the F factor when a new variable
is tested:
1
1
1
v
vv
g
vgnentertoF (2)
where g - 1 and n – g - v are the degrees of freedom for F-to-enter. The new
variable, which is identified to lead to the highest partial F-ratio, i.e., the
largest decrease of the Wilks‟ lambda, is added to the model.
2.3.2 Multivariate Regression Analyses
The aim of computing quantitative models is predicting a property of
unknown samples with spectral data. A model is built and validated by using
several sample sets. A first one is the calibration set used to compute the
model. A second sample set is the validation set used to evaluate the ability
of the model to predict unknown samples. The calibration and the validation
sets have to be independent, and they must consist of samples from different
batches.
2.3.2.1 Multiple Linear Regression
Multiple linear regression (MLR) produces a linear model describing the
relationship between a dependent (response) variable and independent
variables [78, 82]:
40
eXby (3)
where y is the measured response vector ( 1y , 2y , …, ny ), and X is a matrix
of size n × (m + 1) in which the first column are assigned the value 1 as the
intercept term and the remaining columns are assigned the values ijx . The
parameters n, m, i, and j correspond respectively to the number of samples,
the number of variables, the index for samples and the index for variables.
The parameter b is the vector of the estimated regression coefficients, and e
is the vector of the y residuals resulting from systematic modeling errors and
random measurement errors assumed to have normal distribution with
expected value E(e) = 0. By minimizing the sum of the squared residuals, the
regression coefficients can be approximated as [83, 84]:
yXXXb TT 1)( (4)
Each variable jx is then multiplied by its regression coefficient jb to obtain
the predicted value for y, noted as y :
mm xbxbxbby ...ˆ22110 (5)
2.3.2.2 Ridge Regression
MLR is particularly sensitive to highly correlated (co-linear) variables,
which can result in highly unreliable model predictions. In addition, MLR is
inappropriate when there are fewer samples than variables. As a shrinkage
method, Ridge regression (RR) limits the range of the regression coefficients
and thereby stabilizes their estimation [36]. The RR technique aims to resolve
41
the co-linearity problem associated with MLR by modifying the X’X matrix so
that its determinant can be appreciably different from 0. The objective of RR
is to minimize:
m
j
j
n
i
ii byy1
22
1
)ˆ( (6)
where the first term is the residual sum of squares (RSS), and the second
term is a regularizer which penalizes a large norm of the regression
coefficients. The Ridge parameter or complexity parameter λ determines the
deviation between the Ridge regression and the MLR regression, and thereby
controls the amount of shrinkage [83]. As recognized by Equation (4), the
expressions for Ridge regression and MLR are identical when the
regularization parameter λ = 0. The larger the value of λ is, the greater the
penalty (shrinkage) that is applied to the regression coefficients. The Ridge
regression coefficient ridgeb can be estimated by solving the minimization
problem in Equation (6) and has the following form [82, 83]:
yXIXXb TT
ridge
1)( (7)
Equation (7) is a linear function of the response variable y. The coefficient
ridgeb is similar to the regression coefficient of MLR in Equation (4), but the
inverse is stabilized by the Ridge parameter λ. The performance of Ridge
regression depends heavily on proper choice of the parameter λ, which is
achieved using cross-validation procedures.
42
2.3.2.3 Partial Least Squares Regression
Partial least squares regression (PLSR) is one of the most commonly used
multivariate regression methods in chemometrics [36]. The advantage of this
method over multiple linear regression (MLR) is its capacity to build a
regression model based on highly correlated variables. In the model, the X-
data are first transformed into a set of orthogonal latent variables or
components, a linear combination of the original variables, and these new
variables are used for regression with a dependent variable y. The aim of
PLSR is to construct predictive models between two blocks of variables, the
latent variables and the response variables, so that the covariance between
them is maximized. The number of latent variables determines the complexity
of the model and can be optimized by a leave-one-out cross-validation (LOO-
CV) procedure on the calibration set. The relationship between original data X
and the latent variables T is [76]:
ETPX T (8)
Replacing X in the Equation (1) by latent variables T of lower dimension, the
regression model for y on T can be presented as follows [84]:
fTqebPTfbTPeXby T
TT )()( (9)
where T represents the n × r score matrix for X and y, P the m × r loading
matrix representing the regression coefficients of X on T, E the n × m residual
matrix of X, b the m × 1 vector of regression coefficients, q the r × 1 loading
43
vector representing the regression coefficients of y on T, f the n × 1 residual
vector of y, and r is the number of selected factors
2.3.2.4 Support Vector Regression
As a powerful machine learning technique, support vector machine is
becoming increasingly popular. Support vector regression (SVR) is able to
model complex non-linear relationships by using an appropriate kernel
function that maps the input matrix X onto a higher-dimensional feature space
and transforms the non-linear relationships into linear forms. The feature
space is then used as a new input to deal with the regression problem [85].
By introducing an ε-insensitive loss function, Vapnik extended support vector
machines for classification to regression [86, 87]. In the loss function, the
training objects are represented as a tube with radius ε. If all data points are
situated inside the regression tube, the loss function is equal to 0, whereas if
a data point is located outside the tube, the loss function increases with the
Euclidean distance between the data point and the radius ε of the tube [43].
Thus, the ε-insensitive loss function can be expressed as [77, 88]:
,|ˆ|
,0),ˆ,(
ii
iiyy
yyL otherwise
yy ii |ˆ| (10)
A cost function is defined by [83]:
n
i
ii
m
j
j yyLCbI11
2 ),ˆ,(2
1 (11)
44
It is a combination of a 2-norm term of the regression coefficients and an error
term multiplied by the error weight, C, a regularizing parameter which
determines the trade-off between the training error and model complexity [89].
The slack variables i ,
i are introduced for predicting the deviation of more
than ε above ( i ) or less than -ε below (
i ) the target [90], and thus:
)(2
1 *
11
2
i
n
i
i
m
j
j CbI
(12)
Subject to the constraints:
ii
T
i bxby 0
*
0 iii
T ybxb (13)
0, * ii
The Lagrangian is defined as the cost function plus a linear combination of
the above constraints, and the combination coefficients are called the
Lagrange multipliers [83]:
n
i
ii
T
ii
n
i
ii
m
j
j bxbyCbL1
0
1
*
1
2 )()(2
1
*
1
*
11
*
0
* )( i
n
i
i
n
i
ii
n
i
iii
T
i ybxb
(14)
with the Lagrangian multipliers ,0i ,0* i 0i , 0* i for i = 1, …, n. For
training objects with prediction errors smaller than ±ε, their Lagrange
multipliers αi and αi* are zero, while the training objects with prediction errors
larger than ±ε have nonzero αi and αi*, contribute to the final regression
45
model, and are called support vectors. Therefore, the number of support
vectors is determined by the value of ε. The larger the ε value is, the fewer
the support vectors are, and hence the poorer the prediction performance of
the model will be.
A set of values for the Lagrange multipliers can be obtained based on the
Lagrange optimization, and the regression coefficients are expressed as an
expansion of the Lagrange multipliers multiplied by the corresponding training
objects [83]:
ii
n
i
i xb )( *
1
(15)
Thus, the regression model becomes:
0
*
1
)(ˆ bxxXby i
T
ii
n
i
i
(16)
In other words, the response variable can be predicted via the inner products
only, instead of through their individual properties:
TXXXby ˆ (17)
By replacing the inner product TXX with a kernel function ),( ji xxK , this
linear approach can be extended to nonlinear function. For the non-
transformed data set, j
T
i xx is the element ijk . The matrix of element ijk
becomes the inner product of the transformed objects after nonlinear
mapping:
)(),( jiij xxK (18)
46
bxxxxxky j
n
i
jiiiji
n
i
ii )())()()((),()(ˆ1
*
1
*
(19)
where Φ is the mapping function from data X to the feature space.
In support vector regression, there are four typically used kernel functions,
which are linear kernel, polynomial kernel, radial basis function (RBF) kernel,
and sigmoid kernel.
The linear kernel is the inner product of ix and jx :
jiji xxxxK ),( (20)
The polynomial kernel can model nonlinear relationship in a simple and
efficient way:
d
j
T
iji xxxxK )1(),( (21)
RBF is a commonly used kernel, which is usually in the Gaussian form:
)2
exp(),(2
2
ji
ji
xxxxK
(22)
Sigmoid kernel:
)tanh(),( bxaxxxK jiji (23)
2.3.3 Chemometric Pattern Recognition
Chemometric pattern recognition techniques are very powerful in
analyzing multi-dimensional chemical data, and have been widely applied in
such fields as food and pharmaceuticals for identification of their origin,
impurity assessment, and quality control [37, 39, 40, 46]. Chemometric
47
discrimination of different groups is generally divided into two distinct
categories, viz., “unsupervised” (clustering) and “supervised” (classification)
[36, 39]. Unsupervised techniques aim to explore the natural structure of the
data and no information about class membership is required. The most
commonly used methods include principal components analysis (PCA) and
hierarchical cluster analysis (HCA). On the other hand, supervised techniques
focus on defining a classification rule where class membership information
can be used to maximize the separation between groups and the class of a
sample from a test dataset can be predicted based on the mathematical
models derived from the training dataset [37, 38, 91, 92]. The classification of
a collection of samples into groups is usually conducted using supervised
techniques if their origin is known beforehand. Although many pattern
recognition methods are available for classification, the selection of
appropriate methods relies heavily on the specific nature of the data set, such
as the number of classes, samples and variables, the expected complexity of
the boundaries among classes, and the level of noise. While many algorithms
can achieve satisfactory results in typical cases with linear boundaries and
high ratio of samples to variables, the choice of appropriate approaches
should be made carefully in more complicated cases to attain optimal
performance.
Currently, two kinds of supervised pattern recognition methods are
available - pure discriminating approaches and class modeling techniques
48
[79, 93]. The two methods present substantially different modeling strategies
[94]. Discriminating approaches focus on the dissimilarity between classes,
whereas the class modeling techniques emphasize on the similarity within
each class [95]. For pure classification, the training samples are partitioned
into the data space where there are as many regions as the number of
classes, and the classification rule constructs a border among these classes.
A test sample can be only assigned to a specific region or class to which it
most probably belongs. On the other hand, class-modeling analysis considers
only one category at a time and defines a frontier in the feature space to
separate a specific class from the others. A separate mathematical model is
built for each category from a training set, and then the fitting of samples is
evaluated. A sample is accepted by that class if it falls within a model‟s space,
whereas it is considered an outlier for that specific class if it falls outside the
model‟s space. If more than a single class is modeled, a particular region of
the data space from one class may overlap within the boundaries of other
class models. Therefore, a sample can be assigned to a single class, to more
than one class, or to none of the classes [93]. In chemometrics, the most
commonly used class-modeling tools are soft independent modeling of class
analogy (SIMCA) [96-98] and unequal class modeling (UNEQ, also known as
multidimensional Gauss class modeling or MGCM) [80, 93, 95, 99], which are
distance- and probabilistic-based modeling techniques, respectively. As a
49
modeling version of quadratic discriminant analysis (QDA), UNEQ is the
simplest modeling method based on multivariate normal distribution [79, 94].
An NMR data-analysis procedure is shown in the Figure 6 [100]. After
spectra are accumulated and processed (panel a), a primary data reduction is
carried out that digitizes the one-dimensional spectrum into a series of
integrated regions (panel b). After removal of redundant signals and
appropriate scaling, primary data analysis is used to map the samples
according to their composition and property, using methods such as PCA.
Samples that share a similar property are generally intrinsically similar in
composition, and therefore occupy neighboring positions in the PC space
(panel c). Each class of samples is then modeled separately, and class
boundaries and confidence limits are calculated to construct a model for the
prediction of independent data (panel d).
Figure 6. Schematic diagram representing the process of assessing sample class from raw NMR spectra. Taken from Ref [100].
50
2.3.3.1 Principal Components Analysis
As a well-established multivariate statistical technique, principal
components analysis (PCA) is able to determine the directions of greatest
variance in the dataset, to reduce the dimensionality of the dataset where
there are a large number of intercorrelated variables, and to simplify complex
datasets to generate a lower number of parameters while retaining as much
as possible of the information present in the original data [44-46, 101]. PCA
clusters samples into separate groups in n-dimensional space, where “n” is
the number of features or variables that characterizes each sample. PCA is
especially useful as a discovery tool for complex multivariate data sets,
because this approach reduces the original variables to a much smaller set
that greatly simplifies visualization of the data to see hidden patterns and
similarities/ dissimilarities between the clusters.
PCA approach transforms the original correlated variables into the
uncorrelated ones known as principal components (PC), which are a linear
combination of the original variables and are orthogonal to each other. The
first component explains the maximum amount of variance in the data, and
each succeeding component accounts for the remaining variations. PCA is an
unsupervised method in that no a priori knowledge relating to class affiliation
is required [102]. PCA is commonly used to visualize samples as scores plots
of two dimensions (PC1 vs PC2) or three dimensions (PC1 vs PC2 vs PC3)
that exhibit the number of distinct clusters and the differences between
51
clusters in terms of their characteristic location in variable space. PCA has
been widely applied in conjunction with various discriminant analysis
techniques to handle classification problems. In addition, the PC scores can
be used as inputs to multivariate analyses [103, 104]. In PCA analysis, the
data matrix X is composed of the product of PCA scores matrix T and loading
matrix P plus the error or residual matrix E [77]:
ETPX (24)
2.3.3.2 Partial Least Squares Discriminant Analysis
Partial least squares discriminant analysis (PLS-DA) is a linear regression
approach in which the multivariate variables from the observations are
correlated to the class membership of each sample [41, 105-107]. As an
extension of PCA, PLS-DA attempts to build models that can maximize the
separation among classes of objects. Since the class affiliation of the objects
is included in the regression calculation, PLS-DA is a supervised approach.
PLS-DA models the dataset in a way similar to PCA, but with the addition of
discriminant analysis. Unlike PCA which focuses on the overall variation of
each class, PLS-DA focuses mainly on the variation between classes.
There are two steps for the PLS-DA procedure [42, 103, 108]: the first one
is the application of a PLS regression model on the latent variables which
indicates the grouping information, and the second one is classification of the
objects from the regression results on indicator variables. Once built and
52
validated, a PLS-DA model can be used to predict the class membership for
unknown samples.
The regression of the data (X) against a “dummy matrix” (Y) describes the
variation according to class affiliation, where Y contains the values of 1 and 0
for each class and consists of as many columns as there are classes [35]. For
the training set, an observation is assigned the value of 1 for its class
affiliation, and assigned 0 for the other classes. The output of PLS-DA
regression is a matrix which can be used to classify unknown samples. The
prediction result from the PLS-DA model is a numeric value. If the value is
close to 1, then the test sample is assigned to the modeled class; if the value
is close to 0, then the object is unassigned or assigned to another class.
2.3.3.3 Linear Discriminant Analysis
Linear discriminant analysis (LDA) is a widely used supervised pattern
recognition method, and is also a well-established dimension reduction
technique [103, 109]. In LDA, a linear function of the dataset is sought so that
the ratio of between-class variance is maximized and the ratio of within-class
variance is minimized, and finally the optimal separation among the given
classes is achieved. Like PLS-DA, the ultimate aim of LDA is to qualitatively
predict the group affiliation for unknown samples. Discrimination of the
classes is performed by calculating the Mahalanobis distance of a sample
from the center of gravity of each specified class, and then assigning the
sample to the class associated with the smallest distance [103, 110]. The
53
Mahalanobis distance between a sample ( ix ) and the data center ( x ) is
defined as [111]:
5.01 )]()()[()( xxXXxxxD i
TT
ii (25)
where 1)( XX T is the sample covariance matrix and i denotes the index of
samples. The center is estimated by the arithmetic mean vector x . A test
sample is correctly classified if it is located nearest the center of gravity of its
actual class. Otherwise, the sample would be incorrectly classified to another
class for which the Mahalanobis distance was the smallest.
2.3.3.4 k-Nearest Neighbors
The classification of k-nearest neighbors (kNN) is performed by calculating
the distances between a new object (a test data point) and all objects in the
training set in n-dimensional variable space [112, 113]. Unlike PLS-DA and
LDA, the kNN approach avoids the need for model generation. Neighbor
determination is calculated by the Euclidean distance, and the nearest k
objects were used to estimate the class affiliation of the test object. Euclidean
distance is expressed as [36]:
m
j
jiji xxxxD1
5.02
, ])([),( (26)
where i and j denote the index of samples and variables, respectively, and m
is the number of variables. By applying the majority rule, the new object is
assigned to the class of the majority of the k objects, i.e., the prediction is
related to a majority vote among the neighbors. To correctly assign the group
54
affiliation for a test data point, this technique requires tuning of the adjustable
parameter k (i.e., the optimal number of nearest neighbors to choose). Values
of k that are too small or too large can lead to poor classification of new
objects. Over-fitting may occur if k is too small (such as k = 1), while under-
fitting is more likely if k is too large. By testing a series of k values and
assessing the prediction performance, the optimal value of k is selected which
gives the lowest number of misclassifications.
2.3.3.5 Classification and Regression Tree
As a non-parametric approach, classification and regression tree (CART)
models a data set with the structure of a tree and makes no assumption about
the distribution of the data. This methodology applies decision trees to solve
classification and regression problems for handling both categorical and
continuous responses. A classification tree is yielded when the response
variable is categorical, while the final output is a regression tree when the
response variable is continuous. In general, a CART analysis consists of
three steps [114-117]: in the first step, an over-large tree, called the maximal
tree, is built by recursive partitioning of the original training data using a
binary split-procedure; in the second step, the overgrown tree, which usually
shows overfitting, is pruned so that a series of less complex trees is derived;
in the last step, the tree with the optimal size is selected by a cross-validation
(CV) procedure.
55
The tree construction starts by dividing the root node, containing all
objects in the training set, into exactly two sub-groups or child nodes, and
then each child node becomes a parent node that is further split into two
mutually exclusive child nodes. The splitting procedure is repeated for each of
the resulting nodes until the maximal tree is grown, which is defined as the
tree in which each terminal node consists of either just one object, or contains
a predefined number of objects, or all objects contained in the node are as
pure or homogeneous as possible, i.e., the samples in a node share the same
or similar values of the response variable (Figure 7). To find the most
appropriate variable for splitting and the best split point on the variable so that
the error measure is minimized or the predictive power is maximized, CART
scans through all possible split values over all explanatory variables. In the
decision tree, the first branch is produced by the variable with the best split
point, and each sequential split is conducted by following some fit criteria or
error measures Ql(T) with the purpose of decreasing the misclassification as
much as possible. For classification trees to choose the best split point,
several splitting criteria have been proposed, one of them being the Gini
index which represents the product sum of the relative frequency of one class
and the relative frequency of all other classes, and can be expressed as [36]:
)1()1(11 l
ljk
j l
ljk
j
ljljn
n
n
nppGini
(27)
56
where k denotes the number of possible classes; nl is the number of objects
in node l, and nlj is the number of objects from class j present in the node l.
When the node is pure, i.e., contains only objects of the same group, the
minimum Gini index value is attained.
Figure 7. Structure of a classification or regression tree. Nodes 1, 3 and 4 are TNs, node 2 is a parent node, and nodes 3 and 4 are child nodes. Taken from Ref [114].
The tree built in the first step fits the training set almost perfectly, but
usually exhibits poor predictive ability for new samples, because it has a large
number of terminal nodes (TNs). It is necessary to find trees with less
complexity but better predictive accuracy. The optimal tree size can be
determined by successively cutting back the terminal branches of the
overlarge tree. During this pruning procedure, a series of smaller sub-trees T
are derived from the maximal tree, and the optimal tree with the minimum
57
classification error is obtained by calculating its cost-complexity parameter
CPα(T) as a measure, which is defined as a linear combination of the tree
cost Ql(T) and its complexity |T| [36]:
Minimize: TTQnTCPT
l
ll 1
)()( (28)
where |T| denotes the size of a tree, or the number of terminal nodes, i.e., the
complexity of the sub-tree T; and α, which takes values between 0 and 1, is a
penalty for each additional terminal node, and it establishes the compromise
between classification error and tree size. For each value of α, the optimal
tree size is selected by minimizing CPα(T). A value of α equal to zero results
in the maximal tree where the measure Ql(T) of misclassification is minimized
while the value α > 0 penalizes large trees. By gradually increasing the value
of α starting from 0, a nested sequence of trees with decreasing size or
complexity is then derived. The last stage of this procedure is to compare the
different sub-trees and select the optimal tree from the remaining sequence of
sub-trees, which is determined by cross validation for evaluation of the
predictive error.
2.3.3.6 Artificial Neural Networks (ANNs)
An aritificial neural network (ANN) is a well-established modeling
technique for solving some problems such as classification or pattern
recognition, regression and estimation [37, 38, 91, 92, 118-122]. ANN is able
to handle linear as well as non-linear data for model fitting, and consequently
58
it excels in cases where the data sets contain substantial uncertainty and
measurement errors. ANN is particularly suitable when a mathematical
relationship between the independent and response variables cannot be
established. The typical feed-forward back-propagation ANN is composed of
a large number of fully interconnected processing elements (PE) or neurons
which are organized into a sequence of layers (Figure 8).
Figure 8. A fully connected multilayer feed-forward back-propagation network. Taken from Ref [122].
The first layer is the input layer that contains as many neurons as the
number of independent variables and is used to receive the information from
the outside. The last layer is the output layer consisting of as many neurons
as the number of dependent variables and serves to provide ANN‟s response
to the input data. A series of one or more hidden layers are in between, which
are responsible for communicating with the neurons of the input and output
59
layers. A number of learning algorithms are available for training a neural
network. For multilayer ANNs, the most commonly used approach is the
single hidden layer network with a sufficient number of neurons, which can
model any nonlinear function with any required accuracy. In a feed-forward
architecture, signals are propagated sequentially only in the forward direction,
i.e., from the input layer through the hidden layer to the output layer, where
the output from a previous layer is employed as an input for the successive
layer.
The propagation of the signal through the network from one neuron in a
layer to another neuron in the next layer greatly depends on the strength of
the connection. The interconnections between neurons are represented by a
set of adjustable parameters called weights that are calculated by the ANN
algorithm, which trains the neural network and adapts the weights to an
optimum set of values. In the training process, some interconnections are
strengthened while the others are weakened, in such a way that the ANN will
yield more accurate results. As a popular learning strategy, back propagation
approach corrects the weights in a layer, which are proportional to the error
from the previous layer. The prediction results are fed backwards through the
network to adjust the weights. This process is repeated until the
interconnections are optimized, the error is minimized, the trained network
attains a specified level of accuracy, or a pre-defined number of iterations are
reached.
60
The activation of the neuron is done through the weighed sum of the
inputs, and a transfer function is used to pass the activation signal and
produce a single output of the neuron. The relationship between the input
variable xi and the output variable y is defined by the following equation
formula [118]:
j i jiiijj bbxwfwfy ])([ (29)
where wij and wj represent the connection weights from the input layer to the
hidden layer and from the hidden layer to the output layer, respectively, and bi
and bj are bias constants. The transfer function f(x), which can be linear or
non-linear depending on the topology of the network, determines the
processing inside the neuron. The logistic sigmoid activation function is a
widely used transfer function:
xe
xf
1
1)( (30)
2.3.3.7 Support Vector Machine (SVM)
Support vector machine (SVM) is a recently developed modeling
technique that has demonstrated its utility for a broad range of classification
and regression problems [88, 123-133]. SVM performs pattern recognition by
finding an optimal hyperplane as the decision boundary for separating two
classes of patterns, which can maximize the margin between the closest data
points of each class. The SVM algorithm derives the classification rule using
only a fraction of the training samples that are known as support vectors
61
(SVs) and typically are situated nearby on the margin borders. In general, the
number of SVs is much lower than the number of training samples. For the
linearly separable case, the class boundary is determined in the space of the
original variables by defining an optimal hyperplane with maximal margin,
which divides the data-space into two regions with opposite sign, and leaves
all the vectors of the same sign or class on the same side [123, 124]:
0 bxwT (31)
where w is a weight vector normal to the hyperplane, b is a free threshold
parameter, b/||w|| is the perpendicular distance to the origin, and ||w|| is the
Euclidean norm of w. In linearly non-separable situations, the principle of
linear separation is extended and the complex class boundaries are modeled
by using adequate kernel functions that map the original vectors from input
space to higher dimensional feature space where the non-linear relationship
is expressed in linear form and a linear separation operation can be
performed (Figure 9).
Figure 9. Non-linear separation case in the low dimension input space and linear separation case in the high dimension feature space. Taken from Ref [126].
62
In the presence of noisy data, the learned classifier may fit the noise into
model and force zero training error, leading to poor generalization. The
violation of the margin constraints of the hyperplane is allowed by introducing
a set of non-negative slack variables ξi > 0 (i = 1, . . ., n), which represents
the distance of sample xi from the margin of the pertaining class. Given the
sum of the allowed deviations ∑ξi, the optimization requires simultaneously
maximizing the margin 1/2||w||2 and minimizing the number of
misclassifications. Accordingly, the objective function that is designed to
balance the classification error with complexity of the model can be
expressed as following [88]:
n
i
iCw1
2
2
1 (32)
A soft margin that can separate the hyperplane is constructed by minimizing
the dual form of the above expression, where the regularization parameter C
is used to control a trade-off between maximizing the margin and minimizing
the model complexity. A small value of C allows great deviations ξi, and
hence, the emphasis will be placed on margin maximization and a large
number of samples are retained as support vectors, leading to overfitting of
the training data. In contrast, when C is too large, the second term dominates,
allowing smaller deviations ξi and minimizing the training error, leading to a
less complex boundary and smaller margin.
63
The results of the SVM approach depend highly on the choice of the
kernel function that decides the sample distribution in the mapping space and
may influence the performance of the final model. The most commonly used
kernel function in SVM is radial basis function (RBF) or Gaussian function,
and is formulated as [125]:
)exp(),(2
jiji xxxxK (33)
where xi and xj are two independent variables; γ is a tuning parameter that
controls the amplitude of the kernel function and, therefore, controls the
generalization performance of the SVM. A very large γ value can produce
models with overfitting because most of the training objects are used as the
support vectors, while a very small γ value can lead to poor predictive ability
as all data points are regarded as one object.
2.3.3.5 SIMCA Analysis
Soft independent modeling of class analogy (SIMCA) is a widely applied
class modeling technique in chemometrics. SIMCA uses the principal
component analysis (PCA) to develop a statistical model which describes the
similarities among the samples of a category [79, 134, 135]. The class model
for each category is derived separately in the training set based on the
computation of the principal components (PCs). The number of significant
components, which determines the dimensionality of the inner space for each
category and can differ for each category, is evaluated by a cross validation
64
procedure. Depending on the number of PCs or the variance retained in each
data class, classes can be modeled by one of a series of linear structures,
such as a point, a line, a plane, and so on [36]. In the space of the first few
PCs, the SIMCA model exhibits a parallelepiped structure, delimited by the
range of the scores in the direction of each PC.
The class boundaries around these linear structures can be built on the
basis of the distribution of Euclidean distance between the data points of
training samples and the fitted class model. The mean distance between the
samples belonging to a class and the class model, i.e., the class residual
standard deviation 0s , is defined as [136]:
)]1)(/[(1 1
22
0
AnAmesn
i
m
j
ij (34)
where n, m and A denote the number of samples in the class, the number of
variables, and the number of categories, respectively, and 2
ije is the squared
residual of the ith sample for the jth variable. A critical distance crits is
computed based on an F-test at a certain limit of confidence level:
critcrit Fss 0 (35)
In the present study, a 95% confidence level was set to define each class.
After the model has been developed on the training set, a new sample can be
tested for its membership in the defined classes by the orthogonal projection
65
distance between the new sample and the PC model of each class. The
squared distance of the test sample is determined by:
)/(1
2
,
2 Amesm
j
jtesttest
(36)
It is then compared with the class confidence limit crits . The new sample is
assigned to one or more classes if it lies within the statistical limits, i.e.,
tests < crits , and it is considered to be an outlier if the distance is larger [79].
Therefore, a sample can be a member of a single class, more than one class,
or none of the defined classes.
The model generated by SIMCA for each category can be evaluated in
terms of sensitivity (SENS) and specificity (SPEC), which are associated with
the number of false positive and false negative errors for each class. The
SENS of a class is the proportion of samples belonging to that class and
correctly identified by the model, while SPEC corresponds to the proportion of
samples outside the class and correctly rejected by the model [95, 135, 137].
When more than two classes are present, specificity can be calculated
individually for each class. SENS and SPEC are closely associated with the
concepts of type I (α) errors which refer to the probability of erroneously
rejecting a member of the class as a non-member (false negative), and type II
(β) errors which refer to the probability of erroneously accepting a non-
member of the class as a member (false positive). Assume An and An are
the number of samples belonging to category A and the number of samples
66
accepted by the model, respectively, while An and An are the number of
samples not belonging to category A and the number of samples rejected by
the model, respectively. Given these definitions [85], the two relationships
follow:
100A
A
n
nSENS (37)
100A
A
n
nSPEC (38)
indicating that SENS and SPEC are the complementary percent measure of
type I and II errors, respectively.
2.3.3.6 UNEQ Analysis
UNEQ is a class-modeling technique equivalent to quadratic discriminant
analysis (QDA) and is based on the assumption of multivariate normal
distribution of the measured or transformed variables for each class
population [79, 94, 95]. In this method, each category is represented by
means of its centroid.
In a specific class, the category space or the distance of each sample from
the barycenter or centroid is calculated according to various measures that
follow a chi-squared distribution. Usually, the Mahalanobis distance is
applied, which is measured on the basis of correlations between variables
and is a useful way for determining similarity of an unknown sample set to a
known one. The Mahalanobis distance is different from Euclidean distance in
67
that it accounts for the covariance structure, i.e., it considers the distribution
of the sample points in the variable space and is independent of the scale of
measurements (scale-invariant). Thus, for UNEQ class modeling, three
parameters, i.e., the centroid, the matrix of covariance, and the Mahalanobis
distance of each sample to the centroid, need to be estimated [97]. As in
SIMCA, a confidence interval that represents the class boundary is defined,
and the membership of new samples is tested based on whether they fall
within the defined class boundary. The class space is constructed as the
confidence limit of hyper-ellipsoids around each centroid, which determines
the 95% probability of the multivariate normal distribution.
2.3.4 Application of Chemometrics in Heparin Field
Although chemometric techniques are becoming increasingly popular in
pharmaceutical field, and multivariate approaches are an attractive alternative
to classical analytical methods which are more tedious and time-consuming,
the application of chemometrics in heparin investigation is limited [17, 25, 30,
40, 44-46, 59]. Here, quantitative determination of DS and OSCS content as
well as discrimination of heparin contaminants are briefly summarized.
2.3.4.1 DS Concentration Determination
The estimation of dermatan sulfate (DS) impurity in heparin by means of
the quantitation of the corresponding 1H NMR signals was performed by Ruiz-
Calero et al who examined the potential of the 1H NMR technique for the
68
quantification of DS in heparin samples and estimated the concentration of
DS present as an impurity in heparin samples using partial least squares
regression (PLSR) [44]. The 1H NMR spectra of heparin and DS standards
showed characteristic profiles. Thus, differences in the methyl peaks of
acetamido groups of heparin and DS were greatly advantageous for the
analysis. Other hydrogens of the sugar ring were also relevant in this study.
The determination of DS content by multivariate calibration depended on all
these differences. In addition, a data standardization procedure was
developed in order that 1H NMR spectra registered with different instruments
operating under different measurement conditions were comparable. The
quantification of DS in the samples was satisfactory, with an overall prediction
error of 6%.
2.3.4.2 PCA Analysis of Heparin and its Contaminants
More than 100 samples of heparin collected from international markets
were subjected to a PCA analysis by Holzgrabe et al [30]. Spectra containing
both DS and OSCS are represented by points aligned along the principal
component. The PC1 and PC2 account for 83.6% and 12.6% of the total
variance, respectively. Their scores values scale with relative concentration.
The PC1 scores are dominated by the effect of OSCS contamination whereas
PC2 variation results from DS concentration variation (Figure 10). Both
effects are rather independent, because the PCs are orthogonal.
69
Figure 10. Scores plot of the PCA analysis for the
1H NMR spectral data of heparin
samples containing DS and OSCS. Taken from Ref [30].
Beyer et al [17] conducted PCA analysis of qualitative characteristics of
heparin samples in order to evaluate whether these contaminants are related
to each other. The existence of the various contaminants was represented by
the encoding scheme: -1 is used when the contaminant was not detected
while +1 encoded the contaminant. The application of the PCA revealed that
the samples containing OSCS can be separated from all other samples by
plotting the scores of PC3 against the scores of PC4 (Figure 11).
70
Figure 11. Separation of the samples containing OSCS (marked by +1.000) from those not containing OSCS (marked by –1.000) in a score-plot of a PCA model. Taken from Ref [17].
2.3.4.3 Raman Spectra for Screening Suspect Heparin Lots
In order to screen suspect lots of heparin, Spencer et al [25] studied a set
of 69 heparin powder samples obtained from several foreign and domestic
suppliers by means of near infrared (NIR) reflectance and laser Raman
spectroscopy techniques. The baseline-corrected, vector normalized Raman
spectra of heparin, OSCS, chondroitin sulfate A and DS are shown in Figure
12A. Both the NIR and Raman spectra of individual heparin API samples
were correlated with sample compositions determined from response-
corrected relative peak areas of the capillary chromatograms (CE) of the
samples using a PLSR model. Chemometric models were found to produce
accurate predictive models. OSCS prediction plots for the Raman test sets
are displayed in Figure 12B. The plot suggests that a threshold value of 1%
predicted OSCS can be used to eliminate suspicious heparin samples. When
71
the NIR model is used, a 1% threshold resulted in 38 out of 41 samples
correctly classified as being either good (15 samples at OSCS < 1%) or
suspect (26 samples > 1% OSCS). One good sample was classified as
suspect (1 false negative) and one suspect sample was classified as good
(false positive). Prediction with the Raman model showed similar accuracy,
with 36 out of 38 samples being correctly classified with one false positive
and one false negative. The overall accuracy in classifying heparin samples
as suspect or good using these spectroscopic/chemometric methods as
screening tools can be expected to exceed 95%. Both NIR and Raman allow
the elimination of over 60% of the heparin samples as suspicious. The
remaining 40% would be subjected to additional analyses by CE, NMR or
other separation methods to detect the presence of low levels of OSCS.
A B
Figure 12. Comparison of Raman spectra of heparin and the principal contaminants (A) and Raman PLS model test for OSCS of thirty eight samples (B). Solid diamond points are considered ‘‘Good’’; open square points are ‘‘Suspect’’. Taken from Ref [25].
72
Chapter III
DATA AND METHODS
In the present study, all 1H NMR spectral data were provided by the
Division of Pharmaceutical Analysis (DPA) of the US FDA, and various
multivariate regression approaches as well as pattern recognition techniques
were applied to the data.
3.1 Heparin Samples
Over 200 heparin sodium API samples from different manufacturers and
suppliers were analyzed. These samples contained substantial amounts of
DS (up to 19% of the polymer mixture) and OSCS (in an amount from 0 to
27%).
3.1.1 Pure, Impure and Contaminated Heparin APIs for Classification
Preliminary screening of heparin batches collected from different sources
by means of 1H NMR spectroscopy and capillary electrophoresis (CE)
revealed four different groups, i.e., pure heparin with DS ≤ 1.0%, heparin
containing DS in varying amounts but without OSCS, heparin with OSCS and
without DS, and heparin with both OSCS and DS.
Revisions proposed by the FDA for the Stage 3 Heparin Sodium USP
monograph specify that the weight percent of galactosamine in total
73
hexosamine (%Gal) may not exceed 1.0% and no level is acceptable for
OSCS. Thus, the samples in this study were divided into three groups: (a)
pure heparin with DS ≤ 1.0% and OSCS = 0% (Heparin); (b) impure heparin
with DS > 1.0% and OSCS = 0% (DS); and (c) contaminated heparin with
OSCS > 0% and any content of DS (OSCS). An additional fourth class,
namely [DS + OSCS], was included to characterize samples that contained
DS > 1.0%, OSCS > 0%, or both. In order to obtain a model with validation
capabilities, data was divided into two data sets: a training set employed to
build the model, and a validation set employed to test the predictive ability of
the model using data excluded from the training set. The data set of 178
heparin samples was split 2:1 into 118 samples for training (54 Heparin, 33
DS, and 31 OSCS) and 60 samples for external validation and testing (28
Heparin, 17 DS, and 15 OSCS). Multivariate statistical modeling was
conducted separately on the entire region (1.95-5.70 ppm) and two local
regions (1.95-2.20 and 3.10-5.70 ppm), which correspond to 74, 9 and 65
variables, respectively.
3.1.2 Heparin API Samples for %Gal Determination
1H NMR analytical data of over 100 heparin sodium API samples from
different suppliers with varying levels of chondroitins were obtained from the
chromatographic and spectroscopic experiments. DS is the primary
chondroitin impurity observed in heparin APIs and, for the purpose of this
study, the %Gal is presumed to be the same as the %DS for samples not
74
containing OSCS. These samples contained up to 10% by weight of
chondroitins in the API by the %Gal HPLC assay. Based on the range of
%Gal, the NMR spectral data were classified into two datasets, Dataset A and
Dataset B, which correspond to 0-10% and 0-2% galactosamine, respectively,
so Dataset B is a subgroup of Dataset A. For each dataset, heparin samples
were randomly split into two subsets: a training set that is used to build the
calibration models and an independent test set that is used to evaluate and
validate the model‟s predictive ability. The statistics of these two datasets are
summarized in Table 1. In the present study, models built by Dataset A and
Dataset B are named Model A and Model B, respectively.
Table 1. Summary Statistics of %Gal Measured from HPLC __________________________________________________________________________________
Number of samples Minimum Maximum Median Mean __________________________________________________________________________________
Dataset A
Training set 76 0.01 9.68 0.86 1.74
Test set 25 0.11 8.05 0.87 1.76
Dataset B
Training set 57 0.01 1.86 0.66 0.71
Test set 19 0.11 1.74 0.72 0.73 __________________________________________________________________________________
3.1.3 Blends of Heparin Spiked with other GAGs
A series of blends was prepared by spiking heparin APIs with native
impurities chondroitin sulfate A (CSA), chondroitin sulfate B (CSB, or DS),
heparan sulfate (HS), or synthetic contaminants oversulfated-(OS)-CSA (i.e.,
75
OSCS), OS-CSB, OS-HS or OS-heparin at the 1.0%, 5.0% and 10.0% weight
percent levels [15]. The detailed composition of the series of blends is
reported in the Chapter VI: Results and Discussion.
3.2 Proton NMR Spectra
Figure 13 illustrates the overlaid 500 MHz 1H NMR spectra of heparin
samples that contained 10.0% weight percent spikes of native and synthetic
GAGs, i.e., chondroitin sulfate A (CSA), oversulfated CSA (OS-CSA or
OSCS), chondroitin sulfate B (CSB) or dermatan sulfate (DS), and
oversulfated CSB (OS-DS), plotted in the range from 1.95 to 6.00 ppm. The
methyl protons of the N-acetyl methyl groups resonated around a chemical
shift of ca. 2 ppm, which was well separated from the other NMR signals in
the 3.0 to 6.0 ppm range where a complex pattern of overlapping signals
occurred. Each spectrum revealed distinctive features, and their respective
patterns were easily distinguished from one other in the range from 1.95 to
2.20 ppm (Figure 13A). The basic repeating disaccharide unit for heparin is 2-
O-sulfated uronic acid and 6-O-sulfated N-sulfated glucosamine, whereas the
corresponding repeating unit for DS or OSCS is uronic or glucuronic acid,
respectively, and galactosamine. About every fifth amino group is acetylated
for heparin, but almost all of the amino groups are acetylated in DS and
OSCS [21, 30]. A single peak appeared at 2.05 ppm for the N-acetyl protons
of heparin, and the methyl signal shifted about 0.03 ppm downfield in DS.
76
A
B
Figure 13. An overlay of the 500MHz
1H NMR spectra of a heparin sodium API spiked
with 10.0% weight percent of CSA, OS-CSA, DS and OS-DS. (A) In the 2.20-1.95 ppm region; (B) In the 6.00-3.00 ppm region.
77
Thus, a small peak, corresponding to the N-acetyl protons of DS, was
observed near 2.08 ppm. For OS-DS, two signals, which were located at 2.09
and 2.11 ppm, appeared downfield of the heparin methyl signal. A shoulder
peak at 2.02 ppm appeared upfield of the heparin methyl proton signal for
CSA while OS-CSA exhibited a characteristic signal near 2.15 ppm. Figure
13B showed the 3.0-6.0 ppm region of the overlaid spectra. The presence of
CSA resulted in the signals at 3.38, 3.58 and 4.02 ppm while the
characteristic peaks at 4.16, 4.48, 4.97 and 5.01 ppm came from OS-CSA.
DS displayed resonances at chemical shifts distinct from those of heparin at
3.54, 3.87, 4.03, 4.68 and 4.87 ppm. In addition, the signals at 4.27 and 4.93
ppm were associated with the OS-DS sample.
The proton NMR spectra of heparin samples are rich in information.
Although it is difficult to assign all peaks in the spectrum for use in the
determination of the quality of complex APIs such as heparin, these patterns
of intensities are valuable for characterizing and quantifying analytes for
quality control and purity assessment [39], and ideal for analysis using
chemometric approaches.
3.3 Data Processing
Prior to building multivariate models, the 1H NMR spectra of the heparin
samples were preprocessed into a discrete set of variables that served as the
input to the pattern recognition tools for subsequent analysis of the pure, DS-
impure, and OSCS-contaminated heparin samples.
78
1H NMR spectra were processed using the software MestRe-C (Version
5.3.0). Phase correction was achieved through automatic zero- and first-order
correction procedures, and peak integration was performed for each spectral
region. Chemical shifts were referenced to internal 4, 4-dimethyl-4-
silapentane-1-sulfonic acid (DSS). For the chemometric analysis, each 1H
NMR spectrum was automatically data-reduced and converted into 125
variables by dividing the 1.95 to 5.70 ppm region into sequential windows with
width of 0.03 ppm. During initial processing of the data, heparin lots were
found to contain residual solvents and reagents such as ethanol (triplet at
1.18 and quartet at 3.66 ppm), acetate (singlet at 1.92 ppm), and methanol
(singlet at 3.35 ppm) at varying levels. In addition, the residual H2O in the D2O
had a strong signal at 4.77 ppm. These regions were excluded from the data
acquisition, and the total data set was reduced to 74 regions or variables,
which are listed in Table 2 together with their corresponding chemical shifts.
The area within the spectral regions was integrated. In order to
compensate for differences in concentration among the heparin samples, the
74 variables for each spectrum were normalized to the total of the summed
integral value. Prior to chemometric analysis, the spectra were converted into
ASCII files where the data were represented in n × m-dimensional space (n
and m equal to the number of samples and the number of variables,
respectively), and the resulting data matrix was imported into Microsoft Excel
2003. The data were preprocessed by autoscaling, also known as unit
79
variance scaling (i.e., each of the variables is mean-centered and then
divided by its standard deviation) [138].
Table 2. Variable IDs and their Corresponding Chemical Shifts __________________________________________________________________________________
ID shift (ppm) ID shift (ppm) ID shift (ppm) ID shift (ppm) __________________________________________________________________________________
1 1.96 20 3.80 39 4.37 57 5.16 2 1.99 21 3.83 40 4.40 58 5.19 3 2.02 22 3.86 41 4.43 59 5.22 4 2.05 23 3.89 42 4.46 60 5.25 5 2.08 24 3.92 43 4.49 61 5.28 6 2.11 25 3.95 44 4.52 62 5.31 7 2.14 26 3.98 45 4.55 63 5.34 8 2.17 27 4.01 46 4.58 64 5.37 9 2.20 28 4.04 47 4.61 65 5.40 10 3.50 29 4.07 48 4.64 66 5.43 11 3.53 30 4.10 49 4.92 67 5.46 12 3.56 31 4.13 50 4.95 68 5.49 13 3.59 32 4.16 51 4.98 69 5.52 14 3.62 33 4.19 52 5.01 70 5.55 15 3.65 34 4.22 53 5.04 71 5.58 16 3.68 35 4.25 54 5.07 72 5.61 17 3.71 36 4.28 55 5.10 73 5.64 18 3.74 37 4.31 56 5.13 74 5.67 19 3.77 38 4.34 __________________________________________________________________________________
3.4 Computational Programs
Mathematical treatments for data standardization, multivariate analysis,
and statistical model building were performed using the R statistical analysis
software for Windows (Version 2.8.1) [139]. Stepwise variable selection,
genetic algorithms, multiple linear regression, Ridge regression, partial least
squares regression and support vector regression were implemented using
the packages chemometrics, subselect, stats, MASS, pls and e1071,
respectively [36, 140, 141]. The packages stats, caret, MASS, rpart, nnet, as
80
well as class and chemometrics were used to perform principal component
analysis, partial least squares discriminant analysis, linear discriminant
analysis, classification and regression tree, artificial neural network, and k-
nearest neighbors analysis, respectively. All the class modeling analyses
were performed using the chemometric software V-Parvus 2008 [142].
3.5 Performance Validation
The quality of the calibration model is evaluated by building a regression
between the experimental values and the predicted values. The statistical
parameters, viz., coefficient of determination ( 2R ), root mean squared error
(RMSE), and relative standard deviation (RSD), are used to measure the
performance, which are in the following forms [143, 144]:
n
i
i
n
i
ii
yy
yy
R
1
2
1
2
2
)(
)ˆ(
1 (39)
n
i
ii yyn
RMSE1
2)ˆ(1
1 (40)
%100y
RMSERSD (41)
where iy is the actual %Gal of sample i measured by HPLC, iy is the %Gal
predicted by the model, and y is the mean of all samples in a data set. 2R is
the most popular measure of the model‟s ability to fit the data. A value of 2R
near zero suggests no linear relationship, while a value approaching unity
81
indicates a near perfect linear fit. An acceptable model should have a
large 2R , a small RMSE, and a small RSD. The value of 2R will increase as
the model increases in complexity (i.e., more independent variables), so the
number of variables in the model must be considered. An alternative for 2R is
the adjusted coefficient, 2
adjR which includes the number of variables n in a
model, and favors models with a small number of variables. 2
adjR is defined by
[36]:
)1(1
11 22 R
mn
nRadj
(42)
In order to evaluate and validate the built models, training-test validation
as well as leave-one-out cross-validation (LOO-CV) methods are employed to
compare the predictive performance. For LOO-CV, the data set is divided into
n subsets: the training is performed on the (n - 1) blocks, and the test is
conducted on the objects belonging to the nth subset. In order to predict all
the objects, this process is repeated n times through block permutation [42,
103, 145].
82
Chapter IV
RESULTS AND DISCUSSION
The whole research work was divided into two parts, i.e., multivariate
regression analysis for the determination of the weight percent of
galactosamine (%Gal) and pattern recognition analysis for the differentiation
of pure, impure and contaminated heparin samples.
4.1 Multivariate Regression Analysis for Predicting %Gal
Multivariate regression (MVR) analysis of 1H NMR spectral data obtained
from heparin samples was employed to build quantitative models for the
prediction of %Gal. The MVR analysis was conducted using four separate
methods: multiple linear regression (MLR), Ridge regression (RR), partial
least squares regression (PLSR), and support vector regression (SVR).
Genetic algorithms (GAs) and stepwise selection methods were applied for
variable selection.
4.1.1 Variable Selection
In order to build robust regression models with high predictive
performance, stepwise selection methods and genetic algorithms were used
here to select a subset of variables from the original NMR spectral matrix.
83
4.1.1.1 Stepwise Procedure
In the stepwise selection, variables are added one at a time, and can be
deleted later if fail to make a significant contribution to the model. The number
of variables retained in the final model is based on the significance levels.
The Bayes information criterion (BIC) was used as a measure of the model fit,
which can be expressed as [36]:
nmnRSSnBIC log)/log( (43)
where RSS is the residual sum of squares, n is the number of samples, and
m is the number of regression variables. The variable is added to or removed
from the model in order to achieve the largest reduction of the BIC. When the
BIC value can be no longer reduced, the model selection process is stopped
resulting in the optimal subset of variables.
The variation of BIC values with the model size for all steps of the
stepwise procedure is plotted in Figure 14. Datasets A and B follow similar
trends in that each model search starts from the point in the upper left corner
of the plot, and ends in the lower right corner. The BIC measure decreases
continuously to a minimum value. However, the two datasets follow different
paths to minimize the BIC value. For Dataset A, the most highly correlated
variable, i.e., variable 2.08 ppm, entered the model first, followed by the
inclusion of variables 2.02, 2.11, 4.31, 3.53, 3.50, 5.61, and 5.34 ppm, and
then variable 2.11 ppm was dropped due to its insignificance in the model.
After that, variables 5.43, 4.25, 3.59, and 2.14 ppm were added sequentially.
84
Number of Variables
Number of Variables
Figure 14. The relationship between the Bayes information criterion (BIC) and the number of variables selected by the stepwise procedure. (A) Dataset A; (B) Dataset B.
85
Finally, the model retained 11 variables as summarized in Table 3. With
regard to Dataset B, variables 2.08, 2.02, 2.11, 1.99, 4.37, and 4.22 ppm
were added to the model step-by-step, followed by the removal of variable
2.02 ppm, the inclusion of variable 2.20 ppm, and the elimination of variable
2.11 ppm. This process led to the 5-variable subset as shown in Table 4.
Comparing the final variable subsets for Datasets A and B, the only variable
in common is 2.08 ppm. This finding implies that differences in DS content
greatly influence the selection of variables. The selected variables can be
directly used for MLR and Ridge regression analysis, or they can be
employed to derive PLSR and SVR models.
Table 3. The Stepwise Variable Selection Procedure for Dataset A
Model Size BIC Selected Variables (ppm) Add(+)/Drop(-)
1 190.97 2.08 + 2.08
2 124.81 2.02, 2.08 + 2.02
3 99.64 2.02, 2.08, 2.11 + 2.11
4 84.84 2.02, 2.08, 2.11, 4.31 + 4.31
5 76.98 2.02, 2.08, 2.11, 3.53, 4.31 + 3.53
6 56.00 2.02, 2.08, 2.11, 3.50, 3.53, 4.31 + 3.50
7 54.23 2.02, 2.08, 2.11, 3.50, 3.53, 4.31, 5.61 + 5.61
8 50.01 2.02, 2.08, 2.11, 3.50, 3.53, 4.31, 5.34, 5.61 + 5.34
7‟ 45.47 2.02, 2.08, 3.50, 3.53, 4.31, 5.34, 5.61 - 2.11
8‟ 45.05 2.02, 2.08, 3.50, 3.53, 4.31, 5.34, 5.43, 5.61 + 5.43
9 42.17 2.02, 2.08, 3.50, 3.53, 4.25, 4.31, 5.34, 5.43, 5.61 + 4.25
10 40.48 2.02, 2.08, 3.50, 3.53, 3.59, 4.25, 4.31, 5.34, 5.43, 5.61 + 3.59
11 37.49 2.02, 2.08, 2.14, 3.50, 3.53, 3.59, 4.25, 4.31, 5.34, 5.43, 5.61 + 2.14
86
Table 4. The Stepwise Variable Selection Procedure for Dataset B
Model Size BIC Selected Variables (ppm) Add(+)/Drop(-)
1 69.71 2.08 + 2.08
2 42.48 2.02, 2.08 + 2.02
3 30.73 2.02, 2.08, 2.11 + 2.11
4 27.61 1.99, 2.02, 2.08, 2.11 + 1.99
5 24.13 1.99, 2.02, 2.08, 2.11, 4.37 + 4.37
6 20.93 1.99, 2.02, 2.08, 2.11, 4.22, 4.37 + 4.22
5‟ 17.17 1.99, 2.08, 2.11, 4.22, 4.37 - 2.02
6‟ 15.50 1.99, 2.08, 2.11, 2.20, 4.22, 4.37 + 2.20
5‟‟ 14.45 1.99, 2.08, 2.20, 4.22, 4.37 - 2.11
4.1.1.2 Genetic Algorithms
As a probabilistic global optimization method in which various
combinations of variables are evaluated, GAs have been proven to be a
valuable tool for selecting optimal variables in multivariate calibration [71-77].
The approach is designed to select variables with the lowest prediction error
and is especially useful for data sets ranging between 30 and 200 variables,
and hence it is suitable for the heparin NMR datasets. GA training requires
the selection of several parameters, i.e., the number of chromosomes, initial
population, selection mode, crossover parameters, mutation rate, and
convergence criteria, all of which can influence the final results. In the present
investigation, the entire set of 74 variables was used as inputs to the GA for
selection of the subset of variables that works best for predicting %Gal.
87
Table 5. Parameters for the Genetic Algorithms
Population size 200 chromosomes
Chromosome size (the total number of variables) 74
Generation gap (initialization probability) 0.9
Crossover scheme Single-point
Crossover probability 50%
Mutation scheme Simple mutation
Mutation probability 1%
Number of generations 100
Number of variables selected in the chromosome 5, 10, 20, 30, and 40
Number of runs 500
An initial population size was set to 200, and the maximum number of
selected variables in the model was maintained between 5 and 40. The
chromosome with the maximum fitness value was chosen during each
population. The crossover probability and mutation probability were set to
50% and 1%, respectively. After a period of 100 generations, an effective
search was established. The configuration of the proposed GA is summarized
in Table 5. As the GA process is characteristically stochastic, the search
results depend on the randomly generated original population, and the
variables selected after each search process can be substantially different.
Therefore, it is necessary to carry out multiple independent runs. In this study,
each GA procedure was run 500 times, and the most frequently selected
variables were retained to build the calibration model. Figure 15 shows the
histograms of frequency with which each variable was selected in the case of
88
A
B
Figure 15. Histograms of frequency for the selected variables by genetic algorithms for 500 runs in the case of selecting 10 (A) and 20 (B) variables.
89
Table 6. The Variables (ppm) Selected by Genetic Algorithms
Number of variables Selected variables __________________________________________________________________________________
Dataset A
5 variables 2.08, 2.11, 3.50, 3.53, 4.46
10 variables 2.02, 2.08, 3.50, 3.53, 3.56, 3.71, 3.80, 5.49, 5.55, 5.67
20 variables 2.08, 2.11, 2.17, 2.20, 3.50, 3.53, 3.56, 3.71, 3.74, 3.92,
4.01, 4.04, 4.40, 4.46, 4.52, 4.92, 5.01, 5.46, 5.58, 5.67
30 variables 2.02, 2.08, 2.11, 2.14, 2.20, 3.53, 3.71, 3.74, 3.89, 3.98,
4.04, 4.13, 4.19, 4.34, 4.40, 4.46, 4.52, 4.92, 4.95, 4.98,
5.01, 5.04, 5.07, 5.22, 5.25, 5.37, 5.40, 5.58, 5.61, 5.67
40 variables 1.96, 2.02, 2.05, 2.08, 2.11, 2.14, 2.20, 3.50, 3.56, 3.59,
3.62, 3.68, 3.71, 3.74, 3.83, 3.92, 3.95, 3.98, 4.01, 4.07,
4.10, 4.31, 4.34, 4.40, 4.43, 4.49, 4.58, 4.64, 5.04, 5.07,
5.13, 5.16, 5.22, 5.31, 5.34, 5.37, 5.40, 5.46, 5.61, 5.67 Dataset B
5 variables 2.08, 3.50, 3.56, 3.71, 4.46
10 variables 2.02, 2.08, 2.14, 3.50, 3.56, 3.71, 4.46, 5.19, 5.49, 5.64
20 variables 2.02, 2.08, 2.14, 2.20, 3.50, 3.56, 3.71, 3.77, 4.07, 4.13,
4.37, 4.43, 4.46, 4.49, 4.58, 5.04, 5.10, 5.19, 5.49, 5.61
30 variables 1.96, 2.02, 2.08, 2.14, 2.20, 3.50, 3.56, 3.62, 3.71, 3.92,
3.95, 3.98, 4.07, 4.13, 4.37, 4.43, 4.46, 4.49, 4.58, 4.64,
5.04, 5.07, 5.10, 5.13, 5.16, 5.19, 5.22, 5.31, 5.49, 5.52
40 variables 1.96, 2.02, 2.05, 2.08, 2.11, 2.14, 2.20, 3.50, 3.56, 3.59,
3.62, 3.68, 3.71, 3.74, 3.83, 3.92, 3.95, 3.98, 4.01, 4.07,
4.10, 4.31, 4.34, 4.40, 4.43, 4.49, 4.58, 4.64, 5.04, 5.07,
5.13, 5.16, 5.22, 5.31, 5.34, 5.37, 5.40, 5.46, 5.61, 5.67
90
10 and 20 variables. Some variables, such as variable 5 (i.e. 2.08 ppm), are
selected each time, while others are less common. The output of the
algorithm consists of the subsets of 5, 10, 20, 30 and 40 variables which are
presented in Table 6. The most frequently selected variables are 2.08, 3.50
and 3.53 ppm, which correspond to the characteristic chemical shifts of DS.
These results added confidence that the information provided by GAs is
useful for determination of %Gal. For Dataset A, only variable 2.08 ppm was
selected in all the five subsets of variables. Other frequently selected
variables were 3.50 and 3.53 ppm. Variables 2.08, 3.50, 3.56 and 3.71 ppm
were the most frequently selected ones in Dataset B. Only two variables, 2.08
and 3.50 ppm, were found in common for both Dataset A and Dataset B.
4.1.2 Multiple Linear Regression Analysis
MLR is a simple and easy calibration method that avoids the need for
adjustable parameters such as the factor number in partial least squares
regression, the regularization parameter λ in Ridge regression, and the kernel
parameters in SVR. Consequently, MLR is among the most common
approaches used to build multivariate regression models. However, overly
complex MVR models with large numbers of independent variables may
actually lose their predictive ability. This common problem occurs when too
many variables are used to fit the calibration set, and can be solved by using
a subset of the selected variables to build the model.
91
The performance of MLR models were compared for different numbers of
variables selected by either stepwise or GA methods (Table 7). For Dataset
A, when all 74 variables were employed for the regression analysis, the
model yielded 2
adjR values of 1.000 for the training dataset but only 0.616 for
the test set. Figure 16A depicts the experimental %Gal by HPLC versus that
predicted from the NMR data. All the training sample points are located on a
straight line through the origin and with a slope equal to 1. However, many
test samples deviate from the diagonal in the plot. When MLR is trained using
all variables, some of the variables are unrelated to the variation of the
response, i.e., the %Gal. Such cases produce models that are over-fitted for
the training set but yield poor predictability for the test set.
When the most informative variables were selected, and variables that
were redundant or not correlated to the response were discarded, the
performance of the model was enhanced significantly. The predictive ability
was remarkably improved for the models up to 11 variables based on
stepwise selection. Compared to the all-variable model, the 2
adjR for the test
set increased from 0.616 to 0.976 even though the 2
adjR value for the training
set dropped slightly from 1.000 to 0.985. Taken together, these results reflect
the excellent agreement between the measured and the predicted values
after the selection of variables.
92
Table 7. Model Parameters of Multiple Linear Regression (MLR)
__________________________________________________________________________________
All Stepwise Genetic Algorithms _____ __________ ____________________________________
# of Variables 74 11 5 10 20 30 40
__________________________________________________________________________________
Model A
Dataset A
Training RMSE 0.01 0.26 0.35 0.27 0.26 0.22 0.17
RSD 0.01 0.15 0.20 0.15 0.15 0.13 0.10
2
adjR 1.000 0.985 0.971 0.983 0.985 0.989 0.993
Test RMSE 1.34 0.33 0.29 0.23 0.29 0.31 0.55
RSD 0.76 0.19 0.17 0.13 0.16 0.18 0.31
2
adjR 0.616 0.976 0.981 0.987 0.980 0.976 0.930
Dataset B
Training RMSE 0.01 0.19 0.26 0.19 0.18 0.16 0.14
RSD 0.01 0.27 0.39 0.27 0.25 0.22 0.20
2
adjR 1.000 0.860 0.784 0.861 0.892 0.901 0.918
Test RMSE 1.47 0.29 0.29 0.20 0.27 0.28 0.55
RSD 1.99 0.40 0.39 0.27 0.36 0.38 0.75
2
adjR 0.105 0.656 0.696 0.845 0.764 0.723 0.587
Model B
Dataset B
Training RMSE NA 0.21 0.18 0.13 0.10 0.07 0.03
RSD NA 0.30 0.25 0.18 0.14 0.10 0.04
2
adjR NA 0.797 0.853 0.922 0.955 0.979 0.997
Test RMSE NA 0.26 0.25 0.18 0.15 0.10 0.14
RSD NA 0.36 0.34 0.24 0.20 0.13 0.19
2
adjR NA 0.694 0.733 0.862 0.917 0.959 0.941
__________________________________________________________________________________
93
Figure 16. Predicted (from NMR data) versus measured (from HPLC) %Gal for Dataset A (%Gal: 0-10). (A) Predicted by Model A, using all 74 variables; (B) Predicted by Model A, using 10 variables selected from GA.
94
When considering the results from GA variable selection, the model quality
relied heavily on the number of selected variables. Table 7 shows that 2
adjR for
the training set improved continuously from 0.971 to 0.993 between 5 and 40
variables. In contrast, the test set followed a different pattern, i.e., the 2
adjR
value initially increased to a maximum 0.987 at 10 variables, after which it
gradually decreased to 0.930 at 40 variables. Therefore, the minimum
number of prediction errors occurred when the model was of moderate
complexity. In the present case, the resulting model demonstrated good
performance in estimating the %Gal concentrations at 10 variables. A strong
correlation between the measured and the predicted values over the entire
concentration range was obtained for both the training and test data sets as
illustrated in Figure 16B. Comparing the two variable selection approaches,
GA and stepwise selection, the statistical parameters 2
adjR and RMSE
revealed a slight advantage for GA over stepwise selection. The 2
adjR values
obtained using GA for variable selection were 0.981 for 5 variables and 0.980
for 20 variables, which exceed the value 0.976 for the model obtained using
stepwise variable selection.
As the USP upper limit for %Gal is 1.0%, we checked the predictive
performance of our models at low %Gal concentration. When only Dataset B
(0.0-2.0%Gal) is considered, the results predicted using Model A are only
mediocre as expected. Using the all-variable model, 2
adjR approaches 1 for the
95
training set but is unsatisfactory at 0.105 for the test set. Although variable
selection enhanced the predictive ability, the best 2
adjR for the test set was
only 0.845 at 10 variables (Figure 17A).
In order to improve the predictive ability in the lower range of 0.0-2.0%Gal,
Dataset B was employed to construct the MLR models. When building MLR
models, the number of samples must equal or exceed the number of
independent variables. The training set for Dataset B contained only 57
samples, much lower than the 74 independent variables extracted from the
NMR data, consequently the full-variable model was not feasible. The results,
summarized in Table 7, reveal that the top model performance was attained
( 2
adjR = 0.959) using a subset of 30 variables selected by GA. The superb
agreement between the predicted and experimental values (Figure 17B)
confirms the high predictive ability of Model B in the lower range of 0.0-
2.0%Gal. Stepwise variable selection yielded unsatisfactory results in terms
of the predictive ability of Model B. The 2
adjR values for the training and test
sets were 0.797 and 0.694, respectively, which were far lower than those
obtained from the corresponding GA models for any number of variables. A
possible explanation is that stepwise variable selection is limited in its ability
to explore possible combinations of variables.
96
Figure 17. Predicted (from NMR data) versus measured (from HPLC) %Gal for Dataset B (%Gal: 0-2). (A) Predicted by Model A, using 10 variables selected from GA; (B) predicted by Model B, using 30 variables selected from GA.
97
4.1.3 Ridge Regression Analysis
Multiple linear regression is very sensitive to variables with high
correlation, since near-collinearity causes a large variance or uncertainty of
the model parameters and therefore makes model predictions highly
unreliable [36]. In addition, when there are fewer samples than parameters to
be estimated, MLR method cannot be used. By applying Ridge regression
technique, the collinearity problem that happens with MLR is expected to be
solved because the X’X matrix is artificially modified so that its determinant
can be appreciably different from 0. An extra parameter is introduced in the
model, i.e., the Ridge parameter or complexity parameter λ which can
constrain the size of the regression coefficients. The value of λ determines
how much the Ridge regression deviates from the MLR regression. On one
hand, Ridge regression cannot efficiently fight collinearity if λ is too small. On
the other hand, the bias of the estimates for regression coefficients becomes
large if the value is too large.
In Ridge regression, the first step is to find the optimal parameter λ which
can produce the smallest prediction error. By estimating the prediction error
as the mean squared error for prediction (MSEP) using the generalized cross-
validation (GCV), a series of λ values corresponding to a range of variables
was obtained (Table 8). The dependence of the MSEP on the Ridge
parameter λ for the 40-variable model selected using GA is illustrated in
Figure 18A. The optimal value of λ is 0.267, which yielded the smallest
98
prediction error. The relationship between the regression coefficients and the
parameter λ is shown in Figure 18B, where the regression coefficient of each
variable is represented by a particular curve and the size changes as a
function of λ. It is clear that larger values of the Ridge parameter lead to
greater shrinkage of the coefficients which approach zero as λ approaches
infinity. The optimal choice of λ (0.267) is depicted by the vertical line in
Figure 18B, which intersects the optimized regression coefficients of the
curves.
Prediction of the test data was achieved using the optimized regression
coefficients. The statistical parameters calculated for the Ridge regression
models, including the adjusted coefficient 2
adjR , root mean squared error
(RMSE), and relative standard deviation (RSD) for both training and test sets,
are presented in Table 8. For the all-variable model, the coefficient of
determination 2
adjR for the test set increases from 0.616 to 0.801 compared
with the MLR model for Dataset A. The all-variable MLR model is unavailable
for Dataset B since the number of variables exceeds the numbers of samples.
Ridge regression is not constrained by this condition, and the all-variable
model gave 2
adjR = 1.000 for the training set and 0.778 for the test set (Table
8). The sub-optimal 2
adjR for the test set, along with the large difference
between errors for the training set and test set (0.01 to 0.38), are indicative of
model over-fitting and poor predictive ability. When the variables were
99
reduced by the stepwise and GA selection methods, the co-linearity effect
was eliminated and the predictive ability of the RR models approached that of
MLR models. Like the MLR models, the RR model showed poor predictive
ability when the model contains too few variables (under-fitting) or too many
variables (over-fitting). Therefore, selecting the appropriate number of
variables was a key factor in achieving good predictive results in Ridge
regression.
A B
Figure 18. Ridge regression for the heparin 1H NMR data at 40 variables selected from
GA. The optimal Ridge parameter λ = 0.267 is determined by generalized cross validation (GCV) (A), and the corresponding regression coefficients are the intersections of the curves of the regression coefficients with the vertical line at λ = 0.267 (B).
100
Table 8. Model Parameters of Ridge Regression (RR) __________________________________________________________________________________
All Stepwise Genetic Algorithms _____ _________ _________________________________________
# of Variables 74 11 5 10 20 30 40
__________________________________________________________________________________
Model A λ 0.01 0.28 0.18 0.56 0.64 0.34 0.27
Dataset A
Training
RMSE 0.02 0.26 0.35 0.27 0.26 0.23 0.17
RSD 0.01 0.15 0.20 0.16 0.15 0.13 0.10
2
adjR 1.000 0.985 0.971 0.983 0.984 0.987 0.992
Test
RMSE 0.93 0.32 0.28 0.23 0.29 0.33 0.64
RSD 0.53 0.18 0.16 0.13 0.17 0.19 0.36
2
adjR 0.801 0.978 0.982 0.988 0.981 0.973 0.902
Dataset B
Training
RMSE 0.02 0.18 0.27 0.18 0.17 0.15 0.14
RSD 0.03 0.26 0.38 0.26 0.24 0.22 0.20
2
adjR 0.997 0.857 0.780 0.864 0.898 0.902 0.919
Test RMSE 0.97 0.29 0.28 0.20 0.26 0.27 0.54
RSD 1.30 0.38 0.37 0.27 0.35 0.36 0.73
2
adjR 0.308 0.688 0.692 0.850 0.768 0.749 0.598
Model B λ 0.01 0.27 0.06 0.02 0.05 0.03 0.01
Dataset B
Training
RMSE 0.01 0.21 0.17 0.13 0.10 0.07 0.03
RSD 0.01 0.30 0.25 0.19 0.14 0.10 0.04
2
adjR 1.000 0.796 0.852 0.921 0.954 0.977 0.996
Test RMSE 0.23 0.26 0.25 0.18 0.15 0.11 0.14
RSD 0.31 0.35 0.34 0.24 0.20 0.14 0.19
2
adjR 0.778 0.693 0.731 0.863 0.907 0.952 0.949
__________________________________________________________________________________
101
4.1.4 Partial Least Squares Regression Analysis
As one of the most common multivariate analysis techniques, partial least
squares regression (PLSR) can be applied to spectroscopic data to transform
the large amount of correlated variables into a small set of orthogonal
variables. In PLSR, information in the independent variable matrix X is
projected onto a small number of latent variables, where the response
variable matrix Y is simultaneously used in estimating the latent variables in X
that will be most relevant for predicting the Y variables [36]. The linear
combinations of all original variables considerably reduce the dimensionality
of regression model. Unlike the MLR, variable selection is not essential for
PLSR since the latent variables are orthogonal and not sensitive to
collinearity.
The performance of PLSR depends on the selection of the appropriate
PCs used to build the regression model, and the optimal number of PCs
determines the complexity of the model and can be optimized by a leave-one-
out cross-validation (LOO-CV) procedure on the training set [76, 84]. The
optimal model size corresponds to that with the lowest uncertainty estimates
obtained from the predictive error sum of squares (PRESS). The black lines in
Figure 19 depict the standard error of prediction (SEP) values from a single
cross-validation with 10 segments while the gray lines are produced by
repeating this procedure 100 times [36]. The dashed horizontal line
represents the SEP value for the test set at the optimal number of
102
Figure 19. The relationship between the component number of PLSR and the standard error of prediction (SEP) for Dataset A. The black lines were produced from a single 10-fold CV, while the gray lines correspond to 100 repetitions of the 10-fold CV. (A) Plot of SEP versus number of components for the all-variable model; (B) Plot of SEP versus number of components for the 20-variable model selected by GA.
103
components depicted by the dashed vertical line. By repeating this cross-
validation procedure 100 times, the SEP variation was much larger for the all-
variable model than for the corresponding 20-variable model with variables
selected by GA indicating the latter‟s greater stability. The optimal number of
PCs was 12 and 15 corresponding to the all-variable and 20-variable models,
respectively.
Training set models were constructed using variables selected by either
GA or stepwise methods. The number of PCs previously judged to be optimal
was employed and the computed models were applied to the test set. The
optimal number of PCs for each model, along with corresponding values
of 2
adjR , RMSE, and relative standard error (RSE), are summarized in Table 9.
As mentioned above, the all-variable model required 12 PCs which
corresponded to the minimal cross-validation error. PLSR models built using
11 variables selected by the stepwise method yielded 2
adjR = 0.984 for the
training set and 0.979 for the test set. The prediction performance of the
model for %Gal was also satisfactory using GA for variable selection. The
prediction performance of the model with 5 to 20 variables selected by GA
was better than the all-variable model. The 10-variable model, which gave a
high 2
adjR of 0.988 and a low RSD of 0.124 (Table 9), was therefore chosen as
the optimal model.
104
Table 9. Model Parameters of Partial Least Squares Regression (PLSR)
__________________________________________________________________________________
All Stepwise Genetic Algorithms _____ ___________ __________________________________
# of Variables 74 11 5 10 20 30 40 __________________________________________________________________________________
Model A
Optimal PCs 12 8 5 8 15 18 22
Dataset A
Training RMSE 0.16 0.26 0.35 0.27 0.26 0.26 0.23
RSD 0.09 0.15 0.20 0.16 0.15 0.15 0.13
2
adjR 0.994 0.984 0.972 0.982 0.983 0.985 0.988
Test RMSE 0.39 0.31 0.29 0.22 0.28 0.33 0.37
RSD 0.22 0.18 0.17 0.12 0.16 0.19 0.21
2
adjR 0.962 0.979 0.980 0.989 0.982 0.974 0.970
Dataset B
Training RMSE 0.14 0.17 0.26 0.23 0.19 0.18 0.16
RSD 0.20 0.25 0.38 0.33 0.27 0.25 0.24
2
adjR 0.912 0.869 0.784 0.817 0.863 0.868 0.897
Test RMSE 0.29 0.27 0.29 0.20 0.26 0.27 0.28
RSD 0.39 0.36 0.39 0.26 0.35 0.36 0.38
2
adjR 0.696 0.740 0.694 0.855 0.751 0.735 0.718
Model B Optimal PCs 28 5 5 9 19 23 34
Dataset B
Training RMSE 0.03 0.20 0.17 0.13 0.10 0.06 0.04
RSD 0.04 0.28 0.25 0.18 0.13 0.09 0.05
2
adjR 0.994 0.799 0.855 0.924 0.958 0.980 0.987
Test RMSE 0.20 0.26 0.25 0.18 0.15 0.09 0.14
RSD 0.27 0.34 0.33 0.24 0.20 0.12 0.19
2
adjR 0.846 0.697 0.733 0.864 0.917 0.965 0.948
__________________________________________________________________________________
105
When the %Gal of Dataset B (%Gal = 0.0-2.0) was predicted by Model B,
the all-variable PLSR model yielded 2
adjR = 0.846 and RMSE = 0.20 (Table 9).
Variable selection by GAs on the Dataset B greatly enhanced the prediction
performance. The optimal model occurred using 30 variables with an 2
adjR
value of 0.965 for the test set.
4.1.5 Support Vector Regression Analysis
In multivariate regression models (i.e. MLR and PLSR), a linear
relationship is assumed between the NMR spectral variables and the %Gal.
Consequently, the predictive ability of a model will suffer if the actual
relationship between the dependent and independent variables is non-linear
rather than linear. In these cases, regression methods that encompass both
linear and non-linear models represent an effective strategy. Support vector
regression (SVR) processes both linear and non-linear relationships by using
an appropriate kernel function that maps the input matrix X onto a higher-
dimensional feature space and transforms the non-linear relationships into
linear forms [82, 83]. This new feature space is then implemented to deal with
the regression problem [85]. We employed SVR to construct both linear and
nonlinear prediction models for assessing whether nonlinear regression
models would improve prediction results on the same datasets.
Therefore, initially the proper kernel functions were selected and then
optimized for a specific parameter. Unlike the Lagrange multipliers which can
106
be optimized automatically by the program, SVR requires the user to adjust
the kernel parameters, the radius of the tube ε, and the regularizing
parameter C. When applying the RBF kernel, the generalization property is
dependent on the parameter γ which controls the amplitude of the kernel
function. If γ is too large, all training objects are used as the support vectors
leading to over-fitting. If γ is too small, all data points are regarded as one
object resulting in poor ability to generalize [83]. In addition, the penalty
weight C and the tube size ε also require optimization. As the regularization
parameter, C controls the trade-off between minimizing the training error and
maximizing the margin. Generally, values of C that are too large or too small
lead to regression models with poor prediction ability. When C is very low, the
predictive ability of the model is exclusively determined by the weights of
regression coefficients [86]. When C is large, the cost function decides the
performance while the regression coefficients have little bearing even if their
values are very high. Data points with prediction errors larger than ±ε are the
support vectors which determine the predictive ability of the SVR model. A
large number of support vectors occur at low ε, while sparse models are
obtained when the value of ε is high. The optimal value of ε depends heavily
on the individual datasets. Small values of ε should be used for low levels of
noise, whereas higher values of ε are appropriate for large experimental
errors. Thus, in order to find the optimized combination of the parameters γ, C
and ε, cross validation via parallel grid search was performed.
107
Table 10. Model Parameters for Support Vector Regression with RBF Kernel
__________________________________________________________________________________
All Stepwise Genetic Algorithms _____ ___________ __________________________________
# of Variables 74 11 5 10 20 30 40 __________________________________________________________________________________
Model A
SVR parameters ε 0.14 0.01 0.18 0.10 0.07 0.05 0.10
C × 10-4
10 1.0 10 100 10 10 1.0
Γ × 105 1.0 1.0 1.0 1.0 1.0 1.0 1.0
# of Vectors 28 71 21 43 39 59 37 Dataset A
Training RMSE 0.22 0.28 0.36 0.28 0.27 0.25 0.24
RSD 0.13 0.16 0.21 0.16 0.16 0.14 0.14
2
adjR 0.989 0.983 0.971 0.983 0.984 0.986 0.987
Test RMSE 0.43 0.25 0.28 0.23 0.22 0.21 0.41 RSD 0.243 0.14 0.16 0.13 0.13 0.12 0.23
2
adjR 0.956 0.985 0.983 0.987 0.988 0.990 0.960
Dataset B
Training RMSE 0.21 0.18 0.27 0.17 0.17 0.16 0.15
RSD 0.31 0.25 0.39 0.25 0.24 0.23 0.22
2
adjR 0.816 0.863 0.774 0.878 0.884 0.896 0.901
Test RMSE 0.39 0.23 0.25 0.20 0.18 0.16 0.36
RSD 0.53 0.31 0.34 0.26 0.24 0.22 0.49
2
adjR 0.663 0.783 0.761 0.839 0.870 0.887 0.701
Model B
SVR parameters ε 0 0.60 0.15 0.40 0.03 0.05 0.07
C × 10-5
10 10 1.0 10 10 10 10
Γ × 105 1.0 1.0 100 1.0 1.0 1.0 1.0
# of Vectors 57 16 39 15 53 51 49 Dataset B
Training RMSE 0.02 0.21 0.14 0.14 0.10 0.07 0.04
RSD 0.02 0.30 0.21 0.19 0.14 0.10 0.05
2
adjR 0.999 0.787 0.902 0.913 0.958 0.976 0.994
Test RMSE 0.20 0.24 0.23 0.18 0.16 0.10 0.15
RSD 0.27 0.33 0.32 0.24 0.21 0.13 0.20
2
adjR 0.821 0.736 0.756 0.868 0.913 0.960 0.922
__________________________________________________________________________________
108
The values of the optimal parameters γ, C and ε as well as the predicted
results of the optimal SVR models are shown in Table 10. For Dataset A, the
coefficient of determination 2
adjR between the measured and predicted %Gal
for the test set was 0.956 for the all-variable model yielded. The predictive
ability of the model with variables selected by GA gradually increased starting
with 5 variables, reached a maximum at 30 variables, and then receded
beyond this number. The 2
adjR values for the test set were 0.983, 0.987, 0.988,
0.990, and 0.960 corresponding to 5, 10, 20, 30, and 40 variables.
As with RR and PLSR, SVR model performance was poorer for Dataset B
than for Dataset A. For the all-variable models, the RBF kernel yielded 2
adjR =
0.999 for the training set, but only 0.821 for the test set, suggesting the
existence of over-fitting. The predictive ability of the models improved
considerably using GA for variable selection with an appropriate number of
variables. A maximum 2
adjR of 0.960 for the test set was achieved at 30
variables.
4.2 Classification of Pure and Contaminated Heparin Samples
Preliminary screening of contaminated heparin batches collected from
different sources by means of 1H NMR spectroscopy and capillary
electrophoresis (CE) revealed four different groups, i.e., pure heparin with
almost no DS, heparin containing DS in varying amounts but without OSCS,
109
heparin with OSCS and without DS, and heparin with both OSCS and DS. In
this study, 178 heparin samples from various suppliers were analyzed, where
the DS content is up to 19% of the polymer mixture and the OSCS varies
from 0 to 27%. The new USP specification states that the impurity acceptance
limit for DS is 1.0%, and no any OSCS level is acceptable. Thus, the 178
samples were classified into three groups, i.e., pure heparin with DS ≤ 1%
and OSCS = 0%; impure heparin with DS > 1% and OSCS = 0%; and
contaminated heparin with OSCS > 0% and any content of DS.
The high-resolution 1H NMR spectroscopy data were represented as
complex matrices with rows as objects and columns as variables. By applying
multivariate statistical methods and pattern recognition techniques, the
dimensionality of the data can be reduced to facilitate visualization, the
inherent patterns among the sets of spectral measurements can be revealed,
and classification models can be built. As one of the fundamental
methodologies in chemometrics, the purpose of classification is to find a
mathematical model to recognize the class membership of new objects by
assigning a proper class. In this study, the NMR data were analyzed by both
the unsupervised approaches such as principal component analysis (PCA)
and the supervised ones such as partial least squares discriminant analysis
(PLS-DA) to distinguish the pure and contaminated heparin samples.
110
4.2.1 Principal Components Analysis
Principal components analysis (PCA) is a nonparametric approach that
reduces a complex dataset to lower dimensions, performs the optimum
coordinate rotation, and maximize the variance within the data [36, 40]. In this
study, PCA is employed to provide an overview of the spectral data, from
which a general picture of the classification of heparin samples into groups
can be acquired. Since PCA preserves most of the variance in just a few
numbers of principal components (PCs), this information can be readily
displayed in a graph of reduced dimensions and data can be visualized by
using the scores plots that differentiate samples from various sources based
on the measured properties. The most common way is to project the spectra
into the subspace of PC1 versus PC2 with PC1 along the x-axis and PC2
along the y-axis, where the sample distribution on this graph may reveal
patterns, clusters and other features that might be related to the general
characteristics of the samples [44-46].
The PCA scores plots obtained from analysis of the 1H NMR spectra for
representative heparin samples are shown in Figures 20A, 21A and 22A.
Each point on the plots represents one spectrum of an individual sample, and
points of the same color indicate samples of the same origin, such as pure
heparin, heparin with the impurity DS, or heparin with the contaminant OSCS.
The spectra with similar characteristics form a cluster and the variations along
the PC axes maximize the differences between the spectra. The Heparin and
111
DS samples were not well separated using this approach (Figure 20A). The
Heparin class is located on the upper side while DS class is distributed on the
lower side. The closer to 1.0% the content of DS, the more overlapped the
two classes. This result is unsurprising, in view of the NMR spectral similarity
of heparin and DS. With respect to the Heparin vs OSCS samples together,
the scores plot of PC1 versus PC2 showed that the samples were separated
into two distinct clusters (Figure 21A). The Heparin group formed a tighter
cluster than the OSCS group. Heparin samples were situated on the left side.
By contrast, the contaminant samples were distributed from left to right side
as the content of OSCS increased. For the Heparin vs DS vs OSCS samples
together, the PC1 scores were dominated by OSCS while the variations of
heparin, DS and OSCS led to PC2 variability (Figure 22A). The three types of
samples were separated by the first principal component (PC1), with some
sample overlap. OSCS clustered in a range lying toward the positive side of
PC1, whereas the scores near zero or on the negative side of PC1
corresponded to Heparin, and the DS samples were mostly centered on the
PC1 axis with some samples dispersed on the positive side of the PC1 axis.
To achieve further separation and classify these samples, supervised
analysis of the pattern recognition was performed.
115
4.2.2 Partial Least Squares Discriminant Analysis
To optimize separation between heparin and impure or contaminated
samples and to build predictive models for class identification, PLS-DA was
performed using the classes of Heparin, DS or OSCS as the y variables. The
scores plots of the first and the second latent variables are displayed in
Figures 20B, 21B and 22B. With PLS-DA, nearly all samples were in distinct
classes, and a clear discrimination of heparin samples from the DS impurity
and OSCS contaminant was observed. Here, the heparin samples appeared
in a more compact grouping, while the OSCS contaminated samples
exhibited a distribution similar to that in the PCA model. Applying PLS-DA, the
correct classification of these samples in three different groups was obtained
as shown in Figure 22B, where Heparin and DS were located in the upper-
and lower-left zones, respectively, while OSCS was distributed toward the
right side. This supervised clustering approach gave much improved
separation compared with the PCA model, and excellent class discrimination
was achieved between the different types of heparin samples.
After PLS data compression, PLS-DA classification models were built and
tested while increasing the number of PLS components starting at 1. The
number of correct classifications in both the training and test sets was taken
as a measure of performance. Figure 23 illustrates the evolution of the
misclassification rates in the training and test sets as a function of the number
of PLS components in the model. As expected for the training set, the number
116
of correct classifications increased with the number of dimensions (PCs). For
any model, the misclassification rates were small even with few PLS
components and reached a plateau at which all the rates approached zero
after 20 to 40 components.
A B
C D
Figure 23. Misclassification rate as a function of the number of PLS dimensions for the PLS-DA model. (A) Heparin vs DS; (B) Heparin vs OSCS; (C) Heparin vs [DS + OSCS]; (D) Heparin vs DS vs OSCS.
117
Leave-one-out cross-validation (LOO-CV) was employed to select the
model with the optimal number of PLS components that minimize the
misclassification rate. For LOO-CV, the data set was split into s segments:
the training was performed on the (s - 1) blocks, and the testing was
conducted on the objects belonging to the sth subset. To predict all the
objects, this process was repeated s times through block permutation [104,
145]. Classification rates of 85, 97 and 82% were obtained for Heparin vs DS,
Heparin vs OSCS, and Heparin vs [DS + OSCS] models, respectively. In
addition, a 75% classification rate was attained by the threefold Heparin vs
DS vs OSCS model. The majority of misclassifications between Heparin and
DS involved cases where the DS content was close to the 1.0% DS boundary
between the two classes, as measured by HPLC measurements.
The true test of the model depends on its performance when applied to an
external test set of samples that were not employed for building the model.
Consequently, the model was validated using an external test set of 60
samples. The results, plotted in Figure 23, point to the same conclusions as
described above for the LOO-CV. By increasing the number of PLS
components incrementally, it was observed that the classification rates were
optimal for the Heparin vs DS (84%), Heparin vs OSCS (100%), and Heparin
vs [DS + OSCS] (88%) models when the number of PCs = 2-6, 10-12, and 6-
10, respectively. Even for the threefold Heparin vs DS vs OSCS model, the
classification rate was 85% using 16 PCs.
118
Table 11. Number and Type of Misclassifications (Errors) by PLS-DA Classification
Model for Test Sets Using Different Number of Components
Components 1 2 4 6 8 10 12 14 16 18 20 __________________________________________________________________________________
Model
Heparin vs DS
Heparin errors / 28 samples 4 2 1 1 2 4 4 5 5 5 6
DS errors / 17 samples 5 5 6 6 6 6 7 7 7 8 8
Heparin vs OSCS
Heparin errors / 28 samples 0 0 0 0 0 0 0 1 1 1 1
OSCS errors / 15 samples 3 2 2 1 1 0 0 0 1 1 1
Heparin vs [DS + OSCS]
Heparin errors / 28 samples 3 4 2 1 2 2 3 3 4 5 8
[DS + OSCS] errors / 32 samples 9 6 7 6 5 5 5 5 5 6 8
Heparin vs DS vs OSCS
Heparin errors / 28 samples 4 3 1 1 1 2 3 3 3 3 4
DS errors / 17 samples 7 7 7 8 8 8 7 7 5 7 8
OSCS errors / 15 samples 6 6 5 4 2 1 1 1 1 1 2
The results for the corresponding test sets are presented in Table 11. For
the Heparin vs DS model using 4-6 PCs, misclassification of Heparin as DS
occurred only once and DS as Heparin six times. In nearly all of these cases
the DS content was 1.06-1.20%, i.e., near the 1.0% boundary specifying the
two classes. For the Heparin vs OSCS model using 1-12 PCs,
misclassification of Heparin as OSCS was zero (100% success rate) and
OSCS for Heparin varied from 0 to 3. The number of misclassifications was
zero (100% success rate) for the Heparin vs OSCS model using 10-12 PCs.
119
For the Heparin vs [DS + OSCS] model using 8-10 PCs, only two Heparin
samples and five samples in the [DS + OSCS] group were misclassified. As
noted for the Heparin vs DS model, in most cases these misclassifications
occurred when the DS content was near the 1.0% DS boundary defining the
Heparin and DS classes. The same interpretation applies to the threefold
Heparin vs DS vs OSCS model, where most of the misclassifications involved
samples near the 1.0% DS borderline between Heparin and DS. Notably, the
discrimination between the Heparin and OSCS samples was 100%.
4.2.3 Linear Discriminant Analysis
As an alternative approach, linear discriminant analysis (LDA) was
employed to classify the Heparin, DS and OSCS samples based on
predefined classes. For LDA analysis, the data matrix of variances-
covariances needs to be inverted, which would be impossible if the number of
samples is less than that of the variables [79, 93]. Therefore, a preliminary
variable reduction step is necessary so that the data matrix for each class
presents a high ratio between the number of training samples and the number
of variables. In order to select a subset of the original variables that affords
the maximum improvement of the discriminating ability between classes,
stepwise linear discriminant analysis (SLDA) was performed before LDA
analysis. Preliminary variable reduction using SLDA led to the selection of 20
variables (Table 12).
120
Table 12. Wilks’ Lambda ( v ) & F-to-enter (F) of Variables (V) for Various Models
__________________________________________________________________________________
Order Heparin vs DS Heparin vs OSCS Heparin vs [DS + OSCS] Hearin vs DS vs OSCS
V (ppm) v F V (ppm) v F V (ppm) v F V (ppm) v F
__________________________________________________________________________________
1 2.08 103.0 0.54 2.14 14.0 0.36 2.08 97.1 0.63 2.11 134.3 0.38
2 3.62 15.8 0.48 2.08 15.1 0.33 4.49 23.3 0.55 3.86 30.9 0.28
3 5.34 8.9 0.45 4.49 8.1 0.29 2.14 3.6 0.52 3.53 9.2 0.25
4 2.17 1.7 0.44 4.16 6.7 0.26 4.16 5.3 0.50 4.49 7.1 0.23
5 2.14 2.3 0.43 4.04 5.5 0.24 4.46 3.3 0.49 5.16 10.1 0.20
6 4.61 1.5 0.42 3.56 2.5 0.22 5.16 2.7 0.47 3.59 6.5 0.19
7 2.11 1.1 0.42 4.52 5.2 0.21 5.10 2.6 0.46 2.14 4.1 0.18
8 3.95 2.1 0.41 3.65 4.2 0.20 5.61 2.8 0.46 3.95 4.4 0.17
9 5.67 1.2 0.41 5.61 8.0 0.19 4.28 3.9 0.45 4.46 3.5 0.16
10 4.04 1.9 0.40 5.67 4.0 0.18 3.56 4.1 0.44 5.01 3.5 0.15
11 5.43 1.6 0.40 4.37 1.9 0.18 4.95 2.2 0.43 4.43 3.0 0.15
12 3.71 1.1 0.39 5.25 4.4 0.17 5.49 3.8 0.42 3.71 5.3 0.14
13 4.46 1.7 0.39 3.74 3.1 0.16 4.98 1.9 0.41 5.13 2.1 0.14
14 3.77 1.7 0.39 5.04 3.9 0.15 4.61 2.2 0.40 5.04 2.5 0.13
15 3.74 1.5 0.38 2.17 2.4 0.15 4.22 1.0 0.40 5.46 1.5 0.13
16 5.40 1.7 0.38 5.49 3.5 0.14 5.19 2.2 0.40 4.64 2.1 0.13
17 3.68 1.0 0.37 3.68 2.7 0.14 5.43 1.9 0.39 4.13 1.8 0.12
18 4.01 1.1 0.37 4.10 3.9 0.13 4.34 1.1 0.39 4.16 1.5 0.12
19 5.19 1.6 0.37 5.28 4.0 0.13 5.58 1.4 0.39 4.28 1.9 0.12
20 5.31 1.5 0.36 5.19 2.3 0.12 5.25 1.6 0.38 4.22 1.8 0.11 __________________________________________________________________________________
After variable selection and dimension reduction, LDA analysis was
conducted using the squared Mahalanobis distance from the centers of
gravity of each group for assigning the class affiliation of each sample. For
the training set, the success rates gradually rose with increasing the number
of variables (Table 13). The Heparin vs OSCS model required very few
variables to achieve 100% success rates due to the clear distinction in
121
Table 13. Performance of LDA Classification Models under Different Variables
Number of variables 2 4 6 8 10 12 14 16 18 20
Model
Heparin vs DS
Training set Errors / 87 samples 14 12 10 10 9 9 8 6 5 3
Success rates (%) 84 86 89 89 90 90 91 93 94 97
CV set Errors / 87 samples 15 13 12 12 10 10 12 13 14 14
Success rates (%) 83 85 86 86 89 89 86 85 84 84
Test set Errors / 45 samples 7 6 5 5 6 6 7 8 8 10
Success rates (%) 84 87 89 89 87 87 84 82 82 78
Heparin vs OSCS
Training set Errors / 85 samples 6 4 4 2 1 1 0 0 0 0
Success rates (%) 93 95 95 98 99 99 100 100 100 100
CV set Errors / 85 samples 6 5 4 4 2 0 1 2 3 5
Success rates (%) 93 94 95 95 98 100 99 98 97 94
Test set Errors / 43 samples 2 1 1 1 0 0 1 2 2 3
Success rates (%) 95 98 98 98 100 100 98 95 95 93
Heparin vs [DS + OSCS]
Training set Errors / 118 samples 17 15 14 14 13 13 12 10 9 9
Success rates (%) 86 87 88 88 89 89 90 92 93 93
CV set Errors / 118 samples 19 18 18 16 14 11 10 12 15 17
Success rates (%) 84 85 85 86 88 91 92 90 87 86
Test set Errors / 60 samples 7 6 5 5 4 5 6 6 6 8
Success rates (%) 88 90 92 92 93 92 90 90 90 87
Heparin vs DS vs OSCS
Training set Errors / 118 samples 26 24 21 19 16 14 12 12 10 8
Success rates (%) 78 80 82 84 86 88 90 90 92 93
CV set Errors / 118 samples 28 27 25 19 15 13 16 18 19 21
Success rates (%) 76 77 79 84 87 89 86 85 84 82
Test set Errors / 60 samples 12 11 10 9 6 6 8 8 10 10
Success rates 80 82 83 85 90 90 87 87 83 83
122
spectral features between heparin and OSCS. Cross validation and external
validation studies indicated that model performance reached a maximum
using an intermediate number of variables. LDA models typically include a set
of tunable parameters, the number of which increases with the number of
variables. While even models with complex relationships in the sample can
usually be fit quite well by using enough tunable parameters, this typically
leads to much higher error rates for the test set than for the training set as
occurred in the present instance.
The risks of over-fitting can be alleviated by selecting the optimal number
of variables, which was determined by the success rate of classifications
using LOO-CV and validation with external test sets. Optimal success rates,
varying from 89% to 100%, for the Heparin vs DS, Heparin vs OSCS, Heparin
vs [DS + OSCS] models were achieved using 6-14 variables depending on
the specific model and testing procedure (Table 13). In the same way, the
threefold Heparin vs DS vs OSCS model achieved an optimal success rate of
89% using 10-12 variables. Once again, the majority of misclassifications are
attributed to Heparin and DS samples in which the DS content was near the
1.0% boundary between the two classes.
With respect to classification of individual samples and overall success
rates, the performance of LDA was comparable to PLS-DA for the Heparin vs
OSCS model and superior to PLS-DA for other three models. For the external
test set under optimal conditions, the success rates for the Heparin vs DS,
123
Heparin vs [DS + OSCS], and Heparin vs DS vs OSCS models were
respectively 89, 93, and 90% using LDA compared to 84, 88 and 85% using
PLS-DA.
4.2.4 k-Nearest-Neighbor
The kNN method was implemented to evaluate its performance for
classification. Various k values (3, 5 or 7) were tested using the all-variable
data set, and the success rates for the training set, LOO-CV, and the test set
are summarized in Table 14. Overall, the results obtained were inferior for
kNN compared with LDA and PLS-DA. For example, the success rates for the
Heparin vs DS, Heparin vs OSCS, Heparin vs [DS + OSCS], and Heparin vs
DS vs OSCS models using k = 3 were respectively 69, 91, 82 and 68% for the
test set.
To obtain better classification results, the PCA scores were employed as
inputs to build the kNN models. Various combinations of PCs and k values
were investigated, and the results are summarized in Table 15. Unlike the
PLS-DA and LDA models where the misclassification rates for the training set
decreased monotonically to 0% as the number of PCs or variables increased,
the misclassification rates of the kNN models for the training set fluctuated
within a range of values. This fluctuating pattern is commonly observed with
kNN. The optimal performance of the kNN model was achieved using 15-25
PCs depending on the specific model.
124
Table 14. Performance of kNN Classification Models for Original Data
Model Hep vs DS Hep vs OSCS Hep vs [DS + OSCS] Hep vs DS vs OSCS
k = 3 Training set
Errors / samples 7 / 87 1 / 85 13 / 118 16 / 118
Success rate (%) 92 99 89 86
LOO-CV set
Errors / samples 16 / 87 4 / 85 25 / 118 32 / 118
Success rate (%) 82 95 79 73
Test set
Errors / samples 14 / 45 4 / 43 11 / 60 19 / 60
Success rate (%) 69 91 82 68
k = 5 Training set
Errors / samples 12 / 87 2 / 85 17 / 118 21 / 118
Success rate (%) 86 98 86 82
LOO-CV set
Errors / samples 17 / 87 5 / 85 25 / 118 30 / 118
Success rate (%) 81 94 79 75
Test set
Errors / samples 13 / 45 4 / 43 11 / 60 22 / 60
Success rate (%) 71 91 82 63
k = 7 Training set
Errors / samples 13 / 87 2 / 85 17 / 118 20 / 118
Success rate (%) 85 98 86 83
LOO-CV set
Errors / samples 14 / 87 5 / 85 27 / 118 33 / 118
Success rate (%) 84 94 77 72
Test set
Errors / samples 13 / 45 4 / 43 13 / 60 21 / 60
Success rate (%) 71 91 78 65
125
Table 15. Performance of PCA-kNN Classification Models under Different PCs
PCs 5 10 15 20 25 30 35 40 45 50 55 60
Heparin vs DS (k = 2)
Training set Errors / 87 samples 13 11 7 5 12 8 10 12 10 13 15 14
Success rates (%) 85 87 92 94 86 91 89 86 89 85 83 84
CV set Errors / 87 samples 25 20 17 20 25 25 27 22 29 34 31 33
Success rates (%) 71 77 80 77 71 71 69 75 67 61 64 62
Test set Errors / 45 samples 12 15 16 12 10 14 12 15 15 12 16 19
Success rates (%) 73 67 64 73 78 69 73 67 67 73 64 58
Heparin vs OSCS (k = 4)
Training set Errors / 85 samples 6 3 5 5 9 8 8 11 11 16 13 19
Success rates (%) 93 96 94 94 89 91 91 87 87 81 85 78
CV set Errors / 85 samples 10 13 11 10 14 18 19 25 22 24 25 26
Success rates (%) 88 85 87 88 84 79 78 71 74 72 71 69
Test set Errors / 43 samples 37 38 40 39 39 37 30 33 33 31 30 33
Success rates (%) 86 88 93 91 91 86 70 77 77 72 70 77
Heparin vs [DS + OSCS] (k = 3)
Training set Errors / 118 samples 17 10 13 17 19 11 16 14 18 17 19 25
Success rates (%) 86 92 89 86 84 91 86 88 85 86 84 79
CV set Errors / 118 samples 23 30 26 34 33 39 31 28 34 36 34 43
Success rates (%) 81 75 78 71 72 67 74 76 71 69 71 64
Test set Errors / 60 samples 13 13 12 9 17 15 17 19 23 22 22 21
Success rates (%) 78 78 80 85 72 75 72 68 62 63 63 65
Heparin vs DS vs OSCS (k = 3)
Training set Errors / 118 samples 18 13 19 23 22 17 21 21 23 23 25 32
Success rates (%) 85 89 84 81 81 86 82 82 81 81 78 73
CV set Errors / 118 samples 30 39 32 42 42 40 43 43 47 41 46 52
Success rates (%) 75 67 73 64 64 66 64 64 60 65 61 56
Test set Errors / 60 samples 21 19 18 15 20 23 23 23 25 24 27 27
Success rates 65 68 70 75 67 62 62 62 58 60 55 55
126
The misclassification rates for nearest neighbors k from 1 to 25 are plotted
in Figure 24. The black dots and the vertical bars represent the means as well
as mean ±1 standard error for the misclassification rates using LOO-CV. The
smallest LOO-CV error is depicted by a dotted horizontal line corresponding
to the position of the mean plus one standard error. For the training sets, the
misclassification rate was always zero for k = 1 and increased with larger k
values for all four models. The test sets showed a similar pattern, i.e., the
misclassification rates varied within a tight range, except the Heparin vs
OSCS model for which the rates rose for k > 4. The optimal k values of 2, 4, 3
and 3 respectively were for the Heparin vs DS, Heparin vs OSCS, Heparin vs
[DS + OSCS], and Heparin vs DS vs OSCS models.
When the predictive ability was evaluated for the external test set based
on the above analysis for different numbers of PCs and a series of k values,
the optimal success rates were 78, 93, 83 and 75% for the four models as
shown in Table 15. For the Heparin vs DS model, one heparin sample was
misclassified as DS but nine out of the seventeen DS test samples were
misclassified as Heparin. Unlike PLS-DA and LDA, kNN was unable to
completely discriminate Heparin and OSCS. For the Heparin vs [DS + OSCS]
model, three Heparin samples were misclassified as [DS + OSCS] while six
DS samples and one OSCS sample were misclassified as Heparin. Likewise
for the threefold Heparin vs DS vs OSCS model, kNN produced a total of
fifteen misclassifications.
127
A B
C D
Figure 24. kNN classification over the range k = 1 to 25. (A) Heparin vs DS (PCs = 25); (B) Heparin vs OSCS (PCs = 15); (C) Heparin vs [DS+OSCS] (PCs = 20); (D) Heparin vs DS vs OSCS (PCs = 20).
128
4.2.5 Classification and Regression Tree
Classification tree models were built using the three data sets, composed
of 9, 65 and 74 variables corresponding to the three regions 1.95-2.20, 3.10-
5.70 and 1.95-5.70 ppm, respectively. The four known classes (Heparin, DS,
OSCS and [DS + OSCS]) were used as response variables. The trees were
grown and pruned using the Gini index as a splitting criterion and the optimal
size of the tree was determined using 10-fold cross validation (CV), in which
the samples are randomly divided into 10 segments, and then a model is built
on nine segments and the remaining one is used for evaluating the predictive
power until each segment has been used once as a test set. For Heparin vs
DS vs OSCS in the region of 1.95-2.20 ppm, the division of the samples by
the nodes of the classification tree is shown in Figure 25A. The data were
split according to 2.08 and 2.15 ppm, the characteristic chemical shifts of DS
and OSCS, respectively. It is observed that the first split is defined by variable
2.15 ppm that split the samples into two groups: (Heparin + DS) and OSCS,
and then variable 2.08 ppm divided the (Heparin + DS) samples into two
separate classes: Heparin and DS, leading to a classification tree with a
complexity of three nodes (Figure 25C).
Each terminal node represents the majority of the samples in a specified
class. The OSCS terminal node is called a pure node in that it contains only
samples of the OSCS class, i.e., all of the 31 OSCS samples are correctly
classified and no Heparin or DS samples are located in this terminal. The
129
A B
C D
Figure 25. Classification trees and their corresponding complexity parameter CP for model Heparin vs DS vs OSCS. (A) and (C): the region of 1.95-2.20 ppm; (B) and (D): the region of 3.10-5.70 ppm.
130
(Heparin + DS) group was split into the DS and Heparin classes solely by the
chemical shift 2.08 ppm. Both of these terminal nodes contain
misclassifications. The DS node contains two Heparin samples, while the
Heparin node contains six DS samples. The classification rates, summarized
in Table 16, were 93.2% (110/118) for the training set (8 misclassifications)
and 90.0% (54/60) for the test set (6 misclassifications).
When modeling the data set of the 3.10-5.70 ppm region, the resulting tree
was slightly more complex, consisting of five terminal nodes (Figure 25B).
The variables splitting the data are 3.53, 3.95, 4.48 and 5.67 ppm. Variable
Table 16. Model Parameters and Classification Rates for CART
__________________________________________________________________________________
Model Region Nodes Variables Training Test
ppm ppm % % __________________________________________________________________________________
Heparin vs DS 1.95 - 5.70 2 2.08 90.8 (79/87) 88.9 (40/45)
1.95 - 2.20 2 2.08 90.8 (79/87) 88.9 (40/45)
3.10 - 5.70 3 3.53, 3.86 83.9 (73/87) 80.0 (36/45) Heparin vs OSCS 1.95 - 5.70 2 2.15 100 (85/85) 100 (43/43)
1.95 - 2.20 2 2.15 100 (85/85) 100 (43/43)
3.10 - 5.70 2 4.48 97.6 (83/85) 97.7 (42/43)
Heparin vs [DS + OSCS] 1.95 - 5.70 3 2.08, 2.15 91.5 (108/118) 90.0 (54/60)
1.95 - 2.20 3 2.08, 2.15 91.5 (108/118) 90.0 (54/60)
3.10 - 5.70 5 3.53, 3.95, 4.48, 5.67 89.8 (106/118) 78.3 (47/60)
3.10 - 5.70 4 3.53, 3.95, 4.48 88.1 (104/118) 83.3 (50/60) Heparin vs DS vs OSCS 1.95 - 5.70 3 2.08, 2.15 93.2 (110/118) 90.0 (54/60)
1.95 - 2.20 3 2.08, 2.15 93.2 (110/118) 90.0 (54/60)
3.10 - 5.70 5 3.53, 3.95, 4.48, 5.67 88.1 (104/118) 80.0 (48/60)
3.10 - 5.70 4 3.53, 3.95, 4.48 86.4 (102/118) 85.0 (51/60) __________________________________________________________________________________
131
4.48 ppm split off class OSCS from Heparin and DS, and then variable 3.53,
3.95 and 5.67 ppm sequentially divided the samples on the left side into two
separate classes: Heparin and DS. Figure 25D shows the evolution of the
relative error (RE, vertical axis) and complexity parameter (CP, horizontal)
with the tree size, where the dashed line represents the standard errors. The
RE decreases as the number of terminal nodes increases, having its lowest
value for a tree with five terminal nodes. On the basis of the lowest cost-
complexity measure, the optimal sized tree is the one with five nodes. In order
to select a simpler tree than the one with the minimum CV error, the rule of
one standard deviation error (1-SE) is applied, for which the optimal tree is
selected as the simplest one among those that have a CV error within (1-SE)
of the minimal CV error. As shown in Figure 25D, the tree with the lowest
error appeared at the size of 5 whereas the tree with optimal size = 4 and a
CP of 0.058 represents a simpler one within (1-SE) of the tree of size 5.
Although the tree with size = 4 was slightly less accurate than tree with size =
5 for the training set (86.4% versus 88.1%), the former tree yielded an
improved predictive rate for the test set (85.0% versus 80.0%). Consequently,
the pruned tree is more appropriate for prediction purposes. It should be
noted that both results are poorer than those from the region 1.95-2.20 ppm.
With respect to Heparin vs DS, the corresponding model has two terminal
nodes by splitting the data using 2.08 ppm for both 1.95-2.20 and 1.95-5.70
ppm. The success rates of 90.8% (79/87) and 88.9% (40/50) were achieved
132
for training and test sets, respectively. For the region of 3.10-5.70 ppm,
chemical shifts 3.53 and 3.86 ppm were selected to divide the data, leading to
a success rate of 83.9% (73/87) for the training set and 80.0% (36/45) for the
test set. These trees have no pure nodes, meaning that absolute
discrimination between Heparin and DS was not achieved by CART. For the
model Heparin vs OSCS, the classification tree present two terminal nodes by
splitting the data of 1.95-2.20 or 1.95-5.70 according to 2.15 ppm. As a result,
both Heparin and OSCS samples were classified on their respective terminal
nodes on the classification tree, giving a perfect separation of the two groups
(100% discrimination). In contrast, the accuracies for the region of 3.10-5.70
ppm are 97.6% (83/85) and 97.7% (42/43) corresponding to the training and
test sets by selecting 4.48 ppm as a splitting variable. For the case Heparin
vs [DS + OSCS], a model with complexity or tree size = 3 is built by splitting
2.08 and 2.15 ppm as with Heparin vs DS vs OSCS for both 1.95-2.20 and
1.95-5.70 ppm. The predictive ability of this model was 91.5% (108/118) for
the training set and 90.0% (54/60) for the test set (Table 16). For the 3.10-
5.70 ppm region, a classification tree with five terminal nodes was obtained
for the discrimination of Heparin from [DS + OSCS] by selecting four variables
3.53, 3.95, 4.48, and 5.67 ppm to divide the data, resulting in a tree very
similar to that of Heparin vs DS vs OSCS, and the test set of 60 samples was
predicted with 83.3% (50/60) accuracy.
133
Analysis of the above results reveals that the predictive and discrimination
ability is much better with trees built from 1.95-2.20 and 1.95-5.70 ppm than
from 3.10-5.70 ppm. In addition, the discrimination results are exactly the
same using the entire region 1.95-5.70 ppm as using the local region 1.95-
2.20 ppm. Although 1.95-5.70 ppm region contains more variables (74) and
many more details in terms of chemical shifts, the CART model selected only
2.08 and 2.15 ppm as the splitting variables and ignored the 3.10-5.70 ppm
region entirely, suggesting the N-acetyl methyl proton chemical shifts (1.95-
2.20 ppm) play a critical role in discriminating heparin from its impurities and
contaminants for the CART model.
4.2.6 Artificial Neural Networks
A three-layer feed-forward network trained with a back propagation
algorithm was investigated to optimize separation between pure, impure and
contaminated heparin samples, and to build predictive models for class
identification. The input layer contained as many neurons as the independent
variables of the dataset, which are the chemical shifts with numbers of 9, 65
and 74 for the data sets of 1.95-2.20, 3.10-5.70 and 1.95-5.70 ppm,
respectively, and the output corresponded to the four classes Heparin, DS,
OSCS and [DS + OSCS]. The number of neurons in the hidden layer was
varied to assess its influence on network performance. Too few hidden
neurons will lead to poor generalization and the built model becomes
unstable, whereas if too many hidden neurons are used, the neural network
134
will overfit the training data. The sigmoid transfer function was exclusively
employed for activation in both hidden and output layers. The output from the
ANN is a prediction of the class membership in the samples of each class,
consisting of a matrix Ŷ with the same dimensions as the dependent variable
Y that contains the binary values of 1 or 0 for each class and comprises as
many columns as there are classes. The numeric value of element ŷij in Ŷ is
in an interval between 0 and 1, which can be regarded as an estimate of the
probability for assigning the ith sample to the jth class. If the output value is
close to 1, then the test sample is ascribed to the modeled class while the
sample is assigned to other classes if the value is close to 0.
For ANN, a commonly used error function is the cross entropy or deviance
defined in Equation 44 [36]:
Minimize: ij
n
i
k
j
ij yy ˆlogˆ1 1
(44)
Since ANN is very sensitive to overfitting, a regularization term, called weight
decay, is introduced. The modified criterion is given by Equation 45:
Minimize:
2
1 1
)(ˆlogˆ parametersyy ij
n
i
k
j
ij (45)
where “parameters” represents the values of all parameters that are used in
the ANN training. Therefore, the second term takes into account the
magnitude of all the parameters. The magnitude of the adjusting parameter λ
controls how much the constraint of shrinking the parameters should be
addressed. When the value of λ is zero (i.e., no weight decay) or small, the
135
boundary or edge between classes is rough or non-smooth, leading to
overfitting of the model, while a smoother boundary is yielded as the weight
decay increases.
For ANN classification, the number of hidden units and the weight decay
need to be optimized, which can be done through cross validation. Figure 26
shows the relationship between the misclassification rate of the classification
and the decay weight and the number of neurons for the training set, test set
and 10-fold CV process for Heparin vs DS vs OSCS with the data set of 1.95-
5.70 ppm. In order to investigate the influence of the size of neurons in the
hidden layer on the prediction accuracy, ANNs with neuron numbers ranging
from 3 to 30 were developed with the weight decay fixed at 0.1. The
prediction results are plotted as a function of the size of hidden units in Figure
26A that shows that 9 neurons in the hidden layer are optimal. The
dependency of the error rate on the weight decay λ for 9 hidden units is
depicted in Figure 26B.
The optimal values of these parameters and the results of the ANNs
analysis, performed by combining in various types of input data, are
presented in Table 17. For Heparin vs DS vs OSCS, with the optimal settings
of λ = 0.075 and neurons = 6, this specific ANN showed a classification rate
of 96.6% (114/118) and a prediction accuracy was 91.7% (55/60) with only
five samples misclassified in the test set for the 1.95-2.20 range. The
prediction accuracy for the training set and test set corresponded to 95.8%
136
A
B
Figure 26. The variations of misclassification errors from ANN with the hidden units and weight decay for the model Heparin vs DS vs OSCS for the data set in the 1.95-5.70 ppm range. (A) Fixing weight decay with λ = 0.1; (B) Fixing the number of hidden units at 9.
137
Table 17. Model Parameters and Classification Rates for ANN __________________________________________________________________________________
Model Region Hidden size Weight decay Training Test
ppm λ % % __________________________________________________________________________________
Heparin vs DS 1.95 - 5.70 9 5.0 × 10-1
95.4 (83/87) 86.7 (39/45)
1.95 - 2.20 6 5.0 × 10-2
97.7 (85/87) 88.9 (40/45)
3.10 - 5.70 9 3.0 × 10-1
96.6 (84/87) 84.4 (38/45) Heparin vs OSCS 1.95 - 5.70 9 1.0 × 10
-1 100 (85/85) 100 (43/43)
1.95 - 2.20 6 2.0 × 10-2
100 (85/85) 100 (43/43)
3.10 - 5.70 9 1.0 × 10-1
100 (85/85) 100 (43/43) Heparin vs [DS + OSCS] 1.95 - 5.70 9 2.5 × 10
-1 94.9 (112/118) 91.7 (55/60)
1.95 - 2.20 6 3.0 × 10-2
99.2 (117/118) 91.7 (55/60)
3.10 - 5.70 9 2.0 × 10-1
99.2 (117/118) 90.0 (54/60) Heparin vs DS vs OSCS 1.95 - 5.70 9 9.0 × 10
-1 95.8 (113/118) 88.3 (53/60)
1.95 - 2.20 6 7.5 × 10-2
96.6 (114/118) 91.7(55/60)
3.10 - 5.70 9 8.0 × 10-1
93.2 (110/118) 86.7(52/60) __________________________________________________________________________________
(113/118) and 88.3% (53/60) for the 1.95-5.70 range and, similarly, with
93.2% (110/118) and 86.7% (52/60) for the 3.10-5.70 ppm range. This can be
shown by the number of misclassified samples indicated in Table 17. For
Heparin vs OSCS, the ANN model classified all members of the training and
tests sets correctly with 100% prediction accuracy. The prediction rates for
the Heparin vs DS model for the three regions are very close with 95.4-97.7%
for the training set and 84.4-88.9% for the test set. For the Heparin vs [DS +
OSCS] model, the prediction rates of the various networks were quite similar
at 90.0-91.7% for the three regions as summarized in Table 17. In general,
the performance of the models was slightly better for those built from the
1.95-2.20 ppm than from either the 3.10-5.70 or 1.95-5.70 ppm regions.
138
4.2.7 Support Vector Machine
Using the same training and test sets as for CART and ANN, the SVM
algorithm with the non-linear soft margin was employed to build classification
models. For SVM classification with the RBF kernel, the optimization requires
to specify two parameters, i.e., the width of the kernel function γ and the
regularization parameter C. Their combination determines the boundary
complexity and thus the classification performance, i.e., prediction ability.
Cross-validation (CV) is widely used to determine the parameters for
evaluating the performance of the model and minimizing the risk of overfitting.
The parameters C and γ are optimized by the user, and the optimal values
are obtained by performing an exhaustive grid search with 10-fold CV on the
training set using their various combinations. The set of C and γ values giving
the highest percentage accuracy or the lowest error rate is selected for further
analysis. In this study, a wide range of γ and C values were tuned
simultaneously in a 9 × 9 grid of 81 possible combinations for C from 1 to 108
and γ from 10−8 to 1. After all the combinations have been searched, a
contour plot is created in decimal logarithmic scales, which indicates the
prediction accuracy or classification error. Figure 27 presents the optimization
grids in terms of cross validation classification rate for the models Heparin vs
DS vs OSCS and Heparin vs OSCS. The two coarse grid plots of γ and C
values delineate regions where the optimal parameter settings might be
located. The two deep red “islands” in Figure 27A correspond to the lowest
139
prediction error for Heparin vs DS vs OSCS, reflecting the difficulty in tuning
the γ and C values to achieve optimal discrimination of Heparin, DS and
OSCS. In contrast, the large red stripe in Figure 27B reflects the relative ease
in tuning the γ and C values for optimal discrimination of Heparin vs OSCS. In
order to obtain high resolution, this range of γ and C values are further refined
to achieve the final SVM model. The SVM model of Heparin vs DS vs OSCS
with the optimum values C = 6.0 × 104 and γ = 1.0 × 10-3 for 1.95-5.70 ppm
gave the best classification performance. For Heparin vs OSCS, the optimal
parameter settings C = 1.0 × 103 and γ = 1.0 × 10-4 led to perfect
discrimination.
Using the optimal paired values of γ and C, the results from SVM are
summarized in Table 18. The prediction accuracy is > 90% in all cases for the
samples for all data sets. A larger number of samples were classified
correctly, and the models generally presented few (< 3-5) misclassifications. It
is worth noting that SVM achieved nearly identical results for both the 1.95-
2.20 ppm and 3.10-5.70 ppm regions, giving credence to its ability to
differentiate even subtle structural differences between pure, impure, and
contaminated heparin. In contrast, visual inspection of the Heparin, DS, and
OSCS spectra (Figure 13) clearly reveals distinctions in the 1.95-2.20 ppm
region but not in the 3.10-5.70 ppm region.
140
A
B
Figure 27. Contour plots in decimal logarithmic scales obtained from 9 × 9 grid search of the optimal values of γ and C for the SVM model. (A) Heparin vs DS vs OSCS for the 1.95-5.70 ppm region; (B) Heparin vs OSCS for the 1.95-5.70 ppm region.
141
Table 18. Model Parameters and Classification Rates for SVM __________________________________________________________________________________
Model Region C γ Training Test
ppm % % __________________________________________________________________________________
Heparin vs DS 1.95 - 5.70 2.0 × 10
3 1.0 × 10
-4 97.7 (85/87) 91.1 (41/45)
1.95 - 2.20 1.0 × 104 1.0 × 10
-3 96.6 (84/87) 93.3 (42/45)
3.10 - 5.70 1.8 × 104 1.0 × 10
-4 97.7 (85/87) 93.3 (42/45)
Heparin vs OSCS 1.95 - 5.70 1.0 × 10
3 1.0 × 10
-4 100 (85/85) 100 (43/43)
1.95 - 2.20 1.0 × 103 1.0 × 10
-4 100 (85/85) 100 (43/43)
3.10 - 5.70 1.0 × 103 1.0 × 10
-4 100 (85/85) 100 (43/43)
Heparin vs [DS + OSCS] 1.95 - 5.70 8.0 × 10
4 1.0 × 10
-5 98.3 (116/118) 95.0 (57/60)
1.95 - 2.20 1.0 × 107 2.0 × 10
-4 97.5 (115/118) 95.0 (57/60)
3.10 - 5.70 1.0 × 105 1.8 × 10
-5 98.3 (116/118) 95.0 (57/60)
Heparin vs DS vs OSCS 1.95 - 5.70 6.0 × 10
4 1.0 × 10
-3 97.5 (115/118) 95.0 (57/60)
1.95 - 2.20 2.0 × 105 1.0 × 10
-3 99.2 (117/118) 95.0 (57/60)
3.10 - 5.70 1.5 × 105 1.0 × 10
-5 98.3 (116/118) 95.0 (57/60)
__________________________________________________________________________________
4.2.8 Analysis of Misclassifications
As shown in Tables 16, 17 and 18, the predictive abilities of the
classification models built from CART, ANN and SVM were outstanding in
differentiating Heparin from DS and OSCS with few errors. In particular,
higher predictive accuracies or fewer misclassifications were attained for the
Heparin vs OSCS model than for Heparin vs DS, Heparin vs [DS + OSCS]
and Heparin vs DS vs OSCS models. While all three pattern recognition
approaches were able to completely discriminate Heparin and OSCS with
success rates of 100% under optimal conditions, for the other models it can
be seen by cross comparison from Tables 16 to 18 that using the same input
142
variables, the model generated from the SVM algorithm consistently
outperformed ANN, which in turn marginally outperformed CART. When the
entire chemical shift region was divided into two subsets (1.95-2.20 and 3.10-
5.70 ppm), better results were achieved for the former than the latter region.
The sole exception was SVM, which achieved nearly identical results from
both regions. SVM performed better in every aspect, as can be appreciated
by comparing the misclassified rates in Tables 16-18. Taking the Heparin vs
DS vs OSCS model for the region of 3.10-5.70 ppm as an example, the
success rates of the training set and test set were appreciably higher for SVM
(98.3% and 95.0%) than for CART (86.4% and 85.0%) and ANN (93.2% and
86.7%).
Tables 19-21 summarize the results of the classification matrices
evaluated by means of both training and test sets in the region of 1.95-5.70
ppm. All of the misclassified samples were between Heparin and DS: several
samples belonging to Heparin were predicted as DS, while some DS samples
were predicted as Heparin. Using SVM, only one Heparin sample was
misclassified as DS and three DS samples were misclassified as Heparin for
the Heparin vs DS model in the test set (Table 19). Misclassification of
Heparin as DS occurred only once and DS as Heparin twice for the Heparin
vs [DS + OSCS] model (Table 20). The same result occurred for the threefold
Heparin vs DS vs OSCS model, that is, SVM produced a total of three
misclassifications (Table 21).
143
When examining the misclassified samples, it was noted that in most
cases, these misclassifications occurred when the DS content of the sample
ranged from 0.90% to 1.20%, i.e., they were close to the DS = 1.0% impurity
limit defining the Heparin and DS classes, and it is hard to distinguish from
each other due to the similarity in the 1H NMR spectral patterns of Heparin
and DS samples on the borderline. When removing these borderline samples
from the data set, it was found that the overall performance of the model was
greatly improved and much better results with very few misclassifications
were achieved, especially for the SVM model, where only one sample was
misclassified in the test set (Tables 19-21).
Table 19. Classification Matrices for the Heparin vs DS Model in 1.95-5.70 ppm Region __________________________________________________________________________________
All samples After removing borderline samples ________________________ _______________________________
Training set Test set Training set Test set Heparin DS Heparin DS Heparin DS Heparin DS __________________________________________________________________________________
CART Heparin 52 6 25 2 48 3 23 0
DS 2 27 3 15 2 23 2 13
ANN Heparin 52 2 27 5 50 0 25 2
DS 2 31 1 12 0 26 0 11
SVM Heparin 53 1 27 3 50 0 25 1
DS 1 32 1 14 0 26 0 12 __________________________________________________________________________________
144
Table 20. Classification Matrices for the Heparin vs [DS + OSCS] Model
in the 1.95-5.70 ppm Region __________________________________________________________________________________
All samples After removing borderline samples ______________________________ _______________________________
Training set Test set Training set Test set Hep [DS + OSCS] Hep [DS + OSCS] Hep [DS + OSCS] Hep [DS + OSCS]
__________________________________________________________________________________
CART Heparin 49 5 24 2 46 3 23 0
[DS + OSCS] 5 59 4 30 4 54 2 28
ANN Heparin 52 4 27 4 50 0 24 2
[DS + OSCS] 2 60 1 28 0 57 1 26
SVM Heparin 52 1 27 2 50 0 25 1
[DS + OSCS] 2 63 1 30 0 57 0 27
__________________________________________________________________________________
Table 21. Classification Matrices for the Heparin vs DS vs OSCS Model
in the 1.95-5.70 ppm Region __________________________________________________________________________________
All samples After removing borderline samples ________________________________ ________________________________
Training set Test set Training set Test set Hep DS OSCS Hep DS OSCS Hep DS OSCS Hep DS OSCS __________________________________________________________________________________
CART Heparin 52 6 0 25 2 0 48 3 0 23 1 0
DS 2 27 0 3 14 0 2 23 0 2 12 0
OSCS 0 0 31 0 1 15 0 0 31 0 0 15
ANN Heparin 52 3 0 25 5 0 50 0 0 25 2 0
DS 2 30 0 3 12 0 0 26 0 0 11 0
OSCS 0 0 31 0 0 15 0 0 31 0 0 15
SVM Heparin 53 0 0 27 2 0 50 1 0 25 1 0
DS 1 33 0 1 15 0 0 25 0 0 12 0
OSCS 0 0 31 0 0 15 0 0 31 0 0 15 __________________________________________________________________________________
145
4.2.9 Classification Analysis of Heparin Spiked with other GAGs
Heparin APIs may contain GAG impurities other than dermatan sulfate
(DS), such as chondroitin sulfate A (CSA) and heparan sulfate (HS), and
other possible synthetic oversulfated contaminants that can mimic the
functions of heparin, could be found in heparin lots. In order to assess the
capability of the developed models to discriminate and detect a wide range of
potential GAG-like impurities and contaminants previously unseen in the
heparin samples, a series of blends was prepared by spiking heparin APIs
with native impurities CSA, DS and HS, as well as their partially- or fully-
oversulfated (OS) versions OS-CSA (i.e., OSCS), OS-DS, OS-HS and OS-
heparin at the 1.0%, 5.0% and 10.0% weight percent levels [15], and the
resulting multivariate statistical models were used to test their class
assignations for the Heparin vs DS vs OSCS model. The blend samples are
highly diverse in composition when compared to the clearly defined Heparin,
DS and OSCS classes, since they contain multiple components with varying
degrees of sulfation and concentration from 1% to 10% as shown in Table 22.
For exploratory purposes, agglomerative hierarchical cluster analysis
(HCA) was performed on the 30 blend samples. As an unsupervised
technique, HCA describes the nearness between objects, identifies specific
differences, finds natural groupings of the data set, and allows the
visualization of the relationships between objects in the form of a dendrogram
[112, 115, 146]. The procedure starts by setting each object in its own cluster,
146
and then two objects closest together are joined, followed by the next step in
which either a third object joins the just formed cluster, or two clusters join
together into a new cluster. Each step yields clusters with a number less than
the previous step. The iterative procedure repeats until all objects are merged
into a single cluster. HCA analysis was implemented using the Euclidean
distance for measuring the similarity among blend samples with average
linkage for merging the clusters. Figure 28 depicts the hierarchical clustering
of the blend samples in the 1.95-5.70 ppm region. From this dendrogram, two
distinct clusters can be observed, which were formed according to the content
of GAGs. The cluster on the left-side included samples with the low content of
GAGs (1%), while the high content of GAGs (i.e., 5% and 10%) comprised
the right-side cluster which consists of two sub-clusters, one is a cluster of the
native GAGs, i.e., CSA (B1 and B2), DS (B4 and B5) and HS (B7 and B8),
and another is made of oversulfated GAGs, where the samples with the same
GAG composition lay close to each other and clustered in pair.
The test results obtained in the identification of the blend samples using
the resulting models from CART, ANN and SVM are summarized in Table 22.
Blend samples B28-30 (blank or control samples), B4-6 (DS) and B10-12
(OS-CSA) correspond to the classes Heparin, DS and OSCS, respectively.
As expected, all of them were correctly classified into their respective classes.
All other blends, by nature, don‟t belong to any the designated class, but they
have to be assigned to a class. As can be seen, some blends containing low
147
levels (1%) of GAGs were assigned to Heparin, most of the native impurities
(CSA and HS) were classified as DS, and meanwhile the blends with
oversulfated synthetic compounds were assigned to OSCS except for several
samples with low content (1%). Overall, the models can distinguish between
pure heparin and unacceptable samples.
Figure 28. Dendrogram on the series of blends of heparin spiked with other GAGs, generated based on their Euclidean distances and average linkage.
148
Table 22. Compositions of the Series of Blends of Heparin Spiked with other GAGs and
Test Results for Classification from SVM, CART and ANN in the 1.95-5.70 ppm Region __________________________________________________________________________________
ID GAGs Content (%) SVM CART ANN ___________________________________________
Classified as Heparin (H), DS (D) or OSCS (O) __________________________________________________________________________________
B1 CSA 10 D D D B2 CSA 5 D D D B3 CSA 1 D D D
B4 DS 10 D D D B5 DS 5 D D D B6 DS 1 D D D
B7 HS 10 D D D B8 HS 5 D D D B9 HS 1 H H H
B10 FS-CSA 10 O O O B11 FS-CSA 5 O O O B12 FS-CSA 1 O D D
B13 FS-DS 10 O O O B14 FS-DS 5 O O O B15 FS-DS 1 D D D
B16 OS-HS 10 O O O B17 OS-HS 5 O O O B18 OS-HS 1 H H H
B19 OS-Hep 10 O O O B20 OS-Hep 5 O O O B21 OS-Hep 1 H H H
B22 PS-CSA#1 10 O O O B23 PS-CSA#1 5 O O D B24 PS-CSA#1 1 D H H
B25 PS-CSA#2 10 O O O B26 PS-CSA#2 5 O O D B27 PS-CSA#2 1 D H H
B28 Blank - H H H B29 Blank - H H H B30 Blank - H H H __________________________________________________________________________________
CSA: chondroitin sulfate A; DS: dermatan sulfate; HS: heparan sulfate; FS: fully sulfated;
OS: oversulfated; PS: partially sulfated; Blank: control (pure heparin sample). The weight
percent sulfur for PS-CSA#1 and PS-CSA#2 is 11.01% and 11.14%, respectively.
149
4.3 Class Modeling for Discriminating Heparin Samples
Previously we explored the ability of pure classification methods, i.e.,
principal component analysis (PCA), partial least squares discriminant
analysis (PLS-DA), linear discriminant analysis (LDA), k-nearest neighbor
(kNN), classification and regression tree (CART), artificial neural network
(ANN), and support vector machine (SVM), to distinguish between pure,
impure and contaminated heparin samples based on evaluation of their 1H
NMR spectra. Class modeling techniques represent a substantially different
modeling strategy. Whereas pure discriminating methods focus on the
dissimilarity between classes, class modeling approaches emphasize the
similarity within each class. In this section, soft independent modeling of class
analogy (SIMCA) and unequal class modeling (UNEQ) were applied to
differentiate heparin samples that contain varying amounts of dermatan
sulfate (DS) impurities and oversulfated chondroitin sulfate (OSCS)
contaminants. The two methods enable the construction of individual models
for each class and the determination of the modeling ability of each variable in
a class.
4.3.1 SIMCA Analysis
In SIMCA, each class is modeled separately using principal component
analysis (PCA). Class boundaries which define the range of acceptable
samples at a selected confidence level are built around the PC model that
encloses the internal space. SIMCA is able to indicate the discriminant power
150
and modeling power for each variable when defining the similarity among the
members of a class of samples.
4.3.1.1 Analysis of Pure, Impure and Contaminated Heparin Samples
The SIMCA model was developed using a set of 1H NMR spectral data
with 168 samples corresponding to 72 heparin samples, 50 DS/heparin
samples and 46 OSCS/heparin samples with 74 variables. As defined above,
three classes, i.e., Heparin, DS, and OSCS were considered. An additional
fourth class, namely [DS + OSCS], was included to characterize samples that
contained DS > 1.0% or OSCS > 0%. For each class, only components with
eigenvalues greater than unity were employed to build the model. The
numbers of PCs used for the class models were twelve for the class Heparin,
and nine each for the DS, OSCS and [DS + OSCS] classes, accounting for
98.4, 99.3, 99.4, and 98.7% of the total variance, respectively. The results of
SIMCA modeling after separate category autoscaling and column centering
are reported in Table 23. It was observed that 16 of the 72 Heparin, 13 of the
50 DS, 7 of the 46 OSCS and 20 of the 96 [DS + OSCS] samples were
erroneously rejected by their specific category models by the SIMCA F-test at
the 95% confidence level, resulting in a SENS of 77.8%, 74.0%, 84.8% and
79.2% for the four classes, respectively. The class models built using SIMCA
exhibited high SPEC particularly for the OSCS class model. Both Heparin and
DS rejected all samples in OSCS, leading to a SPEC of 100%. OSCS also
rejected all samples in Heparin and accepted only one sample in DS. In
151
addition, the Heparin class model accepted the same five DS samples from
both DS and [DS + OSCS] classes; hence, the SPECs of Heparin for DS and
for [DS + OSCS] were 90.0% (45/50) and 94.8% (91/96), respectively. The
DS content in these five samples was in the range 1.06% to 1.20%, i.e., they
were nearby the borderline of the 1.0% acceptance criterion. The same
observation was ascribed as the cause of misclassifications in the previous
work [16, 147]. On the other hand, the DS and [DS + OSCS] class models
accepted 13 and 32 Heparin samples, respectively, corresponding to SPEC
values of 81.9% and 55.6%, respectively. The low SPEC value of the [DS +
OSCS] class model was due to its difficulty in discriminating Heparin samples
from DS samples in cases where the DS content was nearby the 1.0%
acceptance criterion for DS.
Table 23. Sensitivity and Specificity from SIMCA Modeling
__________________________________________________________________________________
Model Number Explained Sensitivity (%) Specificity (%) of PCs variance (%) __________________________________________________________________________________
Heparin 12 98.4 77.8 (56/72) 90.0 (45/50) for DS;
100 (46/46) for OSCS;
94.8 (91/96) for [DS + OSCS].
DS 9 99.3 74.0 (37/50) 81.9 (59/72) for Heparin;
100 (46/46) for OSCS.
OSCS 9 99.4 84.8 (39/46) 100 (72/72) for Heparin;
98.0 (49/50) for DS.
[DS + OSCS] 9 98.7 79.2 (76/96) 55.6 (40/72) for Heparin.
__________________________________________________________________________________
152
The results of class modeling can be displayed by means of Coomans
plots, which are a useful tool for visualizing the groupings [93, 99, 148]. In a
Coomans plot, two classes are drawn against one another, and each
category is plotted as a rectangle whose boundary corresponds to the
confidence limit defined by the class space. The distance of each sample
from both categories is measured by the coordinates in the axes [80, 97,
135]. The plot is divided into four areas by the boundary of 95% confidence
level for both categories. The samples accepted by only one model fall in two
areas of the Coomans plot: one is the left upper rectangle and another is the
right bottom rectangle. Samples located in the lower-left corner area, where
the two categories overlap, are accepted by both of the two classes. A
sample whose distance is beyond the critical limit for the class model is
rejected as an outlier for that specific class. Consequently, it is plotted outside
the area defining the class model. Samples rejected by both models are
plotted in the upper-right square.
The Coomans plots for different pairs of classes are displayed in Figure
29, in which each sample is represented by its category index. The
distribution of the samples from these models at the critical distance for 95%
confidence is shown. Most of the samples were correctly accepted by their
respective classes, with only few samples plotted beyond their critical limits.
Figure 29A shows the Coomans plot for the Heparin and OSCS classes,
which are located in the upper left quadrant and lower right quadrant of
153
A B
C D
Figure 29. Coomans plots for SIMCA class modeling. (A) Heparin vs OSCS; (B) Heparin vs DS; (C) Heparin vs [DS + OSCS]; (D) DS vs OSCS.
154
the plot, respectively. All the OSCS samples are clustered at the right side,
forming a tight group, and all are far from the lower left corner. Meanwhile no
Heparin sample fell into the bottom box. All of the OSCS samples were
completely separated from the Heparin class without any overlap between the
two classes, indicating 100% successful discrimination.
The Coomans plot for the Heparin and DS classes is shown in Figure 29B.
The upper left zone corresponds to the samples accepted by the Heparin
class model while the bottom right zone corresponds to the samples accepted
by the DS class model. Heparin samples with low DS content are far from the
bottom box, i.e., the DS class model, while samples with DS content close to
1.0% are located near or within the lower left square. One sample (with %DS
= 1.04) was accepted by the DS model and 12 samples (with %DS = 0.80-
1.02) appear in the overlapping area. Although 13 DS samples were rejected
by the DS class model, all of these samples fell close to the boundary.
Samples with high DS content are situated on the right side while samples
with low DS content are very close to the Heparin model. The samples
situated in the lower left square of the diagram are accepted by both models.
Unsurprisingly, a certain degree of overlap occurred between the models of
these two classes. The Heparin class model accepted five DS samples, while
the DS class model accepted 13 Heparin samples as indicated in the left
bottom square.
155
The Coomans plot for the Heparin and [DS + OSCS] classes is shown in
Figure 29C. Similar to Figure 29A and 29B, Figure 29C demonstrates that all
OSCS samples are located on the right side and five DS samples are in the
lower left square. Of the 72 samples belonging to the Heparin class, 32 are
plotted in the lower left quadrant belonging to both classes, revealing the low
degree of specificity of the [DS + OSCS] class model for the Heparin class.
In the Coomans plot for the DS and OSCS classes (Figure 29D), all of the
OSCS samples were significantly distant from the region of the left rectangle
corresponding to the DS class model and far from the critical distance of the
DS class model. No OSCS samples fell in the region for the DS model, thus
the specificity of DS with respect to OSCS was 100%. Likewise, the OSCS
model accepted only 1 of the 50 DS samples corresponding to 98%
specificity. Overall, excellent separation was achieved between the Heparin
and OSCS classes and between the DS and OSCS classes.
In SIMCA, a sample is classified according to its analogy with samples
belonging to a class defined by principal components (PCs). Classification is
carried out based on the orthogonal distance of the sample to the hyperplane
of the class model defined by the first few PCs. The classification
performance is evaluated in terms of prediction ability. Validation of the class
models was performed using a full leave-one-out cross-validation (LOO-CV)
approach, which recalculates the local models after each sample is
sequentially excluded from the model [99]. Training and prediction rates were
156
computed as the average of the classification rates for each class,
corresponding to the success rate in classifying the training set samples and
the success rate in classifying the test set samples. The results obtained for
the training and test sets are summarized in Table 24, recorded as the
classification matrix of the model indicating the correct predictions for each
class. The success rates for the training and test sets were 99.2% and 92.4%
for Heparin vs OSCS and 94.3% and 81.1% for Heparin vs DS. The number
of misclassifications was unevenly distributed among the different classes,
with higher number of errors occurring in the Heparin and DS classes. The
OSCS samples were sufficiently distant from the Heparin and DS class
models, and consequently none of the OSCS samples were misclassified as
either Heparin or DS. The OSCS samples were classified perfectly (no
misclassifications) both in the training and the validation phases. However,
this was not the case for other SIMCA class models. For Heparin vs OSCS, 3
of the 72 Heparin samples were misclassified as OSCS. For DS vs OSCS, 11
of the 50 DS samples were misclassified as OSCS.
Given the similarity in the 1H NMR spectra of heparin and DS, several
samples were misclassified for Heparin vs DS. Fifteen of the 72 Heparin
samples were misclassified as DS while 8 of the 50 DS samples were
misclassified as Heparin. A great number of Heparin samples were
misclassified for Heparin vs [DS + OSCS], in which 29 of the 72 Heparin
samples were assigned to [DS + OSCS] whereas 7 of the 96 [DS + OSCS]
157
samples assigned to Heparin, all of these 7 samples belonging to the DS
class.
Table 24. Classification Matrices and Success Rates from SIMCA Class Modeling
__________________________________________________________________________________
Model Training Prediction __________________________________________________________________________________
Heparin vs DS Hep DS Rate (%) Hep DS Rate (%)
Hep 68 4 94.4 57 15 79.2
DS 3 47 94.0 8 42 84.0
Total - - 94.3 - - 81.1
Heparin vs OSCS Hep OSCS Rate (%) Hep OSCS Rate (%)
Hep 71 1 98.6 63 9 87.5
OSCS 0 46 100 0 46 100
Total - - 99.2 - - 92.4
Heparin vs [DS + OSCS] Hep [DS + OSCS] Rate (%) Hep [DS + OSCS] Rate (%)
Hep 68 4 94.4 47 25 65.3
[DS + OSCS] 7 89 92.7 9 87 90.6
Total - - 93.5 - - 79.8
DS vs OSCS DS OSCS Rate (%) DS OSCS Rate (%)
DS 49 1 98.0 39 11 78.0
OSCS 0 46 100 0 46 100
Total - - 99.0 - - 88.5
Heparin vs DS vs OSCS Hep DS OSCS Rate (%) Hep DS OSCS Rate (%)
Hep 54 17 1 75.0 45 22 5 62.5
DS 0 49 1 98.0 5 36 9 72.0
OSCS 0 0 46 100 0 0 46 100
Total - - - 88.7 - - - 75.6 __________________________________________________________________________________
158
With regard to the three-class system Heparin vs DS vs OSCS, OSCS
yielded a 100% success rate in prediction ability on the test set. On the other
hand, 5 and 22 of the 72 samples from the Heparin class were misclassified
to OSCS and DS, respectively, while 5 and 9 of the 50 samples from the DS
class were misclassified to Heparin and OSCS, respectively. The expected
poor prediction ability of both Heparin and DS resulted in a modest overall
classification rate of 75.6%.
As a highly informative multivariate analysis technique, SIMCA allows the
discrimination between those variables which make great contributions to
distinguishing between classes and those which provide little useful
information [97]. The discriminant power (DP) of the variables indicates the
importance of each variable in discriminating the samples into different class
models [99]. DP is defined as the ratio of the residual standard deviation of
samples in one class when fitted to the other class to the residual standard
deviation of the samples when fitted to their own class [149]. For two classes
c and g, the squared DP for variable j is:
2
,
2
,
2
,
2
,2)()(
),(gjcj
gjcj
jss
csgsgcDP
(46)
where
c
n
i
ijccj ngegsc
/)()(1
22
,
(47)
159
g
n
i
ijggj ncecsg
/)()(1
22
,
(48)
)1/(1
22
,
cc
n
i
ijccj Anesc
(49)
)1/(1
22
,
gg
n
i
ijggj Anesg
(50)
)(2
, gs cj and )(2
, cs gj are the residual standard deviations for samples in class c
and class g when fitted to class g and class c, respectively; 2
,cjs and 2
,gjs are
the residual standard deviations for samples in class c and class g when fitted
to their own classes, i.e., class c and class g, respectively; 2
ijce and 2
ijge are
the residual distances for sample i in the class to the class itself while )(2 geijc
and )(2 ceijc are the residual distances for samples in class c and class g to the
class g and class c, respectively; cn and gn denote the number of samples in
class c and class g, and cA and gA are the number of PCs for class c and
class g, respectively. DP implies the ability for each variable to contribute to
the discrimination between classes. A large value suggests a great
contribution to the differentiation between the two corresponding classes,
while a value of unity indicates no discrimination power at all.
The importance of the individual variables and their DP for various class
pairs were examined, and the variables that made the greatest contribution to
160
the class discrimination are listed in Table 25. When analyzing the
discriminating ability of the different variables, 2.08 ppm (DP = 8.29) was
found to be the chemical shift with the highest discriminating power, being
most effective in discriminating between the Heparin and DS classes.
Significant discriminating ability was also shown by 3.56 ppm (DP = 3.46),
4.46 ppm (DP = 3.05), 4.04 ppm (DP = 3.04) and 2.11 ppm (DP = 3.00). The
highest DP value in the Heparin vs OSCS, DS vs OSCS, Heparin vs [DS +
OSCS] and Heparin vs DS vs OSCS models was at 2.14 ppm, corresponding
to 61.88, 41.41, 38.63 and 34.86, respectively. The same chemical shift
contributed substantially to discriminating OSCS from all of the other classes.
Other variables showing a significant discriminating power were 4.07 ppm
(DP = 16.81), 2.20 ppm (DP = 15.19), 2.17 ppm (DP = 15.06), 5.01 ppm (DP
= 12.02) and 5.04 ppm (DP = 11.71) for Heparin vs OSCS; 2.17 ppm (DP =
15.64), 4.07 ppm (DP = 15.10), 3.80 ppm (DP = 13.95), 5.04 ppm (DP =
12.46), 3.95 ppm (DP = 12.34), 5.34 ppm (DP = 12.16) and 5.01 ppm (DP =
11.13) for Heparin vs [DS + OSCS]; and 4.31 ppm (DP = 12.80), 2.08 ppm
(DP = 12.32), 5.01 ppm (DP = 11.85) and 4.49 ppm (DP = 10.58) for DS vs
OSCS. For Heparin vs DS vs OSCS, the results in Table 25 show that the
variables with the greatest discriminating power are 2.14 ppm (DP = 34.86)
and 2.08 ppm (DP = 10.03), which are the characteristic chemical shifts of
OSCS and DS, respectively.
161
Table 25. Discriminant Powers (DP) of Variables (V) for Various Models
__________________________________________________________________________________
Order Hep vs DS Hep vs OSCS Hep vs [DS + OSCS] DS vs OSCS Hep vs DS vs OSCS
V (ppm) DP V (ppm) DP V (ppm) DP V (ppm) DP V (ppm) DP __________________________________________________________________________________
1 2.08 8.29 2.14 61.88 2.14 38.63 2.14 41.41 2.14 34.86
2 3.56 3.46 4.07 16.81 2.17 15.64 4.31 12.80 2.08 10.03
3 4.46 3.05 2.20 15.19 4.07 15.10 2.08 12.32 2.17 8.67
4 4.04 3.04 2.17 15.06 3.80 13.95 5.01 11.85 4.07 8.63
5 2.11 3.00 5.01 12.02 5.04 12.46 4.49 10.58 5.01 8.44
6 3.92 2.89 5.04 11.71 3.95 12.34 2.17 9.45 2.20 7.68
7 4.01 2.82 4.22 10.92 5.34 12.16 5.16 8.37 4.49 6.80
8 3.53 2.80 4.37 10.02 5.01 11.13 5.19 7.12 4.31 6.64
9 3.71 2.68 2.08 9.65 4.01 9.25 4.98 7.02 5.04 6.14
10 3.95 2.51 3.80 9.56 4.43 9.13 4.07 7.01 5.19 5.76
11 4.31 2.48 5.43 9.44 5.61 8.76 3.89 6.95 4.98 5.57
12 3.86 2.47 5.37 9.23 5.43 8.68 2.20 6.85 3.95 5.34
13 3.59 2.39 4.25 9.18 3.89 8.17 3.98 6.71 4.61 5.24
14 5.37 2.34 4.58 9.13 4.22 7.86 5.10 6.55 2.11 5.22
15 3.89 2.25 4.10 9.03 5.25 7.81 4.64 6.40 4.22 5.21
16 3.50 2.23 4.55 8.71 5.31 7.76 2.11 6.05 4.10 5.06
17 4.25 2.22 4.61 8.69 3.92 7.73 3.74 5.88 4.04 4.98
18 3.80 2.19 5.19 8.52 3.50 7.68 4.61 5.77 3.80 4.91
19 5.34 2.05 4.49 8.41 3.53 7.66 2.02 5.59 5.43 4.90
20 3.74 2.00 4.98 8.31 2.08 7.47 3.53 5.54 4.19 4.84
__________________________________________________________________________________
The SIMCA class distance is defined as the ratio of the sum of the residual
standard deviations for all variables within one class when fitted to the other
class to the sum of the residual standard deviations for all variables when
fitted to their own class [102]. The distance is used to measure how far two
models are from each other. The squared SIMCA class distance between
category c and category g is given by:
162
1
)(
))()((
),(2
,
2
,
1
2
,
2
,
12
gjcj
m
j
gjcj
m
j
ss
csgs
gcD (51)
When g = c, the first term is 1 and the distance between a category and itself
becomes 0. A class distance of less than 1 indicates that the two classes
overlap, while if a class distance is greater than 1 but smaller than 3, a partial
separation of the classes occurs. A model distance of greater than 3 indicates
separation of the classes. It was found that the SIMCA class distances were
4.9 for Heparin vs DS, 53.0 for Heparin vs OSCS, 35.2 for Heparin vs [DS +
OSCS], and 26.1 for DS vs OSCS. Therefore, DS was very close to Heparin
while OSCS was far from the Heparin, and not surprisingly, [DS + OSCS] was
intermediate between DS and OSCS.
4.3.1.2 Analysis of Heparin Samples Spiked with other GAGs
Heparin API can also contain GAGs other than DS, such as chondroitin
sulfate A (CSA) and heparan sulfate (HS). In addition, oversulfated version of
these GAGs other than CSA could be used to adulterate heparin in the future.
The methods described herein are expected to identify a wide range of
potential GAG-like contaminants in NMR data. To augment the usefulness of
the method, the blend samples of heparin spiked with non-, partially-, or fully
oversulfated CSA, DS and HS at the 1.0%, 5.0% and 10.0% weight percent
levels were tested for their class assignations from the built models, which
163
allowed us to investigate the capability of the models (e.g., Heparin, DS, and
OSCS) to accept or reject the blend samples, and hence to detect fraudulent
or contaminated products. The detailed compositions of the series of blends
as well as the test results from class modeling are summarized in Table 26.
As can be seen from Table 26, the blend samples were diverse when
compared to the Heparin, DS and OSCS classes. It covered multiple
components, including CSA, DS, heparan, crude and purified heparin with
varying degrees of sulfation and with component content ranging from 1% to
10%. The blend sample can be assigned to one or more classes if they are
situated within the statistical limits, and they can be considered to be outliers
if the distance is beyond the limits. Thus, a blend sample can be assigned to
a single class, more than one class, or none of the above defined classes.
Samples B28, B29 and B30 are blank ones, that is, they are pure heparin
samples. Therefore, it is not doubt that they are all accepted by the Heparin
class. In addition, the Heparin class accepts some samples with low content
(1%) of GAGs, such as B9 (1% HS), B18 (1% OS-HS), B21 (1% OS-Hep),
B24 (1% PS-CSA#1) and B27 (1% PS-CSA#2). The Heparin class rejects all
blend samples with high content of GAGs (5% and 10%) as well as four low
content samples, which are B3 (1% CSA), B6 (1% DS), B12 (1% FS-CSA)
and B15 (1% FS-DS).
Blends B4, B5 and B6 are heparin samples spiked with 10%, 5% and 1%
DS, respectively. As expected, they are all accepted by the DS class.
164
Table 26. The Compositions of the Series of Blends of Heparin Spiked with other
GAGs and Test Results from SIMCA Class Modeling
__________________________________________________________________________________
ID GAGs Content (%) Accepted (A) or Rejected (R) by the classes _________________________________________
Heparin DS OSCS __________________________________________________________________________________
B1 CSA 10 R R R
B2 CSA 5 R R R
B3 CSA 1 R R R
B4 DS 10 R A R
B5 DS 5 R A R
B6 DS 1 R A R
B7 HS 10 R R R
B8 HS 5 R R R
B9 HS 1 A A R
B10 FS-CSA 10 R R A
B11 FS-CSA 5 R R A
B12 FS-CSA 1 R R A
B13 FS-DS 10 R R R
B14 FS-DS 5 R R R
B15 FS-DS 1 R A R
B16 OS-HS 10 R R R
B17 OS-HS 5 R R R
B18 OS-HS 1 A A R
B19 OS-HEP 10 R R R
B20 OS-HEP 5 R R R
B21 OS-HEP 1 A R R
B22 PS-CSA#1 10 R R A
B23 PS-CSA#1 5 R R R
B24 PS-CSA#1 1 A R R
B25 PS-CSA#2 10 R R A
B26 PS-CSA#2 5 R R R
B27 PS-CSA#2 1 A R R
B28 Blank - A R R
B29 Blank - A R R
B30 Blank - A R R
__________________________________________________________________________________
CSA: Chondroitin Sulfate A; DS: Dermatan Sulfate ; HS: Heparan Sulfate; FS: Fully Sulfated;
OS: Over Sulfated; PS: Partially Sulfated; Blank: control (pure heparin sample).
165
Samples B13, B14 and B15 correspond to fully-sulfated DS with content of
10%, 5% and 1%, respectively. The DS class only accepts the low content of
sample (B15), and rejects B13 and B14. As with the Heparin class, samples
B9 (1% HS) and B18 (1%OS-HS) are also accepted into the DS class.
The OSCS class model accepts five blend samples, viz., B10, B11, B12,
B22, and B25. B10, B11 and B12 are heparin samples spiked with 10%, 5%
and 1% fully-sulfated CSA, i.e., OSCS, and hence they absolutely belong to
the OSCS class. Samples B22 and B25 contain 10% partially-sulfated CSA
and they present very similar structure to OSCS.
4.3.2 UNEQ Analysis
The heparin 1H NMR data set was also analyzed using the unequal class
modeling (UNEQ) method. UNEQ, similar to quadratic discriminant analysis
(QDA), is based on the assumption of multivariate normal distribution of the
measured or transformed variables for each class population. In general,
UNEQ represents each class by means of its centroid. In a specific class, the
category space or the distance of each sample from the barycenter (center of
mass) or centroid is calculated according to various measures that follow a
chi-squared distribution.
4.3.2.1 Stepwise LDA Variable Reduction
For UNEQ modeling, the data matrix of variances-covariances needs to be
inverted, which would be impossible if the number of samples is less than that
166
of the variables [79, 93]. Therefore, a preliminary variable reduction step is
necessary so that the data matrix for each category presents a high ratio
between the number of training samples and the number of variables. In
general, the number of samples is required to be at least three times greater
than that of variables. In order to select a subset of original variables that
affords the maximum improvement of the discriminating ability between
categories, stepwise linear discriminant analysis (SLDA) was performed
before UNEQ modeling.
In the present study, all variables entering the model had a threshold F-to-
enter value equal to or greater than the entry value, 1.0. If the number of
variables with F-to-enter ≥ 1.0 exceeds 1/3 (one third) of the samples, then
the number of variables retained in the model was taken as 1/3 of the sample
number. Preliminary variable reduction using stepwise LDA led to the
selection of 15, 14, 14 and 15 variables for Heparin vs DS, Heparin vs OSCS,
Heparin vs [DS + OSCS], and DS vs OSCS, respectively (Table 27). For
Heparin vs DS, chemical shift 2.08 ppm had the highest F-value (101.6), so it
was the most important variable for the differentiation of Heparin from DS.
The next most important variable was 3.53 ppm with F-value of 8.9. These
two variables (2.08 and 3.53 ppm) were also found to be highly discriminating
in SIMCA modeling. The variable 4.49 ppm was significant for Heparin vs
OSCS, Heparin vs [DS + OSCS] and DS vs OSCS with F-values of 14.0, 23.3
and 103.0, respectively. In addition, the variable 2.08 ppm was also important
167
for Heparin vs OSCS and Heparin vs [DS + OSCS] with F-values of 15.1 and
97.1, respectively. Other significant variables for DS vs OSCS were 4.04 ppm
(F-value = 21.9) and 3.71 ppm (F-value = 14.3).
Table 27. Wilks Lambda (λ) and F-to-enter (F) Values of Variables (V)
__________________________________________________________________________________
Order Heparin vs DS Heparin vs OSCS Heparin vs [DS + OSCS] DS vs OSCS
V (ppm) λ F V (ppm) λ F V (ppm) λ F V (ppm) λ F
__________________________________________________________________________________
1 2.08 0.54 101.6 4.49 0.36 14.0 2.08 0.63 97.1 4.49 0.48 103.0
2 3.53 0.48 8.9 2.08 0.33 15.1 4.49 0.55 23.3 3.71 0.41 14.3
3 2.17 0.45 1.7 2.17 0.29 8.1 2.17 0.52 3.6 4.04 0.33 21.9
4 2.14 0.44 2.5 3.92 0.26 7.6 4.16 0.50 5.3 5.22 0.30 10.9
5 3.95 0.43 1.1 3.68 0.24 5.0 4.46 0.49 3.3 3.65 0.28 5.7
6 4.04 0.42 2.2 5.16 0.22 8.9 5.16 0.47 2.7 4.19 0.27 3.6
7 5.43 0.42 1.6 3.56 0.21 5.6 5.10 0.46 2.6 3.74 0.26 2.9
8 3.92 0.41 1.3 5.13 0.20 7.0 5.61 0.46 2.8 4.10 0.25 2.4
9 4.46 0.41 2.4 3.74 0.19 3.4 4.28 0.45 3.9 3.59 0.25 2.5
10 4.49 0.40 2.6 3.86 0.18 3.8 3.56 0.44 4.1 5.40 0.24 1.4
11 3.89 0.39 1.5 5.61 0.17 3.4 4.95 0.43 2.2 4.43 0.15 3.0
12 5.61 0.39 2.0 4.37 0.17 3.4 5.49 0.42 3.8 3.71 0.14 5.3
13 1.96 0.38 1.5 5.52 0.16 3.6 4.98 0.41 1.9 5.13 0.14 2.1
14 4.55 0.37 2.3 5.25 0.16 2.4 4.61 0.40 2.2 5.04 0.13 2.5
15 4.16 0.15 3.9 5.46 0.13 1.5
__________________________________________________________________________________
4.3.2.2 Analysis of Pure, Impure and Contaminated Heparin Samples
Results from UNEQ modeling using the selected subsets of variables as
inputs are summarized in Tables 28 and 29. Table 28 shows the sensitivity for
each of the four categories Heparin, DS, OSCS and [DS + OSCS], together
with the specificity of each model between each pair of categories. For
168
different systems, the subsets of selected variables were different, so that the
values of sensitivity and specificity varied within a certain range. The values
of sensitivity for Heparin, DS and OSCS were 84.7-87.5% (61-63/72), 80.0-
90.0% (40-45/50) and 87.0-91.3% (40-42/46), respectively. In all cases, the
sensitivity obtained using UNEQ was better than that evaluated with SIMCA
for which the values for Heparin, DS, OSCS and [DS + OSCS] were 77.8%
(56/72), 74.0% (37/50), 84.8% (39/46) and 79.2% (76/96), respectively.
Compared with SIMCA, UNEQ accepted 7 more Heparin, 8 more DS, 3 more
OSCS, and 5 more [DS + OSCS] samples by their specific category models
under optimal conditions. In the modeling of the category Heparin, 63 of the
72 samples were accepted by the category model built using UNEQ for
Heparin vs DS, while 56 of them were accepted by the SIMCA class model.
The differences between the two methods were far more marked when the
modeling of the other classes was considered. Of the 50 DS samples, 45
were correctly accepted by the UNEQ class model compared with 37 by the
SIMCA model. In addition, 42 out of the 46 OSCS samples and 81 out of the
96 [DS + OSCS] samples were correctly recognized by the UNEQ class
models compared with 39 out of 46 OSCS samples and 76 out of 96 [DS +
OSCS] samples by the SIMCA models.
Even though the sensitivity was greatly improved for the UNEQ model, a
corresponding decrease was observed in the specificity of UNEQ compared
to SIMCA. This is most evident by comparing the models for the Heparin and
169
DS classes as reported in Table 28. The specificity of the individual class
models was rather poor, most of the values being lower than 50%. The
classes DS and [DS + OSCS] accepted a large number of Heparin samples
(52/72 and 57/72, respectively), leading to significantly lower specificities
(27.8% and 20.8%) than the corresponding SIMCA values 81.9% (59/72) and
55.6% (40/72). The specificity of Heparin with respect to DS remarkably
decreased to 72.0% (36/50) from 90.0% (45/50), and that of Heparin to [DS +
OSCS] decreased to 82.3% (79/96) from 94.8% (91/96). Furthermore, the
UNEQ model showed a much poorer specificity for OSCS to Heparin and DS.
Table 28. Sensitivity and Specificity from UNEQ Class Modeling __________________________________________________________________________________
Model Sensitivity (%) Specificity (%) __________________________________________________________________________________
Heparin vs DS Heparin 87.5 (63/72) 72.0 (36/50) for DS
DS 80.0 (40/50) 27.8 (20/72) for Heparin
Heparin vs OSCS Heparin 84.7 (61/72) 100 (46/46) for OSCS
OSCS 87.0 (40/46) 26.4 (19/72) for Heparin
Heparin vs [DS + OSCS] Heparin 86.1 (62/72) 82.3 (79/96) for [DS + OSCS]
[DS + OSCS] 84.4 (81/96) 20.8 (15/72) for Heparin
DS vs OSCS DS 90.0 (45/50) 97.8 (45/46) for OSCS
OSCS 91.3 (42/46) 26.0 (13/50) for DS
Heparin vs DS vs OSCS Heparin 84.7 (61/72) 54.0 (27/50) for DS;
100 (46/46) for OSCS
DS 86.0 (43/50) 23.6 (17/72) for Heparin;
89.1 (41/46) for OSCS
OSCS 87.0 (40/46) 15.3 (11/72) for Heparin;
14.0 (7/50) for DS __________________________________________________________________________________
170
The values of the specificity for OSCS with respect to Heparin and DS
samples considerably decreased to 26.4% (19/72) and 26.0% (13/50) in
UNEQ compared with 100% (72/72) and 98.0% (49/50) in SIMCA. A major
exception was that of Heparin for OSCS, which remained perfect 100%
(46/46). The specificity of DS for OSCS was also high at 97.8% (45/46) for
UNEQ compared with 100% (46/46) for SIMCA.
The UNEQ class modeling results can be graphically visualized on the
Coomans plots as displayed in Figure 30. Compared with those correspon
ding to the SIMCA models, the Cooman‟s plots produced from the UNEQ
models revealed a large number of samples occupying the lower left quadrant
belonging to both classes. This outcome is a consequence of the low
specificity of the UNEQ class models.
Table 29 summarizes the results of the classification matrix evaluated by
means of leave-one-out cross-validation. Compared with SIMCA, the UNEQ
models exhibited better overall prediction ability. For example, the prediction
rates increased from 79.8% to 86.9% for Heparin vs [DS + OSCS] and from
75.6% to 84.5% for Heparin vs DS vs OSCS. Comparing these overall
abilities with those computed for the individual categories, it was noted that
the increase in the overall performance was mainly due to Heparin, and to a
lesser extent to DS, as the number of misclassified samples from these
classes was lower than in the corresponding SIMCA model. For Heparin vs
OSCS, Heparin vs [DS + OSCS] and Heparin vs DS vs OSCS, the prediction
171
A B
C D
Figure 30. Coomans plots for UNEQ class modeling. (A) Heparin vs OSCS; (B) Heparin vs DS; (C) Heparin vs [DS + OSCS]; (D) DS vs OSCS.
rates of the Heparin class increased from 87.5%, 65.3% and 62.5% for
SIMCA to 91.7%, 88.9% and 80.3% for UNEQ. For Heparin vs DS, DS vs
OSCS and Heparin vs DS vs OSCS, the prediction rates of DS class
172
increased from 84.0%, 78.0% and 72.0% in SIMCA to 86.0%, 88.0% and
78.0% in UNEQ.
Table 29. Classification Matrices from UNEQ Class Modeling
__________________________________________________________________________________
Model Training Prediction
__________________________________________________________________________________
Heparin vs DS Heparin DS Rate (%) Heparin DS Rate (%) Heparin 64 8 88.9 56 15 79.2
DS 3 47 94.0 7 43 86.0
Total - - 91.0 - - 81.8
Heparin vs OSCS Heparin OSCS Rate (%) Heparin OSCS Rate (%) Heparin 72 0 100 66 6 91.7
OSCS 0 46 100 1 45 97.8
Total - - 100 - - 94.9
Heparin vs [DS + OSCS] Heparin [DS + OSCS] Rate (%) Heparin [DS + OSCS] Rate (%) Heparin 72 0 100 64 8 88.9
[DS + OSCS] 14 82 85.4 14 82 85.4
Total - - 91.7 - - 86.9
DS vs OSCS DS OSCS Rate (%) DS OSCS Rate (%) DS 50 0 100 44 6 88.0
OSCS 0 46 100 2 44 95.7
Total - - 100 - - 91.7
Heparin vs DS vs OSCS Heparin DS OSCS Rate (%) Heparin DS OSCS Rate (%) Heparin 64 8 0 88.9 57 12 2 80.3
DS 4 46 0 92.0 8 39 3 78.0
OSCS 0 0 46 100 1 0 45 97.8
Total - - - 92.9 - - - 84.5
__________________________________________________________________________________
173
Chapter V
SUMMARY AND CONCLUSIONS
In order to differentiate heparin samples with varying amount of dermatan
sulfate (DS) impurities and oversulfated chondroitin sulfate (OSCS)
contaminants, proton NMR spectral data for heparin sodium active
pharmaceutical ingredient (API) samples from different manufacturers were
analyzed by multivariate chemometric methods for qualitative and quantitative
evaluation. The following conclusions were drawn based on multivariate
regression and pattern recognition separately.
5.1 Multivariate Regression for Predicting %Gal
In this study, the content of galactosamine (%Gal) in heparin (primarily
originating from the impurity dermatan sulfate, DS) was predicted from 1H
NMR spectral data by means of four multivariate analysis approaches, i.e.,
multiple linear regression (MLR), Ridge regression (RR), partial least squares
regression (PLSR), and support vector regression (SVR). Variable selection
was performed by genetic algorithms (GAs) or stepwise method in order to
build robust and reliable models. The results demonstrated that excellent
prediction performance was achieved in the determination of %Gal by all four
regression models under optimal conditions. Variable selection enhanced the
predictive ability substantially of all models, particularly the MLR model.
174
Simple models were obtained using a subset of selected variables that
predicted %Gal with high coefficients of determination and low prediction
errors.
In general, GA was superior to the stepwise method for variable selection.
Because GA can select any number of variables, a series of variables from 5
to 40 was selected to build predictive models. Over-fitted models based on
the training sets due to use of excessive variables led to poor predictive ability
on the test sets. Likewise, under-fitted models resulting from an insufficient
number of variables for model building led to statistically unstable models.
The optimal subsets for Datasets A and B were 10 and 30 variables,
respectively. After variable selection, the four regression models considered
in this study produced very similar results.
The range of %Gal in the samples influences many factors, i.e., the
selection of regression approach; the choice of variable selection method and
number of variables; and the interpretation of the models. Dataset A covered
the full range 0-10%Gal, while Dataset B was the subset covering 0-2%Gal.
As expected, Model A performed best for Dataset A while Model B was
preferred for Dataset B, indicating that a multi-stage modeling approach could
provide the best accuracy and range. Variable selection influenced the PLSR
and SVR models only slightly for Dataset A but was required to achieve
optimal results for Dataset B. All four MVR approaches (MLR, RR, PLSR, and
SVR) performed equally well and were robust under optimal conditions.
175
However, SVR was slightly superior to the other three regression approaches
when building models with Dataset B.
The present study offers assistance in selecting the appropriate MVR
approach to predict the %Gal in heparin based on analysis of 1D 1H-NMR
data. The results demonstrate that the combination of 1H NMR spectroscopy
and chemometric techniques provides a rapid and efficient way to
quantitatively determine the galactosamine content in heparin. More
generally, the present study underscores the importance in choosing the
appropriate regression method, variable selection approach, and fitting
parameters to build highly predictive regression models.
5.2 Classification for Pure, Impure and Contaminated Heparin Samples
To develop robust classification models for rapid screening of heparin
samples with varying amounts of dermatan sulfate (DS) impurities and
oversulfated chondroitin sulfate (OSCS) contaminants, several multivariate
statistical approaches, i.e., PCA, PLS-DA, LDA, kNN, CART, ANN and SVM,
were employed in combination with 1H NMR spectroscopy, and their
performance was compared using three data sets based on different chemical
shift regions (1.95-2.20, 3.10-5.70 and 1.95-5.70 ppm). It is shown that these
chemometric methods are useful tools for the exploration and visualization of
heparin NMR spectral data, and for the generation of classification models
with outstanding performance attributes. The large number of original
variables was reduced by chemometric methods into a much smaller number
176
of new variables (PCs, or latent variables) for effective clustering and
classification. The degree of success of the classification models in
discriminating the samples of pure heparin from those containing the impurity
DS and the contaminant OSCS depended on the specific chemometric
procedures for choosing the appropriate variables.
The well-known unsupervised chemometric method of PCA was used to
explore the similarities and differences in the complex pattern of overlapping
1H NMR signals found in the heparin spectra. The PCA results showed that
the samples were separated into two distinct clusters for the Heparin vs
OSCS groups, but the distinction between Heparin and DS was less evident.
Excellent discrimination of the Heparin samples from those samples
containing impurities (DS) and contaminants (OSCS) was achieved with the
supervised method PLS-DA.
The predictive performance of the models obtained from PLS-DA and LDA
were outstanding in differentiating Heparin from DS and OSCS with very few
misclassifications. In all cases, better classification rates (fewer
misclassifications) were attained for Heparin vs OSCS models than for
Heparin vs DS models regardless of the clustering and classification
approach. Under optimal conditions, success rates of 100% were frequently
achieved for discrimination between Heparin and OSCS samples. This
outcome is plausible, in view of the much closer similarity in the 1H NMR
spectral patterns between Heparin and DS than between Heparin and OSCS.
177
CART is a simple but powerful technique for class discrimination. It is able
to select the most relevant explanatory variables from the dataset and derive
classification rules on the basis of the reduced set of variables, so the tree-
structured models are easy to interpret and understand. For heparin and its
derivatives, the characteristic N-acetyl methyl proton chemical shifts are
located in the 1.95-2.20 ppm region. Specifically 2.08 and 2.15 ppm, the
characteristic chemical shifts of DS and OSCS, were found to possess the
greatest discriminating power for Heparin vs DS and Heparin vs OSCS,
respectively. After excluding the N-acetyl region, it was observed that the
classification and prediction rates in the local region 3.10-5.70 ppm were
evidently poor due to the lack of distinguishable characteristic peaks, implying
that the 2.0-2.2 ppm region plays an important role in discriminating heparin
from its impurities and contaminants for CART. Therefore, the variables or
chemical shifts selected in the CART analysis were interpretable and the
resulting trees were chemically justified.
As a widely applied learning approach, ANN is able to model highly
complex relationships with non-linear trends. Nevertheless, ANN modeling is
prone to overfitting and suffers difficulties with generalization since there are a
large number of model parameters to be optimized. By introducing the weight
decay in our study, the overfitting effect was greatly alleviated. As can be
seen from the obtained results, while the predictive performance of ANN
178
models on the test set is comparable to that of CART models for 1.95-2.20
ppm, it was slightly superior to CART in the 3.10-5.70 ppm region.
SVM represents a recent statistical learning technique and can model
complex non-linear boundaries through using adapted kernel functions. The
problem of overfitting can be effectively solved and remarkable generalization
performance can be achieved due to the high mapping power of the kernel,
resulting in a more highly predictive model. SVM can deal with high-
dimensional data with relatively few samples in the training set, and as a
consequence, no prior step of variable reduction is required. The SVM
algorithm does not provide the best solution automatically; model learning
requires optimization of the kernel parameter γ and the regularization
parameter C. The parameter tuning is a critical step, and the optimal values
are classically acquired by exhaustive search. In the present study, it was
quite easy to tune these parameters for Heparin vs OSCS, but for Heparin vs
DS, it needs to be very careful because it is difficult to discriminate Heparin
and DS owing to the similarity of samples on the 1.0% DS boundary. SVM
outperformed all other approaches for discrimination of the Heparin and DS
samples, and gave the best classification results in all cases (Figure 31). In
addition, it was found that the predictive rates for both 1.95-2.20 and 3.10-
5.70 ppm were very close to each other, indicating even minute structural
difference between heparin and DS can lead to remarkable discrimination
from SVM.
179
Heparin vs DS Heparin vs OSCS
Heparin vs [DS + OSCS] Heparin vs DS vs OSCS
Figure 31. Comparison of the classification performance of the six approaches.
The validated Heparin vs DS vs OSCS model was challenged for
classification of the blend samples in which the heparin APIs were spiked with
native or partially/fully oversulfated chondroitin sulfate A (CSA), dermatan
180
sulfate (DS) and heparan sulfate (HS) at the 1.0, 5.0, or 10.0 weight percent
levels. Overall, the results obtained from classification assignation on the
blends were excellent although the three multivariate pattern recognition
approaches are not class modeling techniques, which means that any object,
even a clear outlier, could be assigned to one class. We conclude that all of
the samples containing partially or fully oversulfated components, and the
potential GAG impurities, were readily distinguished from USP grade heparin
by the resulting models.
In summary, the present study reveals that 1H NMR spectroscopy, in
combination with multivariate chemometric methods, represent an effective
strategy for fast and reliable identification of impurities (DS) and contaminants
(OSCS) in heparin API samples. The pattern recognition approach applied
here may be useful in monitoring purity of other complex naturally derived
compounds.
5.3 Class Modeling Using SIMCA and UNEQ
In this work, two chemometric class-modeling techniques SIMCA and
UNEQ were employed to assess the quality of the heparin samples and to
perform pattern recognition among the various classes (pure heparin,
impurities and contaminants). Compared to pure classification techniques,
class-modeling approaches focus more on the analogies among the samples
from the same class than on the differences among the different classes;
hence, class modeling approaches allow us to explore the fundamental
181
details and individual characteristics of the classes. One of the advantages of
class modeling is that a sample can be recognized to be a member of one or
more classes, or none of the classes. The sensitivity, specificity and
prediction ability were computed as indicators of the quality of the models.
SIMCA can work on a small set of samples (as low as 10) per class and
does not apply any restriction on the number of measurement variables. This
is especially important because the number of variables is usually greater
than that of the analytical measurements (i.e. 1H NMR) of each sample. In
contrast, UNEQ requires variable reduction since the ratio of the number of
samples per class must be at least three-fold the number of variables in the
model. The computation of Wilks‟ lambda on the basis of stepwise linear
discriminant analysis (SLDA) enabled the selection of optimal subsets of
variables. The selected variables were useful for classification and
discrimination of the heparin samples by their origins. Depending on the
specific systems, the individual subsets of variables were different. The
subsets selected from two-class systems were more useful for discrimination
of the different classes.
For the heparin 1H NMR analytical data, significant differences were
observed between SIMCA and UNEQ analysis. The SIMCA models produced
excellent class separation between the Heparin and OSCS classes and
between the DS and OSCS classes, achieving nearly 100% specificity. On
the contrary, the UNEQ models produced excellent sensitivity but poor
182
specificity. Although the Heparin and DS classes rejected most of the OSCS
samples in UNEQ analysis, the OSCS class accepted a large number of
Heparin and DS samples, leading to extremely poor specificity. However,
when the computations were completed by UNEQ, the obtained models were
significantly better in terms of sensitivity and prediction ability. UNEQ
exhibited higher sensitivity with values of 88%, 90% and 91% for Heparin, DS
and OSCS compared to those of 78%, 74% and 85% from SIMCA.
The composition of the blend samples, in which the heparin APIs were
spiked with non-, partially-, or fully oversulfated chondroitin sulfate A (CSA),
dermatan sulfate (DS) and heparan sulfate (HS) at the 1.0%, 5.0% and 10.0%
weight percent levels, were highly diverse. These blend samples were
employed to challenge the Heparin, DS and OSCS class models. Overall, the
results obtained from SIMCA on the blends were excellent. The Heparin class
accepted pure heparin samples as well as some blends with low content (1%)
of GAGs, while the DS and OSCS classes accepted their respective GAG
blends. Importantly, some blends, such as OS-HS and OS-Hep, were
rejected by all the three class models. We conclude that all of the samples
containing partially or fully oversulfated components, and the potential GAG
impurities, were readily distinguished from USP grade heparin by the SIMCA
class models. The poor specificity (SPEC) of corresponding UNEQ class
models led to subpar performance metrics for the blend samples and,
therefore, were omitted here from detailed analysis.
183
According to USP specifications, the acceptance criterion for OSCS
content in heparin API and finished dose products is 0%. Although there are
no criteria for crude heparin products, it is desirable to use robust and
validated methods to identify and screen lots before they are fully processed.
The present study demonstrates that pattern recognition techniques, such as
SIMCA and UNEQ, are useful tools for discriminating pure and impure
heparin samples under investigation. The results reported here show that
through the employment of two chemometric class-modeling techniques,
namely SIMCA and UNEQ, it is possible to assess the quality of the samples.
In class modeling, it is important to consider the compromise between
sensitivity and specificity. Although it is significant to accept more samples
into their respective class models and achieve a higher sensitivity, these
models should not include too many samples from foreign classes, otherwise
the specificity will decline. It can be seen in the present study that the
specificities are higher for the SIMCA model while the sensitivities are greater
for the UNEQ model. The ability of UNEQ modeling to differentiate good
quality heparin from impure or contaminated samples was better than the
SIMCA modeling approach. However SIMCA modeling performed better in
distinguishing samples with high levels of DS from good quality heparin
compared to the UNEQ modeling.
184
Chapter VI
FUTURE DIRECTION FOR RESEARCH
Besides the proton NMR spectral data, the FDA also provided us with
strong-anion-exchange high-performance liquid chromatography (SAX-HPLC)
and near infrared (NIR) spectral data for a set of heparin samples obtained
from several foreign and domestic manufacturers.
During the heparin crisis, new tests and specifications were developed by
the US FDA and the USP in order to detect the contaminant as well as to
improve assurance of quality and purity of the drug product. In 2009, a new
USP monograph was put in place that included 1H NMR, a SAX-HPLC test,
and a percent galactosamine in total hexosamine measurement (% Gal),
which are assays orthogonal to each other [14, 15, 26, 58 ]. While the 1H
NMR spectra are primarily used to identify the presence or absence of
possible impurities or contaminants in heparin, SAX-HPLC data have
sufficiently resolved signals for the GAGs and have been used to quantify the
levels of DS or OSCS because the HPLC method is more sensitive and
robust for measurement of these GAGs in heparin. Figure 32 shows the
overlaid chromatograms of a heparin API spiked with CSB or OS-CSB at the
1.0%, 5.0% or 10.0% level. The CSB, heparin, and OS-CSB components
elute at 16.2, 20.4 and 22.5 min, respectively.
185
Figure 32. Overlaid plots of the 10–30 min portion of SAX-HPLC chromatograms derived from injections of a heparin API alone or spiked with 1.0%, 5.0% or 10.0% CSB and the same heparin API alone or spiked with 1.0%, 5.0% or 10.0% OS-CSB.
NIR spectroscopy covers the transition from the visible to the mid-infrared
region with the wavelength 800–2500 nm or frequency 12821–4000 cm-1,
where the absorption results from overtones or combinations of the
fundamental mid-infrared bands. The stretching vibrations of functional
groups containing –CH, –OH, –SH and –NH bonds are observed in NIR
spectra [25, 69, 73, 150]. As a rapid and non-destructive analytical method,
NIR technique can provide a fingerprinting for drug products and have been
successfully applied in the pharmaceutical industry. Many of hydrogen-bond
groups are present in heparin molecules and hence NIR spectra contain
information about the chemical and physical properties of heparin. As shown
186
in Figure 33, absorption bands at 5200 and 6900 cm-1 agree with Raman
spectrum studies. Heparin displays an irregular peak at 4730 cm-1 and a
shoulder at 6500 cm-1 which distinguish it from dermatan sulfate. OSCS has
two small peaks in the region around 4730 cm-1, another peak at 5800 cm-1
and a third peak at 7000 cm-1. The presence of OSCS as a contaminant in
heparin is expected to shift the large heparin peak at 6900 cm-1 to higher
energy.
Figure 33. Near infrared spectra of 108 heparin samples that contain DS impurities and OSCS contaminants.
In the future research, the use of 1H NMR, SAX-HPLC and NIR data in
combination with multivariate chemometric approaches is proposed to
conduct the following qualitative and quantitative analysis:
187
(1) Classification of samples for discriminating pure heparin, impurities
and contaminants according to the SAX-HPLC chromatographic data to
qualify raw materials and to control final products;
(2) Pattern recognition investigation for the three major components
heparin, DS and OSCS based on the NIR spectral data to demonstrate
the feasibility of NIR to identify contamination of heparin;
(3) Quantification of both DS and OSCS compositions in heparin sodium
by the specific signals in 1H NMR spectra coupled with multivariate
regression methods;
(4) Establishing calibration models by correlating NIR spectra of individual
heparin samples with the DS and OSCS content determined by SAX-
HPLC.
188
References
[1] Ampofo SA, Wang HM, Linhardt RJ. Disaccharide compositional analysis of heparin and heparan sulfate using capillary zone electrophoresis. Analytical Biochemistry. 1991, 199:249-255.
[2] Rabenstein DL. Heparin and heparan sulfate: structure and function.
Natural Product Report. 2002, 19:312-331.
[3] Casu B. Heparin structure. Haemostasis. 1990, 20:62-73.
[4] Sudo M, Sato K, Chaidedgumjorn A, Toyoda H, Toida T, Imanari T. 1H nuclear magnetic resonance spectroscopic analysis for determination of glucuronic and iduronic acids in dermatan sulfate, heparin, and heparan sulfate. Analytical Biochemistry. 2001, 297:42-51.
[5] Linhardt RJ. Hepairn: An important drug enters its seventh decade. Chemistry and Industry. 1991, 2:45-50.
[6] Lepor NE. Anticoagulation for acute coronary syndromes: from heparin to direct thrombin inhibitors. Reviews in Cardiovascular Medicine. 2007, 8 (suppl. 3):S9-S17.
[7] Fischer KG. Essentials of anticoagulation in hemodialysis. Hemodialysis International. 2007, 11:178-189.
[8] Maruyama T, Toida T, Imanari T, Yu G, Linhardt RJ. Conformational changes and anticoagulant activity of chondroitin sulfate following its O-sulfonation. Carbohydrate Research. 1998, 306:35-43.
[9] Guerrini M, Bisio A, Torri G. Combined quantitative 1H and 13C nuclear magnetic resonance spectroscopy for characterization of heparin preparations. Seminars in Thrombosis and Hemostasis. 2001, 27:473-482.
[10] Toida T, Maruyama T, Ogita Y, Suzuki A, Toyoda H, Imanari T, Linhardt RJ. Preparation and anticoagulant activity of fully O-sulphonated glycosaminoglycans. International Journal of Biological Macromolecules. 1999, 26:233-241.
[11] Griffin CC, Linhardt RJ, Van Gorp CL, Toida T, Hileman RE, Schubert RL, Brown SE. Isolation and characterization of heparan sulfate from crude porcine intestinal mucosal peptidoglycan heparin. Carbohydrate Research. 1995, 276:183-197.
189
[12] Pervin A, Gallo C, Jandik KA, Han XJ, Linhardt RJ. Preparation and
structural characterization of large heparin-derived oligosaccharides. Glycobiology. 1995, 5:83-95.
[13] Guerrini M, Zhang Z, Shriver Z, Naggi A, Masuko S, Langer R, Casu B,
Linhardt RJ, Torri G, Sasisekharan R. Orthogonal analytical approaches to detect potential contaminants in heparin. Proceedings of the National Academy of Sciences. 2009, 106(40):16956-16961.
[14] Keire DA, Trehy ML, Reepmeyer JC, Kolinski RE, Ye W, Dunn J,
Westenberger BJ, Buhse LF. Analysis of crude heparin by 1H NMR, capillary electrophoresis, and strong-anion-exchange-HPLC for contamination by over sulfated chondroitin sulfate. Journal of Pharmaceutical and Biomedical Analysis. 2010, 51:921-926.
[15] Keire DA, Mans DJ, Ye H, Kolinski RE, Buhse LF. Assay of possible economically motivated additives or native impurities levels in heparin by 1H NMR, SAX-HPLC, and anticoagulation time approaches. Journal of Pharmaceutical and Biomedical Analysis. 2010, 52:656-664.
[16] Zang Q, Keire DA, Wood RD, Buhse LF, Moore CMV, Nasr M, Al-Hakim A, Trehy ML, Welsh WJ. Determination of Galactosamine impurities in Heparin samples by multivariate regression analysis of their 1H NMR spectra. Analytical and Bioanalytical Chemistry. 2011, 399(2):635-649.
[17] Beyer T, Matz M, Brinz D, Rädler O, Wolf B, Norwig J, Abumann K,
Alban S, Holzgrabe U. Composition of OSCS-contaminated heparin occuring in 2008 in batches on the German market. European Journal of Pharmaceutical Sciences. 2010, 40:297-304.
[18] Korir AK, Larive CK. Advances in the separation, sensitive detection,
and characterization of heparin and heparan sulfate. Analytical and Bioanalytical Chemistry. 2009, 393:155-169.
[19] Casu B, Guerrini M, Naggi A, Torri G, De-Ambrosi L, Boveri G, Gonella
S, Cedro A, Ferró L, Lanzarotti E et al. Characterization of sulfation patterns of beef and pig mucosal heparins by nuclear magnetic resonance spectroscopy. Arzneimittelforschung. 1996, 46:472-477.
[20] Eldridge SL, Korir AK, Gutierrez SM, Campos F, Limtiaco JFK, Larive CK. Heterogeneity of depolymerized heparin SEC fractions: to pool or not to pool?. Carbohydrate Research. 2008, 343:2963-2970.
190
[21] Guerrini M, Beccati D, Shriver Z, Naggi A, Viswanathan K, Bisio A,
Capila I, Lansing JC, Guglieri S, Fraser B et al. Oversulfated chondroitin sulfate is a contaminant in heparin associated with adverse clinical events. Nature Biotechnology. 2008, 26(6):669-675.
[22] Kishimoto TK, Viswanathan K, Ganguly T, Elankumaran S, Smith S, Pelzer K, Lansing JC, Sriranganathan N, Zhao G, Galcheva-Gargova Z et al. Contaminated heparin associated with adverse clinical events and activation of the contact system. The New England Journal of Medicine. 2008, 358:2457-2467.
[23] McMahon W, Pratt RG, Hammad TA, Kozlowski S, Zhou E, Lu S,
Kulick CG, Mallick T, Pan GD. Pharmacoepidemiology and Drug Safety. 2010, 19:921-933.
[24] Tami C, Puig M, Reepmeyer JC, Ye H, D‟Avignon DA, Buhse L,
Verthelyi D. Inhibition of Taq polymerase as a method for screening heparin for oversulfated contaminants. Biomaterials. 2008, 29:4808-4814.
[25] Spencer JA, Kauffman JF, Reepmeyer JC, Gryniewicz CM, Ye W, Toler DY, Buhse LF, Westenberger BJ. Screening of heparin API by near infrared reflectance and Raman spectroscopy. Journal of Pharmaceutical Sciences. 2009, 98(10):3540-3547.
[26] Trehy ML, Reepmeyer JC, Kolinski RE, Westenberger BJ, Buhse LF. Analysis of heparin sodium by SAX/HPLC for contaminants and impurities. Journal of Pharmaceutical Biomedical Analysis. 2009, 49:670-673.
[27] Wielgos T, Havel K, Ivanova N, Weinberger R. Determination of impurities in heparin by capillary electrophoresis using high molarity phosphate buffers. Journal of Pharmaceutical and Biomedical Analysis. 2009, 49:319-326.
[28] Jagt RBC, Gómez-Biagi RF, Nitz M. Pattern-based recognition of heparin contaminants by an array of self-assembling fluorescent receptors. Angewandte Chemie, International Edition. 2009, 48:1995-1997.
[29] McEwen I, Rundlöf T, Ek M, Kakkarainen B, Carlin G, Arvidsson T. Effect of Ca2+ on the 1H NMR chemical shift of the methyl signal of
191
oversulphated chondroitin sulphate, a contaminant in heparin. Journal of Pharmaceutical and Biomedical Analysis. 2009, 49:816-819.
[30] Beyer T, Diehl B, Randel G, Humpfer E, Schäfer H, Spraul M, Schollmayer C, Holzgrabe U. Quality assessment of unfractionated heparin using 1H nuclear magnetic resonance spectroscopy. Journal of Pharmaceutical and Biomedical Analysis. 2008, 48:13-19.
[31] Zhang Z, Weïwer M, Li B, Kemp MM, Daman TH, Linhardt RJ. Oversulfated chondroitin sulfate: impact of a heparin impurity, associated with adverse clinical events, on low-molecular-weight heparin preparation. Journal of Medicinal Chemimstry. 2008, 51(18):5498-5501.
[32] Bigler P, Brenneisen R. Improved impurity fingerprinting of heparin by high resolution 1H NMR Spectroscopy. Journal of Pharmaceutical and Biomedical Analysis. 2009, 49:1060-1064.
[33] Sitkovwki J, Bednarek E, Bocian W, Kozerski L. Assessment of oversulfated chondroitin sulfate in low molecular weight and unfractioned heparins diffusion ordered nuclear magnetic resonance spectroscopy methods. Journal Medicinal Chemistry. 2008, 51:7663-7665.
[34] King JT, Desai UR. A capillary electrophoretic method for fingerprinting low molecular weight heparins. Analytical Biochemistry. 2008, 380:229-234.
[35] Domanig R, Jöbstl W, Gruber S, Freudemann T. One-dimensional cellulose acetate plate electrophoresis - A feasible method for analysis of dermatan sulfate and other glycosaminoglycan impurities in pharmaceutical heparin. Journal of Pharmaceutical and Biomedical Analysis. 2009, 49:151-155.
[36] Varmuza K, Filzmoser P. Introduction to Multivariate Statistical Analysis in Chemometrics. Boca Raton: CRC Press; 2009.
[37] Welsh WJ, Lin W, Tersigni SH, Collantes E, Duta R, Carey MS, Zielinski WL, Brower J, Spencer JA, Layloff TP. Pharmaceutical Fingerprinting: evaluation of neural networks and chemometric techniques for distinguishing among same-product manufacturers. Analytical Chemistry. 1996, 68(19):3473-3482.
192
[38] Tetko IV, Villa AEP, Aksenova TI, Zielinski WL, Brower J, Collantes ER, Welsh WJ. Application of a pruning algorithm to optimize artificial neural networks for pharmaceutical fingerprinting. Journal of Chemical Information and Computer Sciences. 1998, 38(4):660-668.
[39] Berrueta LA, Alonso-Salces RM, Héberger K. Supervised pattern
recognition in food analysis. Journal of Chromatography A. 2007, 1158:196-214.
[40] Rudd TR, Skidmore MA, Guimond SE, Cosentino C, Torri G, Fernig
DG, Lauder RM, Guerrini M, Yates EA. Glycosaminoglycan origin and structure revealed by multivariate analysis of NMR and CD spectra. Glycobiology. 2009, 19(1):52-67.
[41] Constantinou MA, Papakonstantinou E, Spraul M, Sevastiadou S, Costalos C, Koupparis MA, Shulpis K, Tsantili-Kakoulidou A, Mikros E. 1H NMR-based metabonomics for the diagnosis of inborn errors of metabolism in urine. Analytica Chimica Acta. 2005, 542:169-177.
[42] Keun HC, Ebbels TMD, Antti H, Bollard ME, Beckonert O, Holmes E,
Lindon JC, Nicholson JK. Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Analytica Chimica Acta. 2003, 490:265-276.
[43] Bailey NJC, Wang Y, Sampson J, Davis W, Whitcombe I, Hylands PJ,
Croft SL, Holmes E. Prediction of anti-plasmodial activity of Artemisia annua extracts: application of 1H NMR spectroscopy and chemometrics. Journal of Pharmaceutical and Biomedical Analysis. 2004, 35:117-126.
[44] Ruiz-Calero V, Saurina J, Galceran MT, Hernández-Cassou S,
Puignou L. Potentiality of proton nuclear magnetic resonance and multivariate calibration methods for the determination of dermatan sulfate contamination in heparin samples. Analyst. 2000, 125:933-938.
[45] Ruiz-Calero V, Saurina J, Hernández-Cassou S, Galceran MT, Puignou L. Proton nuclear magnetic resonance characterization of glycosaminolgycans using chemometric techniques. Analyst. 2002, 127:407-415.
[46] Ruiz-Calero V, Saurina J, Galceran MT, Hernández-Cassou S, Puignou L. Estimation of the composition of heparin mixtures from various origins using proton nuclear magnetic resonance and
193
multivariate calibration methods. Analytical and Bioanalytical Chemistry. 2002, 373:259-265.
[47] Holmes E, Antti H. Chemometric contributions to the evolution of
metabonomics: mathematical solutions to characterising and interpreting complex biological NMR spectra. Analyst. 2002:127, 1549-1557.
[48] Waters NJ, Holmes E, Waterfield CJ, Farrant RD, Nicholson JK. NMR
and pattern recognition studies on liver extracts and intact livers from rats treated with α-naphthylisothiocyanate. Biochemical Pharmacology. 2002, 64:67-77.
[49] Brereton RG. Chemometrics for pattern recognition. West Sussex: A
John Wiley and Sons, Ltd.; 2009.
[50] El-Abassy RM, Donfack P, Materny A. Visible Raman spectroscopy for the discrimination of olive oils from different vegetable oils and the detection of adulteration. Journal of Raman Spectroscopy. 2009, 40:1284-1289.
[51] Gurdeniz G, Ozen B. Detection of adulteration of extra-virgin olive oil
by chemometric analysis of mid-infrared spectral data. Food Chemistry. 2009, 116:519-525.
[52] Reid LM, O‟Donnell CP, Downey G. Potential of SPME-GC and
chemometrics to detect adulteration of soft fruit purées. Journal of Agricultural Food Chemistry. 2004, 52:421-427.
[53] de Veij M, Vandenabeele P, Hall KA, Fernandez FM, Green MD, White
NJ, Dondorp AM, Newton PN, Moens L. Fast detection and identification of counterfeit antimalarial tablets by Raman spectroscopy. Journal of Raman Spectroscopy. 2007, 38:181-187.
[54] de Veij M, Deneckere A, Vandenabeele P, de Kaste D, Moens L.
Detection of counterfeit Viagra with Raman spectroscopy. Journal of Pharmaceutical and Biomedical Analysis. 2008, 46:303-309.
[55] Storme-Paris I, Rebiere H, Matoga M, Civade C, Bonnet PA, Tissier
MH, Chaminade P. Challenging near infrared spectroscopy discriminating ability for counterfeit pharmaceuticals detection. Analytica Chimica Acta. 2010, 658:163-174.
194
[56] Zhang Z, Li B, Suwan J, Zhang F, Wang Z, Liu H, Mulloy B, Linhardt RJ. Analysis of pharmaceutical heparins and potential contaminants using 1H-NMR and PAGE. Journal of Pharmaceutical Sciences. 2009, 98(11):4017-4026.
[57] Rudd TR, Guimond SE, Skidmore MA, Duchesne L, Guerrini M, Torri
G, Cosentino C, Brown A, Clarke DT, Turnbull JE, Fernig DG, Yates EA. Influence of substitution pattern and cation binding on conformation and activity in heparin derivatives. Glycobiology. 2007, 17(9):983-993.
[58] Perlin AS, Sauriol F, Cooper B, Folkman J. Dermatan sulfate in
pharmaceutical heparins. Thrombbosis and Haemostasis. 1987, 58:792-793.
[59] Alban S, Lühn S, Schiemann S, Beyer T, Norwig J, Schilling C, Rädler
O, Wolf B, Matz M, Baumann K, Holzgrabe U. Comparison of established and novel purity tests for the quality control of heparin by means of a set of 177 heparin samples. Analytical and Bioanalytical Chemistry. 2011, 399(2):605-620.
[60] Keire DA, Ye H, Trehy ML, Ye W, Kolinski RE, Westenberger BJ,
Buhse LF, Nasr M, Al-Hakim A. Characterization of currently marketed heparin products: key tests for quality assurance. Analytical and Bioanalytical Chemistry. 2011, 399(2):581-591.
[61] Laurencin CT, Nair L. The FDA and safety – beyond the heparin crisis.
Nature Biotechnology. 2008, 26(6):621-623.
[62] Guerrini M, Shriver Z, Bisio A, Naggi A, Casu B, Sasisekharan R, Torri G. The tainted heparin story: an update. Thrombosis Haemostasis. 2009, 102:907-911.
[63] Beni S, Limtiaco JFK, Larive CK. Analysis and characterization of
heparin impurities. Analytical and Bioanalytical Chemistry. 2011, 399(2):527-539.
[64] Brustkern AM, Buhse LF, Nasr M, Al-Hakim A, Keire DA.
Characterization of currently marketed heparin products: reversed-phase ion-pairing liquid chromatography mass spectrometry of heparin digests. Analytical Chemistry. 2010, 82:9865-9870.
195
[65] Limtiaco JF, Jones CJ, Larive CK. Characterization of heparin impurities with HPLC-NMR using weak anion exchange chromatography. Analytical Chemistry. 2009, 81:10116-10123.
[66] Üstün B, Sanders KB, Dani P, Kellenbach ER. Quantification of
chondroitin sulfate and dermatan sulfate in danaparoid sodium by 1H NMR spectroscopy and PLS regression. Analytical and Bioanalytical Chemistry. 2011, 399:629-634.
[67] McEwen I, Mulloy B, Hellwig E, Kozerski L, Beyer T, Holzgrabe U,
Rodomonte A, Wanko R, Spieser JM. Determination of oversulphated chondroitin sulphate and dermatan sulphate in unfractioned heparin by 1H NMR. Pharmeuropa Bio/ the Biological Standardisation Programme. 2008, 1:31-39
[68] Mutihac L, Mutihac R. Mining in chemometrics. Analytica Chimica
Acta. 2008, 612:1-18.
[69] Roggo Y, Chalus P, Maurer L, Lema-Martinez C, Edmond A, Jent N. A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies. Journal of Pharmaceutical and Biomedical Analysis. 2007, 44:683-700.
[70] Estienne F, Massart DL, Zanier-Szydlowski N, Marteau P. Multivariate
calibration with Raman spectroscopic data: a case study. Analytica Chimica Acta. 2000, 424:185-201.
[71] Leardi R. Genetic algorithms in chemometrics and chemistry: a review.
Journal of Chemometrics. 2001, 15:559-569.
[72] Jouan-Rimbaud D, Massart D, Leardi R, De Noord OE. Genetic algorithms as a tool for wavelength selection in multivariate calibration. Analytical Chemistry. 1995, 67:4295-4301.
[73] Liebmann B, Friedl A, Varmuza K. Determination of glucose and
ethanol in bioethanol production by near infrared spectroscopy and chemometrics. Analytica Chimica Acta. 2009, 642:171-178.
[74] Carneiro RL, Braga JWB, Bottoli CBG, Poppi RJ. Application of genetic
algorithm for selection of variables for the BLLS method applied to determination of pesticides and metabolites in wine. Analytica Chimica Acta. 2007, 595:51-58.
196
[75] Gourvénec S, Capron X, Massart DL. Genetic algorithms (GA) applied to the orthogonal projection approach (OPA) for variable selection. Analytica Chimica Acta. 2004, 519:11-21.
[76] Forshed J, Schuppe-Koistinen I, Jacobsson SP. Peak alignment of
NMR signals by means of a genetic algorithm. Analytica Chimica Acta. 2003, 487:189-199.
[77] Üstün B, Melssen WJ, Oudenhuijzen M, Buydens LMC. Determination
of optimal support vector regression parameters by genetic algorithms and simplex optimization. Analytica Chimica Acta. 2005, 544:292-305.
[78] Broadhurst D, Goodacre R, Jones A, Rowland JJ, Kell DB. Genetic
algorithms as a method for variable selection in multiple linear regression and partial least squares regression with applications to pyrolysis mass spectrometry. Analytica Chimica Acta. 1997, 348:71-86.
[79] Forina M, Oliveri P, Lanteri S, Casale M. Class-modeling techniques
classic and new for old and new problems. Chemometrics and Intelligent Laboratory Systems. 2008, 93:132-148.
[80] Marini F, Magri AL, Balestrieri F, Fabretti F, Marini D. Supervised
pattern recognition applied to the discrimination of the floral origin of six types of Italian honey samples. Analytica Chimica Acta. 2004, 515:117-125.
[81] Pérez-Magariño S, Ortega-Heras M, González-San José ML, Boger Z.
Comparative study of artificial neural network and multivariate methods to classify Spanish DO rose wines. Talanta. 2004, 62:983-990.
[82] Huang J, Brennan D, Sattler L, Alderman J, Lane B, O‟Mathuna C. A
comparison of calibration methods based on calibration data size and robustness. Chemometrics and Intelligent Laboratory Systems. 2002, 62:25-35.
[83] Czekaj T, Wu W, Walczak B. About kernel latent variable approaches
and SVM. Journal of Chemometrics. 2005, 19:341-354.
[84] Tistaert C, Dejaegher B, Nguyen Hoai N, Chataigné G, Riviere C, Nguyen Thi Hong V, Chau Van M, Quetin-Leclercq J, Vander Heyden Y. Potential antioxide compounds in Mallotus species fingerprints. Part I: Indication using linear multivariate calibration techniques. Analytica Chimica Acta. 2009, 649:24-32.
197
[85] Liu H, Zhang R, Yao X, Liu M, Hu Z, Fan B. Prediction of electrophoretic mobility of substituted aromatic acids in different aqueous-alcoholic solvents by capillary zone electrophoresis based on support vector machine. Analytica Chimica Acta. 2004, 525:31-41.
[86] Vapnik V. The nature of Statistical Learning Theory. New York:
Springer-Verlag; 1995.
[87] Vapnik V. Statistical Learning Theory. New York: John Wiley & Sons; 1998.
[88] Li H, Liang Y, Xu Q. Support vector machines and its applications in
chemistry. Chemometrics and Intelligent Laboratory Systems. 2009, 95:188-198.
[89] Thissen U, Pepers M, Üstün B, Melssen WJ, Buydens LMC.
Comparing support vector machines to PLS for spectral regression applications. Chemometrics and Intelligent Laboratory Systems. 2004, 73:169-179.
[90] Pan Y, Jiang J, Wang R, Cao H. Advantages of support vector
machine in QSPR studies for predicting auto-ignition temperatures of organic compounds. Chemometrics and Intelligent Laboratory Systems. 2008, 92:169-178.
[91] Collantes ER, Duta R, Welsh WJ, Zielinski WL, Brower J.
Preprocessing of HPLC trace impurity patterns by wavelet packets for pharmaceutical fingerprinting using artificial neural networks. Analytical Chemistry. 1997, 69(7):1392-1397.
[92] Zielinski WL, Brower JF, Welsh WJ, Collantes E, Layloff TP. A strategy
for developing consistent HPLC data for assessing sameness and difference in consistency of pharmaceutical products. American Pharmaceutical Review. 1998, 1:44-54.
[93] Marini F, Bucci R, Magrì AL, Magrì AD. Authentication of Italian CDO
wines by class-modeling techniques. Chemometrics and Intelligent Laboratory Systems. 2006, 84:164-171.
[94] Forina M, Oliveri P, Casale M, Lanteri S. Multivariate range modeling,
a new technique for multivariate class modeling: The uncertainty of the estimates of sensitivity and specificity. Analytica Chimica Acta. 2008, 622:85-93.
198
[95] Sáiz-Abajo MJ, González-Sáiz JM, Pizarro C. Near infrared spectroscopy and pattern recognition methods applied to the classification of vinegar according to raw material and elaboration process. Journal Near Infrared Spectroscopy. 2004, 12:207-219.
[96] Casale M, Armanino C, Casolino C, Forina M. Combining information
from headspace mass spectrometry and visible spectroscopy in the classification of the Ligurian olive oils. Analytica Chimica Acta. 2007, 589:89-95.
[97] Meléndez ME, Sánchez MS, Íñiguez M, Sarabia LA, Ortiz MC.
Psychophysical parameters of color and the chemometric characterization of wines of the certified denomination of origin „Rioja‟. Analytica Chimica Acta. 2001, 446:159-169.
[98] Sáiz-Abajo M, González-Sáiz J, Pizarro C. Classification of wine and
alcohol vinegar samples based on near-infrared spectroscopy, Feasibility study on the detection of adulterated vinegar samples. Journal of Agricultural and Food Chemistry. 2004, 52:7711-7719.
[99] Marinia F, Magria AL, Buccia R, Balestrierib F, Marini D. Class-
modeling techniques in the authentication of Italian oils from Sicily with a Protected Denomination of Origin (PDO). Chemometrics and Intelligent Laboratory Systems. 2006, 80:140-149.
[100] Nicholson JK, Connelly J, Lindon JC, Holmes E. Metabonomics: a
platform for studying drug toxicity and gene function. Nature Reviews Drug Discovery. 2002, 1:153-161.
[101] Ramadan Z, Jacobs D, Grigorov M, Kochhar S. Metabolic profiling
using principal component analysis, discriminant partial least squares, and genetic algorithms. Talanta. 2006, 68:1683-1691.
[102] Foot M, Mulholland M. Classification of chondroitin sulfate A,
chondroitin sulfate C, glucosamine hydrochloride and glucosamine 6 sulfate using chemometric techniques. Journal of Pharmaceutical and Biomedical Analysis. 2005, 38:397-407.
[103] Rezzi S, Axelson DE, Héberger K, Reniero F, Mariani C, Guillou C.
Classification of olive oils using high throughput flow 1H NMR fingerprinting with principal component analysis, linear discriminant analysis and probabilistic neural networks. Analytica Chimica Acta. 2005, 552:13-24.
199
[104] Kemsley EK. Discriminant analysis of high-dimensional data: a comparison of principal components analysis and partial least squares data reduction methods. Chemometrics and Intelligent Laboratory Systems. 1996, 33:47-61.
[105] Eriksson L, Antti H, Gottfries J, Holmes E, Johansson E, Lindgren F,
Long I, Lundstedt T, Trygg J, Wold S. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics. Analytical and Bioanalytical Chemistry. 2004, 380:419-429.
[106] Whelehan OP, Earll ME, Johansson E, Toft M, Eriksson L. Detection of
ovarian cancer using chemometric analysis of proteomic profiles. Chemometrics and Intelligent Laboratory Systems. 2006, 84:82-87.
[107] Zhou J, Xu B, Huang J, Jia X, Xue J, Shi X, Xiao L, Li W. 1H-NMR-
based metabonomic and pattern recognition analysis for detection of oral squamous cell carcinoma. Clinica Chimica Acta. 2009, 401:8-13.
[108] Chevallier S, Bertrand D, Kohler A, Courcoux P. Application of PLS-DA
in multivariate image analysis. Journal of Chemometrics. 2006, 20:221-229.
[109] Ballabio D, Skov T, Leardi R, Bro R. Classification of GC-MS
measurements of wines by combining data dimension reduction and variable selection techniques. Journal of Chemometrics. 2008, 22:457-463.
[110] Pereira GE, Gaudillere JP, van Leeuwen C, Hilbert G, Maucourt M,
Deborde C, Moing A, Rolin D. 1H NMR metabolite fingerprints of grape berry: Comparison of vintage and soil effects in Bordeaux grapevine growing areas. Analytica Chimica Acta. 2006, 563:346-352.
[111] Domingo C, Arcis RW, Osorio E, Toledao M, Saurina J. Principal
component analysis and cluster analysis for the characterization of dental composites. Analyst. 2000, 125:2044-2048.
[112] Beckonert O, Bollard ME, Ebbels TMD, Keun HC, Antti H, Holmes E,
Lindon JC, Nicholson JK. NMR-based metabonomic toxicity classification: hierarchical cluster analysis and k-nearest-neighbor approaches. Analytica Chimica Acta. 2003, 490:3-15.
200
[113] Sikorska E, Gorecki T, Khmelinskii IV, Sikorski M, Koziol J. Classification of edible oils using synchronous scanning fluorescence spectroscopy. Food Chemistry. 2005, 89:217-225.
[114] Caetano S, Aires-de-Sousa J, Daszykowski M, Vander Heyden Y.
Prediction of enantioselectivity using chirality codes and classification and regression trees. Analytica Chimica Acta. 2005, 544:315-326.
[115] Questier F, Put R, Coomans D, Walczak B, Vander Heyden Y. The use
of CART and multivariate regression trees for supervised and unsupervised feature selection. Chemometrics and Intelligent Laboratory Systems. 2005, 76:45-54.
[116] Deconinck E, Hancock T, Coomans D, Massart DL, Vander Heyden Y.
Classification of drugs in absorption classes using the classification and regression trees (CART) methodology. Journal of Pharmaceutical and Biomedical Analysis. 2005, 39:91-103.
[117] Caetano S, Üstun B, Hennessy S, Smeyers-Verbeke J, Melssen W,
Downey G, Buydens L, Heyden YV. Geographical classification of olive oils by the application of CART and SVM to their FT-IR. Journal of Chemometrics. 2007, 21:324-334.
[118] Marini F. Artificial neural networks in foodstuff analyses: trends and
perspectives, a review. Analytica Chimica Acta. 2009, 635:121-131. [119] Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural
network (ANN) modeling and its application in pharmaceutical research. Journal of Pharmaceutical and Biomedical Analysis. 2000, 22:717-727.
[120] Ginoris YP, Amaral AL, Nicolau A, Coelho MAZ, Ferreira EC.
Recognition of protozoa and metazoa using image analysis tools, discriminant analysis, neural networks and decision trees. Analytica Chimica Acta, 2007. 595:160-169.
[121] Hernández-Caraballo EA, Rivas F, Pérez AG, Marcó-Parra LM.
Evaluation of chemometric techniques and artificial neural networks for cancer screening using Cu, Fe, Se and Zn concentrations in blood serum. Analytica Chimica Acta. 2005, 533:161-168.
[122] Ma Q, Yan A, Hu Z, Li Z, Fan B. Principal component analysis and
artificial neural networks applied to the classification of Chinese pottery of neolithic age. Analytica Chimica Acta. 2000, 406:247-256.
201
[123] Belousov AI, Verzakov SA, von Frese J. Applicational aspects of aspects of support vector machines. Journal of Chemometrics. 2002, 16:482-489.
[124] Xu Y, Zomer S, Brereton RG. Support vector machines: a recent method for classification in chemometrics. Critical Review in Analytical Chemimstry. 2006, 36:177-188.
[125] Devos O, Ruckebusch C, Durand A, Duponchel L, Huvenne JP. Support vector machines (SVM) in near infrared (NIR) spectroscopy: focus on parameters optimization and model interpretation. Chemometrics and Intelligent Laboratory Systems. 2009, 96:27-33.
[126] Chen Q, Guo Z, Zhao J. Identification of green tea‟s (Camellia sinensis (L.)) quality level according to measurement of main catechins and caffeine contents by HPLC and support vector classification pattern recognition. Journal of Pharmaceutical and Biomedical Analysis. 2008, 48:1321-1325.
[127] Amendolia SR, Cossu G, Ganadu ML, Golosio B, Masala GL, Mura GM. A comparative study of k-nearest neighbour, support vector machine and multi-layer perceptron for thalassemia screening. Chemometrics and Intelligent Laboratory Systems. 2003, 69:13-20.
[128] Zomer Z, Guillo C, Brereton RG, Hanna-Brown M. Toxicological classification of urine samples using pattern recognition techniques and capillary electrophoresis. Analytical and Bioanalytical Chemistry. 2003, 378:2008-2020.
[129] Zheng L, Watson DG, Johnston BF, Clark RL, Edrada-Ebel R, Elseheri W. A chemometric study of chromatograms of tea extracts by correlation optimization warping in conjunction with PCA, support vector machines and random forest data modeling. Analytica Chimica Acta. 2009, 642:257-265.
[130] Fernández Piernz JA, Baeten V, Michotte Renier A, Cogdill RP, Dardenne P. Combination of support vector machines (SVM) and near-infrared (NIR) imaging spectroscopy for the detection of meat and bone meal (MBM) in compound feeds. Journal of Chemometrics. 2004, 18:341-349.
[131] Yao XJ, Panaye A, Doucet JP, Chen HF, Zhang RS, Fan BT, Liu MC, Hu ZD. Comparative classification study of toxicity mechanisms using support vector machines and radial basis function neural networks. Analytica Chimica Acta. 2005, 535:259-273.
202
[132] Ren Y, Liu H, Xue C, Yao X, Liu M, Fan B. Classification study of skin sensitizers based on support vector machine and linear discriminant analysis. Analytica Chimica Acta. 2006, 572:272-282.
[133] Zhang Q, Yoon S, Welsh WJ. Improved method for predicting β-turn
suing support vector machine. Bioinformatics. 2005, 21(10):2370-2374. [134] Zang Q, Keire DA, Wood RD, Buhse LF, Moore CMV, Nasr M, Al-
Hakim A, Trehy ML, Welsh WJ. Class modeling analysis of heparin 1H NMR spectral data using the soft independent modeling of class analogy and unequal class modeling techniques. Analytical Chemistry. 2011, 83(3):1030-1039.
[135] Parisi D, Magliulo M, Nanni P, Casale M, Forina M, Roda A. Analysis
and classification of bacteria by matrix-assisted laser desorption/ ionization time-of-flight mass spectrometry and a chemometric approach. Analytical and Bioanalytical Chemistry. 2008, 391:2127-2134.
[136] Candolfi A, De Maesschalck R, Massart DL, Hailey PA, Harrington
ACE. Identification of pharmaceutical excipients using NIR spectroscopy and SIMCA. Journal of Pharmaceutical and Biomedical Analysis. 1999, 19:923-935.
[137] Alonso-Salces RM, Herrero C, Barranco A, Berrueta LA, Gallo B,
Vicente F. Classification of apple fruits according to their maturity state by the pattern recognition analysis of their polyphenolic compositions. Food Chemistry. 2005, 93:113-123.
[138] Weljie AM, Newton J, Mercier P, Carison E, Slupsky CM. Targeted
profiling: quantitative analysis of 1H NMR metabolomics data. Analytical Chemistry. 2006, 78:4430-4442.
[139] R: software, a language and environment for statistical computing. R
Development Core Team, Foundation for Statistical Computing, www.r-project.org.
[140] Maindonald J, Braun J. Data analysis and graphics using R.
Cambridge (UK): Cambridge University Press; 2003. [141] Wehrens R. Chemometrics with R: multivariate data analysis in the
natural sciences and life sciences. Berlin Heidelberg: Springer-Verlag; 2011.
203
[142] Forina M, Lanteri S, Armanino C, Casolino C, Casale M. V-Parvus. 2007. http://www.parvus.unige.it.
[143] Sun M, Zheng Y, Wei H, Chen J, Cai J, Ji M. Enhanced replacement
method-based quantitative structure - activity relationship modeling and support vector classification of 4-anilino-3-quinolinecarbonitriles as Src kinase inhibitors. QSAR & Combinatorial Science. 2009, 28:312-324.
[144] Zhu D, Ji B, Meng C, Shi B, Tu Z, Qing Z. The performance of v-
support vector regression on determination of soluble solids content of apple by acousto-optic tunable filter near-infrared spectroscopy. Analytica Chimica Acta. 2007, 598:227-234.
[145] Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen
EJJ, van Duijnhoven JPM, van Dorsten FA. Assessment of PLSDA cross validation. Metabolomics. 2008, 4:81-89.
[146] Chen Y, Zhu S, Xie M, Nie S, Liu W, Li C, Gong X, Wang Y. Quality
control and original discrimination of Ganoderma lucidum based on high-performance liquid chromatographic fingerprints and combined chemometrics methods. Analytica Chimica Acta. 2008, 623:146-156.
[147] Zang Q, Keire DA, Wood RD, Buhse LF, Moore CMV, Nasr M, Al-
Hakim A, Trehy ML, Welsh WJ. Combining 1H NMR spectroscopy and chemometrics to identify heparin samples that may possess dermatan sulfate (DS) impurities or oversulfated chondroitin sulfate (OSCS) contaminants. Journal of Pharmaceutical and Biomedical Analysis. 2011, 54(5):1020-1029.
[148] Armanino C, Casolino MC, Casale M, Forina M. Modelling aroma of
three Italian red wines by headspace-mass spectrometry and potential functions. Analytica Chimica Acta. 2008, 614:134-142.
[149] Ryan EA, Farquharson MJ. Breast tissue classification using x-ray
scattering measurements and multivariate data analysis. Physics in Medicine and Biology. 2007, 52:6679-6696.
[150] Sun C, Zang H, Liu X, Dong Q, Li L, Wang F, Sui L. Determination of
potency of heparin active pharmaceutical ingredient by near infrared reflectance spectroscopy. Journal of Pharmaceutical and Biomedical Analysis. 2010, 51:1060-1063.
204
Appendix A: Abbreviations
ANN: artificial neural network APIs: active pharmaceutical ingredients BIC: Bayes information criterion CART: classification and regression tree CE: capillary electrophoresis CSA: chondroitin sulfate A CSB: chondroitin sulfate B CSC: chondroitin sulfate C CV: cross validation DP: discriminant power DPA: Division of Pharmaceutical Analysis DS: dermatan sulfate DSS: 4, 4-dimethyl-4-silapentane-1-sulfonic acid FDA: the US Food and Drug Administration GAG: glycosaminoglycan GAs: genetic algorithms GCV: generalized cross-validation HA: hyaluronic acid HCA: hierarchical cluster analysis HPLC: high-performance liquid chromatography HS: heparan sulfate
205
kNN: k-nearest neighbors
LDA: linear discriminant analysis LOO-CV: leave-one-out cross-validation MLR: multiple linear regression MSEP: mean squared error for prediction MVR: multivariate regression NIR: near infrared NMR: nuclear magnetic resonance OSCS: oversulfated chondroitin sulfate PC: principal component PCA: principal component analysis PE: processing element PLS-DA: partial least squares discriminant analysis PLSR: partial least squares regression PRESS: predictive error sum of squares QDA: quadratic discriminant analysis RBF: radial basis function RMSE: root mean squared error RR: Ridge regression RSD: relative standard deviation RSE: relative standard error RSS: residual sum of squares
206
SAX-HPLC: strong-anion-exchange high-performance liquid chromatography SEP: standard error of prediction SIMCA: soft-independent modeling of class analogy SLDA: stepwise linear discriminant analysis SVM: support vector machine SVR: support vector regression TNs: terminal nodes UNEQ: unequal dispersed classes USP: the United States Pharmacopeia
207
Appendix B: Index
Active pharmaceutical ingredients (APIs): 2, 5, 6, 9, 11, 21, 22, 32, 70, 72-75, 77, 78, 148, 165, 177, 183, 184, 186, 187, 189, 190 Allergic: 5, 25, 26 Anaphylactic reaction: 2, 25 Anticoagulant: 1, 4, 16, 17, 21-24, 27 Artificial neural network (ANN): 3, 13, 57-59, 80, 136-141, 144-147, 149, 151, 180, 182 Bayes information criterion (BIC): 84-87 Blend: 15, 75, 148-151, 165-168, 183, 186, 187 Calibration: 11, 33, 34, 39, 42, 68, 74, 80, 87, 88, 92, 192 Capillary electrophoresis (CE): 6, 7, 27, 31, 32, 70, 72, 111 Carbohydrate: 4, 16, 17, 28 Centroid: 66, 67, 169 Chemical shift: 13, 76, 78, 79, 131, 132, 135, 136, 145, 163, 170, 180-182 Chemometric: 2, 3, 7-12, 14-16, 33, 35, 41, 46, 48, 67, 70, 71, 73, 78- 80, 111, 177, 179, 180, 183, 185, 187, 191 Chi-squared distribution: 66, 169 Chondroitin sulfate: 1, 2, 4-6, 18, 19, 21, 22, 27, 30, 32, 70, 75, 148, 151, 165, 168, 183, 186 Class modeling: 14, 15, 47, 48, 63, 67, 80, 152, 155, 157, 160, 166, 168, 172, 173, 175, 176, 183, 185, 187 Classification: 3, 4, 9, 10, 12-15, 33, 43, 46, 47, 50, 51, 53-57, 60, 62, 72, 73, 80, 83, 111, 112, 117-120, 124-133, 135, 138, 140-142, 144-148, 151, 152, 158-161, 174, 175, 179-181, 183-185, 192
208
Classification and regression tree (CART): 3, 13, 54, 55, 131, 133, 135, 136, 141, 144-147, 149, 151, 180-182 Cluster: 13, 46, 49, 50, 112, 113, 148, 149, 156, 180 Clustering: 2, 12, 46, 117, 180, 181 Coefficient of determination: 12, 80, 102, 109, 178 Collinearity: 40, 98, 102, 103 Confidence level: 49, 64, 65, 67, 89, 152, 153, 155, 179 Contaminant: 2, 4-7, 9, 11, 13, 15, 16, 24-27, 29, 31, 67-69, 71, 75, 113, 117, 130, 136, 148, 152, 165, 177, 179-181, 183, 185, 189, 191, 192 Coomans plot: 155-158, 173, 176 Cost complexity parameter: 56, 134 Cost function: 43, 44, 108 Covariance: 37, 38, 42, 52, 66, 67, 122, 169 Cross entropy: 137 Cross validation (CV): 54, 57, 63, 99, 100, 104, 108, 123, 125, 128, 131, 134, 138, 141 Dendrogram: 148-150 Dermatan sulfate (DS): 1, 2, 4-7, 9, 10, 12-14, 15, 18, 19, 21, 22, 28-31, 67, 68, 70, 72-76, 78, 86, 89, 111, 113, 114, 116-121, 123-136, 138-140, 142-154, 156-161, 163-168, 170-177, 179-181, 183, 184, 186-189, 191, 192 Deviance: 137 Dimension: 7, 13, 18, 27, 42, 46, 48-50, 52, 53, 61, 63, 79, 103, 107, 111, 112, 118, 121, 122, 137, 182 Disaccharide: 4-6, 9, 17, 19, 20, 29, 30, 76 Discriminant power (DP): 161-164
209
Euclidean distance: 43, 53, 61, 64, 66, 149, 150 Feature space: 42, 45, 48, 61, 107 Fingerprint: 3, 9, 10, 11, 16, 28, 190 Galactosamine: 1, 5, 7, 11, 12, 18, 28, 29, 32, 72, 74, 76, 83, 189 Galactosamine content (%Gal): 1-3, 5, 7, 8, 11, 12, 15, 73-75, 81, 83, 87, 89, 92, 94-98, 105, 107, 109, 177-179 Gaussian: 46, 48, 63 Generalization: 61, 63, 108, 136, 182 Generalized cross-validation (GCV): 99, 100 Genetic algorithms (GAs): 3, 8, 12, 35, 36, 80, 83, 87-102, 104-107, 109, 110, 177, 178 Gini index: 55, 131 Glycosaminoglycan (GAG): 1, 4, 6, 11, 16-18, 20-22, 32, 75, 148-151, 165-168, 183, 187, 189 Grid search: 108, 141, 143 Heparan sulfate (HS): 4, 15, 19, 21, 22, 75, 148-151, 165-168, 183, 186, 187 Hexosamine: 17, 18, 32, 73, 189 Hexuronic acid: 17 Hidden layer: 58-60, 136-138 Hierarchical cluster analysis (HCA): 13, 46, 148, 149 High performance liquid chromatography (HPLC): 1, 6, 7, 11, 12, 27, 32, 74, 75, 81, 92, 94, 97, 119, 189-192 Hyaluronic acid (HA): 5, 21 Hyperplane: 60-62, 158,
210
Impure: 2, 3, 11, 72, 73, 78, 83, 111, 117, 136, 142, 152, 153, 171, 179, 187, 188 Impurity: 1, 2, 4-6, 9, 13, 21, 22, 30, 31, 46, 67, 73, 75, 111, 113, 117, 136, 146, 148, 150, 152, 177, 179-181, 183, 185, 187, 189, 191, 192 Inner product: 45 Input: 42, 50, 58-61, 78, 87, 107, 126, 136, 138, 144, 171 Kernel function: 42, 45, 46, 61-63, 92, 107-110, 141, 178, 182 k-nearest-neighbor (kNN): 3, 13, 53, 114, 126-130, 152, 180 Lagrange multiplier: 44, 107 Latent variable: 41, 42, 51, 103, 117, 180 Leave-one-out cross-validation (LOO-CV): 14, 42, 82, 103, 119, 124, 126, 127, 129, 158, 174 Linear discriminant analysis (LDA): 3, 13, 52, 53, 80, 114, 121-126, 129, 152, 181, 189 Loss function: 42, 43 Mahalanobis distance: 52, 66, 67, 121, 122 Mapping function: 45, 62, 182 Margin: 60-62, 108, 141 Mean squared error for prediction (MSEP): 99 Misclassification: 53, 55, 57, 62, 117-121, 124, 126, 128, 129, 133, 139, 142, 144-146, 154, 159, 181 Model parameter: 93, 98, 101, 106, 110, 133, 140, 144, 182 Multiple linear regression (MLR): 3, 8, 12, 39-41, 83, 86, 92, 93, 96, 98, 102, 103, 107, 177-179 Multivariate: 3, 7-15, 33, 34, 39, 41, 48-51, 67, 68, 72-74, 78, 80, 83, 87, 92, 103, 107, 111, 148, 161, 169, 177, 180, 183, 191, 192
211
Multivariate regression (MVR): 3, 12, 34, 39, 41, 72, 83, 92, 107, 177, 192 Near infrared (NIR): 70, 71, 189-192 Normal distribution: 39, 48, 66, 67, 169 Nuclear magnetic resonance (NMR): 1-3, 6-16, 27-29, 31, 32, 48, 49, 67, 68, 71, 72, 74-78, 83, 87, 92, 94, 96, 97, 100, 107, 111-113, 146, 152, 153, 159, 165, 169, 177, 179-181, 183, 185, 186, 189, 191, 192 Objective function: 62 One standard deviation: 129, 134 Optimization: 33-35, 44, 62, 87, 108, 141, 182 Output: 51, 54, 58-60, 89, 136, 137 Overfitting: 13, 53, 54, 62, 63, 92, 102, 108, 109, 124, 137, 138, 141, 178, 182 Oversulfated chondroitin sulfate (OSCS): 2, 6, 7, 9-16, 19, 25, 27, 28, 30-32, 67-73, 75-78, 111, 113, 115-136, 138-145, 147-161, 163-168, 170-177, 179-184, 186, 187, 189-192 Partial least squares discriminant analysis (PLS-DA): 3, 13, 50-53, 68, 80, 112, 114-121, 124, 126, 129, 152, 180, 181 Partial least squares regression (PLSR): 3, 8, 12, 41, 70, 71, 80, 83, 86, 103-107, 109, 177-179 Pattern recognition: 2, 7, 10, 11, 46, 47, 52, 57, 60, 72, 78, 111, 114, 121, 144, 177, 183, 184, 185, 187, 192 Polysaccharide: 4, 17, 20, 21 Predictive error sum of squares (PRESS): 103 Principal components (PC): 49, 50, 63, 64, 68, 69, 103, 105, 106, 112, 113, 118-121, 126, 128-130, 152-154, 158, 162, 180 Principal components analysis (PCA): 3, 13, 46, 48-51, 63, 68-70, 112, 114-117, 126, 128, 152, 180
212
Quadratic discriminant analysis (QDA): 48, 66, 169 Radial basis function (RBF): 45, 46, 62, 108, 109, 141 Regression coefficient: 39-44, 98-100, 102, 108 Regularization parameter: 41, 43, 62, 92, 108, 141, 182 Relative standard deviation (RSD): 80, 81, 93, 101, 102, 105, 106, 110 Relative standard error (RSE): 105 Residual standard deviation: 64, 161, 162, 164 Residual sum of squares (RSS): 40, 84 Ridge regression (RR): 3, 8, 12, 40, 41, 80, 83, 86, 92, 98-102, 177 Root mean squared error (RMSE): 80, 81, 93, 96, 101, 102, 105-107, 110 Screening: 2, 9, 15, 27, 31, 70-72, 111, 179 Sensitivity: 14, 27, 65, 154, 171-173, 185-187 Slack variable: 43, 62 Soft-independent modeling of class analogy (SIMCA): 3, 14, 15, 48, 63-65, 67, 152-154, 157-161, 164, 165, 171-174, 185-188 Specificity: 14, 65, 154, 158, 171-174, 185-187 Spectra: 2, 3, 6, 9-11, 16, 28, 29, 48, 49, 68, 70, 71, 75-79, 112, 113, 142, 152, 159, 180, 189-191 Spectral data: 39, 69, 72, 74, 83, 112, 177, 180, 189, 192 Standard error of prediction (SEP): 103-105 Stepwise linear discriminant analysis (SLDA): 14, 37, 122, 169, 185 Stepwise selection: 12, 83, 95, 96 Strong anion exchange (SAX): 6, 7, 11, 32, 119, 189-192
213
Supervised: 8, 46, 47, 51, 52, 112, 113, 117, 121, 180 Support vector: 44, 60, 62, 63, 108 Support vector machine (SVM): 3, 13, 14, 42, 60, 62, 63, 141-147, 149, 151, 180, 182, 183 Support vector regression (SVR): 3, 8, 12, 42, 45, 80, 83, 86, 92, 107-110, 177-179 Synthetic: 6, 30, 75, 148, 150 Terminal nodes (TNs): 54, 56, 57, 131-135 Test set: 13, 33, 34, 70, 73-75, 92, 95, 98, 102, 103, 105, 107, 109, 117, 119, 120, 123-129, 131, 133-135, 138, 140, 141, 145-147, 159, 161, 178, 182 The US Food and Drug Administration (FDA): 5, 7, 24, 25-27, 72, 189 The United States Pharmacopeia (USP): 1, 2, 5-7, 15, 32, 72, 96, 111, 183, 187, 189 Training set: 13, 33, 34, 48, 51, 53, 54, 56, 63, 64, 73-75, 92, 95, 96, 102, 103, 105, 109, 117, 122, 123, 125-129, 133-135, 138, 140, 141, 145-147, 159, 178, 182 Transfer function: 59, 60, 137 Underfitting: 53, 102, 178 Unequal dispersed classes (UNEQ): 3, 14, 15, 48, 66, 67, 152, 169, 171-176, 185-188 Unsupervised: 8, 46, 50, 112, 148, 180 Variable reduction: 14, 37, 122, 169, 170, 182, 185 Variable selection: 12, 13, 33-36, 80, 83, 86, 87, 92, 95, 96, 98, 103, 105, 107, 109, 121, 122, 177-179 Visualization: 49, 111, 148, 180 Weight decay: 137-140, 182