multivariate methods nutshell...

ChemometricsThe secrets behind multivariate methods in a nutshell!

• "Statistics means never having to say you're certain."– We will use spectroscopy as the analytical method to

explain the most commonly applied multivariate models … and their + and - !

• Least Squares Regression (LSR) – the starting point!• Classical Least Squares Regression (CLS)• Inverse Least Squares Regression (ILS)• Principle Components Analysis/Regression (PCA/R)• Partial Least Squares Regression (PLS)(Remember: we are still thinking linear!)• Nonlinear? Artificial Neural Networks (ANN)!

What do we want?Take advantage of changes we do not see!

2000 1500 1000-0.02

0.00

0.02

0.04

0.06

Na2SO4

0.09 % w/v 0.25 % w/v 0.5 % w/v

wavenumber (cm-1)

-0.02

0.00

0.02

0.04

0.06

abso

rban

ce (

abs.

u.)

CaCl2 1 % w/v 3 % w/v 5 % w/v

2000 1500 1000-0.02

0.00

0.02

0.04

0.06

MgCl2 1 % w/v 3 % w/v 5 % w/v

wavenumber (cm-1)

2000 1500 1000-0.02

0.00

0.02

0.04

0.06

wavenumber (cm-1)

KCl 1 % w/v 3 % w/v 5 % w/v

abso

rban

ce (

abs.

uni

ts)

-0.02

0.00

0.02

0.04

0.06

KBr 1 % w/v 3 % w/v 5 % w/v

-0.02

0.00

0.02

0.04

0.06

NaCl 1 % w/v 3 % w/v 5 % w/v

-0.02

0.00

0.02

0.04

0.06

NaBr 1 % w/v 3 % w/v 5 % w/v

100012001400160018002000Wavenumber cm-1

-0.0

10.

000.

010.

020.

03Ab

sorb

ance

Units

Salt ions in water

What do we want?Another example …

(courtesy K. Booksh, ASU)

• 80 corn flour samples– NIR reflectance measurements (differences?)– Calibrate for moisture, oil, protein, and starch

1000 1500 2000 25000

0.2

0.4

0.6

0.8

Wavelength (nm)

%R

What we really want …Calibration models!

• Quantitation of analytical results (in our case spectral analysis) requires prior training of the system with samples of known concentration!– Simplest case: measurement of band height or

band area of samples with known concentration and comparison of raw numeric values to unknown sample (measured at sameconditions). (Note: one-point calibration)

• Requirement – well resolved bands, but …… how about the real world ???

A little more sophisticated …Calibration equations instead of singular values!

• Create calibration equation (or series of equations) based on set of standard mixtures (= training set).– Set reflects composition of unknown samples as

close as possible.– Set spans the expected range of concentrations

and composition.– Training set measured under same conditions

(e.g. path length, sampling method, instrument, resolution, etc.) as unknown sample.


• ca = Aa (areaa) + B• ca = Aa (heighta)2 + (heighta) B + C, etc. …

– A,B,C … calibration coefficients.– Coefficients usually not known.– Samples with known concentrations (training set).– Minimum number of calibration samples is the number

of unknown coefficients!– Usually more samples measured to improve accuracy of

calib. coefficients. (Note: robust model; minimize sum of squared errors/residuals)

– Repeated measurements of same conc. for averaging (Note: noise!).


• Best way to find calibration coefficients: Least Squares Regression (LSR)!Calculate the coefficients of given equation such that differences between known responses (peak areas or heights) and predicted responses are minimized.

Areas of spectral component band and concentrations used to compute the coefficients of the calibration equation by Least Squares Regression .

LSRThe easy way!

• What do we need to pay attention to:– If more than one component in the samples a separate

band must be used for each.– Hence, one equation necessary for each component.– LSR assumes that absorbance measurement of peak

height or peak area is result of only one component. – Hence, results not accurate for mixtures with

overlapping bands.– Predictions will have large errors if interfering (spectrally

overlapping) components are present.Solution: more sophisticated statistics!

More sophisticated …… getting closer to real world samples!

• We should calculate the absorptivitycoefficients across a much larger portion of the spectrum.

• Beer’s law: Aλ = ελ c b– Assuming that for a given λ ελ and b remain constant we

define the constant Kλ = ελ b.– Solving this equation: measure the absorbance of a

single sample of known concentration and use these values to solve for given Kλ.

– Prediction of unknown sample: c = Aλ / Kλ

Classical Least Squares Regression (CLS)… also called K-Matrix!

• Problem?– Basing an entire calibration on a single sample

is generally not a good idea (noise, instrument error, sample handling error, etc.).

• Solution?– Measure the absorbances of a series of

different concentrations and calculate the best fit line through all the data points (see LSR).

CLS has more problems …

• Sample with two constituents:– Algebraic solution requires:

# equations = # unknowns.– Let’s consider A’s of component A

and B are at different λ andabsorptivity constants Kλ aredifferent.

– We can solve each equation independently provided that the spectrum of one constituent does not interfere with the spectrum of the other.

Aλ1 = cA KAλ1 Aλ2 = cB KBλ2

• Do you see a problem with that?

CLS has more problems …

• Yes, there is a problem!– Equations make assumption that

absorbance at λ1 is entirely dueto constituent A and λ2 entirelydue to constituent B!

– Similar to LSR we need to findtwo λ in the spectra of the training set exclusivelyrepresenting constituents A and B.

– Difficult with complex mixtures or simple mixtures of similar materials!

Do you see a solution to that?

CLS• Beer’s law can help us:

– Absorbances of multiple constituents at the same λ are additive .

Aλ1 = cA KAλ1 + cB KBλ1

Aλ2 = cA KAλ2 + cB KBλ2

– Did we forget something? We assume there is no error in the measurement (i.e. the calculated least squares line(s) that best fits the calibration samples is perfect). Unfortunately, this never happens! Hence, we add a variable E describing the residual error between the least squares fit line and the actual absorbances.

CLS• Now we write:

Aλ1 = cA KAλ1 + cB KBλ1 + … + cn Knλ1 + Eλ1

Aλ2 = cA KAλ2 + cB KBλ2 + … + cn Knλ1 + Eλ2

Aλn = …– As with most calibration models – CLS requires many

more training samples to build an accurate calibration.– Hence, we need to solve many equations (for many

constitutents and λs).– Solution? Linear algebra formulating the equations

into a matrix every PC is craving for!

CLS• If we solve the equations for the K matrix we

can use the resulting best fit least squares line(s) to predict concentrations of unknownK = A C-1

(Note: check back on matrix algebra in the beginning of this class!)

• Advantage compared to LSR:– We can use parts or the entire spectrum for calibration.– Averaging effect increases accuracy of prediction.– If entire spectrum is used for calibration the rows of the

K matrix are actually spectra of the absorptivities for each of the constituents, which look very similar to the pure constituent spectra.

CLSThe Good, the Bad and the Ugly …

• Advantages:– Based on Beer’s law.– Calculations are relatively fast.– Applicable to moderately complex mixtures.– Calibrations do not require wavelength selection as long as the #

wavelengths exceeds the # constituents.

• Disadvantages:– Requires knowing the entire composition and concentration of

every constituent of the training set.– Limited applicability for mixtures with constituents that interact.– Very susceptible to baseline effects (e.g. drifts) as equations

assume the response at a wavelength is only due to the calibrated constituents.

Again more sophisticated …… getting even closer to real world samples!

• Beer’s law: Aλ = ελ c b– No interference in the spectrum between the

individual sample constituents.– Concentrations of all the constituents in the

samples are known ahead of time.– Very unlikely for real world samples!

Solution: let’s rearrange Beer’s law again!

Inverse Least Squares Regression (ILS)… also called P-Matrix or Multiple Linear

Regression (MLR)!

• Beer’s law rearranged:c = Aλ / ελ b– Assuming that for a given λ ελ and b remain

constant we define the constant P = 1/ελ b.– Now we can write:

c = P Aλ + E– In this expression Beer’s Law says that the

concentration is a function of the absorbancesat a series of given wavelengths!

??? Where is the difference/advantage to CLS ???

ILS

• CLS:Aλ1 = cA KAλ1 + cB KBλ1 + Eλ1

Aλ2 = cA KAλ2 + cB KBλ2 + Eλ2– Absorbance at a single wavelength is calculated as an additive

function of the constituent concentrations, i.e. concentrations of ALL components need to be known!

• ILS:cA = Aλ1 PAλ1 + Aλ2 PAλ2 + EA

cB = Aλ1 PBλ1 + Aλ2 PBλ2 + EBNOTE: even if the concentrations of all the other constituents in the

mixture are not known, the matrix of coefficients P can still be calculated correctly!!!

ILS• Consequently:

– Only concentrations of the constituents of interest need to be known.

– No knowledge of the sample composition is needed. • What do we need?

– Selected wavelengths must be in a region the constituent of interest contributes to the overall spectrum.

– Measurements of the absorbances at different wavelengths are needed for each constituent.

– Measurements of at least one different λ is needed for each additional independent variation (constituent) in the spectrum.

ILS

• Matrix algebra is helping again:P = C A-1

– Now we can accurately build models for complex mixtures when only some of the constituent concentrations are known.

– We just need to select wavelengths corresponding to the absorbances of the desired constituents.

So where is the UGLY?

ILS

• Dimensionality of matrix equations:(see also: very beginning of this class on matrix algebra!)

– Number of selected wavelengths cannot exceed the number of training samples.

• Collinearity:(see also: linear independency!)

– We could measure many more training samples to allow for additional wavelengths.

– BUT: absorbances in a spectrum tend to all increase and decrease together as the concentrations of the constituents in the mixture change.

ILS• Overfitting:

(see also: chance correlations!)

– In general, starting from very few λ, and adding more to the model (of course selected to reflect the constituents of interest) will improve the prediction accuracy.

– BUT: if the number of λ increases in the calibration equations, the likelihood that "unknown" samples will vary in exactly the same manner decreases and prediction accuracy goes down again.

• Noise:(see also: issues of noise!)

– If too much information (too many λ) is used to calibrate the model starts to include the spectral noise (which is unique to the training set only) as constituent signal and the prediction accuracy for unknown samples suffers.

ILS• Consequence:

– Averaging effect gained by selecting many wavelengths as in CLS is effectively lost.

– Wavelength selection is critically important to building an accurate ILS model.

– Ideal situation: selecting sufficient wavelengths to compute accurate least squares line(s) and few enough that the calibration is not (overly) affected by the collinearity of the spectral data.

– Hence – optimization of model required! • Advantage:

Main advantage of this multivariate method is the ability to calibrate for a constituent of interest without having to account for interferences in the spectra.

ILSThe Good, the Bad and the Ugly …

• Advantages:– Based on Beer’s law.– Calculations are relatively fast.– True multivariate model, which allows calibration of very complex

mixtures since only knowledge on constituents of interest is required.(Note: multivariate because concentration (dependent variable) is solved by calculating a solution from responses at several selected wavelengths (multiple independent variables).

• Disadvantages:– Wavelength selection can be difficult and time consuming.– Collinearity of wavelengths must be avoided.– # wavelengths used in the model limited by # calibration samples.– Large number of samples are required for accurate calibration

Maybe we need to think more abstract to solve some of these problems…

• Why?– Spectrum of real world samples: many different

variations contribute, incl. constituents in the mixture, interaction between constituents, instrument variations (e.g. detector noise), changing of ambient conditions affecting baseline and absorbance, sample handling, etc.

• What do we hope for?– That the largest variations in the calibration set are the

changes in the spectrum due to the different concentrations of the constituents!

Do we need the absolute absorbance values in the spectrum ?

• Well …… even if many complex variations affect the spectrum

there should be a finite number of independent common variations in the spectral data.

• Conclusion?– If we can calculate a set of variation spectra

representing the changes in the absorbances at all wavelengths in the spectra, this data could be used instead of the raw spectral data for building the calibration model!

Let’s continue this idea …

• Can we reconstruct a spectrum from ‘variations’?– If we multiply a sufficient amount of variation spectra

each with a different constant scaling factor and add the results together we should be able to reconstruct a ‘real spectrum’.

– Each spectrum in the calibration set would have a different set of scaling constants for each variation since the concentrations of the constituents are all different.

– Hence, the fraction of each variation spectrum that must be added to reconstruct the unknown data should be related to the concentration of the constituents.

What does this mean mathematically?

• ‘Variations spectra’ are called Eigenvectors!(also: loading vectors or principal components)– The scaling constants applied to reconstruct the

spectra (which we multiply with the ‘variations spectra’ = Eigenvectors) are called scores.

• What do we need to do?– We break down a spectroscopic data set into its

most basic variations …

Variations vs. AbsorbancesSpectrum of 3 components

… which is called:Principal Component Analysis (PCA)

• Why do we want to do that?– Because there should be much fewer common

variations in the calibration spectra than the number of calibration spectra.

– Hence, because we are lazy (or time is pressing) we expect to significantly reduce the number of calculations for the calibration equations!

Principal Component Analysis (PCA)

• Why does prediction work on the basis of Eigenvectors?– The calculated eigenvectors derive from the original

calibration data (spectra).– Hence, inherently they must relate to the concentrations

of the constituents making up the samples.– Following, the same loading vectors (eigenvectors,

principal components) can be used to predict unknown samples.

• Consequence:– The only difference between the spectra of samples

with different constituent concentrations is the fraction of each added loading vector (scores).


• What do we need to do?– The calculated scores are unique to each

separate principal component and to each training spectrum.

– In fact, they can be used in lieu of absorbancesin either of the classical model equations in CLS or ILS.

– Consequently, the representation of the mixture spectrum is reduced from many wavelengths to a few scores.


• Now comes the real advantage!– We can use the ILS expression of Beer’s law

(c = P Aλ + E) to calculate concentrations as this allows us to calculate concentrations among interfering species.

– At the same time the calculations maintain the averaging effect of CLS by using a large number of wavelengths in the spectrum (up to the entire spectrum) for calculating the eigenvectors.

Eigenvector models combine the best of both worlds!


• What do we get?– Spectral data condensed into the most prevalent

spectral variations (principal components, eigenvectors, loadings) and the corresponding scaling coefficients (scores).


• Difference to CLS and ILS:– PCA models base the concentration predictions on changes in the

data (‘variation spectra’) and not absolute absorbance measurements.

– Conclusion: in order to establish a PCA model spectral data mustchange.

– Simplest way: vary the concentrations of the constituents of interest in the trainings set.

– Important: avoid collinearity. i.e. 2 or more components in the calibration samples should not be present in the same ratio (e.g. A and B are present in the stock solution in ratio 2:1 and training set is prepared by dilution of that solution)! PCA will detect only ONE variation!

– Calibration of Eigenvector models requires randomly distributed ratios of the constituents of interest.


• Mean centering of data:– Data is commonly mean centered prior to PCA.– Mean centering: mean spectrum (average spectrum) is calculated

from all calibration spectra and then subtracted from every calibration spectrum.

– Effect: enhancement of small differences between spectra as changes in the absorbance data important and not the absolute absorbance (i.e. data not falsified!).

– Following mean centering a set of Eigenvectors (principal components) is created that represents the changes in the absorbances common to all calibration spectra.

– After training data has been fully processed by the PCA algorithm, two main matrices remain:+ The Eigenvectors (spectra)+ The scores (the eigenvector weighting values for all the calibration spectra)


• Matrix expression of PCA:A = S F + EA

- EA … error matrix describing the model’s ability to predict the calibration absorbances;has same dimensionalityas the A matrix.

- EA called the matrix ofresidual spectra(Note: see residualanalysis!)

(Note: mean spectrum only added ifdata mean centered; spectralresidual (EA) is the difference bet-ween the reconstructed spectrumand the original – keep in mind:no model is perfect!) Original calibration

spectrum


• Now the question arises – how many PCs do we need to model our data? (Note: same question for # of factors in PLS)- Calculated Eigenvectors are ordered by their degree of importance to

the model – in case of PCA the decisive parameter is the variance.- If too many PCs are taken into account (‘overfit’) the Eigenvectors will

begin modeling the system noise as the smallest contributions ofvariance in the training data set.

- This is great: if we select the correct number of PCs we effectively filter our noise!

- However: if the number of PCs is too small (‘underfit’) the concentration prediction for unknown samples will suffer.

- So here is the task: define a model that contains enough orthogonal (linear independent) Eigenvectors to properly model the components of interest without adding too much contribution from noise!

But how?


• Calculate the PRESS value for every possible factor(PRESS=Prediction Residual Error Sum of Squares)- We build a calibration model with a number of

factors.- Then we predict some samples of known

concentration (usually from the training set) with the model.

- The sum of the squared difference between the predicted and known concentrations gives the Prediction Residual Error Sum of Squares for that model.

(n is the number of samples in the training set; m is the number of constituents; Cpis the matrix of predicted sample concentrations from the model; C is the matrix of known concentrations of the samples)


• Self prediction:- Models built using all the spectra in the training set.- Then the same spectra are predicted back against

these models.- Disadvantage: all vectors calculated exist in all

training spectra. Hence, the PRESS plot will continue to fall as new factors are added to the model and will never rise. This gives the (false) impression that all vectors are ‘constituent’ vectors and that there are no ‘noise’ vectors to eliminate (which is never the case!).

- However, there is one tempting advantage – this method is very fast as the model is only built once!

Better way: Cross validation


• Cross validation (1):- Again, unknown samples emulated by training set.- However: sample to be predicted left out during

calibration.- Procedure repeated until every calibration sample

has been left out and predicted at least once. The calculated squared residual error is added to all the previous PRESS values.

- Disadvantage: time consuming as re-calculation is required for every left-out sample.

- However, as the predicted samples are not the same as the samples used to build the model, the calculated PRESS value is a very good indication of the error in the accuracy of the model when used to predict "unknown" samples in the future !

Hence: The only recommended way!


• Cross validation (2):- Initially the prediction error (PRESS value) decreases

as new Eigenvectors (PCs) are added to the model. This indicates that the model is still underfit and there are not enough factors to completely account for the constituents of interest.

- At some point the PRESSvalues reach a minimumand start to PCs thatcontain uncorrelated noiseindicating that the modeloverfit.

Principal Component Regression (PCR)

• PCA combined with ILS:- Quantitative models for complex samples can be

established.- Instead of directly regressing the constituent

concentrations against the spectroscopic response via Beer’s law we regress the concentrations against the PCA scores.

- Eigenvectors of PCA decomposition represent the spectral variations common to all of the spectroscopic calibration data.

- We can use that information to calculate a regression equation providing a robust model for predicting concentrations of the desired constituents in very complex samples (instead of directly utilizing absorbances).

Principal Component Regression (PCR)• How does it work?

- Let’s compare against the techniques we know:CLS: K = A C-1, Aλ1 = cA KAλ1 + cB KBλ1 + Eλ1ILS: P = C A-1, cA = Aλ1 PAλ1 + Aλ2 PAλ2 + EAPCA: A = S F + EA

- F-Matrix in PCA (containing the PCs) has similar function as K-Matrix in CLS: stores the spectral (or spectral variance) data of the constituents. The F-Matrix ‘needs’ the S-Matrix (scores) to be useful; likewise, the K-Matrix needs the C-Matrix.

- The scores summarized in the S-Matrix are unique to each calibration spectrum.

- An optical spectrum is represented by a collection of absorbances at a series of wavelengths. In analogy, the very same spectrum can be represented by a series of scores for a given set of factors.

Hence: we can regress the concentrations (C-Matrix) against the scores (similar to the classical approach regressing the concentrations against the absorbances, i.e. A-Matrix).

Principal Component Regression (PCR)• Using the ILS approach we can formulate:

- C = B S + EC

C represents the constituent concentration matrix, Bthe matrix of regression coefficients and S the scores matrix from the PCA.

- Now we understand why this approach is called PCR: we combine PCA (first step) with ILS regression (second step) to solve the calibration equation for the model. In contrast (and as we shall see later), partial least squares (PLS) regression performs these operations in one step.

- We can use A = S F rearranged to S = A F-1(neglecting the error matrix for simplicity):

C = B A F-1 + EC … PCR model equation

Principal Component Regression (PCR)• IMPORTANT (1):

- PCR calibration model is a two-step process: (1) PCA Eigenvectors and scores are calculated; (2) scores are regressed against the constituent concentrations using a regression method similar to ILS.

- NOTE: Remember that the ILS approach can build accurate calibrations, provided that the selected variables are physically related to the constituent concentrations. However, the PCA factors/scores are calculated independently of any knowledge of these concentrations represent only the largest common variations among all the spectra in the training set.

- We assume that these variations will be mostly related to changes in the constituent concentrations, but there is no guarantee this will be true.

Principal Component Regression (PCR)• IMPORTANT (2):

- Practically, many PCR models include more factors than are actually necessary as some of the Eigenvectors are probably not related to any of the constituents of interest.

- Ideally, a PCR model should be built by performing a selection on the scores (similar to wavelengths selection in ILS model) determining which factors should be used to build a model for each constituent.

- As these selection rules are difficult to establish and to wrap into algorithms corresponding treatment is not included in most chemometrics packages!

That’s why we have yet another technique – PLS ;-)

PCA/PCRThe Good, the Bad and the Ugly …

• Advantages:– Does not require wavelength selection, usually whole

spectrum or large regions used (though scores selection might be advantageous sometimes!).

– Larger number of wavelengths provides averaging effect (model less susceptible to spectral noise).

– PCA data compression (much less PCs than spectra) allows using inverse regression to calculate model coefficients calibrating only for constituents of interest.

– Can be used for very complex mixtures since only knowledge of constituents of interest is required.

– Can sometimes be used to predict samples with constituents (contaminants) not present in the original calibration mixtures.

PCA/PCRThe Good, the Bad and the Ugly …

• Disadvantages:– Calculations slower than most classical methods (not a

tremendous problem nowadays given the available computation power).

– Optimization requires some knowledge of PCA; models are more complex to understand and interpret coefficients calibrating only for constituents of interest.

– Large number of samples are required for accurate calibration.

– Hence – preparation/collection of calibration samples can be difficult avoiding collinearity of constituent concentrations.

Partial Least Squares (PLS)… yet another technique!

• More focused on concentrations …- PLS is closely related to PCA.- Main difference: spectral decomposition uses concentration

information provided in the training set.- PCA: first we decompose spectral matrix into set of

Eigenvectors and scores; then we regress them against the concentrations in a separate step.

- PLS: concentration information used already during the decomposition process; hence, spectra containing higher constituent concentrations weighted more heavily than spectra containing low concentrations.

- Consequence: Eigenvectors and scores calculated using PLS are different from those in PCR.

The main idea of PLS is to get as much concentration information as possible into the first few loading vectors

Partial Least Squares (PLS)

• Here’s what we do …- PCA decomposes spectra into the most common

variations.- PLS takes advantage of the correlation relationship

already existing between the spectral data and the constituent concentrations and decomposes the concentration data also into the most common variations.

- Consequently: two sets of vectors (one set for spectral data; one set for constituent concentrations) and two sets of corresponding scores are generated for the calibration model.


• Let’s contrast the results of PCA and PLS:- PCA decomposes first

and then performs the regression.

- PLS performs decomposition of spectral and concentration data simultaneously (=regression already included in one step).

Principal components are called factors in PLS.

PCA

PLS


• How does it work?- We will not derive the algorithms. For those

interested in applying the methodology guidelines for implementation will be posted on the web.

- It is assumed that the two sets of scores (spectral and concentration scores) are related to each other through some type of regression (which appears natural as the spectral features are dominated by the constituent concentrations). Hence, a calibration model can be constructed.

- As each new factor is calculated for the model, the scores are "swapped" before the contribution of the factor is removed from the raw data. The reduced data matrices are then used to calculate the next factor. This process is repeated until the desired number of factors is calculated.


• Main difference between PCA and PLS- In PLS the resulting spectral

vectors are directly related to the constituents of interest.

- In PCR the vectors only represent the most common spectral variations in the data completely ignoring their relation to the constituents of interest until the final regression step.

Partial Least Squares (PLS)… to make it even more complicated …

• … there is a PLS-1 and a PLS-2 algorithm! - PLS-1 is the procedure we just discussed, which

results in in a separate set of scores and loading vectors for each constituent of interest (see previous slide). Hence, the calculated vectors are optimizedfor each individual constituent.

- PLS-2 basically adopts the strategy of PCA and calibrates for all constituents simultaneously. Hence, the calculated vectors are not optimized for each individual constituent.

- Consequence: in principle, the predictions derived from PLS-1 should be more accurate then PLS-2 and PCA. But: speed of calculation!

(Note: separate set of eigenvectors and scores must be calculated for every constituent of interest; training sets with a large number of samples and constituents will significantly increase the time of calculation.)


• Advantage of PLS: - For systems that have constituent concentrations that

are widely varied. - Example: calibration spectra contain A in

concentration range 40-60%, B in concentration range 5-8% and C in concentration range 0.1-0.5%.

- Here, PLS-1 will very likely predict better than PLS-2 or PCA.

- If the concentration ranges of A, B and C are approx. the same, PCA and PLS-2 will perform with similar predictive quality, however, PLS-1 will definitely take longer to calculate.

PLSThe Good, the Bad and the Ugly …

• Advantages:– Combines the full spectral coverage of CLS with partial

composition regression of ILS (‘best of both worlds’ argument!).

– Single step decomposition and regression.– Eigenvectors/Factors directly related to constituents of

interest rather than largest common spectral variations.– Calibrations are generally more robust if calibration set

accurately reflects range of variability expected in unknown samples.

– Can sometimes be used to predict samples with constituents (contaminants) not present in the original calibration mixtures.

– In general, literature argues that PLS has superior predictive ability - HOWEVER: there are many published examples where certain calibrations simply have performed better using PCR or PLS-2 instead of PLS-1 !!!

PLSThe Good, the Bad and the Ugly …

• Disadvantages:– Extensive calculation times.– Models are fairly abstract and difficult to understand

and interpret.– Large number of samples are required for accurate

calibration.– Hence – preparation/collection of calibration samples

can be difficult avoiding collinearity of constituent concentrations.

Decision maker on the correct method …

What do we want?

• 80 corn flour samples– NIR reflectance measurements (differences?)– Calibrate for moisture, oil, protein, and starch

1000 1500 2000 25000

0.2

0.4

0.6

0.8

Wavelength (nm)

%R

40 Calibration20 Validation20 Test

80 Samples {

Corn flour samplesPCR vs. PLS for Oil in Corn Flour

Factors Percent Spectral Variance

Cumulative % Spectral Variance

1 88.21 88.21 2 6.57 94.78 3 2.16 96.94 4 0.98 97.92 5 0.76 98.68 6 0.46 99.14 7 0.30 99.45 8 0.25 99.70 9 0.11 99.81 10 0.06 99.87 11 0.03 99.90 12 0.03 99.93 13 0.02 99.95 14 0.01 99.96 15 0.01 99.97

Factors Percent Spectral Variance

Cumulative % Spectral Variance

1 82.54 82.54 2 12.19 94.73 3 0.84 95.58 4 0.57 96.15 5 1.78 97.94 6 0.78 98.72 7 0.61 99.33 8 0.11 99.44 9 0.22 99.66 10 0.12 99.78 11 0.07 99.85 12 0.05 99.90 13 0.02 99.92 14 0.01 99.93 15 0.02 99.96

• Left: PCR; right: PLS

SalinityPCR

2000 1500 1000-0.02

0.00

0.02

0.04

0.06

Na2SO4

0.09 % w/v 0.25 % w/v 0.5 % w/v

wavenumber (cm-1)

-0.02

0.00

0.02

0.04

0.06

abso

rban

ce (

abs.

u.)

CaCl2 1 % w/v 3 % w/v 5 % w/v

2000 1500 1000-0.02

0.00

0.02

0.04

0.06

MgCl2 1 % w/v 3 % w/v 5 % w/v

wavenumber (cm-1)

2000 1500 1000-0.02

0.00

0.02

0.04

0.06

wavenumber (cm-1)

KCl 1 % w/v 3 % w/v 5 % w/v

abso

rban

ce (

abs.

uni

ts)

-0.02

0.00

0.02

0.04

0.06

KBr 1 % w/v 3 % w/v 5 % w/v

-0.02

0.00

0.02

0.04

0.06

NaCl 1 % w/v 3 % w/v 5 % w/v

-0.02

0.00

0.02

0.04

0.06

NaBr 1 % w/v 3 % w/v 5 % w/v

Salt ions in water

SalinityPCR

Spectral evaluation

• around 3350 cm-1 - water absorption is too strong

• around 2100 cm-1 - weak water absorption (included)

• around 1635 cm-1 - appropriate water absorption (included)

• around 900 cm-1 - appropriate water absorption (included)

• around 1100 cm-1 - absorption band of SO42- ion

→ wavenumber range used for chemometric data evaluation: 2300 cm-1 – 700 cm-1

SalinityPCR

Salt ions in water

-0.05

0.00

0.05

0.10

0.15

0.20

PC #1 PC #2

load

ings

(a.

u.)

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

PC #3 PC #4

2000 1500 1000-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

PC #5 PC #6

2000 1500 1000-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

PC #7 PC #8

wavenumber (cm-1)

SalinityPCR

Synthetic samples

0 200 400 600 800

0

400

800 estimated detection limit:

100 mM·L-1

Na+ + K+

0 200 400 600 800 1000 12000

400

800

1200

Cl-

0 200 400 600 8000

200

400

600

800

Br-

mea

sure

d co

nc.

(mM

·L-1)

0 200 400 600 8000

200

400

600

800

Ca2+

0 200 400 600 800-200

0

200

400

600

800

Mg2+

input conc. (mM·L-1)

0 10 200

10

20estimated detection limit:

0.3 mM·L-1

SO42-

input conc. (mM·L-1)

SalinityPCR

Artificial seawater

0 100 200 300 4000

200


100 mM·L-1

Br-

typ. 1 mM·L-1

Na+ + K+

typ. 479 mM·L-1

0 100 200 300 400 500-200

0

200

400

Cl-

typ. 559 mM·L-1

0.0 0.3 0.6-100

0

100

mea

sure

d c

onc

(mM

·L-1)

0 5 10-50

0

50

100

Ca2+

typ. 11 mM·L-1

0 10 20 30 40 50-50

0

50

100

Mg2+

typ. 54 mM·L-1

input conc (mM·L-1)

0 10 20 300

10

20


0.3 mM·L-1

SO42-

typ. 29 mM·L-1

input conc (mM·L-1)

SalinityPCR

Conclusions

• a sensor is proposed for salinity analysis of aqueous samples

• investigated ions: Cl –, Na+, Mg2+, SO42-, Ca2+, K+, Br –

• measurement principle is based on changes of the water IR spectrum due to the ions

(species and concentration dependent)

• multicomponent analysis of several salt ions was successful

• the influence of Na+ and K+ on the water spectrum is too similar to be discriminated

• estimated detection limits: 100 mM·L-1 for all ion species except SO42-

0.3 mM·L-1 for SO42-

• Cl –, Na+ + K+, and SO42- can be determined at the concentrations present in sea water

• Ca2+, Br –, and Mg2+ are present in real world samples at too low concentrations

Design of Training Data SetIn general …

• Quality of training data set is the most important aspect!– Predictive ability of the equations are only as

good as the data used to calculate them in the first place!

• Control the variables:– Collecting representative samples.– Accurate primary calibration method.– Appropriate sample measurements

(reproducibility of conditions, etc.).

Design of Training Data SetTraining samples similar to unknown samples

• Training samples should be as similar as possible to unknown samples!– Spectrum of pristine constituent looks.

different from when it is part of a mixture!– Exception: very simple mixtures; samples in

gas phase.– Factor based models (PCA, PLS) can

compensate for interconstituent interactions. • BUT …

… only if the training set contains examples of these !!!


• If samples are simple mixtures:– Few components, distinct absorption features.– Use simple models (CLS, ILS)!

• If samples are complex mixtures:– Many components, overlapping absorption features.– Use factor based models (PCR, PLS) extracting the

relevant information from the spectra and ignoring the rest!

– HOWEVER: give the model the best chance to learn!Train it using samples that emulate the unknowns as closely as possible.


• Strategy:– Collect actual samples from the measurement

environment (e.g. plant, the field, etc.).– Analyze them in the lab using other primary

calibration methods (e.g. chromatography, wet chemical test, etc.).

– This data along with the sample spectra formulates the training set to build a reliable calibration model.

Design of Training Data SetBracket concentration range

• Strategy:– Constituent values for the training samples should

span the expected range of all future unknown samples.

– Extrapolation is generally not a good idea!– External validation is the only way to determine how

well a model will predict outside the original calibration range.

– Consequently: constituent values in the training samples should be larger and smaller than the expected values in unknown samples.

– Do not hesitate using a lot of calibration samples!

Design of Training Data SetUse enough samples …

• Strategy:– Training set must have at least as many

samples as there are constituents of interest.– Usually many more than that (Note: noise)!

• How many samples are required to build a good model?– As many samples as it takes!– The more data fed into the model, the higher

your confidence in the prediction!


• Strategy:– Use a sufficiently large number of samples for

calibration to allow sufficient factors in the model.– For complex matrices you need enough samples to

account for all the variability in the real samples.– Note: the maximum number of factors that can be

calculated for a given training set is limited by the smallest dimension of the data matrix.Example: if a training set has 300 samples but the calibration regions have only 20 total spectral data points, then the maximum number of factors is limited to 20 as well.


• Keep in mind:– Quantity AND quality of the data is important!– The more samples, the better the

discrimination between analytically relevant signatures and noise.

– BUT: Only accumulating a huge number of spectra as a training set will not guarantee a better model - carefully measuring and qualifying a much smaller number for calibration is the way to go!

Design of Training Data SetConstituent collinearity …

• We already know:– Collinearity is the effect observed when the relative

amounts of two or more constituents are constant throughout all the training samples.

• Why is this a problem?– Factor based models do not calibrate by creating a

direct relationship between the constituent data and spectral response.

– They correlate the change in concentration to corresponding changes in the spectra.

– If constituents are collinear, multivariate models cannot not differentiate them, and calibrations for the constituents will be unstable.


• Note:– For simple bivariate calibrations we make one stock

solution with high concentrations of all constituents of interest. Then we make multiple dilutions of that one mixture to create the remaining samples.

– This procedure will completely fail for multivariate models!

– To an eigenvector-based model only one factor will arise containing nearly all the variance in the training data set.

How can we determine that collinearity happens?


• Plot the sample concentrations of each constituent in the model against the others:– If the points fall on a straight

line, the concentrations are collinear.

– If the constituents were uncorrelated they would form a cluster of points!

Spectral region selectionAn optimization problem …

• Should we always use the whole spectrum?– It is very easy (and convenient) to simply

select the entire range of the training spectra as the set of data to use for calibration.

– PLS and PCR models will certainly be able to figure out the regions in the spectra that are most important for calibration.

Since there is no apparent penalty in using as many wavelengths as possible for calibrations, why not just use the entire spectrum?


• There are many reasons why not …– Regions of the spectrum where either the

detector, the spectrometer source or the optics are not effective (e.g. noise at detector cut-off).

– Example: including data from wavelengths below the detector cut-off is adding randomly distributed and uncorrelated absorbances to the factors.

– In general, only selecting the highly correlated regions of the spectrum for calibration will improve the accuracy for predicting the constituents of interest (Note: we evaluate the changes in the spectra!).


• There are more reasons why not …– Use this information along with your chemical

knowledge on the samples to pre-select spectral regions for inclusion in a calibration.

– PCA and PLS factor analysis can correct for some non-linearities (e.g. Beer’s law).

– However, they cannot correct for regions of over-absorbance (total absorbance).


• What is the price to pay?– Discovery of impurities in the samples or unknown

absorbers may be impaired.– If the spectral bands of the impurities do not appear in

the selected calibration regions, then there is will be no indication that the predicted constituent values are potentially incorrect.

– Only a problem if samples are “entirely” unknown, which is a problem as such anyway (see also: how much should we know about the sample to establish reliable calibration models)!

Spectral region selectionHow do we determine the useful regions?

• Correlation analysis– Calculate the correlation of the absorbance at

every wavelength in the training spectra to the concentrations of constituents.

– Regions that show high correlation are regions that should be selected for calibration, regions that show low or no correlation should be ignored.


– 20 FT-IR spectra of ethanol/water mixtures.– Coefficient of determination (R2).

(Note: goodness of the fit for linear regression; 1 = perfect fit;1 indicates regions of high correlation between the spectral absorbances and the constituent concentrations; regions that are near 0 are not correlated)

– Linear correlation (R).(Note: This type of correlation plot not only indicates regions of the spectrum that are correlated to the constituents but the type of correlation as well.)

– Negative correlation in (R):two constituent mixture; increase in ethanol concentration gives a corresponding decrease in water.


• Reason for ‘negative correlation’– Increasing the concentration of one

constituent in a mixture "dilutes" the others.– If dilution occurs as ratio function of the

increase of the constituent added, a negative correlation will appear.

– In most cases, these regions are as useful for calibration as the positively correlated regions!

Always true?


• No!– Collinearity!– If the concentrations of the constituents of

interest vary as a function of one another (or of other unknown constituents), the correlations will indicate regions that are not really useful.

– Example: creating a training set by simply making dilutions of a single mixture!

Be careful in correlation analysis!

Artificial neural networksWhat is it?

• Definition?– Sophisticated modeling techniques capable of

modeling extremely complex functions .– Capable of modeling non-linear relationships.– Can handle large numbers of variables.– They learn by example:

NN user gathers representative data, and then invokes training algorithms to automatically learn the structure of the data.Hence – easy to use!

Artificial neural networks What is it?

• When to apply?– NN are applicable in virtually every situation in

which a relationship between the predictor variables (independents, inputs) and predicted variables (dependents, outputs) exists.

– Even when the relationship is very complex and not easy to articulate in the usual terms of ‘correlations’ or ‘defined differences between groups’.


• Why ‘neural’?– Grew out of research in Artificial Intelligence.– Attempts to mimic the fault-tolerance and

capacity to learn of biological neural systems by modeling the low-level structure of the brain.

– Idea: brain composed of a large number (approx. 1010) of neurons; massively interconnected (average of several thousand interconnects per neuron, although this varies enormously ;-)


• Why ‘neural’?– Each neuron is a specialized cell which can propagate an

electrochemical signal .– Neuron has a branching input structure (dendrites), cell body, and a

branching output structure (axon).– Axons of one cell connect to the dendrites of another via a synapse.– If neuron is activated, fires an electrochemical signal along the

axon.– Signal crosses the synapses to other neurons, which may in turn

fire.– Neuron fires only if the total signal received at the cell body from the

dendrites exceeds a certain level (firing threshold).


• Why ‘neural’?– Conclusion: from a very large number of

extremely simple processing units (each performing a weighted sum of its inputs, and then firing a binary signal if the total input exceeds a certain level) the brain manages to perform extremely complex tasks.

Can we use that model?


• The artificial neural network (ANN)– Artificial neuron:

+ Receives a number of inputs (either from original data, or from the output of other neurons in the ANN).+ Each input comes via a connection that has a strength (‘weight’)+ Weights correspond to synaptic efficacy in a biological neuron.+ Each neuron also has a single threshold value.+ The weighted sum of the inputs is formed, and the threshold subtracted, to compose the ‘activation’ of the neuron (post-synaptic potential).+ The activation signal is passed through an activation function (‘transfer function’) to produce the output.


• The artificial neural network (ANN)– How does it work?

If a step activation function is used (i.e. neuron output is 0 if the input is less than zero and 1 if the input is greater than or equal to 0), then neuron acts like the biological neuron. (Note: usually sigmoid functions applied)

– Network: inputs (which carry the values of variables of interest in the outside world) and outputs (which form predictions, or control signals) have to be connected. Inputs and outputs correspond to sensory and motor nerves such as those coming from the eyes and leading to the hands.


• The artificial neural network (ANN)– Usual structure

Input layer, hidden layer and output layer connected together in‘feed-forward’ structure:signals flow from inputs, forward through any hidden units, finally reaching the output units.

– Distinct layered topology.– Hidden and output layer

neurons are eachconnected to all of the unitsin the preceding layer.

Artificial neural networks How does it work?

• Operation of an ANN– Feed information

The input variable values are placed in the input units.– Processing

+ The hidden and output layer units are progressively executed.+ Each of them calculates its activation value by taking the weighted sum of the outputs of the units in the preceding layer and subtracting the threshold.+ Activation value is passed through the activation function to produce the output of the neuron.+ When entire network has been executed: outputs of output layer act as the output of the entire network.

multivariate methods nutshell...

Documents