matti hotokka physical chemistry Åbo akademi university

Chemometrics

Matti HotokkaPhysical chemistry

Åbo Akademi University

Signal y: to be modeled or optimized. E.g., yield, measuringtime, figure-of-merit, deviation from a model etc.

Category x: numbers the experiments

Signal

Analytical function

y

x

Signal processing, e.g., Fourier transforms, Hadamartransforms etc, see courses in spectroscopy.

Signal

Analytical function

Factor, or feature: pH, concentration, temperature, ...

A huge number of factors govern every measurement. Thechemist must know which are important and must be tested.

The others are kept as constant as possible.

Factor

Every measurement is repeated from start a number oftimes so that a mean, a standard error and a confidence limitcan be determined.

Observation: the mean of a set of parallel measurements.

Blank: reference observation with default value of all theimportant factors, yB.

Observation

Repetition: Repeated reading of the meter.

Replication: New measurement from start.

Repetitions test your ability to read a digital meter. Replications test the experimental errors in the measuringprocedure.

Observation

Replications vs. Repetitions

. Sensitivity

. Detection limit

. Precision and trueness

. Specificity and selectivity

Definitions

Calibration parameters

Sensitivity = slope

Definitions

Calibration curve

Concentration, x

Signal, y

Äx

Äy

6b0

Intercept b0 can be ignored if thesample is obtained against a blank(reference).

Dynamic range: The valid range of x where the signal ydepends functionally on x.

Analytical range: the interval of x where the signal y can bedetermined accurately.

Definitions

Analytical range

Definitions

Concentration, x

Signal, y

Dynamic rangeDL

Analytical range

LoD

Detection limit: lowest value of x where the signal still can beseparated from the noise. The noise is measured as thevariance of the blank.

Definitions

Detection limit

Limit of determination: lowest value of x (concentration)where y can be determined with a useful accuracy.

Definitions

Limit of determination

Error e in variable x (say, concentration)

Definitions

Bias

x

Error e in variable x (say, concentration)

Definitions

Bias

x

x&

Random error

xtrue

Systematic error

Precision = repeatability

Definitions

Precision and trueness

Trueness = deviation from true value

Selectivity: possibility to measure in presence of interferingcomponents.

Specificity: sensitivity for a given analyte.

Analytical resolution: N = x/Äx.

Definitions

Selectivity

ÄxÄx

All experiments must always be made in random order.

Random sampling

Why randomization

ResponseResponse

ConcentrationConcentration

Systematic Random

True slope

Drift

12

3

4

5

True slope1

2

3

4

5

The random sequences are obtained from tables of randomnumbers. Nowadays random number generators of pocketcalculators may be used.

Normally, you get the same sequence every time. This isOK. If you want truly random numbers you should use arandom seed.

Random sampling

Random number lists

Assume that four different concentrations are to be tested.Name them A, B, C, D.

Make four parallel measurements for each: A1, A2, A3, A4 etc.

Random sampling

Always randomize

First run: Measure A1, B1, C1, D1.

Second run: Start from scratch and do A2, B2, etc.

Random sampling

Always randomize

A1, B1, C1, D1

A2, B2, C2, D2

A3, B3, C3, D3

A4, B4, C4, D4

Wrong! Systematicerrors will not befound.

Randomize the order of concentrations in the runs.

Random sampling

Always randomize

A1, B1, C1, D1

C2, D2, A2, B2

D3, A3, B3, C3

B4, C4, D4, A4

Use linear (or non-linear) regression to analyse the results. Specifically, plot the residues to see whether some effectswere not captured.

Random sampling

Always randomize

Random sampling

Analysis

Conc.

y

1

2

3

4

1

2

3

4

1

1

2

2

33

4

4

B C DA

A1, B1, C1, D1

C2, D2, A2, B2

D3, A3, B3, C3

B4, C4, D4, A4

Random sampling

Residues

Conc.

Residues

1

2

3

4

1

2

3

4

1

1

2

2 3

34

4

B C DA

A1, B1, C1, D1

C2, D2, A2, B2

D3, A3, B3, C3

B4, C4, D4, A4

Drift!

. Controlled factors< Varied systematically or kept constant

. Known factors that cannot be controlled< E.g., drift of instrument

. Unknown factors that can be anticipated< E.g., impurities of the chemicals

. Truly unknown effects

Random sampling

Types of factors

Some constant factors cannot be kept fixed but vary frombatch to batch, day to day, ...

Make a series of measurements varying one factor andkeeping the other conditions as constant as possible => A1,B1, C1, D1. This is a block. Then measure A2, B2, C2, D2

keeping the conditions constant but not necessarily thesame as in block 1 if this is not possible.

Random sampling

Blocking

In 1747, while serving as surgeon on HM Bark Salisbury, James Lind carried out a controlled experiment to develop acure for scurvy.

Lind selected 12 men from the ship, all suffering from scurvy, and divided them into six pairs, giving each group differentadditions to their basic diet for a period of two weeks. The treatments were all remedies that had been proposed at onetime or another. They were:

* A quart of cider every day

* Twenty five gutts (drops) of elixir vitriol (sulphuric acid) three times a day upon an empty stomach,

* One half-pint of seawater every day

* A mixture of garlic, mustard, and horseradish in a lump the size of a nutmeg

* Two spoonfuls of vinegar three times a day

* Two oranges and one lemon every day.

The men who had been given citrus fruits recovered dramatically within a week. One of them returned to duty after 6 daysand the other became nurse to the rest. The others experienced some improvement, but nothing was comparable to thecitrus fruits, which were proved to be substantially superior to the other treatments.

In this study his subjects' cases "were as similar as I could have them", that is he provided strict entry requirements toreduce extraneous variation. The men were paired, which provided blocking. From a modern perspective, the main thingthat is missing is randomized allocation of subjects to treatments.

[http://en.wikipedia.org/wiki/Design_of_experiments]

Random sampling

Controlled experiment

In mathematics, a latin square is an n*n table where ndifferent symbols are placed so that each symbol occursexactly once in every row and in every column.

Random sampling

Latin squares

2 1 31 3 23 2 1

A reduced (or normalized) latin square has the symbols inthe natural order in the first row and the first column.

1 2 32 3 13 1 2

Special case: Sudoku.

. 3x3 squares: 12 different (1 reduced)

. 4x4: 576 (4)

. 5x5: 161280 (56)

. 9x9: 5.5x1026 (3.7x1017)

. Tabulated for a few simple cases

Random sampling

Latin squares

Randomize the blocking experiment.

Random sampling

Latin square designs

Run Sample 1 2 3 41 A B D C2 D C A B3 B D C A4 C A B D

Observe the good balance.

. Typically two-level experiments< A low level and a high level for each factor.

. Typically for screening< Study which of the presumed factors really show a significant effect.

Factorial designs

What?

Each factor is tested at a low and a high level. Designatethe levels symbolically -1 and +1.

Factorial designs

Two levels

Rate of p-phenylenediamine (PPD) oxidation at constant enzyme levelof 13.6 mg L-1 is studied using spectrophotometry:

Factor Level -1 +1T, NC 35 40pH 4.8 6.4[PPD], mM 0.5 27.3

Let there be k factors. Each has one of two values. Therewill be 2k possible combinations.

Factorial designs

2k design

Run Coded factor levels T PPD pH1 +1 -1 -12 +1 +1 -13 +1 +1 +14 +1 -1 +15 -1 -1 -16 -1 +1 -17 -1 +1 +18 -1 -1 +1

x2

x3

x1-1

+1

+1

+1

Point

Factorial design

Experiment plan

Run Factors T PPD pH y1 - y4 E(y) s1 + - - 6.60 6.74 6.81 6.52 6.67 0.132 + + - 11.56 11.86 11.80 11.66 11.72 0.143 + + + 14.71 14.56 14.95 14.88 14.78 0.184 + - + 8.16 7.93 8.27 8.12 8.12 0.145 - - - 6.31 6.45 6.42 6.22 6.35 0.106 - + - 11.24 11.14 11.01 11.04 11.11 0.107 - + + 14.12 13.88 14.26 14.08 14.09 0.168 - - + 7.80 7.40 7.62 7.71 7.63 0.17

No, notquite likethis ...

Factorial design

Experimental plan

Run Factors T PPD pH y1 - y4 E(y) s4 + - + 8.16 7.93 8.27 8.12 8.12 0.147 - + + 14.12 13.88 14.26 14.08 14.09 0.161 + - - 6.60 6.74 6.81 6.52 6.67 0.133 + + + 14.71 14.56 14.95 14.88 14.78 0.186 - + - 11.24 11.14 11.01 11.04 11.11 0.105 - - - 6.31 6.45 6.42 6.22 6.35 0.108 - - + 7.80 7.40 7.62 7.71 7.63 0.172 + + - 11.56 11.86 11.80 11.66 11.72 0.14

... but likethis, rando-mized.

N.B. Also randomize the replications for each run!

Factorial design

2k design

Run Coded factor levels E(y)Main effects Interaction effectsT PPD pH TxPPD TxpH PPDxpH

1 +1 -1 -1 -1 -1 +1 6.672 +1 +1 -1 +1 -1 -1 11.723 +1 +1 +1 +1 +1 +1 14.784 +1 -1 +1 -1 +1 -1 8.125 -1 -1 -1 +1 +1 +1 6.356 -1 +1 -1 -1 +1 -1 11.117 -1 +1 +1 -1 -1 +1 14.098 -1 -1 +1 +1 -1 -1 7.63

DT=(y1+y2+y3+y4)/4 - (y5+y6+y7+y8)/4

Compute the differences high level - low level:

Experimental accuracy?

Four paralleldeterminations =>s = 0.14. D.f.=3.

Factorial design

2k design

DT = 0.53DPPD = 5.73DpH = 2.19DTxPPD = 0.123DTxpH = 0.062DPPDxpH = 0.828

Statistically significant effects at 95 % confidence: |D|>Student t A ss = 0.18 (largest), 3 degrees of freedom => t = 3.18|D| > 3.18A0.18 = 0.57.

DPPD, DpH and DPPDxpH are significant.

A graphical inspection of the effects shows the effects qualitatively

Factorial design

2k design

-1 +1

10

0

15

5

-1 +1

10

0

15

5

-1 +1

10

0

15

5

PPD level

Temperature level

pH level

Anova is very practical if there are two factors. Multi-way ANOVA ispossible but not so illustrative as an example.

MANOVA is not discussed here.

Factorial design

2k design

Cumulative distribution function (assume normal distribution). Moststatistical quantities are normally distributed.

Factorial design

2k design

P

x0

y

x00

1

Cumulative distribution

y

y

1

00 1

An approximationProbability distribution

Quantiles: Select the desired value of q > 1. The total range of a randomvariable, x, can be divided into q-1 sections that are numbered by indexk, 0<k<q. The section k starts at the x value where the cumulativeprobability of the random variable is k/q and ends where the cumulativeprobability exceeds (q-k)/q.

The most common cases are

The 2-quantile median

The 4-quantiles quartiles

The 10-quantiles deciles

Factorial design

2k design

y

x00

1

1/2

Median

Q

1/4

Let y = Ö(x) be the cumulative probability function. Then the xcorresponding to a given probability is x = Ö-1(y). The points x are notequidistant but the cumulative probabilities y are equidistant.

The x values of the limits for the kth q-quantile are shifted here so theystart from zero. They are obtained from the formula

Factorial design

2k design

Half-normal q-plot: Compare distribution of your data points with normaldistribution. On x axis choose q quantile points at positions obtainedfrom the theoretical normal distribution. Plot your q data points inascending order against the x values. If the points lie on a straight linethe data points are normally distributed.

Factorial design

2k design

Six differences, DT, DPPD, DpH, DtxPPD, DTxpH and DPPDxpH, therefore q = 6. The normally distributed x values are

k y=0.5+0.5*(k-0.5)/6 x1 0.542 0.1062 0.625 0.3193 0.708 0.5484 0.792 0.8135 0.875 1.1506 0.958 1.728

Sort the D values in ascending order and associate them with thetheoretical quantile points.

Factorial design

2k design

x y0.106 0.062 TxpH0.319 0.123 TxPPD0.548 0.53 T0.813 0.828 PPDxpH1.150 2.19 pH1.728 5.73 PPD

0

5

1

1 2

3

4

2

PPD

T

pH

PPDxpHTxpH

TxPPDThe effect of PPD and pHare stronger than normallydistributed variables.

The combinations of high and low values are arrays.

Factorial design

Orthogonal arrays

Run Coded factor levels T PPD pH1 +1 -1 -12 +1 +1 -13 +1 +1 +14 +1 -1 +15 -1 -1 -16 -1 +1 -17 -1 +1 +18 -1 -1 +1

x2

x3

x10

+1

+1

+1

Array0 0

Vectors are orthogonal if the scalar product is zero: 1 and 6:1*0+0*10*0=0; 1 and 2: 1*1+0*1+0*0=1.

The 2k design gives 8 combinations for k = 3. This can be handled. However, k=8 gives 64 combinations. Too many degrees of freedom!

Choose half of the combinations, 2k-1. However, you cannot choose anyset of combinations. The arrays must be orthogonal.

In the 23-1 case four vectors must be chosen from the total of 8. They are

Factorial design

Orthogonal vectors

Run Coded factorlevels T PPD pH1 +1 -1 -15 -1 -1 -16 -1 +1 -18 -1 -1 +1

Run Coded factorlevels T PPD pH1 +1 0 05 0 0 06 0 +1 08 0 0 +1

or actually

What you loose when using orthogonal arrays is (some of) theinteraction effects.

There are many ways of choosing orthogonal arrays. Plackett andBurmann, and Hall, and Taguchi, have published large selections basedon Hadamard matrices.

Note that the orthogonal arrays should be well balanced in order to makethe analysis meaningful. The arrays shown previously are orthogonal butnot balanced. Consider, e.g., the factor T. There is one line with a highvalue and three lines with a low value which makes the table unbalanced. The Taguchi tables and others correct this problem.

Factorial design

2k-1 design

Full set of experiments, high and low values

Factorial design

Taguchi table L4 (23)

1 1 11 1 21 2 12 1 11 2 22 1 22 2 12 2 2

Eight experiments

Taguchi design

Factorial design

Taguchi table L4 (23)

1 1 11 2 22 1 22 2 1

Four experiments

Designs of size 2k-p, p>1, also have been proposed.

Factorial design

More reduction

Factorial design

Orthogonal vectors, analysis

Run Coded factorlevels

T PPD pH yi E(y) s5 0 0 0 6.31 6.45 6.42 6.22 6.35 0.107 0 +1 +1 14.12 13.88 14.26 14.08 14.09 0.164 +1 0 +1 8.16 7.93 8.27 8.12 8.12 0.142 +1 +1 0 11.56 11.86 11.80 11.66 11.72 0.14

Sums over similar factors and over the total data table.

Here n is the number of repetitions in each run level, here n=4, andk is number of values a factor can have, here k=2 (low, high).

Factorial design

Orthogonal vectors, analysis

The sum of squares for each factor is calculated from a formula where N isthe number of replications, here N=4. For temperature

For the remaining factors (observe that for products, high and high giveslow):

Factorial design

Ortogonal vectors, analysis

The total sum of squares for all replications is

The temperature factor explains 0.2 % of the total SSQ

The PPD concentration explains 88 % of the variation and pH 12 %. Alsothe combination PPDxpH has a significant impact.

. Only certain combinations of factors, runlevels and screening values are available

< L4 (used here) contains two screening levels(low and high), four run levels and three factors

< L8 has two screening levels, 8 experiments and7 factors

< L9 has three screening levels (low, mid, high), 9experiments and 8 factors.

. For available designs, see

Taguchi tables

Available combinations

http://www.york.ac.uk/depts/maths/tables/orthogonal.htm

Factorial design

Example of a three-level design

x1

00

0

+1

-1

-1

+1

Factor

Response

+1-1

Factorial design

Central composite design

1 -1 -1 -1 y1

2 +1 -1 -1 y2

3 +1 +1 -1 y3

4 -1 +1 -1 y4

5 -1 -1 +1 y5

6 +1 -1 +1 y6

7 +1 +1 +1 y7

8 -1 +1 +1 y8

9 -a 0 0 y9

10 +a 0 0 y10

11 0 -a 0 y11

12 0 +a 0 y12

13 0 0 - a y13

14 0 0 +a y14

15, 16, 17 0 0 0 y15, y16, y17

Factorial design

Box-Behnken design

Factorial design

Lattice design

A higher order polynomial (typically second order) is fitted to theobserved data.

Factorial design

Response surface method

Intermediate points need to be measured in order to obtain the non-linearcoefficients.These are called three-level designs.Multivariate polynomial regression is a common analysis method.

. Biggest is best< Find a set of factor values that give maximal response (e.g., yield)

. Smallest is best< Find minimum

. Nominal is best< Minimize the difference (measured - nominal)

Response surface

Optimization tasks

Response surface

The response

pH

PPD

. Any optimization strategy can be used

. The brute force method is often expensive

. Single factor at a time (the engineering method) maymiss the optimum

. Fixed-size simplex algorithm may work better

Response surface

Optimization techniques

Response surface

Scan the whole range, e.g., using multiwell plates

pH

PPD

Max

Response surface

The engineering method

pH

PPD

Measure at theindicated points.

Max

Code the factor values to the range (0,1). Generate the initial simplex.

Response surface

The simplex method

pH

PPD0

0

1

1

Measure atthe indicatedpoints.

If there are N factors(here N=2) thesimplex has N+1points. Here thepoints are0,0;1,0; 0.5, 0.87

Unknown surface

Response surface

The simplex method

pH

PPD0

0

1

1

w

Remove the worst point.Calculate the centroid ofthe remaining points.

p

Response surface

The simplex method

pH

PPD0

0

1

1

Measure atthe indicatedpoint.

r

w

Generate a new point.

p

Response surface

The simplex method

pH

PPD0

0

1

1

Measure atthe indicatedpoints.

w

. Control factors< Can be kept fixed once chosen

. Noise factors< Cannot be controlled< Create the variations in the quality of the product

Robust parameters

Factor categories

Usually, the quality of the product is improved by reducing the noise.

Unfortunately the noise factors are difficult (=expensive) to reduce.

Robust parameters

Typical procedure

Robust parameters

2k design, a reminder

Run Coded factor levels E(y)Main effects Interaction effectsT PPD pH TxPPD TxpH PPDxpH

1 +1 -1 -1 -1 -1 +1 6.672 +1 +1 -1 +1 -1 -1 11.723 +1 +1 +1 +1 +1 +1 14.784 +1 -1 +1 -1 +1 -1 8.125 -1 -1 -1 +1 +1 +1 6.356 -1 +1 -1 -1 +1 -1 11.117 -1 +1 +1 -1 -1 +1 14.098 -1 -1 +1 +1 -1 -1 7.63

Main effects: All control and noise factors.

Interaction effects between control factors and noise factors may bequite large. If this is the case, then variations in the product quality maybe reduced by adjusting the control factors so that the effect of noise isreduced.

Robust parameters

Control of noise

Consider the dependence of y (signal level) on x (voltage over detector =control factor).

Robust parameters

An example

y

x

Width of noise

matti hotokka physical chemistry Åbo akademi university

Documents