matti hotokka physical chemistry Åbo akademi university
TRANSCRIPT
Chemometrics
Matti HotokkaPhysical chemistry
Åbo Akademi University
Signal y: to be modeled or optimized. E.g., yield, measuringtime, figure-of-merit, deviation from a model etc.
Category x: numbers the experiments
Signal
Analytical function
y
x
Signal processing, e.g., Fourier transforms, Hadamartransforms etc, see courses in spectroscopy.
Signal
Analytical function
Factor, or feature: pH, concentration, temperature, ...
A huge number of factors govern every measurement. Thechemist must know which are important and must be tested.
The others are kept as constant as possible.
Factor
Every measurement is repeated from start a number oftimes so that a mean, a standard error and a confidence limitcan be determined.
Observation: the mean of a set of parallel measurements.
Blank: reference observation with default value of all theimportant factors, yB.
Observation
Repetition: Repeated reading of the meter.
Replication: New measurement from start.
Repetitions test your ability to read a digital meter. Replications test the experimental errors in the measuringprocedure.
Observation
Replications vs. Repetitions
. Sensitivity
. Detection limit
. Precision and trueness
. Specificity and selectivity
Definitions
Calibration parameters
Sensitivity = slope
Definitions
Calibration curve
Concentration, x
Signal, y
Äx
Äy
6b0
Intercept b0 can be ignored if thesample is obtained against a blank(reference).
Dynamic range: The valid range of x where the signal ydepends functionally on x.
Analytical range: the interval of x where the signal y can bedetermined accurately.
Definitions
Analytical range
Definitions
Concentration, x
Signal, y
Dynamic rangeDL
Analytical range
LoD
Detection limit: lowest value of x where the signal still can beseparated from the noise. The noise is measured as thevariance of the blank.
Definitions
Detection limit
Limit of determination: lowest value of x (concentration)where y can be determined with a useful accuracy.
Definitions
Limit of determination
Error e in variable x (say, concentration)
Definitions
Bias
x
Error e in variable x (say, concentration)
Definitions
Bias
x
x&
Random error
xtrue
Systematic error
Precision = repeatability
Definitions
Precision and trueness
Trueness = deviation from true value
Selectivity: possibility to measure in presence of interferingcomponents.
Specificity: sensitivity for a given analyte.
Analytical resolution: N = x/Äx.
Definitions
Selectivity
ÄxÄx
All experiments must always be made in random order.
Random sampling
Why randomization
ResponseResponse
ConcentrationConcentration
Systematic Random
True slope
Drift
12
3
4
5
True slope1
2
3
4
5
The random sequences are obtained from tables of randomnumbers. Nowadays random number generators of pocketcalculators may be used.
Normally, you get the same sequence every time. This isOK. If you want truly random numbers you should use arandom seed.
Random sampling
Random number lists
Assume that four different concentrations are to be tested.Name them A, B, C, D.
Make four parallel measurements for each: A1, A2, A3, A4 etc.
Random sampling
Always randomize
First run: Measure A1, B1, C1, D1.
Second run: Start from scratch and do A2, B2, etc.
Random sampling
Always randomize
A1, B1, C1, D1
A2, B2, C2, D2
A3, B3, C3, D3
A4, B4, C4, D4
Wrong! Systematicerrors will not befound.
Randomize the order of concentrations in the runs.
Random sampling
Always randomize
A1, B1, C1, D1
C2, D2, A2, B2
D3, A3, B3, C3
B4, C4, D4, A4
Use linear (or non-linear) regression to analyse the results. Specifically, plot the residues to see whether some effectswere not captured.
Random sampling
Always randomize
Random sampling
Analysis
Conc.
y
1
2
3
4
1
2
3
4
1
1
2
2
33
4
4
B C DA
A1, B1, C1, D1
C2, D2, A2, B2
D3, A3, B3, C3
B4, C4, D4, A4
Random sampling
Residues
Conc.
Residues
1
2
3
4
1
2
3
4
1
1
2
2 3
34
4
B C DA
A1, B1, C1, D1
C2, D2, A2, B2
D3, A3, B3, C3
B4, C4, D4, A4
Drift!
. Controlled factors< Varied systematically or kept constant
. Known factors that cannot be controlled< E.g., drift of instrument
. Unknown factors that can be anticipated< E.g., impurities of the chemicals
. Truly unknown effects
Random sampling
Types of factors
Some constant factors cannot be kept fixed but vary frombatch to batch, day to day, ...
Make a series of measurements varying one factor andkeeping the other conditions as constant as possible => A1,B1, C1, D1. This is a block. Then measure A2, B2, C2, D2
keeping the conditions constant but not necessarily thesame as in block 1 if this is not possible.
Random sampling
Blocking
In 1747, while serving as surgeon on HM Bark Salisbury, James Lind carried out a controlled experiment to develop acure for scurvy.
Lind selected 12 men from the ship, all suffering from scurvy, and divided them into six pairs, giving each group differentadditions to their basic diet for a period of two weeks. The treatments were all remedies that had been proposed at onetime or another. They were:
* A quart of cider every day
* Twenty five gutts (drops) of elixir vitriol (sulphuric acid) three times a day upon an empty stomach,
* One half-pint of seawater every day
* A mixture of garlic, mustard, and horseradish in a lump the size of a nutmeg
* Two spoonfuls of vinegar three times a day
* Two oranges and one lemon every day.
The men who had been given citrus fruits recovered dramatically within a week. One of them returned to duty after 6 daysand the other became nurse to the rest. The others experienced some improvement, but nothing was comparable to thecitrus fruits, which were proved to be substantially superior to the other treatments.
In this study his subjects' cases "were as similar as I could have them", that is he provided strict entry requirements toreduce extraneous variation. The men were paired, which provided blocking. From a modern perspective, the main thingthat is missing is randomized allocation of subjects to treatments.
[http://en.wikipedia.org/wiki/Design_of_experiments]
Random sampling
Controlled experiment
In mathematics, a latin square is an n*n table where ndifferent symbols are placed so that each symbol occursexactly once in every row and in every column.
Random sampling
Latin squares
2 1 31 3 23 2 1
A reduced (or normalized) latin square has the symbols inthe natural order in the first row and the first column.
1 2 32 3 13 1 2
Special case: Sudoku.
. 3x3 squares: 12 different (1 reduced)
. 4x4: 576 (4)
. 5x5: 161280 (56)
. 9x9: 5.5x1026 (3.7x1017)
. Tabulated for a few simple cases
Random sampling
Latin squares
Randomize the blocking experiment.
Random sampling
Latin square designs
Run Sample 1 2 3 41 A B D C2 D C A B3 B D C A4 C A B D
Observe the good balance.
. Typically two-level experiments< A low level and a high level for each factor.
. Typically for screening< Study which of the presumed factors really show a significant effect.
Factorial designs
What?
Each factor is tested at a low and a high level. Designatethe levels symbolically -1 and +1.
Factorial designs
Two levels
Rate of p-phenylenediamine (PPD) oxidation at constant enzyme levelof 13.6 mg L-1 is studied using spectrophotometry:
Factor Level -1 +1T, NC 35 40pH 4.8 6.4[PPD], mM 0.5 27.3
Let there be k factors. Each has one of two values. Therewill be 2k possible combinations.
Factorial designs
2k design
Run Coded factor levels T PPD pH1 +1 -1 -12 +1 +1 -13 +1 +1 +14 +1 -1 +15 -1 -1 -16 -1 +1 -17 -1 +1 +18 -1 -1 +1
x2
x3
x1-1
+1
+1
+1
Point
Factorial design
Experiment plan
Run Factors T PPD pH y1 - y4 E(y) s1 + - - 6.60 6.74 6.81 6.52 6.67 0.132 + + - 11.56 11.86 11.80 11.66 11.72 0.143 + + + 14.71 14.56 14.95 14.88 14.78 0.184 + - + 8.16 7.93 8.27 8.12 8.12 0.145 - - - 6.31 6.45 6.42 6.22 6.35 0.106 - + - 11.24 11.14 11.01 11.04 11.11 0.107 - + + 14.12 13.88 14.26 14.08 14.09 0.168 - - + 7.80 7.40 7.62 7.71 7.63 0.17
No, notquite likethis ...
Factorial design
Experimental plan
Run Factors T PPD pH y1 - y4 E(y) s4 + - + 8.16 7.93 8.27 8.12 8.12 0.147 - + + 14.12 13.88 14.26 14.08 14.09 0.161 + - - 6.60 6.74 6.81 6.52 6.67 0.133 + + + 14.71 14.56 14.95 14.88 14.78 0.186 - + - 11.24 11.14 11.01 11.04 11.11 0.105 - - - 6.31 6.45 6.42 6.22 6.35 0.108 - - + 7.80 7.40 7.62 7.71 7.63 0.172 + + - 11.56 11.86 11.80 11.66 11.72 0.14
... but likethis, rando-mized.
N.B. Also randomize the replications for each run!
Factorial design
2k design
Run Coded factor levels E(y)Main effects Interaction effectsT PPD pH TxPPD TxpH PPDxpH
1 +1 -1 -1 -1 -1 +1 6.672 +1 +1 -1 +1 -1 -1 11.723 +1 +1 +1 +1 +1 +1 14.784 +1 -1 +1 -1 +1 -1 8.125 -1 -1 -1 +1 +1 +1 6.356 -1 +1 -1 -1 +1 -1 11.117 -1 +1 +1 -1 -1 +1 14.098 -1 -1 +1 +1 -1 -1 7.63
DT=(y1+y2+y3+y4)/4 - (y5+y6+y7+y8)/4
Compute the differences high level - low level:
Experimental accuracy?
Four paralleldeterminations =>s = 0.14. D.f.=3.
Factorial design
2k design
DT = 0.53DPPD = 5.73DpH = 2.19DTxPPD = 0.123DTxpH = 0.062DPPDxpH = 0.828
Statistically significant effects at 95 % confidence: |D|>Student t A ss = 0.18 (largest), 3 degrees of freedom => t = 3.18|D| > 3.18A0.18 = 0.57.
DPPD, DpH and DPPDxpH are significant.
A graphical inspection of the effects shows the effects qualitatively
Factorial design
2k design
-1 +1
10
0
15
5
-1 +1
10
0
15
5
-1 +1
10
0
15
5
PPD level
Temperature level
pH level
Anova is very practical if there are two factors. Multi-way ANOVA ispossible but not so illustrative as an example.
MANOVA is not discussed here.
Factorial design
2k design
Cumulative distribution function (assume normal distribution). Moststatistical quantities are normally distributed.
Factorial design
2k design
P
x0
y
x00
1
Cumulative distribution
y
y
1
00 1
An approximationProbability distribution
Quantiles: Select the desired value of q > 1. The total range of a randomvariable, x, can be divided into q-1 sections that are numbered by indexk, 0<k<q. The section k starts at the x value where the cumulativeprobability of the random variable is k/q and ends where the cumulativeprobability exceeds (q-k)/q.
The most common cases are
The 2-quantile median
The 4-quantiles quartiles
The 10-quantiles deciles
Factorial design
2k design
y
x00
1
1/2
Median
Q
1/4
Let y = Ö(x) be the cumulative probability function. Then the xcorresponding to a given probability is x = Ö-1(y). The points x are notequidistant but the cumulative probabilities y are equidistant.
The x values of the limits for the kth q-quantile are shifted here so theystart from zero. They are obtained from the formula
Factorial design
2k design
Half-normal q-plot: Compare distribution of your data points with normaldistribution. On x axis choose q quantile points at positions obtainedfrom the theoretical normal distribution. Plot your q data points inascending order against the x values. If the points lie on a straight linethe data points are normally distributed.
Factorial design
2k design
Six differences, DT, DPPD, DpH, DtxPPD, DTxpH and DPPDxpH, therefore q = 6. The normally distributed x values are
k y=0.5+0.5*(k-0.5)/6 x1 0.542 0.1062 0.625 0.3193 0.708 0.5484 0.792 0.8135 0.875 1.1506 0.958 1.728
Sort the D values in ascending order and associate them with thetheoretical quantile points.
Factorial design
2k design
x y0.106 0.062 TxpH0.319 0.123 TxPPD0.548 0.53 T0.813 0.828 PPDxpH1.150 2.19 pH1.728 5.73 PPD
0
5
1
1 2
3
4
2
PPD
T
pH
PPDxpHTxpH
TxPPDThe effect of PPD and pHare stronger than normallydistributed variables.
The combinations of high and low values are arrays.
Factorial design
Orthogonal arrays
Run Coded factor levels T PPD pH1 +1 -1 -12 +1 +1 -13 +1 +1 +14 +1 -1 +15 -1 -1 -16 -1 +1 -17 -1 +1 +18 -1 -1 +1
x2
x3
x10
+1
+1
+1
Array0 0
Vectors are orthogonal if the scalar product is zero: 1 and 6:1*0+0*10*0=0; 1 and 2: 1*1+0*1+0*0=1.
The 2k design gives 8 combinations for k = 3. This can be handled. However, k=8 gives 64 combinations. Too many degrees of freedom!
Choose half of the combinations, 2k-1. However, you cannot choose anyset of combinations. The arrays must be orthogonal.
In the 23-1 case four vectors must be chosen from the total of 8. They are
Factorial design
Orthogonal vectors
Run Coded factorlevels T PPD pH1 +1 -1 -15 -1 -1 -16 -1 +1 -18 -1 -1 +1
Run Coded factorlevels T PPD pH1 +1 0 05 0 0 06 0 +1 08 0 0 +1
or actually
What you loose when using orthogonal arrays is (some of) theinteraction effects.
There are many ways of choosing orthogonal arrays. Plackett andBurmann, and Hall, and Taguchi, have published large selections basedon Hadamard matrices.
Note that the orthogonal arrays should be well balanced in order to makethe analysis meaningful. The arrays shown previously are orthogonal butnot balanced. Consider, e.g., the factor T. There is one line with a highvalue and three lines with a low value which makes the table unbalanced. The Taguchi tables and others correct this problem.
Factorial design
2k-1 design
Full set of experiments, high and low values
Factorial design
Taguchi table L4 (23)
1 1 11 1 21 2 12 1 11 2 22 1 22 2 12 2 2
Eight experiments
Taguchi design
Factorial design
Taguchi table L4 (23)
1 1 11 2 22 1 22 2 1
Four experiments
Designs of size 2k-p, p>1, also have been proposed.
Factorial design
More reduction
Factorial design
Orthogonal vectors, analysis
Run Coded factorlevels
T PPD pH yi E(y) s5 0 0 0 6.31 6.45 6.42 6.22 6.35 0.107 0 +1 +1 14.12 13.88 14.26 14.08 14.09 0.164 +1 0 +1 8.16 7.93 8.27 8.12 8.12 0.142 +1 +1 0 11.56 11.86 11.80 11.66 11.72 0.14
Sums over similar factors and over the total data table.
Here n is the number of repetitions in each run level, here n=4, andk is number of values a factor can have, here k=2 (low, high).
Factorial design
Orthogonal vectors, analysis
The sum of squares for each factor is calculated from a formula where N isthe number of replications, here N=4. For temperature
For the remaining factors (observe that for products, high and high giveslow):
Factorial design
Ortogonal vectors, analysis
The total sum of squares for all replications is
The temperature factor explains 0.2 % of the total SSQ
The PPD concentration explains 88 % of the variation and pH 12 %. Alsothe combination PPDxpH has a significant impact.
. Only certain combinations of factors, runlevels and screening values are available
< L4 (used here) contains two screening levels(low and high), four run levels and three factors
< L8 has two screening levels, 8 experiments and7 factors
< L9 has three screening levels (low, mid, high), 9experiments and 8 factors.
. For available designs, see
Taguchi tables
Available combinations
http://www.york.ac.uk/depts/maths/tables/orthogonal.htm
Factorial design
Example of a three-level design
x1
00
0
+1
-1
-1
+1
Factor
Response
+1-1
Factorial design
Central composite design
1 -1 -1 -1 y1
2 +1 -1 -1 y2
3 +1 +1 -1 y3
4 -1 +1 -1 y4
5 -1 -1 +1 y5
6 +1 -1 +1 y6
7 +1 +1 +1 y7
8 -1 +1 +1 y8
9 -a 0 0 y9
10 +a 0 0 y10
11 0 -a 0 y11
12 0 +a 0 y12
13 0 0 - a y13
14 0 0 +a y14
15, 16, 17 0 0 0 y15, y16, y17
Factorial design
Box-Behnken design
Factorial design
Lattice design
A higher order polynomial (typically second order) is fitted to theobserved data.
Factorial design
Response surface method
Intermediate points need to be measured in order to obtain the non-linearcoefficients.These are called three-level designs.Multivariate polynomial regression is a common analysis method.
. Biggest is best< Find a set of factor values that give maximal response (e.g., yield)
. Smallest is best< Find minimum
. Nominal is best< Minimize the difference (measured - nominal)
Response surface
Optimization tasks
Response surface
The response
pH
PPD
. Any optimization strategy can be used
. The brute force method is often expensive
. Single factor at a time (the engineering method) maymiss the optimum
. Fixed-size simplex algorithm may work better
Response surface
Optimization techniques
Response surface
Scan the whole range, e.g., using multiwell plates
pH
PPD
Max
Response surface
The engineering method
pH
PPD
Measure at theindicated points.
Max
Code the factor values to the range (0,1). Generate the initial simplex.
Response surface
The simplex method
pH
PPD0
0
1
1
Measure atthe indicatedpoints.
If there are N factors(here N=2) thesimplex has N+1points. Here thepoints are0,0;1,0; 0.5, 0.87
Unknown surface
Response surface
The simplex method
pH
PPD0
0
1
1
w
Remove the worst point.Calculate the centroid ofthe remaining points.
p
Response surface
The simplex method
pH
PPD0
0
1
1
Measure atthe indicatedpoint.
r
w
Generate a new point.
p
Response surface
The simplex method
pH
PPD0
0
1
1
Measure atthe indicatedpoints.
w
. Control factors< Can be kept fixed once chosen
. Noise factors< Cannot be controlled< Create the variations in the quality of the product
Robust parameters
Factor categories
Usually, the quality of the product is improved by reducing the noise.
Unfortunately the noise factors are difficult (=expensive) to reduce.
Robust parameters
Typical procedure
Robust parameters
2k design, a reminder
Run Coded factor levels E(y)Main effects Interaction effectsT PPD pH TxPPD TxpH PPDxpH
1 +1 -1 -1 -1 -1 +1 6.672 +1 +1 -1 +1 -1 -1 11.723 +1 +1 +1 +1 +1 +1 14.784 +1 -1 +1 -1 +1 -1 8.125 -1 -1 -1 +1 +1 +1 6.356 -1 +1 -1 -1 +1 -1 11.117 -1 +1 +1 -1 -1 +1 14.098 -1 -1 +1 +1 -1 -1 7.63
Main effects: All control and noise factors.
Interaction effects between control factors and noise factors may bequite large. If this is the case, then variations in the product quality maybe reduced by adjusting the control factors so that the effect of noise isreduced.
Robust parameters
Control of noise
Consider the dependence of y (signal level) on x (voltage over detector =control factor).
Robust parameters
An example
y
x
Width of noise