matti hotokka physical chemistry Åbo akademi...
TRANSCRIPT
Chemometrics
Matti HotokkaPhysical chemistry
Åbo Akademi University
Consider a case where two quantities (or features), pH and the Caconcentration are measured. Assume that three different samples areanalysed.
Data analysis
Definitions
Ca pHFeatures
Object
In general
.Purpose< To classify the objects
.Method< Plot the points so that they are spread as much as possible < For this one needs to find the line of largest spread< Then just rotate so that this line becomes a coordinate axis< At the same time you see which original features contribute most to
the spread
Data analysis
Principle
Data analysis
Principle
y=pH
x=Ca
s = %2
s = 1
s = 1
Preprocessing the data is the (most?) crucial step in data analysis.
Many steps may be needed. It is important to keep track of thetransformations so that all steps can be applied in reverse order on thefinal result in order to obtain numbers comparable to the original data.
Preprocessing
A constant offset may be removed from each data point. Each featuremay have its own offset. Usually the offset is the mean of the feature(column) meaning that the feature is centered around origin,
Preprocessing
Centering
Thus, for a given feature (column), i.e., fixed index k
Preprocessing
Centering
y=pH
x=Ca
The features represent different aspects of the same samples and mayhave numercally quite varying orders of magnitude. One may scaleeach feature to make their influence on the system more easilycomparable.
Preprocessing
Scaling
Range scaling
Autoscaling
In autoscaling, the vectors are scaled to length range %n-1.
Preprocessing
Scaling
y=pH
x=Ca
The feature vectors may be normalized to length 1.
Preprocessing
Normalisation
Statistical analysis is possible even if data points are missing. Themissing points should be replaced with a suitable mean value or, in worstcase, with a random number.
Preprocessing
Missing data
Features should be removed if they are strongly correlated with otherfeatures, do not have any influence on the data, or are constant.
Preprocessing
Redundant features
The variances (diagonal elements) and covariances (off-diagonalelements) are
Variances
Variance-covariance matrix
for j = 1,p
for j,k = 1,p, j�k
Notation: There are p features and n objects.
Variances
Variance-covariance matrix
Co-variances
Variance-covariance matrix
.What determines< Largest spread means largest variance
.How to find< Eigenvalues of the variance-covariance matrix give all spreads in
independent directions< Eigenvectors tell in which directions the lines of largest spreads lie.< Eigenvectors are the new rotated coordinate axes
Variances
Largest spread
.Definition< Assume that A is a square matrix< If one can find a number a and a vector X that fulfil the equation
Variances
Eigenvalues and eigenvectors
< the number is an eigenvalue and the vector is an eigenvector
.How to find< Diagonalize the matrix< In this way you will obtain all eigenvalues and -vectors< It is always possible to diagonalize but this is usually unnecessary
work because you only need two, at most three, directions wherethe spreads are the largest
.An nxn matrix has n eigenvalues< A 2x2 matrix has two eigenvalues
Variances
Eigenvalues and eigenvectors
The eigenvalues are 2 and 0 and the corresponding eigenvectors are (1, 1)T and (1, -1)T. The spread along the first line is 2 and the spread along the second axisis zero, no spread at all, so the points lie exactly on one line.
The elements of the correlation matrix are
Variances
Correlation matrix
A reminder of the model system to be used as an example.
PCA
Principal component analysis
Ca pHFeatures
Object
PCA finds a direction along which the points lie.
PCA
What does it mean?
y=pH
x=Ca
Principal component (1 1)
PCA just classifies your observations. It does not perform anyregression.
PCA
What is it?
Low
Medium
High
PCA classifies the points in categories with similar properties.
PCA
What does it mean?
y=pH
x=Ca
Principal component (1 1)
When a two-dimensional coordinate system is rotated there must be twonew rotated coordinate axes that are orthogonal. If there are n featuresthere will be n new rotated coordinate axes.
Sure but ...
You are not interested in more than a few directions with the widestspread.
The first new coordinate axis is chosen so that it gives the widest spreadof points. The second new axis gives the widest spread in anyorthogonal direction. The last ones give essentially no spread and aretherefore uninteresting.
Usually the first two, or maybe three, axes suffice to group theobservations in clusters.
PCA
The next principal component
The second principal component.
PCA
What does it mean?
y=pH
x=Ca
Principal component (1 1)
Principal component (1 -1)
Hair samples from a crime site were analyzed. The following elementalcompositions of the hairs of the suspects were detected.
PCA
A real case
Hair Cu Mn Cl Br I1 9.2 0.30 1730 12.0 3.62 12.4 0.39 930 50.0 2.33 7.2 0.32 2750 65.3 3.44 10.2 0.36 1500 3.4 5.35 10.1 0.50 1040 39.2 1.96 6.5 0.20 2490 90.0 4.67 5.6 0.29 2940 88.0 5.68 11.8 0.42 867 43.1 1.59 8.5 0.25 1620 5.2 6.2
On the first glance the samples are quite random.
Consider the two most important principal components.
PCA
A real case
2
1
3
4
5
678
9
PC1
PC2
The details of the computation will follow.
Direction of largest spread needs to be found. Spread along thecoordinate axes is given by the variance-covariance matrix. Its eigen-value gives the characteristic spread. The corresponding eigenvectorgives the direction.
Eigenvectors are automatically orthogonal.
So, diagonalize the ó matrix.
PCA
How is it done?
Only, you don’t.
Diagonalization gives ALL eigenvalues and eigenvectors. For systems withmany features and many objects this could mean a lot of work. If youcompare, e.g., vibrational spectra in the range 400 to 3600 cm-1 atresolution 4 cm-1 you will have 800 features. This means that the variance-covariance matrix is 800x800 and most of the 800 principal componentsare completely irrelevant. You just need the first two.
The spread of the first component is largest, that of the second smalleretc. Two or three components usually explain all the spread down toexperimental errors.
PCA
Eigenvalues
Component
Eigenvalue= Spread
42 51 3
These do not differentiate theobservations.
Consider the hair example. In order to calculate all eigenvalues, thevariance-covariance matrix was formed. As the numerical scales of thefeatures are widely different, a comparison of them is not relevant. Therefore the correlation matrix was derived from the covariance matrixand diagonalised. The eigenvalues are:
PCA
Explained variances
Component Eigenvalue ë Explained variance % Cumulative variance %
1 3.375 67.5 67.52 1.178 23.6 91.13 0.294 5.9 96.94 0.127 2.5 99.55 0.028 0.5 100.0
The two first principal components explain over 90 % of the spread. The rest is merely noise.
The original observations matrix X shows where the points lie in the(Ca, pH) coordinate system.
PCA
How is it done?
y=pH
x=Ca
Break down the observations X to a product of a scores matrix T and aloadings matrix L.
PCA
How is it done?
ScoresLoadings
Compare: y = ax
This example is mathematically inconsistent! The real mathematics is shown later.
The loadings matrix tells what is the direction of the principal component.
These are the rotated coordinates expressed in terms of the originalones.
PCA
Loadings
y=pH
x=Ca
Principal component, LT = (1 1)
The scores matrix tells where the points lie along the new coordinateaxis.
PCA
Scores
y=pH
x=Ca
PC1
PC2
In linear algebra, y = ax. To solve a one writes a = y x-1.
Here X = TLT. To solve T we multiply with (LT)-1 from the right (note thatthe matrix L is an orthogonal matris, i.e., LT = L-1).
Assume that you have p features (columns) and n objects (rows) in theoriginal observations matrix X. Assume that you need d principalcomponents. Then
PCA
Dimensions of the matrices
X = TLT
X
p
n
=
n
d
T
pd
LT
The method is iterative.
Step 1: Normalize (center to origin and scale to length 1) the originalobservations matrix X.
PCA
The real mathematics
2 13 24 3)
-101( -1
01( ) -%2
0%2
-%20%2( )
Step 2: Estimate the first loading vector lT (i.e., the first row of the matrixLT). Usually, the first row of the normalized X matrix is chosen.
lT = (-%2 -%2)
Step 3: Form the new score vector t:
PCA
The real mathematics
Compare the new score vector with that from the previous iteration. Ifthey differ less than a preset limit (e.g., 10-5) go to step 6, if they differmore, refine the score vector further in step 4.
Step 4: In order to refine the score vector, compute a new loadings vector:
PCA
The real mathematics
Normalize the new loadings vector.
Step 5: If the number of iterations for this scrore vector is less than apreset limit, say 10, go to step 3; if the allowed number of iterations isexceeded go to step 6.
PCA
The real mathematics
Step 6: Determine the residuals, i.e., the so far unexplained part of theobservations:
PCA
The real mathematics
If the necessary number of principal components have been obtained, goto step 8. Otherwise go to step 7.
Step 7: Use the residuals matrix E as the new observations matrix X andgo to step 2 to obtain next score vector.
PCA
The real mathematics
Step 8: Report the results. The original observations in matrix X can beexplained as the product of the matrix T containing the score vectors andthe matrix LT containing the corresponding loadings vectors.
Here only one principal component is needed.
PCA
The real mathematics
The hair samples: scores and loadings
PCA
The real case
PC1
PC2PC1
PC2
CuMnClBrI
Scores plot shows clusters
PCA
The real case
2
1
3
4
5
678
9
PC1
PC2
0 1-1
0
Loadings plot shows which features contribute to each principalcomponent
PCA
The real case
PC1
PC2
0
0
Cu
MnCl
Br
I
Small angle betweenlines means that thefeatures are stronglycorrelated.Copper and manganeseare strongly correlated. Copper and chlorineanticorrelate.
All features contribute toPC1. PC2 dependsmostly on Br and I.
A biplot - properly scaled - shows which features separate the clusters
PCA
The real case
2
1
3
4
5
678
9
PC1
PC2
0 1-1
0
Cu
Mn
Cl
Br
I
Cu, Mn, Br and Cl separatethe clusters (2,5,8) and(3,6,7). The cluster (1,4,9)is characterized by I.