matti hotokka physical chemistry Åbo akademi...

Chemometrics

Matti HotokkaPhysical chemistry

Åbo Akademi University

Consider a case where two quantities (or features), pH and the Caconcentration are measured. Assume that three different samples areanalysed.

Data analysis

Definitions

Ca pHFeatures

Object

In general

.Purpose< To classify the objects

.Method< Plot the points so that they are spread as much as possible < For this one needs to find the line of largest spread< Then just rotate so that this line becomes a coordinate axis< At the same time you see which original features contribute most to

the spread

Data analysis

Principle

Data analysis

Principle

y=pH

x=Ca

s = %2

s = 1

s = 1

Preprocessing the data is the (most?) crucial step in data analysis.

Many steps may be needed. It is important to keep track of thetransformations so that all steps can be applied in reverse order on thefinal result in order to obtain numbers comparable to the original data.

Preprocessing

A constant offset may be removed from each data point. Each featuremay have its own offset. Usually the offset is the mean of the feature(column) meaning that the feature is centered around origin,

Preprocessing

Centering

Thus, for a given feature (column), i.e., fixed index k

Preprocessing

Centering

y=pH

x=Ca

The features represent different aspects of the same samples and mayhave numercally quite varying orders of magnitude. One may scaleeach feature to make their influence on the system more easilycomparable.

Preprocessing

Scaling

Range scaling

Autoscaling

In autoscaling, the vectors are scaled to length range %n-1.

Preprocessing

Scaling

y=pH

x=Ca

The feature vectors may be normalized to length 1.

Preprocessing

Normalisation

Statistical analysis is possible even if data points are missing. Themissing points should be replaced with a suitable mean value or, in worstcase, with a random number.

Preprocessing

Missing data

Features should be removed if they are strongly correlated with otherfeatures, do not have any influence on the data, or are constant.

Preprocessing

Redundant features

The variances (diagonal elements) and covariances (off-diagonalelements) are

Variances

Variance-covariance matrix

for j = 1,p

for j,k = 1,p, j�k

Notation: There are p features and n objects.

Variances


Co-variances


.What determines< Largest spread means largest variance

.How to find< Eigenvalues of the variance-covariance matrix give all spreads in

independent directions< Eigenvectors tell in which directions the lines of largest spreads lie.< Eigenvectors are the new rotated coordinate axes

Variances

Largest spread

.Definition< Assume that A is a square matrix< If one can find a number a and a vector X that fulfil the equation

Variances

Eigenvalues and eigenvectors

< the number is an eigenvalue and the vector is an eigenvector

.How to find< Diagonalize the matrix< In this way you will obtain all eigenvalues and -vectors< It is always possible to diagonalize but this is usually unnecessary

work because you only need two, at most three, directions wherethe spreads are the largest

.An nxn matrix has n eigenvalues< A 2x2 matrix has two eigenvalues

Variances

Eigenvalues and eigenvectors

The eigenvalues are 2 and 0 and the corresponding eigenvectors are (1, 1)T and (1, -1)T. The spread along the first line is 2 and the spread along the second axisis zero, no spread at all, so the points lie exactly on one line.

The elements of the correlation matrix are

Variances

Correlation matrix

A reminder of the model system to be used as an example.

PCA

Principal component analysis

Ca pHFeatures

Object

PCA finds a direction along which the points lie.

PCA

What does it mean?

y=pH

x=Ca

Principal component (1 1)

PCA just classifies your observations. It does not perform anyregression.

PCA

What is it?

Low

Medium

High

PCA classifies the points in categories with similar properties.

PCA

What does it mean?

y=pH

x=Ca


When a two-dimensional coordinate system is rotated there must be twonew rotated coordinate axes that are orthogonal. If there are n featuresthere will be n new rotated coordinate axes.

Sure but ...

You are not interested in more than a few directions with the widestspread.

The first new coordinate axis is chosen so that it gives the widest spreadof points. The second new axis gives the widest spread in anyorthogonal direction. The last ones give essentially no spread and aretherefore uninteresting.

Usually the first two, or maybe three, axes suffice to group theobservations in clusters.

PCA

The next principal component

The second principal component.

PCA

What does it mean?

y=pH

x=Ca


Principal component (1 -1)

Hair samples from a crime site were analyzed. The following elementalcompositions of the hairs of the suspects were detected.

PCA

A real case

Hair Cu Mn Cl Br I1 9.2 0.30 1730 12.0 3.62 12.4 0.39 930 50.0 2.33 7.2 0.32 2750 65.3 3.44 10.2 0.36 1500 3.4 5.35 10.1 0.50 1040 39.2 1.96 6.5 0.20 2490 90.0 4.67 5.6 0.29 2940 88.0 5.68 11.8 0.42 867 43.1 1.59 8.5 0.25 1620 5.2 6.2

On the first glance the samples are quite random.

Consider the two most important principal components.

PCA

A real case

2

1

3

4

5

678

9

PC1

PC2

The details of the computation will follow.

Direction of largest spread needs to be found. Spread along thecoordinate axes is given by the variance-covariance matrix. Its eigen-value gives the characteristic spread. The corresponding eigenvectorgives the direction.

Eigenvectors are automatically orthogonal.

So, diagonalize the ó matrix.

PCA

How is it done?

Only, you don’t.

Diagonalization gives ALL eigenvalues and eigenvectors. For systems withmany features and many objects this could mean a lot of work. If youcompare, e.g., vibrational spectra in the range 400 to 3600 cm-1 atresolution 4 cm-1 you will have 800 features. This means that the variance-covariance matrix is 800x800 and most of the 800 principal componentsare completely irrelevant. You just need the first two.

The spread of the first component is largest, that of the second smalleretc. Two or three components usually explain all the spread down toexperimental errors.

PCA

Eigenvalues

Component

Eigenvalue= Spread

42 51 3

These do not differentiate theobservations.

Consider the hair example. In order to calculate all eigenvalues, thevariance-covariance matrix was formed. As the numerical scales of thefeatures are widely different, a comparison of them is not relevant. Therefore the correlation matrix was derived from the covariance matrixand diagonalised. The eigenvalues are:

PCA

Explained variances

Component Eigenvalue ë Explained variance % Cumulative variance %

1 3.375 67.5 67.52 1.178 23.6 91.13 0.294 5.9 96.94 0.127 2.5 99.55 0.028 0.5 100.0

The two first principal components explain over 90 % of the spread. The rest is merely noise.

The original observations matrix X shows where the points lie in the(Ca, pH) coordinate system.

PCA

How is it done?

y=pH

x=Ca

Break down the observations X to a product of a scores matrix T and aloadings matrix L.

PCA

How is it done?

ScoresLoadings

Compare: y = ax

This example is mathematically inconsistent! The real mathematics is shown later.

The loadings matrix tells what is the direction of the principal component.

These are the rotated coordinates expressed in terms of the originalones.

PCA

Loadings

y=pH

x=Ca

Principal component, LT = (1 1)

The scores matrix tells where the points lie along the new coordinateaxis.

PCA

Scores

y=pH

x=Ca

PC1

PC2

In linear algebra, y = ax. To solve a one writes a = y x-1.

Here X = TLT. To solve T we multiply with (LT)-1 from the right (note thatthe matrix L is an orthogonal matris, i.e., LT = L-1).

Assume that you have p features (columns) and n objects (rows) in theoriginal observations matrix X. Assume that you need d principalcomponents. Then

PCA

Dimensions of the matrices

X = TLT

X

p

n

=

n

d

T

pd

LT

The method is iterative.

Step 1: Normalize (center to origin and scale to length 1) the originalobservations matrix X.

PCA

The real mathematics

2 13 24 3)

-101( -1

01( ) -%2

0%2

-%20%2( )

Step 2: Estimate the first loading vector lT (i.e., the first row of the matrixLT). Usually, the first row of the normalized X matrix is chosen.

lT = (-%2 -%2)

Step 3: Form the new score vector t:

PCA


Compare the new score vector with that from the previous iteration. Ifthey differ less than a preset limit (e.g., 10-5) go to step 6, if they differmore, refine the score vector further in step 4.

Step 4: In order to refine the score vector, compute a new loadings vector:

PCA


Normalize the new loadings vector.

Step 5: If the number of iterations for this scrore vector is less than apreset limit, say 10, go to step 3; if the allowed number of iterations isexceeded go to step 6.

PCA


Step 6: Determine the residuals, i.e., the so far unexplained part of theobservations:

PCA


If the necessary number of principal components have been obtained, goto step 8. Otherwise go to step 7.

Step 7: Use the residuals matrix E as the new observations matrix X andgo to step 2 to obtain next score vector.

PCA


Step 8: Report the results. The original observations in matrix X can beexplained as the product of the matrix T containing the score vectors andthe matrix LT containing the corresponding loadings vectors.

Here only one principal component is needed.

PCA


The hair samples: scores and loadings

PCA

The real case

PC1

PC2PC1

PC2

CuMnClBrI

Scores plot shows clusters

PCA

The real case

2

1

3

4

5

678

9

PC1

PC2

0 1-1

0

Loadings plot shows which features contribute to each principalcomponent

PCA

The real case

PC1

PC2

0

0

Cu

MnCl

Br

I

Small angle betweenlines means that thefeatures are stronglycorrelated.Copper and manganeseare strongly correlated. Copper and chlorineanticorrelate.

All features contribute toPC1. PC2 dependsmostly on Br and I.

A biplot - properly scaled - shows which features separate the clusters

PCA

The real case

2

1

3

4

5

678

9

PC1

PC2

0 1-1

0

Cu

Mn

Cl

Br

I

Cu, Mn, Br and Cl separatethe clusters (2,5,8) and(3,6,7). The cluster (1,4,9)is characterized by I.

matti hotokka physical chemistry Åbo akademi...

Documents