matti hotokka physical chemistry Åbo akademi...

48
Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Chemometrics

Matti HotokkaPhysical chemistry

Åbo Akademi University

Page 2: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Consider a case where two quantities (or features), pH and the Caconcentration are measured. Assume that three different samples areanalysed.

Data analysis

Definitions

Ca pHFeatures

Object

In general

Page 3: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

.Purpose< To classify the objects

.Method< Plot the points so that they are spread as much as possible < For this one needs to find the line of largest spread< Then just rotate so that this line becomes a coordinate axis< At the same time you see which original features contribute most to

the spread

Data analysis

Principle

Page 4: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Data analysis

Principle

y=pH

x=Ca

s = %2

s = 1

s = 1

Page 5: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Preprocessing the data is the (most?) crucial step in data analysis.

Many steps may be needed. It is important to keep track of thetransformations so that all steps can be applied in reverse order on thefinal result in order to obtain numbers comparable to the original data.

Preprocessing

Page 6: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

A constant offset may be removed from each data point. Each featuremay have its own offset. Usually the offset is the mean of the feature(column) meaning that the feature is centered around origin,

Preprocessing

Centering

Thus, for a given feature (column), i.e., fixed index k

Page 7: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Preprocessing

Centering

y=pH

x=Ca

Page 8: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The features represent different aspects of the same samples and mayhave numercally quite varying orders of magnitude. One may scaleeach feature to make their influence on the system more easilycomparable.

Preprocessing

Scaling

Range scaling

Autoscaling

In autoscaling, the vectors are scaled to length range %n-1.

Page 9: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Preprocessing

Scaling

y=pH

x=Ca

Page 10: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The feature vectors may be normalized to length 1.

Preprocessing

Normalisation

Page 11: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Statistical analysis is possible even if data points are missing. Themissing points should be replaced with a suitable mean value or, in worstcase, with a random number.

Preprocessing

Missing data

Page 12: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Features should be removed if they are strongly correlated with otherfeatures, do not have any influence on the data, or are constant.

Preprocessing

Redundant features

Page 13: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)
Page 14: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The variances (diagonal elements) and covariances (off-diagonalelements) are

Variances

Variance-covariance matrix

for j = 1,p

for j,k = 1,p, j�k

Notation: There are p features and n objects.

Page 15: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Variances

Variance-covariance matrix

Co-variances

Variance-covariance matrix

Page 16: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

.What determines< Largest spread means largest variance

.How to find< Eigenvalues of the variance-covariance matrix give all spreads in

independent directions< Eigenvectors tell in which directions the lines of largest spreads lie.< Eigenvectors are the new rotated coordinate axes

Variances

Largest spread

Page 17: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

.Definition< Assume that A is a square matrix< If one can find a number a and a vector X that fulfil the equation

Variances

Eigenvalues and eigenvectors

< the number is an eigenvalue and the vector is an eigenvector

.How to find< Diagonalize the matrix< In this way you will obtain all eigenvalues and -vectors< It is always possible to diagonalize but this is usually unnecessary

work because you only need two, at most three, directions wherethe spreads are the largest

Page 18: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

.An nxn matrix has n eigenvalues< A 2x2 matrix has two eigenvalues

Variances

Eigenvalues and eigenvectors

The eigenvalues are 2 and 0 and the corresponding eigenvectors are (1, 1)T and (1, -1)T. The spread along the first line is 2 and the spread along the second axisis zero, no spread at all, so the points lie exactly on one line.

Page 19: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The elements of the correlation matrix are

Variances

Correlation matrix

Page 20: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)
Page 21: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

A reminder of the model system to be used as an example.

PCA

Principal component analysis

Ca pHFeatures

Object

Page 22: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

PCA finds a direction along which the points lie.

PCA

What does it mean?

y=pH

x=Ca

Principal component (1 1)

Page 23: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

PCA just classifies your observations. It does not perform anyregression.

PCA

What is it?

Low

Medium

High

Page 24: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

PCA classifies the points in categories with similar properties.

PCA

What does it mean?

y=pH

x=Ca

Principal component (1 1)

Page 25: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

When a two-dimensional coordinate system is rotated there must be twonew rotated coordinate axes that are orthogonal. If there are n featuresthere will be n new rotated coordinate axes.

Sure but ...

You are not interested in more than a few directions with the widestspread.

The first new coordinate axis is chosen so that it gives the widest spreadof points. The second new axis gives the widest spread in anyorthogonal direction. The last ones give essentially no spread and aretherefore uninteresting.

Usually the first two, or maybe three, axes suffice to group theobservations in clusters.

PCA

The next principal component

Page 26: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The second principal component.

PCA

What does it mean?

y=pH

x=Ca

Principal component (1 1)

Principal component (1 -1)

Page 27: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Hair samples from a crime site were analyzed. The following elementalcompositions of the hairs of the suspects were detected.

PCA

A real case

Hair Cu Mn Cl Br I1 9.2 0.30 1730 12.0 3.62 12.4 0.39 930 50.0 2.33 7.2 0.32 2750 65.3 3.44 10.2 0.36 1500 3.4 5.35 10.1 0.50 1040 39.2 1.96 6.5 0.20 2490 90.0 4.67 5.6 0.29 2940 88.0 5.68 11.8 0.42 867 43.1 1.59 8.5 0.25 1620 5.2 6.2

On the first glance the samples are quite random.

Page 28: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Consider the two most important principal components.

PCA

A real case

2

1

3

4

5

678

9

PC1

PC2

The details of the computation will follow.

Page 29: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Direction of largest spread needs to be found. Spread along thecoordinate axes is given by the variance-covariance matrix. Its eigen-value gives the characteristic spread. The corresponding eigenvectorgives the direction.

Eigenvectors are automatically orthogonal.

So, diagonalize the ó matrix.

PCA

How is it done?

Only, you don’t.

Diagonalization gives ALL eigenvalues and eigenvectors. For systems withmany features and many objects this could mean a lot of work. If youcompare, e.g., vibrational spectra in the range 400 to 3600 cm-1 atresolution 4 cm-1 you will have 800 features. This means that the variance-covariance matrix is 800x800 and most of the 800 principal componentsare completely irrelevant. You just need the first two.

Page 30: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The spread of the first component is largest, that of the second smalleretc. Two or three components usually explain all the spread down toexperimental errors.

PCA

Eigenvalues

Component

Eigenvalue= Spread

42 51 3

These do not differentiate theobservations.

Page 31: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Consider the hair example. In order to calculate all eigenvalues, thevariance-covariance matrix was formed. As the numerical scales of thefeatures are widely different, a comparison of them is not relevant. Therefore the correlation matrix was derived from the covariance matrixand diagonalised. The eigenvalues are:

PCA

Explained variances

Component Eigenvalue ë Explained variance % Cumulative variance %

1 3.375 67.5 67.52 1.178 23.6 91.13 0.294 5.9 96.94 0.127 2.5 99.55 0.028 0.5 100.0

The two first principal components explain over 90 % of the spread. The rest is merely noise.

Page 32: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The original observations matrix X shows where the points lie in the(Ca, pH) coordinate system.

PCA

How is it done?

y=pH

x=Ca

Page 33: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Break down the observations X to a product of a scores matrix T and aloadings matrix L.

PCA

How is it done?

ScoresLoadings

Compare: y = ax

This example is mathematically inconsistent! The real mathematics is shown later.

Page 34: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The loadings matrix tells what is the direction of the principal component.

These are the rotated coordinates expressed in terms of the originalones.

PCA

Loadings

y=pH

x=Ca

Principal component, LT = (1 1)

Page 35: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The scores matrix tells where the points lie along the new coordinateaxis.

PCA

Scores

y=pH

x=Ca

PC1

PC2

Page 36: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

In linear algebra, y = ax. To solve a one writes a = y x-1.

Here X = TLT. To solve T we multiply with (LT)-1 from the right (note thatthe matrix L is an orthogonal matris, i.e., LT = L-1).

Assume that you have p features (columns) and n objects (rows) in theoriginal observations matrix X. Assume that you need d principalcomponents. Then

PCA

Dimensions of the matrices

X = TLT

X

p

n

=

n

d

T

pd

LT

Page 37: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The method is iterative.

Step 1: Normalize (center to origin and scale to length 1) the originalobservations matrix X.

PCA

The real mathematics

2 13 24 3)

-101( -1

01( ) -%2

0%2

-%20%2( )

Step 2: Estimate the first loading vector lT (i.e., the first row of the matrixLT). Usually, the first row of the normalized X matrix is chosen.

lT = (-%2 -%2)

Page 38: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Step 3: Form the new score vector t:

PCA

The real mathematics

Compare the new score vector with that from the previous iteration. Ifthey differ less than a preset limit (e.g., 10-5) go to step 6, if they differmore, refine the score vector further in step 4.

Page 39: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Step 4: In order to refine the score vector, compute a new loadings vector:

PCA

The real mathematics

Normalize the new loadings vector.

Page 40: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Step 5: If the number of iterations for this scrore vector is less than apreset limit, say 10, go to step 3; if the allowed number of iterations isexceeded go to step 6.

PCA

The real mathematics

Page 41: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Step 6: Determine the residuals, i.e., the so far unexplained part of theobservations:

PCA

The real mathematics

If the necessary number of principal components have been obtained, goto step 8. Otherwise go to step 7.

Page 42: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Step 7: Use the residuals matrix E as the new observations matrix X andgo to step 2 to obtain next score vector.

PCA

The real mathematics

Page 43: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Step 8: Report the results. The original observations in matrix X can beexplained as the product of the matrix T containing the score vectors andthe matrix LT containing the corresponding loadings vectors.

Here only one principal component is needed.

PCA

The real mathematics

Page 44: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

The hair samples: scores and loadings

PCA

The real case

PC1

PC2PC1

PC2

CuMnClBrI

Page 45: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Scores plot shows clusters

PCA

The real case

2

1

3

4

5

678

9

PC1

PC2

0 1-1

0

Page 46: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

Loadings plot shows which features contribute to each principalcomponent

PCA

The real case

PC1

PC2

0

0

Cu

MnCl

Br

I

Small angle betweenlines means that thefeatures are stronglycorrelated.Copper and manganeseare strongly correlated. Copper and chlorineanticorrelate.

All features contribute toPC1. PC2 dependsmostly on Br and I.

Page 47: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)

A biplot - properly scaled - shows which features separate the clusters

PCA

The real case

2

1

3

4

5

678

9

PC1

PC2

0 1-1

0

Cu

Mn

Cl

Br

I

Cu, Mn, Br and Cl separatethe clusters (2,5,8) and(3,6,7). The cluster (1,4,9)is characterized by I.

Page 48: Matti Hotokka Physical chemistry Åbo Akademi Universitymhotokka/mhotokka/lecturenotes/statistik/L6_P… · The variances (diagonal elements) and covariances (off-diagonal elements)