namp module 17: “introduction to multivariate analysis” tier 1, part 1, rev.: 0 the least...

41
NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation” --- but what is “best” ? Criterion: minimize the sum of squared deviations of data points from the regression line.

Upload: ethan-washington

Post on 11-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

The Least Squares Principle

• Regression tries to produce a “best fit equation” --- but what is “best” ?

• Criterion: minimize the sum of squared deviations of data points from the regression line.

Page 2: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

How Good is the Regression (Part 1) ?

(Sum of squares about mean of Y)

(Sum of squares about regression line)

How well does the regression equation represent our original data?

The proportion (percentage) of the of the variance in y that is explained by the regression equation is denoted by the symbol R2.

R2 =

Page 3: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Explained Variability - illustration

High R2 - good explanation Low R2 - poor explanation

Page 4: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

How Good is the Regression (Part 2) ?

• Remember you used a sample from the population of potential data points to determine your regression equation.– e.g. one value every 15 minutes, 1-2 weeks of operating data

• A different sample would give you a different equation with different coefficients bi

• As illustrated on the next slide, the sample can greatly affect the regression equation…

How well would this regression equation predict NEW data points?

Page 5: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Sampling variability of Regression Coefficients - illustration

Sample 1: y = a'x + b' + Sample 2: y = a''x + b'' + Sample 1: y = a’x + b’ + e Sample 2: y = a’’x + b’’ + e

Page 6: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Confidence Limits

• Confidence limits (x%) are upper and lower bounds which have an x% probability of enclosing the true population value of a given variable

• Often shown as bars above and below a predicted data point:

Page 7: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Normalisation of Data

• Data used for regression are usually normalised to have mean zero and variance one.

• Otherwise the calculations would be dominated (biased) by variables having:– numerically large values– large variance

• This means that the MVA software never sees the original data, just the normalised version

Page 8: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Normalisation of Data - illustration

Each variable is represented by a variance bar and its mean (centre).

Raw dataMean-centred

onlyVariance-

centred only Normalised

Page 9: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Requirements for Regression

• Data Requirements– Normalised data– Errors normally distributed with mean zero– Independent variables uncorrelated

• Implications if Requirements Not Met– Larger confidence limits around

regression coefficients (bi)– Poorer prediction on new data

Page 10: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Multivariate Analysis

1. Principal Component Analysis (PCA)• X’s only

2. Projection to Latent Structures (PLS)• a.k.a. “Partial Least Squares”• X’s and Y’s X Y

Can be same dataset, i.e., you can do PCA on the whole thing (X’s and Y’s together)

Can be same dataset, i.e., you can do PCA on the whole thing (X’s and Y’s together)

Now we are ready to start talking about multivariate analysis (MVA) itself. There are two main types of MVA:

Let’s start with PCA. Note that the European food example at the beginning was PCA, because all the food types were treated as equivalent.

Xx

Page 11: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Purpose of PCA

The purpose of PCA is to project a data space with a large number of correlated dimensions (variables) into a second data space with a much smaller number of independent (orthogonal) dimensions.

This is justified scientifically because of Ockham’s Razor. Deep down, Nature IS simple. Often the lower dimenional space corresponds more closely to what is actually happening at a physical level.

The challenge is interpreting the MVA

results in a scientifically valid way.

Reminder…“Ockham’s Razor”

Page 12: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Among the advantages of PCA:

• Uncorrelated variables lend themselves to traditional statistical analysis

• Lower-dimensional space easier to work with• New dimensions often represent more clearly the underlying

structure of the set of variables (our friend Ockham)

Advantages of PCA

+1 -1

Reminder…“Latent Attributes”

Page 13: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

How PCA works (Concept)

• Find a component (dimension vector) which explains as much x-variation as possible

• Find a second component which:

– is orthogonal to (uncorrelated with) the first

– explains as much as possible of the remaining x-variation

• Process continues until researcher satisfied or increase in explanation is judged minimal

PCA is a step-wise process. This is how it works conceptually:

Page 14: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

How PCA Works (Math)

• Consider an (n x k) data matrix X (n observations, k variables)

• PCS models this as (assuming normalized data):

X = T * P’ + E

• where T is the scores of each observation on the new components

P is the loadings of the originalvariables on the new components

E residual matrix, containing the noise

Like linear Like linear regression only regression only using using matricesmatrices

This is how PCA works mathematically:

Page 15: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

3 originalvariables

X1

X3

X2

Data cloud (in red) is projected onto plane defined by first 2 components

..

. ...

. ..

.

. .

How PCA Works (Visually)

The way PCA works visually is by projecting the multidimensional data cloud onto the “hyperplane” defined by the first two components. The image below shows this in 3-D, for ease of understanding, but in reality there can be dozens or even hundreds of dimensions:

Page 16: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Number of Components

Components are simply the new axes which are created to explain the most variance with the least dimensions. The PCA methodology ensures that components are extracted in decreasing order of explained variance. In other words, the first component always explains the most variance, the second component explains the next most variance, and so forth:

1 2 3 4 5 6 . . .

Eventually, the higher-level components represent mainly noise. This is a good thing, and in fact one of the reasons we use PCA in the first place. Because noise is relegated to the higher-level components, it is absent from the first few components. This is because all components are orthogonal to each other, which means that they are statistically independent or uncorrelated.

Page 17: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

The Eigenvalue Criterion

• Eigenvalues of a matrix A :– Mathematically defined by (A - I) = 0

– Useful as an “importance measure” for variables

There are two ways to determine when to stop creating new components:

–Eigenvalue criterion

–Scree test

The first of these uses the following mathematical definition:

Usually, components with eigenvalues less than one are discarded, since they have less explanatory power than the original variables did in the first place.

Page 18: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

The Inflection Point Criterion (Scree Test)

The second method is a simple graphical technique:

• Plot eigenvalues vs. number of components

• Extract components up to the point where the plot “levels off”

• Right-hand tail of the curve is “scree” (like lower part of a rocky slope)

1 2 3 4 5 6Component #

8

7

6

5

4

3

2

1

Eig

en

val

ue

Page 19: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Interpretation of the PCA Components

• Look at strength and direction of loadings• Look for clusters of variables which may be physically related or

have a common origin– e.g., In papermaking, strength properties such as tear, burst,

breaking length in the paper are all related to the length and bonding propensity of the initial fibres.

As with any type of MVA, the most difficult part of PCA is interpreting the components. The software is 100% mathematical, and gives the same outputs whether the data relates to diesel fuel composition or last night’s horse racing results. It is up to the engineer to make sense of the outputs. Generally, you have to:

Page 20: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

What is the difference between PCA and PLS?

PLS is the multivariate version of regression. It uses two different PCA models, one for the X’s and one for the Y’s, and finds the links between the two.

Mathematically, the difference is as follows:

In PCA, we are maximising the variance thatis explained by the model.

In PLS, we are maximising the covariance.

PCA vs. PLS

Xx

X Y

Page 21: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

How PLS works (Concept)

• PLS finds a set of orthogonal components that :

– maximize the level of explanation of both X and Y

– provide a predictive equation for Y in terms of the X’s

• This is done by:– fitting a set of components to X (as in PCA)

– similarly fitting a set of components to Y

– reconciling the two sets of components so as to maximize explanation of X and Y

PLS is also a step-wise process. This is how it works conceptually:

Page 22: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

How PLS works (Math)

• X = TP’ + E outer relation for X (like PCA)

• Y = UQ’ + F outer relation for Y (like PCA)

• uh = bhth inner relation for components

h = 1,…,(# of components)

• Weighting factors w are used to make sure dimensions are orthogonal

This is how PLS works mathematically:

Page 23: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

PLS – the “Inner Relation”

All 3 are solved simultaneously via numerical methods

The way PLS works visually is by “tweeking” the two PCA models (X and Y) until their covariance is optimised. It is this “tweeking” that led to the name partial least-squares.

Page 24: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Interpretation of the PLS Components

Interpretation of the PLS results has all the difficulties of PCA, plus one other one: making sense of the individual components in both X and Y space.

In other words, for the results to make sense, the first component in X must be related somehow to the first component in Y.

Note that throughout this course, the words “cause” and “effect” are absent. MVA determines correlations ONLY. The only exception is when a proper design-of-experiment has been used.

Here is an example of a false correlation: the seed in your birdfeeder remains full all winter, then suddenly disappears in the spring. You conclude that the warm weather made the seeds disintegrate…

Page 25: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Results– Score Plots– Loadings Plots

• Diagnostics– Plot of Residuals– Observed vs. Predicted– …(many more)

Types of MVA Outputs

Already seen…

MVA software generates two types of outputs: results, and diagnostics. We have already seen the Score plot and Loadings plot in the food example. Some others are shown on the nextfew slides.

Page 26: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Also called “Distance to Model” (DModX)– Contains all the noise– Definition:

DModX = ( eik2 / D.F.)1/2

• Used to identify moderate outliers– Extreme outliers visible on Score Plot

Residuals

1

2

3

4

5

1999-11-23

1999-11-24

1999-11-25

1999-11-26

1999-11-27

1999-11-28

1999-11-29

1999-11-30

1999-12-01

1999-12-02

1999-12-03

1999-12-04

1999-12-05

1999-12-06

1999-12-07

1999-12-08

1999-12-09

1999-12-10

1999-12-11

1999-12-12

1999-12-13

1999-12-14

1999-12-15

1999-12-16

1999-12-17

1999-12-18

1999-12-19

1999-12-20

1999-12-21

1999-12-22

1999-12-23

1999-12-24

1999-12-25

1999-12-26

1999-12-27

1999-12-28

1999-12-29

1999-12-30

2000-01-00

2000-01-01

2000-01-02

2000-01-03

2000-01-04

2000-01-05

2000-01-06

2000-01-07

2000-01-08

2000-01-09

2000-01-10

2000-01-11

2000-01-12

2000-01-13

2000-01-14

2000-01-15

2000-01-16

2000-01-17

2000-01-18

2000-01-19

2000-01-20

2000-01-21

2000-01-22

2000-01-23

2000-01-24

2000-01-25

2000-01-26

2000-01-27

2000-01-28

2000-01-29

2000-01-30

2000-01-31

2000-02-01

2000-02-02

2000-02-03

2000-02-04

2000-02-05

2000-02-06

2000-02-07

2000-02-08

2000-02-09

2000-02-10

2000-02-11

2000-02-12

2000-02-13

2000-02-14

2000-02-15

2000-02-16

2000-02-17

2000-02-18

2000-02-19

2000-02-20

2000-02-21

2000-02-22

2000-02-23

2000-02-24

2000-02-25

2000-02-26

2000-02-27

2000-02-28

2000-02-29

2000-03-01

2000-03-02

2000-03-03

2000-03-04

2000-03-05

2000-03-06

2000-03-07

2000-03-08

2000-03-09

2000-03-10

2000-03-11

2000-03-12

2000-03-13

2000-03-14

2000-03-15

2000-03-16

2000-03-17

2000-03-18

2000-03-19

2000-03-20

2000-03-21

2000-03-22

2000-03-23

2000-03-24

2000-03-25

2000-03-26

2000-03-27

2000-03-28

2000-03-29

2000-03-30

2000-03-31

2000-04-01

2000-04-02

2000-04-03

2000-04-04

2000-04-05

2000-04-06

2000-04-07

2000-04-08

2000-04-09

2000-04-10

2000-04-11

2000-04-12

2000-04-13

2000-04-14

2000-04-15

2000-04-16

2000-04-17

2000-04-18

2000-04-19

2000-04-20

2000-04-21

2000-04-22

2000-04-23

2000-04-24

2000-04-25

2000-04-26

2000-04-27

2000-04-28

2000-04-29

2000-04-30

2000-05-01

2000-05-02

2000-05-03

2000-05-04

2000-05-05

2000-05-06

2000-05-07

2000-05-08

2000-05-09

2000-05-10

2000-05-11

2000-05-12

2000-05-13

2000-05-14

2000-05-15

2000-05-16

2000-05-17

2000-05-18

2000-05-19

2000-05-20

2000-05-21

2000-05-22

2000-05-23

2000-05-24

2000-05-25

2000-05-26

2000-05-27

2000-05-28

2000-05-29

2000-05-30

2000-05-31

2000-06-01

2000-06-02

2000-06-03

2000-06-04

2000-06-05

2000-06-06

2000-06-07

2000-06-08

2000-06-09

2000-06-10

2000-06-11

2000-06-12

2000-06-13

2000-06-14

2000-06-15

2000-06-16

2000-06-17

2000-06-18

2000-06-19

2000-06-20

2000-06-21

2000-06-22

2000-06-23

2000-06-24

2000-06-25

2000-06-26

2000-06-27

2000-06-28

2000-06-29

2000-06-30

2000-07-01

2000-07-02

2000-07-03

2000-07-04

2000-07-05

2000-07-06

2000-07-07

2000-07-08

2000-07-09

2000-07-10

2000-07-11

2000-07-12

2000-07-13

2000-07-14

2000-07-15

2000-07-16

2000-07-17

2000-07-18

2000-07-19

2000-07-20

2000-07-21

2000-07-22

2000-07-23

2000-07-24

2000-07-25

2000-07-26

2000-07-27

2000-07-28

2000-07-29

2000-07-30

2000-07-31

2000-08-01

2000-08-02

2000-08-03

2000-08-04

2000-08-05

2000-08-06

2000-08-07

2000-08-08

2000-08-09

2000-08-10

2000-08-11

2000-08-12

2000-08-13

2000-08-14

2000-08-15

2000-08-16

2000-08-17

2000-08-18

2000-08-19

2000-08-20

2000-08-21

2000-08-22

2000-08-23

2000-08-24

2000-08-25

2000-08-26

2000-08-27

2000-08-28

2000-08-29

2000-08-30

2000-08-31

2000-09-01

2000-09-02

2000-09-03

2000-09-04

2000-09-05

2000-09-06

2000-09-07

2000-09-08

2000-09-09

2000-09-10

2000-09-11

2000-09-12

2000-09-13

2000-09-14

2000-09-15

2000-09-16

2000-09-17

2000-09-18

2000-09-19

2000-09-20

2000-09-21

2000-09-22

2000-09-23

2000-09-24

2000-09-25

2000-09-26

2000-09-27

2000-09-28

2000-09-29

2000-09-30

2000-10-01

2000-10-02

2000-10-03

2000-10-04

2000-10-05

2000-10-06

2000-10-07

2000-10-08

2000-10-09

2000-10-10

2000-10-11

2000-10-12

2000-10-13

2000-10-14

2000-10-15

2000-10-16

2000-10-17

2000-10-18

2000-10-19

2000-10-20

2000-10-21

2000-10-22

2000-10-23

2000-10-24

2000-10-25

2000-10-26

2000-10-27

2000-10-28

2000-10-29

2000-10-30

2000-10-31

2000-11-01

2000-11-02

2000-11-03

2000-11-04

2000-11-05

2000-11-06

2000-11-07

2000-11-08

2000-11-09

2000-11-10

2000-11-11

2000-11-12

2000-11-13

2000-11-14

2000-11-15

2000-11-16

2000-11-17

2000-11-18

2000-11-19

2000-11-20

2000-11-21

2000-11-22

2000-11-23

2000-11-24

2000-11-25

2000-11-26

2000-11-27

2000-11-28

2000-11-29

2000-11-30

2000-12-01

2000-12-02

2000-12-03

2000-12-04

2000-12-05

2000-12-06

2000-12-07

2000-12-08

2000-12-09

2000-12-10

2000-12-11

2000-12-12

2000-12-13

2000-12-14

2000-12-15

2000-12-16

2000-12-17

2000-12-18

2000-12-19

2000-12-20

2000-12-21

2000-12-22

2000-12-23

2000-12-24

2000-12-25

2000-12-26

2000-12-27

2000-12-28

2000-12-29

2000-12-30

2000-12-31

2001-01-01

2001-01-02

2001-01-03

2001-01-04

2001-01-05

2001-01-06

2001-01-07

2001-01-08

2001-01-09

2001-01-10

2001-01-11

2001-01-12

2001-01-13

2001-01-14

2001-01-15

2001-01-16

2001-01-17

2001-01-18

2001-01-19

2001-01-20

2001-01-21

2001-01-22

2001-01-23

2001-01-24

2001-01-25

2001-01-26

2001-01-27

2001-01-28

2001-01-29

2001-01-30

2001-01-31

2001-02-01

2001-02-02

2001-02-03

2001-02-04

2001-02-05

2001-02-06

2001-02-07

2001-02-08

2001-02-09

2001-02-10

2001-02-11

2001-02-12

2001-02-13

2001-02-14

2001-02-15

2001-02-16

2001-02-17

2001-02-18

2001-02-19

2001-02-20

2001-02-21

2001-02-22

2001-02-23

2001-02-24

2001-02-25

2001-02-26

2001-02-27

2001-02-28

2001-03-01

2001-03-02

2001-03-03

2001-03-04

2001-03-05

2001-03-06

2001-03-07

2001-03-08

2001-03-09

2001-03-10

2001-03-11

2001-03-12

2001-03-13

2001-03-14

2001-03-15

2001-03-16

2001-03-17

2001-03-18

2001-03-19

2001-03-20

2001-03-21

2001-03-22

2001-03-23

2001-03-24

2001-03-25

2001-03-26

2001-03-27

2001-03-28

2001-03-29

2001-03-30

2001-03-31

2001-04-01

2001-04-02

2001-04-03

2001-04-04

2001-04-05

2001-04-06

2001-04-07

2001-04-08

2001-04-09

2001-04-10

2001-04-11

2001-04-12

2001-04-13

2001-04-14

2001-04-15

2001-04-16

2001-04-17

2001-04-18

2001-04-19

2001-04-20

2001-04-21

2001-04-22

2001-04-23

2001-04-24

2001-04-25

2001-04-26

2001-04-27

2001-04-28

2001-04-29

2001-04-30

2001-05-01

2001-05-02

2001-05-03

2001-05-04

2001-05-05

2001-05-06

2001-05-07

2001-05-08

2001-05-09

2001-05-10

2001-05-11

2001-05-12

2001-05-13

2001-05-14

2001-05-15

2001-05-16

2001-05-17

2001-05-18

2001-05-19

2001-05-20

2001-05-21

2001-05-22

2001-05-23

2001-05-24

2001-05-25

2001-05-26

2001-05-27

2001-05-28

2001-05-29

2001-05-30

2001-05-31

2001-06-01

2001-06-02

2001-06-03

2001-06-04

2001-06-05

2001-06-06

2001-06-07

2001-06-08

2001-06-09

2001-06-10

2001-06-11

2001-06-12

2001-06-13

2001-06-14

2001-06-15

2001-06-16

2001-06-17

2001-06-18

2001-06-19

2001-06-20

2001-06-21

2001-06-22

2001-06-23

2001-06-24

2001-06-25

2001-06-26

2001-06-27

2001-06-28

2001-06-29

2001-06-30

2001-07-01

2001-07-02

2001-07-03

2001-07-04

2001-07-05

2001-07-06

2001-07-07

2001-07-08

2001-07-09

2001-07-10

2001-07-11

2001-07-12

2001-07-13

2001-07-14

2001-07-15

2001-07-16

2001-07-17

2001-07-18

2001-07-19

2001-07-20

2001-07-21

2001-07-22

2001-07-23

2001-07-24

2001-07-25

2001-07-26

2001-07-27

2001-07-28

2001-07-29

2001-07-30

2001-07-31

2001-08-01

2001-08-02

2001-08-03

2001-08-04

2001-08-05

2001-08-06

2001-08-07

2001-08-08

2001-08-09

2001-08-10

2001-08-11

2001-08-12

2001-08-13

2001-08-14

2001-08-15

2001-08-16

2001-08-17

2001-08-18

2001-08-19

2001-08-20

2001-08-21

2001-08-22

2001-08-23

2001-08-24

2001-08-25

2001-08-26

2001-08-27

2001-08-28

2001-08-29

2001-08-30

2001-08-31

2001-09-01

2001-09-02

2001-09-03

2001-09-04

2001-09-05

2001-09-06

2001-09-07

2001-09-08

2001-09-09

2001-09-10

2001-09-11

2001-09-12

2001-09-13

2001-09-14

2001-09-15

2001-09-16

2001-09-17

2001-09-18

2001-09-19

2001-09-20

2001-09-21

2001-09-22

2001-09-23

2001-09-24

2001-09-25

2001-09-26

2001-09-27

2001-09-28

2001-09-29

2001-09-30

2001-10-01

2001-10-02

2001-10-03

2001-10-04

2001-10-05

2001-10-06

2001-10-07

2001-10-08

2001-10-09

2001-10-10

2001-10-11

2001-10-12

2001-10-13

2001-10-14

2001-10-15

2001-10-16

2001-10-17

2001-10-18

2001-10-19

2001-10-20

2001-10-21

2001-10-22

2001-10-23

2001-10-24

2001-10-25

2001-10-26

2001-10-27

2001-10-28

2001-10-29

2001-10-30

2001-10-31

2001-11-01

2001-11-02

2001-11-03

2001-11-04

2001-11-05

2001-11-06

2001-11-07

2001-11-08

2001-11-09

2001-11-10

2001-11-11

2001-11-12

2001-11-13

2001-11-14

2001-11-15

2001-11-16

2001-11-17

2001-11-18

2001-11-19

2001-11-20

2001-11-21

2001-11-22

2001-11-23

2001-11-24

2001-11-25

2001-11-26

2001-11-27

2001-11-28

2001-11-29

2001-11-30

2001-12-01

2001-12-02

2001-12-03

2001-12-04

2001-12-05

2001-12-06

2001-12-07

2001-12-08

2001-12-09

2001-12-10

2001-12-11

2001-12-12

2001-12-13

2001-12-14

2001-12-15

2001-12-16

2001-12-17

2001-12-18

2001-12-19

2001-12-20

2001-12-21

2001-12-22

2001-12-23

2001-12-24

2001-12-25

2001-12-26

2001-12-27

2001-12-28

2001-12-29

2001-12-30

2001-12-31

2002-01-01

2002-01-02

2002-01-03

2002-01-04

2002-01-05

2002-01-06

2002-01-07

2002-01-08

2002-01-09

2002-01-10

2002-01-11

2002-01-12

2002-01-13

2002-01-14

2002-01-15

2002-01-16

2002-01-17

2002-01-18

2002-01-19

2002-01-20

2002-01-21

2002-01-22

2002-01-23

2002-01-24

2002-01-25

2002-01-26

2002-01-27

2002-01-28

2002-01-29

2002-01-30

2002-01-31

2002-02-01

2002-02-02

2002-02-03

2002-02-04

2002-02-05

2002-02-06

2002-02-07

2002-02-08

2002-02-09

2002-02-10

2002-02-11

2002-02-12

2002-02-13

2002-02-14

2002-02-15

2002-02-16

2002-02-17

2002-02-18

2002-02-19

2002-02-20

2002-02-21

2002-02-22

2002-02-23

2002-02-24

2002-02-25

2002-02-26

2002-02-27

2002-02-28

2002-03-01

2002-03-02

2002-03-03

2002-03-04

2002-03-05

2002-03-06

2002-03-07

2002-03-08

2002-03-09

2002-03-10

2002-03-11

2002-03-12

2002-03-13

2002-03-14

2002-03-15

2002-03-16

2002-03-17

2002-03-18

2002-03-19

2002-03-20

2002-03-21

2002-03-22

2002-03-23

2002-03-24

2002-03-25

2002-03-26

2002-03-27

2002-03-28

2002-03-29

2002-03-30

2002-03-31

2002-04-01

2002-04-02

2002-04-03

2002-04-04

2002-04-05

2002-04-06

2002-04-07

2002-04-08

2002-04-09

2002-04-10

2002-04-11

2002-04-12

2002-04-13

2002-04-14

2002-04-15

2002-04-16

2002-04-17

2002-04-18

2002-04-19

2002-04-20

2002-04-21

2002-04-22

2002-04-23

2002-04-24

2002-04-25

2002-04-26

2002-04-27

2002-04-28

2002-04-29

2002-04-30

2002-05-01

2002-05-02

2002-05-03

2002-05-04

2002-05-05

2002-05-06

2002-05-07

2002-05-08

2002-05-09

2002-05-10

2002-05-11

2002-05-12

2002-05-13

2002-05-14

2002-05-15

2002-05-16

2002-05-17

2002-05-18

2002-05-19

2002-05-20

2002-05-21

2002-05-22

2002-05-23

2002-05-24

2002-05-25

2002-05-26

2002-05-27

2002-05-28

2002-05-29

2002-05-30

2002-05-31

2002-06-01

2002-06-02

2002-06-03

2002-06-04

2002-06-05

2002-06-06

2002-06-07

2002-06-08

2002-06-09

2002-06-10

2002-06-11

2002-06-12

2002-06-13

2002-06-14

2002-06-15

2002-06-16

2002-06-17

2002-06-18

2002-06-19

2002-06-20

2002-06-21

2002-06-22

2002-06-23

2002-06-24

2002-06-25

2002-06-26

2002-06-27

2002-06-28

2002-06-29

2002-06-30

2002-07-01

2002-07-02

2002-07-03

2002-07-04

2002-07-05

2002-07-06

2002-07-07

2002-07-08

2002-07-09

2002-07-10

2002-07-11

2002-07-12

2002-07-13

2002-07-14

2002-07-15

2002-07-16

2002-07-17

2002-07-18

2002-07-19

2002-07-20

2002-07-21

2002-07-22

2002-07-23

2002-07-24

2002-07-25

2002-07-26

2002-07-27

2002-07-28

2002-07-29

2002-07-30

2002-07-31

2002-08-01

DM

odX

[1](N

orm

)

Obs ID (TIME)

32-months of 1 day.M2 (PLS), UntitledDModX[1](Norm)

M2-D-Crit[4] = 1.157

D-Crit(0.05)

Original observations

(next slide)

Page 27: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

“Distance to Model”

.eik

i=observationk=variable

Page 28: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

160

180

200

220

240

150 160 170 180 190 200 210 220 230 240

YV

ar(5

3AI0

34.A

I)

YPred[14](53AI034.AI)

32-months of 1 day.M3 (PLS), UntitledYPred[14](53AI034.AI)/YVar(53AI034.AI)

RMSEE = 24.6664

Observed vs. Predicted

IDEAL MODELIDEAL MODEL

This graph plots the Y values predicted by the model, against the original Y values. A perfect model would only have points along the diagonal line.

Page 29: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Here is a list of some of the main challenges you will encounter when doing MVA. You have been warned!

• Difficulty interpreting the plots (“like reading tea leaves”)• Data pre-processing• Control loops can disguise real correlations• Discrete vs. averaged vs. interpolated data• Determining lags to account for flowsheet residence times• Time increment issues

– e.g., second-by-second values, or daily averages?

Some typical sensitivity variables for the application of MVA to real process data are shown on the next page…

MVA Challenges

Page 30: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Typical Sensitivity Variables

MVA Calculations

-Time step / averages-Which variables are used-How many components?-Data pre-processing-Treatment of noise/outliers-PCA vs. PLS

Physical reality -Which are the X’s and Y’s?-Sub-sections of flowsheet-Time lags, mixing & recirculation-Process/equipment changes-Seasonal effects

Unmeasured variables

-Known but not measured

-Unknown and unmeasured

Page 31: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

End of Tier 1

Congratulations!

Assuming that you have done all the reading, this is the end of Tier 1. No doubt much of this information seems confusing, but things will become more clear when we look at real-life examples in Tier 2.

All that is left is to complete the short quiz that follows…

Page 32: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 1:

Looking at one or two variables at a time is not recommended, because often variables are correlated. What does this mean exactly?

a) These variables tend to increase and decrease in unison. b) These variables are probably measuring the same thing,

however indirectly.c) These variables reveal a common, deeper variable that is

probably unmeasured. d) These variables are not statistically independent. e) All of the above.

Page 33: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 2:

What is the difference between “information” and “knowledge”?

a) Information is in a computer or on a piece of paper, while knowledge is inside a person’s head.

b) Only scientists have “true” knowledge.c) Information is mathematical, while knowledge is not.d) Information includes relationships between variables, but

without regard for the underlying scientific causes.e) Knowledge can only be acquired through experience.

Page 34: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 3:

Why does MVA never reveal cause-and-effect, unless a designed experiment is used?

a) Cause-and-effect can only be determined in a laboratory.b) Designed experiments eliminate error. c) MVA without a designed experiment is only inductive,

whereas a cause-and-effect relationship requires deduction.d) Only effects are measurable.e) Scientists design experiments to work perfectly the first time.

Page 35: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 4:

What is the biggest disadvantage to using a “black-box” model instead of one based on first principles?

a) There are no unit operations.b) The model is only as good as the data used to create it. c) Chemical reactions and thermodynamic data are not used.d) A black-box model can never take into account the entire

flowsheet.e) MVA models are linear only.

Page 36: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 5:

What does a confidence interval tell you?

a) How widely your data are spread out around a regression line.b) The range within which a certain percentage of sample

values can be expected to lie.c) The area within which your regression line should fall.d) The level of believability of the results of a specific analysis.e) The number of times you should repeat your analysis to be sure

of your results

Page 37: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 6:

When your data were being recorded, one of the mill sensors was malfunctioning and giving you wildly inaccurate readings. What are the implications likely to be for statistical analysis?

a) More square and cross-product terms in the model you fit to the data.

b) Higher mean values than would normally be expected.c) Higher variance values for the variables associated with the

malfunctioning sensor.d) Different selection of variables to include in the analysis.e) Bigger residual term in your model.

Page 38: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 7:

Why does reducing the number of dimensions (more variables to fewer components) make sense from a scientific point of view?

a) The new components might correspond to underlying physical phenomena that can’t be measured directly.

b) Fewer dimensions are easier to view on a graph or computer output.

c) Ockham’s Razor limits scientists to less than five dimensions.d) The real world is limited to just three dimensions.e) All of the above.

Page 39: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 8:

If two points on a score plot are almost touching, does that mean that these two observations are nearly identical?

a) Yes, because they lie in the same position within the same quadrant.

b) No, because of experimental error.c) Yes, because they have virtually the same effect on the MVA

model.d) No, because the score plot is only a projection.e) Answers (a) and (c).

Page 40: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 9:

Looking at the food example, what countries appear to be correlated with high consumption of olive oil?

a) Italy and Spain, and to a lesser degree Portugal and Austria. b) Italy and Spain only.c) Just Italy.d) Ireland and Italy. e) All the countries except Sweden, Denmark and England.

Page 41: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle Regression tries to produce a “best fit equation”

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Tier 1 Quiz

Question 10:

Why does error get relegated to higher-order components when doing PCA?

a) Because Ockham’s Razor says it will.b) Because the real world has only three dimensions.c) Because noise is false information.d) Because MVA is able to correct for poor data.e) Because noise is uncorrelated to the other variables.