multivariate data analysis in practicemultivariate data analysis in practice 6th edition...
TRANSCRIPT
MultivariateData Analysisin Practice
6th Edition
Supplementary Tutorial Book for
2019
Multivariate Data Analysis
Kim H. Esbensen & Brad Swarbrick
1
Published by CAMO Software AS:
CAMO Software AS
Oslo Science Park
Gaustadalléen 21
0349 Oslo
Norway
Tel: (+47) 223 963 00
The Unscrambler® is a trademark of CAMO Software AS.
Design-Expert® is a trademark of Stat-Ease, Inc.
ISBN 978-82-691104-1-8
© 2019 CAMO Software AS
All Rights reserved. No part of this publication may be reproduced, stored or transmitted, in any
form or by any means, except with the prior permission in writing of the publishers.
Cover art by Gry Andrea Esbensen Norang.
i
Contents
1. Introduction to this tutorial short book ...................................................................................... 1
2. Data sets used in this tutorial short book ................................................................................... 2
2.1. The Jam Data Set (Chapter 2).............................................................................................. 2
2.2. Product Mass Testing and Method Comparison Testing (Chapter 2) ................................. 2
2.3. Beverage Consumption in Europe (Chapter 4) ................................................................... 2
2.4. Ripeness of Green Peas (Chapter 4).................................................................................... 2
2.5. Classification of Vegetable Oils Using Spectroscopic Methods (Chapter 4) ....................... 2
2.6. City Temperatures in Europe (Chapter 4) ........................................................................... 2
2.7. Scaling Process Data (Chapter 5) ........................................................................................ 2
2.8. Preprocessing Mid Infrared Spectra of Vegetable Oils (Chapter 5) .................................... 3
2.9. Preprocessing of Process Near Infrared Spectra (Chapter 5) ............................................. 3
2.10. The Gluten-Starch Data Set: Preprocessing a Difficult Problem (Chapter 5) .................. 3
2.11. Octane Number in Gasoline (Chapter 6) ......................................................................... 3
2.12. Alcohols in Water (Chapter 6) ......................................................................................... 3
2.13. Detecting Outliers: Troodos (Chapter 6) ......................................................................... 3
2.14. Prediction of Alcohol Concentration in Mixtures (Chapter 7) ........................................ 3
2.15. Development of a Predictive Model of Octane Number in Gasoline (Chapter 7) .......... 3
2.16. Prediction of Paper Quality (Chapter 7) .......................................................................... 3
2.17. Prediction of Octane Number in Gasoline (Chapter 7) ................................................... 4
2.18. Prediction of Gluten-Starch Mixtures (Chapter 7) .......................................................... 4
2.19. Raw Material Identification Using Cluster Analysis (Chapter 10) ................................... 4
2.20. Fishers Iris Classification Data (Chapter 10) .................................................................... 4
2.21. Classification of Vegetable Oils Using Supervised Classification (Chapter 10) ............... 4
2.22. Sports Drink Formulation Using Factorial Designs (Chapter 11) ..................................... 4
2.23. Understanding a Chemical Manufacturing Process Using Full and Fractional Factorial
Designs (Chapter 11) ....................................................................................................................... 4
2.24. Optimisation of Bread Baking Using a Central Composite Design (CCD) (Chapter 11) ... 5
2.25. Blending Wines Using a Mixture Design (Chapter 11) .................................................... 5
2.26. Blending Fruit Juices Using a Constrained Mixture Design (Chapter 11) ........................ 5
2.27. Fat Content in Fish Using Factor Rotation (Chapter 12) ................................................. 5
2.28. Chemical Reaction Monitoring Using Multivariate Curve Resolution (MCR) (Chapter
12) 5
2.29. Combining MCR and PLS to Solve Difficult Problems (Fat in Fish Analysis) (Chapter 12)
6
3. The Unscrambler Environment ....................................................................................................... 7
ii
3.1. Data Import ............................................................................................................................. 7
3.2. Data Visualization ................................................................................................................... 8
3.3. Transform ................................................................................................................................ 8
3.4. Analyze .................................................................................................................................... 9
3.5. Predict ..................................................................................................................................... 9
4. Overview of the Modelling Process .............................................................................................. 10
5. Tutorials ........................................................................................................................................ 11
5.1. The Jam Data Set (Chapter 2) ................................................................................................... 11
5.1.1. Description of the Data Set ............................................................................................... 11
5.1.2. Overview of the Data ........................................................................................................ 11
5.1.3. Data Visualisation.............................................................................................................. 12
5.1.4. Descriptive Statistics ......................................................................................................... 16
5.1.5. Summary ........................................................................................................................... 20
5.2. Product Mass Testing and Method Comparison Testing (Chapter 2) ....................................... 21
5.2.1. Description of the Data Set ............................................................................................... 21
5.2.2. Setup of the Data Table .................................................................................................... 22
5.2.3. Evaluation of the Data ...................................................................................................... 24
5.2.4. Summary ........................................................................................................................... 28
5.3. Beverage Consumption in Europe (Chapter 4) ......................................................................... 29
5.3.1. Description of the Data Set ............................................................................................... 29
5.3.2. Evaluation of the Data ...................................................................................................... 29
5.3.3. Running a PCA on the Beverage Data ............................................................................... 35
5.3.4. The PCA Overview ............................................................................................................. 39
5.3.5. Summary ........................................................................................................................... 50
5.4. Ripeness of Green Peas (Chapter 4) ......................................................................................... 51
5.4.1. Description of the Data Set ............................................................................................... 51
5.4.2. Evaluation of the Data ...................................................................................................... 51
5.4.3. Descriptive Statistics ......................................................................................................... 56
5.4.4. Principal Component Analysis of Peas Data ..................................................................... 59
5.4.5. The PCA Overview ............................................................................................................. 63
5.4.6. Influence Plot for Peas Analysis ........................................................................................ 71
5.4.7. Summary ........................................................................................................................... 72
5.5. Classification of Vegetable Oils Using Spectroscopic Methods (Chapter 4) ............................. 74
5.5.1. Description of the Data Set ............................................................................................... 74
5.5.2. Evaluation of the Data ...................................................................................................... 74
5.5.3. Principal Component Analysis of Raw Vegetable Oil Data ............................................... 77
iii
5.5.4. The PCA Overview ............................................................................................................. 80
5.5.5. Influence Plot for Vegetable Oil Analysis .......................................................................... 87
5.5.6. PCA Projection of Unknown Samples onto Vegetable Oil PCA Model ............................. 88
5.5.7. Summary ........................................................................................................................... 90
5.6. City Temperatures in Europe (Chapter 4) ................................................................................. 92
5.6.1. Description of the Data Set ............................................................................................... 92
5.6.2. Evaluation of the Data ...................................................................................................... 92
5.6.3. Principal Component Analysis of European City Temperature Data ................................ 94
5.6.4. The PCA Overview ............................................................................................................. 98
5.6.5. Assessment of 1D Loadings of City Temperature Data ................................................... 102
5.6.6. The Influence Plot of City Temperature Data for 3 PCs .................................................. 104
5.6.7. Recalculate the Model without Belgrade ....................................................................... 107
5.6.8. Summary ......................................................................................................................... 108
5.7. Scaling Process Data (Chapter 5) ............................................................................................ 109
5.7.1. Description of the Data Set ............................................................................................. 109
5.7.2. Evaluation of the Data .................................................................................................... 109
5.7.3. Autoscaling the Data ....................................................................................................... 114
5.7.4. Summary ......................................................................................................................... 116
5.8. Preprocessing of Mid-Infrared Spectroscopic Data of Vegetable Oils (Chapter 5)................. 117
5.8.1. Description of the Data Set ............................................................................................. 117
5.8.2. Evaluation of the Data .................................................................................................... 117
5.8.3. Data Visualization and Descriptive Statistics .................................................................. 117
5.8.4. Summary ......................................................................................................................... 123
5.9. Preprocessing of Process Near Infrared Spectra (Chapter 5) ................................................. 125
5.9.1. Description of the Data Set ............................................................................................. 125
5.9.2. Evaluation of the Data .................................................................................................... 125
5.9.3. Data Visualization and Descriptive Statistics .................................................................. 126
5.9.4. Application of SNV to the Data ....................................................................................... 127
5.9.5. Application of Multiplicative Scatter Correction (MSC) to the Data ............................... 133
5.9.6. Application of Derivatives to the Data ............................................................................ 135
5.9.7. Application of First Derivative and SNV .......................................................................... 138
5.9.8. Summary ......................................................................................................................... 140
5.10. The Gluten-Starch Data Set: A Difficult Preprocessing Problem (Chapter 5) ..................... 141
5.10.1. Description of the Data Set ............................................................................................. 141
5.10.2. Data Visualization and Descriptive Statistics. ................................................................. 141
5.10.3. Application of Multiplicative Scatter Correction (MSC) .................................................. 143
iv
5.10.4. Application of Extended Multiplicative Scatter Correction (EMSC) ................................ 146
5.10.5. Application of Modified Extended Multiplicative Scatter Correction (mEMSC) ............. 148
5.10.6. Summary ......................................................................................................................... 150
5.11. Octane Number in Gasoline: Part 1- PCA of Spectra (Chapter 6) ....................................... 151
5.11.1. Description of the Data Set ............................................................................................. 151
5.11.2. Data Visualization and Grouping..................................................................................... 151
5.11.3. Principal Component Analysis of Gasoline Spectra ........................................................ 154
5.11.4. Summary ......................................................................................................................... 161
5.12. Alcohols in Water (Chapter 6) ............................................................................................. 162
5.12.1. Description of the Data Set ............................................................................................. 162
5.12.2. Data Visualization and Grouping..................................................................................... 162
5.12.3. Principal Component Analysis of Alcohol Spectra .......................................................... 164
5.12.4. Summary ......................................................................................................................... 172
5.13. Detecting Outliers (Troodos) (Chapter 6) ........................................................................... 173
5.13.1. Description of the Data Set ............................................................................................. 173
5.13.2. Data Visualization and Grouping..................................................................................... 173
5.13.3. Principal Component Analysis of Troodos Data .............................................................. 176
5.13.4. Imputation of Missing Values ......................................................................................... 182
5.13.5. Full Interpretation Troodos PCA Model .......................................................................... 183
5.13.6. Summary ......................................................................................................................... 187
5.14. Prediction of Alcohol Concentration in Mixtures (Chapter 7) ............................................ 188
5.14.1. Description of the Data ................................................................................................... 188
5.14.2. Application of Principal Component Regression (PCR) to the Alcohols data set. ........... 188
5.14.3. Application of Partial Least Squares (PLS) Regression to the Alcohols data set. ............ 199
5.14.4. Summary ......................................................................................................................... 209
5.15. Development of a Predictive Model Part 2: Octane Number in Gasoline (Chapter 7) ....... 211
5.15.1. Description of the Data ................................................................................................... 211
5.15.2. Application of Partial Least Squares (PLS) Regression to the Octane data set. .............. 211
5.15.3. Recalculation of Model Without Suspect Samples ......................................................... 227
5.15.4. Recalculation of the Octane Model Without Selected Variables ................................... 232
5.15.5. Summary ......................................................................................................................... 234
5.16. Prediction of Paper Quality (Chapter 7) .............................................................................. 235
5.16.1. Description of the Data Set ............................................................................................. 235
5.16.2. Data Visualization and Grouping..................................................................................... 236
5.16.3. Perform PLS on the Paper Data Set ................................................................................ 237
5.16.4. Recalculate the Paper Model With Important Variables Only ........................................ 243
v
5.16.5. Prediction of New Samples ............................................................................................. 248
5.16.6. Summary ......................................................................................................................... 251
5.17. Octane in Gasoline (part 3): Prediction of New Samples Using Various Models (Chapter 7)
252
5.17.1. Application of the Full Model to the Test Set ................................................................. 252
5.17.2. Application of the Model Without Outliers to the Test Set ............................................ 256
5.17.3. Application of the Model Without Outliers to the Test Set ............................................ 260
5.17.4. Summary ......................................................................................................................... 264
5.18. Prediction of Gluten Starch Mixtures (Chapter 7) .............................................................. 265
5.18.1. Development and Application of PLS to the Raw Data Set ............................................. 265
5.18.2. Development and Application of PLS to the MSC Preprocessed Data Set ..................... 270
5.18.3. Development and Application of PLS to the EMSC Preprocessed Data Set ................... 275
5.18.4. Development and Application of PLS to the mEMSC Preprocessed Data Set ................ 279
5.18.5. Summary ......................................................................................................................... 284
5.19. Raw Material Classification Using Cluster Analysis (Chapter 10) ....................................... 286
5.19.1. Description of the Data Set. ............................................................................................ 286
5.19.2. Overview of the Data. ..................................................................................................... 286
5.19.3. Application of k-Means Clustering to the Data. .............................................................. 289
5.19.4. Application of Hierarchical Cluster Analysis (HCA) to the Data. ..................................... 292
5.19.5. Application of Principal Component Analysis (PCA) to the Data. ................................... 293
5.19.6. Grouping PCA Scores by the Results of Cluster Analysis Methods. ................................ 296
5.19.7. Summary ......................................................................................................................... 297
5.20. Fischers Iris Data (Chapter 10) ............................................................................................ 298
5.20.1. Description of the Data ................................................................................................... 298
5.20.2. Data Visualisation............................................................................................................ 298
5.20.3. Classification Using k-Means and Hierarchical Cluster Analysis (HCA) ........................... 301
5.20.4. Application of PCA to the Iris Data Set. ........................................................................... 304
5.20.5. Developing a SIMCA Library for the Iris Data .................................................................. 306
5.20.6. Summary ......................................................................................................................... 320
5.21. Classification of Vegetable Oils Using Supervised Methods (Chapter 10) .......................... 321
5.21.1. Development of PCA Class Models for Vegetable Oils ................................................... 321
5.21.2. Classification of Oil Samples Using Partial Least Squares Discriminant Analysis (PLS-DA)
324
5.21.3. Classification of Vegetable Oils Using Linear Discriminant Analysis (LDA) ..................... 329
5.21.4. Classification of Vegetable Oils Using Support Vector Machine Classification............... 332
5.21.5. Summary ......................................................................................................................... 336
vi
5.22. Sports Drink Formulation Using Factorial Designs (Chapter 11)......................................... 338
5.22.1. Description of the Data Set ............................................................................................. 338
5.22.2. Building a Design ............................................................................................................. 338
5.22.3. Summary ......................................................................................................................... 359
5.23. Understanding a Chemical Manufacturing Process Using Designed Experiments (Chapter
11) 360
5.23.1. Experimental Approach – Define Stage .......................................................................... 360
5.23.2. Analysis of the Fractional Factorial Design. .................................................................... 363
5.23.3. Extension of the Fractional Factorial Design into a Full Factorial Design. ...................... 370
5.23.4. Summary ......................................................................................................................... 376
5.24. Optimisation of Bread Baking Using a Central Composite Design (Chapter 11)................. 377
5.24.1. Optimisation – Define Stage ........................................................................................... 377
5.24.2. Optimisation – Design Stage ........................................................................................... 377
5.24.3. Joint Optimisation of Two Responses Using Graphical Optimisation. ............................ 389
5.24.4. Summary ......................................................................................................................... 392
5.25. Blending Wines Using a Mixture Design (Chapter 11) ........................................................ 394
5.25.1. Mixture Design – Design Stage ....................................................................................... 395
5.25.2. Mixture Design - Design Analysis .................................................................................... 397
5.25.3. Graphical Optimisation of Wine Preference Criteria. ..................................................... 403
5.25.4. Summary ......................................................................................................................... 404
5.26. Blending Fruit Juices Using A Constrained Mixture Design (Chapter 11) ........................... 406
5.26.1. Define Stage .................................................................................................................... 406
5.26.2. Design Stage .................................................................................................................... 406
5.26.3. Design Table .................................................................................................................... 406
5.26.4. Design Analysis ................................................................................................................ 408
5.26.5. Summary ......................................................................................................................... 415
5.27. Fat Content in Fish Using Factor Rotation (Chapter 12) ..................................................... 416
5.27.1. Visualisation of the Data ................................................................................................. 416
5.27.2. PCA of Second Derivative Spectra. .................................................................................. 419
5.27.3. Parsimax Rotation of PC Axes. ........................................................................................ 422
5.27.4. Summary ......................................................................................................................... 423
5.28. Chemical Reaction Monitoring Using Multivariate Curve Resolution (MCR) (Chapter 12) 424
5.28.1. Data Visualisation............................................................................................................ 424
5.28.2. Principal Component Analysis (PCA) of the UV-Vis Spectra. .......................................... 425
5.28.3. Multivariate Curve Resolution (MCR) of the UV-Vis Data. ............................................. 428
5.28.4. Summary ......................................................................................................................... 430
vii
5.29. Combining MCR and PLS to Solve Difficult Problems (Fat in Fish Analysis) (Chapter 12) ... 431
5.29.1. Application of MCR to the NIR Spectra of Fish ............................................................... 431
5.29.2. Application of PLS Regression to Preprocessed NIR Spectra of Fish............................... 435
5.29.2.1. No Preprocessing .................................................................................................... 435
5.29.2.2. Savitzky-Golay Second Derivative ........................................................................... 436
5.29.2.3. Multiplicative Scatter Correction (MSC) ................................................................. 437
5.29.2.4. Extended Multiplicative Scatter Correction (EMSC). .............................................. 437
5.29.2.5. Standard Normal Variate (SNV). ............................................................................. 438
5.29.2.6. Modified Extended Multiplicative Scatter Correction (mEMSC) ............................ 438
5.29.3. Model Comparisons ........................................................................................................ 439
5.29.4. Summary ......................................................................................................................... 440
6. Resources .................................................................................................................................... 441
7. Final Words of Wisdom ............................................................................................................... 442
9
1. Introduction to this tutorial short book This tutorial study guide provides a step-by-step procedure for performing the software steps used
to generate the analyses provided in ‘Multivariate Data Analysis’ 6th edition published by Camo.
Once the data sets used in this tutorial have been downloaded, the procedures described can be
followed to see how the final results were arrived at. For the Design of Experiments (DoE) exercises,
a valid copy of the Design Expert package is required. If this package is not part of your Unscrambler
installation or if you do not have a standalone version of Design Expert, then please contact Camo
Analytics for more details.
The tutorials in this short book are best performed using The Unscrambler version 10.5., however,
many of the tutorials can be performed using the 10.3 or 10.4 platforms.
Throughout this short book, a number of the data sets are used in multiple chapters to describe a
‘story’ of the data from preprocessing, to data mining and regression analysis. The next section
describes the motivation behind the use of each of the datasets used in the tutorials and their
relevance in a multivariate data analysis setting.
As always, tutorials are used to gain better understanding of the functions and special features of
The Unscrambler and Design Expert. When analysing the data in the tutorials, it is highly important
that you, as a data analyst, translate the information learnt to your own applications and build your
own toolkit for data analysis. The prescriptive use of a tutorial for your own datasets is not
recommended; however, the steps used in the tutorials (where possible) describe a ‘Define, Design,
Analyse, Implement’ logic and this is about where the prescriptiveness should stop and your own
expertise should come through.
If you perform the tutorials with an open mind for learning, then this tutorial book will open up
many new insights into The Unscrambler and Design Expert that will allow you to progress in your
multivariate analysis journey.
10
2. Data sets used in this tutorial short book The datasets used in the tutorials are those that many of the trainers of The Unscrambler around
the world have used in their short courses to describe the power of the Multivariate Analysis and
Design of Experiments methodologies. Up until the 6th edition of the Multivariate Data Analysis
book, the tutorials formed a major part of the book. The authors felt that in the 6th Edition, their
approach to solving data analysis problems should be highlighted to a user so as not to distract from
the learning objectives. This tutorial short book supplements that initial learning with a follow up
‘hands on’ experience with the software. With this is mind, we suggest that you read our approach
in the book first, then use the tutorials to reinforce keystroking in the software and then further
investigate the data using all of the analysis and diagnostic tools in the software.
2.1. The Jam Data Set (Chapter 2) Highlights the functionality of the ‘Descriptive Statistics’ section of the software. Introduces the use
of ‘Box Plots’ for the analysis of sensory data when applied to the taste and appearance of
raspberries used to make fruit jams.
2.2. Product Mass Testing and Method Comparison Testing (Chapter 2) Demonstrates the use of the ‘Statistical Tests’ functionality of the software. In this tutorial, the use
of normality testing, tests for equivalence of variances and means will be investigated. The
appropriate use of one sample, two population and paired t-tests will be described.
2.3. Beverage Consumption in Europe (Chapter 4) Investigates a data set collected on the beverage consumption of 17 cities located around Europe
and Scandinavia. Demonstrates how the power of Principal Component Analysis (PCA) can be used
to assess the drinking patterns of various demographics and is particularly useful in marketing/
product placement studies.
2.4. Ripeness of Green Peas (Chapter 4) An oldie, but a goodie. This classical data set has survived many editions of the book and training
courses due to its educational appeal. Uses a set of sensory data attributes to classify green peas.
This data reveals a hidden variable when external information is used and highlights the graphical
ability of The Unscrambler to reveal the hidden structures.
2.5. Classification of Vegetable Oils Using Spectroscopic Methods (Chapter 4) This is the first example in the book on the application of multivariate methods to spectroscopic
data. PCA is an excellent cluster analysis method and this tutorial shows how the simple collection of
a spectrum from known oil samples can be used to separate the oils into their respective types. This
example forms the basis of the classification methods discussed in chapter 10 of the book.
2.6. City Temperatures in Europe (Chapter 4) Monthly average temperature data for 26 European cities were tabulated and the regions of their
origin were provided for possible grouping (clustering). Temperature profiles are similar to spectral
profiles in many ways and what may not be obvious to the naked eye is perfectly clear to PCA. This
example also introduces the analyst to the concept of outliers and how to deal with them.
2.7. Scaling Process Data (Chapter 5) This tutorial describes the approach to use to compare three (or more) variables together when their
natural scales are orders of magnitude different from each other. It also demonstrates some of the
simple univariate plotting routines available in The Unscrambler
11
2.8. Preprocessing Mid Infrared Spectra of Vegetable Oils (Chapter 5) Extends on the vegetable oil example introduced in chapter 4. Demonstrates how the use of
application-specific preprocessing techniques can reduce physical effects in spectral data that better
reveal the chemical information in the data.
2.9. Preprocessing of Process Near Infrared Spectra (Chapter 5) This tutorial investigates the application of preprocessing methods to data collected on a
pharmaceutical formulation during the process of Fluid Bed Drying (FBD). Spectra collected in an FBD
operation are highly affected by physical density and light scattering effects and the use of
application-specific preprocessing methods can minimise this variability to reveal the chemical
information in the data.
2.10. The Gluten-Starch Data Set: Preprocessing a Difficult Problem (Chapter 5) This data set is another example of Near Infrared spectroscopy applied to a binary mixture of gluten
and starch in known proportions. Sounds easy… right? This tutorial will highlight the intricacies and
pitfalls to be aware of before the application of any preprocessing method to a data set.
2.11. Octane Number in Gasoline (Chapter 6) This is another example using Near Infrared spectroscopy to determine whether the technique is
capable of detecting the differences in various grades of gasoline. The use of samples grouping helps
to highlight the hidden classes within the data. This data set is used extensively throughout the book
due to its ability to show in particular that not all things that look like outliers are outliers.
2.12. Alcohols in Water (Chapter 6) This data set, again based on Near Infrared Spectroscopy, shows how Principal Component Analysis
can be used to solve the Mixture problem. This data set is based on a type of experimental design
known as a Mixture design and the Scores plot can be used to reveal the structure of the design,
provided the preprocessing method used is correct.
2.13. Detecting Outliers: Troodos (Chapter 6) Another classical data set that has survived a number of editions of the book. The data is from the
field of Geochemistry, in particular in the Troodos region of Cyprus. This tutorial shows how outliers
can be detected and justifiably removed from the data set. Reanalysis of the data without the
outliers reveals the true structure in the samples.
2.14. Prediction of Alcohol Concentration in Mixtures (Chapter 7) This tutorial extends on the Alcohols in Water example introduced in chapter 6. Introduces an
analyst on how to develop a Principal Component Regression (PCR) model in The Unscrambler, how
to interpret it and most importantly, how to validate the model.
2.15. Development of a Predictive Model of Octane Number in Gasoline (Chapter 7) This tutorial extends on the Octane Number in Gasoline example introduced in chapter 6. Introduces
an analyst on how to develop a Partial Least Squares (PLS) regression model in The Unscrambler and
introduces why some visual outliers are not actually outliers. This tutorial also introduces an analyst
on how to apply PLS regression models to new data in order to predict new values for unknown
samples.
2.16. Prediction of Paper Quality (Chapter 7) This tutorial presents a set of process variables used in the paper manufacturing industry and
determines whether such variables can be used to predict the quality indicator Print Through, the
amount of visibility of ink from one side of the paper when viewed through the other side. This
12
tutorial uses PLS regression to model the data and introduces Martens Uncertainty Test as a method
for Variable Selection. Application of the model generated to a separate prediction set is performed
and the use of model diagnostics is introduced to show a user how such diagnostics can be used to
determine the quality of a prediction.
2.17. Prediction of Octane Number in Gasoline (Chapter 7) This tutorial provides an analyst with the steps for evaluating PLS regression models when applied to
new data in order to predict new values for unknown samples for the various models developed in
the tutorial described in 2.15.
2.18. Prediction of Gluten-Starch Mixtures (Chapter 7) This tutorial is an extension of the tutorial described in section 2.10 where PLS models are developed
for the data using the various preprocessing methods and these models are used to predict the
gluten content in a test set of samples. When all models have been developed, a comparison of the
predictive ability of each model using the optimal factors and a 1 factor model are made.
2.19. Raw Material Identification Using Cluster Analysis (Chapter 10) One of the most powerful application of non-destructive spectroscopic methods is their use in the
identification of incoming materials, particularly in highly regulated industries such as the
pharmaceutical and related industries. This tutorial introduces the unsupervised cluster analysis
methods applied to spectra of three different raw materials so that an analyst can investigate the
outputs and graphical capabilities of The Unscrambler.
2.20. Fishers Iris Classification Data (Chapter 10) This is the classical data set used to verify nearly every cluster analysis method developed. The data,
collected in the 1930’s by Sir Ronald Fischer, was an attempt to classify Iris species by four
characteristics, Sepal Length, Sepal Width, Petal Length and Petal Width. This tutorial investigates
the use of unsupervised methods for the classification of the Iris data and introduces one of the
most power supervised classification methods known as Soft Independent Modelling of Class
Analogy (SIMCA).
2.21. Classification of Vegetable Oils Using Supervised Classification (Chapter 10) This tutorial is an extension of the data set introduced in section 2.8. where the method of infrared
spectroscopy was used to analyse samples of various vegetable oil types. The methods investigated
in this tutorial are SIMCA, Linear Discriminant Analysis (LDA), Partial Least Squares Discriminant
Analysis (PLS-DA) and Support Vector Machines (SVM) Classification.
2.22. Sports Drink Formulation Using Factorial Designs (Chapter 11) This tutorial provides a first introduction to the development and analysis of a simple Factorial
Design using the Design Expert package. This is an excellent first step to exploring the Design Expert
software and it is suggested this tutorial is performed first, even if you have experience with the
software as it provides much detail on keystroking, which is reduced in later tutorials.
2.23. Understanding a Chemical Manufacturing Process Using Full and Fractional
Factorial Designs (Chapter 11) This tutorial shows how Fractional Factorial Designs can be extended into Full Factorial Designs
without having to repeat any of the initial experiments. The concept of Blocking is demonstrated
and a full analysis of the model with its interpretation is provided.
13
2.24. Optimisation of Bread Baking Using a Central Composite Design (CCD)
(Chapter 11) This tutorial introduces one of the most commonly used Optimisation designs known as the Central
Composite Design (CCD). A CCD is a composite of an initial Factorial Design with points that extend
the design outside of the original such that all design points lie on the surface of a sphere. It also
uses centre points in the model where polynomials up to the quartic can be used to analyse the
data. In this case, we investigate how to optimise two attributes of bread individually and then use
the method of Graphical Optimisation to find a space in the design where both parameters are
jointly optimised.
2.25. Blending Wines Using a Mixture Design (Chapter 11) This tutorial introduces the concept of the Mixture Design. This type of design is based on a
Factorial Design; however, the design is naturally Constrained, i.e. the mixture components are all
dependent on each other. We evaluate two responses in this design and use Graphical Optimisation
to determine if a suitable blend of wines can be made (as determined by a trained sensory panel) at
a cost that a consumer will consider acceptable.
2.26. Blending Fruit Juices Using a Constrained Mixture Design (Chapter 11) This tutorial extends on concepts of the previous tutorial and shows how a Lower Bound on one
component imposes Upper Bounds on the rest of the components in the mixture. When only Lower
Bounds are placed on components, the design is still Simplex Shaped and can be analysed using the
standard methods used for Designed Experiments. In this tutorial, we only model one response and
to find the most acceptable blends, we use Numerical Optimisation to maximise the inclusion of
some components and minimise the addition of others, while at the same time, ensuring the final
blend is still acceptable to the consumer. This tutorial is an adaptation of a classical problem
described in the book by John Cornell, Mixture Designs.
2.27. Fat Content in Fish Using Factor Rotation (Chapter 12) This tutorial is a first introduction to PCA Rotation (also called Factor Rotation) in order to find the
Simple Structure in the data set. The data set consists of NIR spectra collected in transmission mode
on fish samples with associated Fat reference measurements. We show how PCA can provide
abstract components that describe the variance in the data set, however, PCA Rotation can be used
to find a set of factors that better describe the chemistry of the samples.
Even though the new factors are orthogonal to each other, they are no longer orthogonal to the
original PC axes used to describe them, therefore the new factors are not independent of each
other.
2.28. Chemical Reaction Monitoring Using Multivariate Curve Resolution (MCR)
(Chapter 12) This tutorial introduces the concepts of Multivariate Curve Resolution (MCR) applied to a set of
Ultraviolet-Visible (UV-VIS) spectra collected from a chemical reaction. We show how PCA can
reveal the majority of the information in the data set, however, MCR is better able to provide more
chemical meaningful information from the data.
In particular, MCR is able to find a reaction intermediate profile in the data, consistent with the
kinetics of the reaction and also provide an estimate pure spectrum of the intermediate, which
otherwise would not have been able to be physically of chemically separated from the reaction
mixture.
14
2.29. Combining MCR and PLS to Solve Difficult Problems (Fat in Fish Analysis)
(Chapter 11) This tutorial uses the NIR spectra for Fat in Fish data described in 2.27. where MCR is used to find a
pure spectrum of Fat in the in the spectra. This estimated spectrum is used as a Good Spectrum in
Modified Extended Multiplicative Scatter Correction (mEMSC). A number of typical preprocessing
methods used for such data are applied and a comparison of the results are made to show which
methods provide the best results.