what is multivariate analysis - umetrics · pdf filewhat is multivariate analysis ... •...

29
05-08-17 SIMCA-P Getting started.ppt 1 (29) www.umetrics.com What is Multivariate Analysis Multivariate analysis is the best way to summarize a data tables with many variables by creating a few new variables containing most of the information. These new variables are then used for problem solving and display, i.e., classification, relationships, control charts, and more. The new variables, the scores, denoted by t, are created as weighted linear combinations of the original variables. Each observations has t-values. PCA, the basic MV method, summarizes one data table. Plotting the scores (t’s) gives an overview of the observations (objects) PLS summarizes simultaneously 2 data tables (X the predictor variables) and (Y the response variables) in order to develop a relationship between them PCA and PLS are called Projection methods

Upload: vucong

Post on 16-Mar-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

05-08-17 SIMCA-P Getting started.ppt 1 (29)www.umetrics.com

What is Multivariate Analysis

• Multivariate analysis is the best way to summarize a data tables with manyvariables by creating a few new variables containing most of the information.These new variables are then used for problem solving and display, i.e.,classification, relationships, control charts, and more.

• The new variables, the scores, denoted by t, are created as weighted linearcombinations of the original variables. Each observations has t-values.

• PCA, the basic MV method, summarizes one data table.

• Plotting the scores (t’s) gives an overview of the observations (objects)

• PLS summarizes simultaneously 2 data tables (X the predictor variables) and(Y the response variables) in order to develop a relationship between them

• PCA and PLS are called Projection methods

05-08-17 SIMCA-P Getting started.ppt 2 (29)www.umetrics.com

What is a Projection?Reduction of dimensionality, model in latent variables

• Algebraically– Summarizes the information in

the observations as a few new(latent) variables

• Geometrically– The swarm of points in a K

dimensional space(K = number of variables) isapproximated by a(hyper)plane and the pointsare projected on that plane.

05-08-17 SIMCA-P Getting started.ppt 3 (29)www.umetrics.com

NotationEach obs has values of t (and u) – Each variable has values of p (and w and c)

• t: the X scores; the new summarizing variables (coordinates in the hyperplane of X-space)

• u: the Y scores in PLS; the new summarizing variables (coordinates in thehyper plane of Y-space, when Y is multidimensional)

• p: the PC loadings. These are the weights that in PCA combine the originalvariables in X to form the new variables, scores t.

• w*: the PLS weights. These are the weights that in PLS combine theoriginal variables in X to form the new variables, scores t.

• c: the weights used to combine the Y's to form the scores u.

05-08-17 SIMCA-P Getting started.ppt 4 (29)www.umetrics.com

NotationEach obs has values of t (and u) – Each variable has values of p (and w and c)

• One Component consists of one t and one p (PCA) or t, p, w, u, c (PLS).The total number of components is A.

• Model: The data are approximated by a plane or hyper plane, (the model)with as many dimensions as components extracted.

• DModX: also called Distance to the model, is the distance of a givenobservation to the model plane.

• T2: Hotelling’s T2, is a combination of all the scores (t) of all A components.T2 measures how far away an observation is from the center of a PC or PLSmodel.

05-08-17 SIMCA-P Getting started.ppt 5 (29)www.umetrics.com

Notation

• R2X: The fraction of the variation of the X variables explained by the model.

• R2Y: The fraction of the variation of the Y variables explained by the model.

• Q2X: The fraction of the variation of the X variables predicted by the model.

• Q2Y: The fraction of the variation of the Y variables predicted by the model.

05-08-17 SIMCA-P Getting started.ppt 6 (29)www.umetrics.com

MVA – SIMCA Road MapMethods available

• Preprocessing; trimming and Winsorizing (take away extremes)

• Principal Components Analysis (PCA; overview of data)

• Projection to Latent Structures (PLS; relationships X↔Y)

• Simca classification

• PLS-discriminant analysis (classification)

• Hierarchical PCA and PLS

• Predictions and classification of new data using any model

05-08-17 SIMCA-P Getting started.ppt 7 (29)www.umetrics.com

MVA – SIMCA Road MapData set = all data; Work set = working copy of data

1. Start a project

File New

Read Data File

Specify Label Cols & Rows

2. Look at the data

Data set

Quick Info; Variables or Obs.

Preprocessing, Trim, etc.

6. Outliers in scores

Polish data

Prepare new workset

Graphically or via Workset

7. New data

Predictions

Select Pred.set (observations)

T_pred, Y_pred, DModX, etc.

6. No outliers in scores

Continue

Interpret model (plots)

Relate to Objective

5. Plot results

Analysis

Scores, Loadings

Distance to Model

4. Fit the model

Analysis

Autofit

or fast button

3. Prepare a work copy

Workset

variables, observations

Preprocessing, Class spec.

Work main menus from leftto right

and pop-up menus from upto down

Plot / List allows you to plot orlist anything non-standard, notfound under Analysis

05-08-17 SIMCA-P Getting started.ppt 8 (29)www.umetrics.com

Steps in using SIMCA-P using the wizard

• Start a new project and import the data set

• Use the workset wizard to guide through building the workset and fitting themodel

• Generate the report writer to walk through the model results andinterpretation

• When displaying Simca-P plots always use the Analysis adviser to guideyou.

05-08-17 SIMCA-P Getting started.ppt 9 (29)www.umetrics.com

Workset wizard on

ON

05-08-17 SIMCA-P Getting started.ppt 10 (29)www.umetrics.com

Workset wizard

05-08-17 SIMCA-P Getting started.ppt 11 (29)www.umetrics.com

Autotransform variablesTo transform all variables if any needed, mark the check box

05-08-17 SIMCA-P Getting started.ppt 12 (29)www.umetrics.com

Automatic creation of classes for classification ordiscrimination

05-08-17 SIMCA-P Getting started.ppt 13 (29)www.umetrics.com

Selection and Fit of model

05-08-17 SIMCA-P Getting started.ppt 14 (29)www.umetrics.com

Report writerWalks you through the model results with interpretation : File | Generate Report

05-08-17 SIMCA-P Getting started.ppt 15 (29)www.umetrics.com

Steps in Using SIMCA-P, Advanced Mode

• Start a new project and import the data set

• Explore and preprocess the data

• Make working copy of selected data (workset) for model building

• Specify model type and fit it to the workset

• Review fit (plots, diagnostics, coefficients, etc.)

• Predictions

• Generate Report

05-08-17 SIMCA-P Getting started.ppt 16 (29)www.umetrics.com

1a. File NewStarting a new project

• Select the data file containing the raw data of the project– directory, file type (XLS, DIF, TXT, …..), file name

• A Wizard opens (see next page) allowing you to specify (optionally) therow containing the Variable names, and (optionally) the columns withthe Obs. Numbers and Names

• Here (Commands) you can also do additional things such as– transposing the input data matrix

• Use simple mode with workset wizard

• At the last Wizard page, you can (optionally) specify another name anddirectory for the project.

• A map of the missing data is shown

• The Wizard finishes and puts you in the Simca-window

• A starting work set (M1, all data, all X-s, UV -scaled) is ready

05-08-17 SIMCA-P Getting started.ppt 17 (29)www.umetrics.com

1b. The second screen of the Wizard

05-08-17 SIMCA-P Getting started.ppt 18 (29)www.umetrics.com

2. Looking at the data

• With the data set table open (Data set edit):

• Quick Info (both var and obs windows can be open)– variables

– observations

• Moving the cursor in the data set table up and down, or sidewise, changesthe displayed variable and observation

• In the quick info options you can specify what you want to look at(histograms, auto-correlations, …), as well as which items should be thebasis for the plots

05-08-17 SIMCA-P Getting started.ppt 19 (29)www.umetrics.com

View variables or Observations, Trim, etc.Quick Info

05-08-17 SIMCA-P Getting started.ppt 20 (29)www.umetrics.com

3. Prepare a work copy: The WorksetSimple Mode with guidance, or Advanced Mode

• In Workset, you prepare a working copy of the part of the data you willanalyze, i.e., use as the basis of your model.

• Here you specify transformation, scaling, and roles of variables (X or Y orexcluded).

• Also, you select the observations (your “training set”).

• You can start with the previous workset (Workset / New as model xx) andthen modify it, e.g., excluding observations.

• Whatever you do in Workset does NOT touch the raw data

• Note that outliers are just specified as “not included” in the next workset (the“polished” data). Outliers are NEVER removed from the raw data set.

05-08-17 SIMCA-P Getting started.ppt 21 (29)www.umetrics.com

Workset: two Modes, Simple and Advanced

05-08-17 SIMCA-P Getting started.ppt 22 (29)www.umetrics.com

4. AnalysisFit the Model to the Workset Data

• Either menu “Analysis / Autofit” or Fast Button

• A model with appropriate number of components is found

– If nothing happens, get the two first components(also menu or fast button)

• A table appears showing the model, component by component.

• More components can be added (menu or fast button)

• Double click on a model to specify a title

05-08-17 SIMCA-P Getting started.ppt 23 (29)www.umetrics.com

5. Plot resultsAnalysis / menu (or fast buttons)

• Summary / X/Y-Overview shows R2 and Q2 for all var.s

• Scores – scatter plot, t1-t2 and t1-u1 & t2-u2 (PLS)

• Loadings – scatter plot (p1-p2 fro PCA, wc1-wc2 for PLS)

• Distance to Model – line plot

• Contribution plots to interpret interesting observations, e.g. outliers, jumps,…

• For all plots, the right mouse button, properties allows choice of plotmarkers, and more

• The graphical tool box allows further modifications

05-08-17 SIMCA-P Getting started.ppt 24 (29)www.umetrics.com

6a. Outliers were seen in the score plot(well outside the Hotelling ellipse)

• Start another workset

(either from Workset / New as model xx, or using the graphical tool-box toremove outliers from the score plot)

• Note that outliers should NOT be deleted from the data by Edit/Data set

• When the new workset is all-right, return to “4. Analysis” to fit a new modelto the new work set

(fast button or Analysis/Autofit)

05-08-17 SIMCA-P Getting started.ppt 25 (29)www.umetrics.com

6b. No outliers were seen in the score plots(or they have been excluded, and the score plots now look all-right)

• Now, interpret the model

• Look at “patterns”, trends, etc., in the score plots

• Inspect the loading plots to interpret the above patterns

• Look at DModX

• What do these patterns say about the objective of the investigation?

05-08-17 SIMCA-P Getting started.ppt 26 (29)www.umetrics.com

Analysis Advisor to understand and interpret model results

05-08-17 SIMCA-P Getting started.ppt 27 (29)www.umetrics.com

7. PredictionsNew Data, Prediction Set

• Under Predictions, specify the set of observations for which predictions willbe made, the prediction set

• New data can be read in as a secondary data set(File / Import) and predictions can be made for these

• Prediction set / Complement WS, gives a prediction set with thoseobservations that were not in the training set

• Predictions / Y-predicted, T-predicted, etc., calculates and displays thepredicted values accordingly

05-08-17 SIMCA-P Getting started.ppt 28 (29)www.umetrics.com

8. Generate the report, with customizable templates

05-08-17 SIMCA-P Getting started.ppt 29 (29)www.umetrics.com

Use of these slides

• You may use any or all of these slides in your own presentations, providedthat you keep (and do not modify) the Umetrics logo and web reference

• If you have any problems with the software, or with understanding of thematerial, please e-mail us at

[email protected]