multivariate analysis harry r. erwin, phd school of computing and technology university of...

12
Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Upload: shon-howard

Post on 29-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Multivariate Analysis

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Page 2: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources

• Everitt, BS, and G Dunn (2001) Applied Multivariate Data Analysis, London:Arnold.

• Everitt, BS (2005) An R and S-PLUS® Companion to Multivariate Analysis, London:Springer

Page 3: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Roadmap

• PBL group assignments• Multivariate data graphics tutorials• Testing distributional assumptions• Principle components analysis• Cluster analysis• Summary

Page 4: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

PBL group assignments

• Two groups

Page 5: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Multivariate data graphics tutorials

• Available on the module website• Covers both standard and lattice graphics

Page 6: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Testing distributional assumptions

• For these techniques to work, the data have to be distributed in a multivariate normal distribution.

• There are two ways of testing this:– Examine each variable separately (this does not

imply the data follow a multivariate normal distribution)

– Convert the data to a single number (a generalised distance) and plot against an appropriate chi-squared distribution.

Page 7: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Separate Examination

• X has two columns, and the combined data are bivariate normal:par(mfrow=c(1,2)qqnorm(X[,1],ylab= “Ordered observations”)

qqline(X[,1])qqnorm(X[,2],ylab= “Ordered observations”)

qqline(X[,2])

Page 8: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Comparison to a chi-squared distribution

• Same data, using chisplot available at http://biostatistics.iop.kcl.ac.uk/publications/everitt/ par(mfrow=c(1,1)chisplot(X)

Page 9: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Principle components analysis (PCA)

• Describe the variation of a set of multivariate data in terms of a set of uncorrelated variables, each a linear combination of the original variables.

• The goal is to reduce the number of meaningful variables to a small number that summarise the data set.

• Deals with highly correlated explanatory variables.• Representative of projection pursuit methods.

Page 10: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Cluster analysis

• A tool for classifying a phenomenon that sorts the samples into a small number of groups or clusters, usually non-overlapping.

• These clusters may not be unique.– Predictive clustering– Clustering based on causation

• Hence a cluster analysis is neither true nor false, but is simply useful.

Page 11: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Cluster analysis approaches

• Agglomerative hierarchical clustering (fusion from the bottom-up)

• K-means type methods (partition from the top down)• Classification maximum likelihood methods (assume a

model for the shape of the clusters)• Or you can simply use the tree library.

library(tree)model<-tree(ozone~.,data=ozone.pollution)plot(model)text(model)

Page 12: Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Summary

• Multivariate statistics is usually done from the point of view that there are no laws of scientific inference—‘anything goes’.

• First, you explore the data to come up with hypotheses—the models.

• Then you confirm the models on a second data set.• If you have a single data set, split it into two parts, one for

exploration and one for confirmation.• Good data analysis is based on the skilful interpretation of

evidence and the subsequent development of hunches.