multivariate analysis harry r. erwin, phd school of computing and technology university of...

Post on 29-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Multivariate Analysis

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Resources

• Everitt, BS, and G Dunn (2001) Applied Multivariate Data Analysis, London:Arnold.

• Everitt, BS (2005) An R and S-PLUS® Companion to Multivariate Analysis, London:Springer

Roadmap

• PBL group assignments• Multivariate data graphics tutorials• Testing distributional assumptions• Principle components analysis• Cluster analysis• Summary

PBL group assignments

• Two groups

Multivariate data graphics tutorials

• Available on the module website• Covers both standard and lattice graphics

Testing distributional assumptions

• For these techniques to work, the data have to be distributed in a multivariate normal distribution.

• There are two ways of testing this:– Examine each variable separately (this does not

imply the data follow a multivariate normal distribution)

– Convert the data to a single number (a generalised distance) and plot against an appropriate chi-squared distribution.

Separate Examination

• X has two columns, and the combined data are bivariate normal:par(mfrow=c(1,2)qqnorm(X[,1],ylab= “Ordered observations”)

qqline(X[,1])qqnorm(X[,2],ylab= “Ordered observations”)

qqline(X[,2])

Comparison to a chi-squared distribution

• Same data, using chisplot available at http://biostatistics.iop.kcl.ac.uk/publications/everitt/ par(mfrow=c(1,1)chisplot(X)

Principle components analysis (PCA)

• Describe the variation of a set of multivariate data in terms of a set of uncorrelated variables, each a linear combination of the original variables.

• The goal is to reduce the number of meaningful variables to a small number that summarise the data set.

• Deals with highly correlated explanatory variables.• Representative of projection pursuit methods.

Cluster analysis

• A tool for classifying a phenomenon that sorts the samples into a small number of groups or clusters, usually non-overlapping.

• These clusters may not be unique.– Predictive clustering– Clustering based on causation

• Hence a cluster analysis is neither true nor false, but is simply useful.

Cluster analysis approaches

• Agglomerative hierarchical clustering (fusion from the bottom-up)

• K-means type methods (partition from the top down)• Classification maximum likelihood methods (assume a

model for the shape of the clusters)• Or you can simply use the tree library.

library(tree)model<-tree(ozone~.,data=ozone.pollution)plot(model)text(model)

Summary

• Multivariate statistics is usually done from the point of view that there are no laws of scientific inference—‘anything goes’.

• First, you explore the data to come up with hypotheses—the models.

• Then you confirm the models on a second data set.• If you have a single data set, split it into two parts, one for

exploration and one for confirmation.• Good data analysis is based on the skilful interpretation of

evidence and the subsequent development of hunches.

top related