introduction to r - dtu · 2019. 4. 15. · enote 1 1.2 basic plotting, graphics - data...

eNote 1 1

eNote 1

Introduction to R

Updated: 01/02/16 kl. 16:10

eNote 1 INDHOLD 2

Indhold

1 Introduction to R 11.1 Getting started with R and Rstudio . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Console and scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Assignments and vectors . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.4 Use of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Basic plotting, graphics - data visualisation . . . . . . . . . . . . . . . . . . 71.2.1 Frequency distributions and the histogram . . . . . . . . . . . . . . 81.2.2 Cumulative distributions . . . . . . . . . . . . . . . . . . . . . . . . 101.2.3 The Box-Plot and the modified Box-Plot . . . . . . . . . . . . . . . . 131.2.4 The Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.2.5 Bar plots and Pie charts . . . . . . . . . . . . . . . . . . . . . . . . . 201.2.6 More plots in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.7 R in 27411 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.8 Storage of Text and Graphics . . . . . . . . . . . . . . . . . . . . . . 221.2.9 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3 Introduction day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.3.1 Height-weight example from introstat eNote1 . . . . . . . . . . . . 231.3.2 Report ready tables with xtable . . . . . . . . . . . . . . . . . . . . 241.3.3 Height-weight example - continued . . . . . . . . . . . . . . . . . . 261.3.4 Height-weight example - continued: with some details related to

PCA and SVD - singular value decomposition (Appendix 2.6 and2.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4 Something more on correlation and covariance . . . . . . . . . . . . . . . . 341.5 Some additional matrix scatterplotting for the Varmuza toy data in 2.6.3 . 361.6 Matrix scatterplotting the mtcars data . . . . . . . . . . . . . . . . . . . . . 431.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

eNote 1 1.1 GETTING STARTED WITH R AND RSTUDIO 3

1.1 Getting started with R and Rstudio

The program R is an open source statistics program that you can download to yourown laptop for free. Go to http://mirrors.dotsrc.org/cran/ and select your platform(Windows, Mac, or Linux) and follow instructions.

RStudio is a free and open source integrated development environment (IDE) for R.You can run it on your desktop (Windows, Mac, or Linux) or even over the web usingRStudio Server. It works as (an extended) alternative to running R in the basic way. Thiswill be used in the course. Download it from http://www.rstudio.com/ and followinstallation instructions. To use the software, you only need to open Rstudio (not R

itself).

1.1.1 Console and scripts

Once you have opened Rstudio, you will see a number of different windows. One ofthem is the console. Here you can write commands and execute them by hitting Enter.For instance:

> ## Adding numbers in the console

> 2+3

[1] 5

In the console you cannot go back and change previous commands and neithercan you save your work for later. To do this you need to write a script. Go toFile→ New→ R Script. In the script you can write a line and execute it in theconsole by hitting Ctrl+R (Windows) or Cmd+Enter (Mac). You can also markseveral lines and execute them all at the same time.

http://mirrors.dotsrc.org/cran/

http://www.rstudio.com/


1.1.2 Assignments and vectors

If you want to assign a value to a variable, you can use = or <-. The latter is the preferredby R-users, so for instance:

> y <- 3

It is often useful to assign a set of values to a variable like a vector. This is done with thefunction c (short for concatenate).

> x <- c(1, 4, 6, 2)

> x

[1] 1 4 6 2

Use the colon :, if you need a sequence, e.g. 1 to 10:

> x <- 1:10

> x

[1] 1 2 3 4 5 6 7 8 9 10

You can also make a sequence with a specific stepsize different from 1 with seq(from,

to, stepsize):

> x <- seq( 0, 1, by = 0.1)

> x

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

If you are in doubt of how to use a certain function, the help page can be opened bytyping ? followed by the function, e.g. ?seq.


1.1.3 Descriptive statistics

All the basic summary statistics measures can be found as functions or part of functionsin R:

• mean(x) - mean value of the vector x

• var(x) - variance

• sd(x) - standard deviation

• median(x) - median

• quantile(x,p) - finds the pth quantile. p can consist of several different values,e.g. quantile(x,c(0.25,0.75)) or quantile(x,c(0.25,0.75), type=2)

• cov(x, y) - the covariance of the vectors x and y

• cor(x, y) - the correlation

Please again note that the words quantiles and percentiles are used interchangeably - theyare essentially synonyms meaning exactly the same, even though the formal distinctionhas been clarified earlier.

Example 1.1

Consider some n = 10 data on student heights. We can read these data into R and computethe sample mean and sample median as follows:

## Sample Mean and Median

x <- c(168, 161, 167, 179, 184, 166, 198, 187, 191, 179)

mean(x)

[1] 178

median(x)

[1] 179


The sample variance and sample standard deviation are found as follows:

## Sample variance and standard deviation

var(x)

[1] 149.1111

sqrt(var(x))

[1] 12.21111

sd(x)

[1] 12.21111

The sample quartiles can be found by using the quantile function as follows:

## Sample quartiles

quantile(x, type = 2)

0% 25% 50% 75% 100%

161 167 179 187 198

The option “type=2” makes sure that the quantiles found by the function is found usingthe definition given in the basic section of the eNote1 of the introstat course. By default, thequantile function would use another definition (not detailed here). Generally, we considerthis default choice just as valid as the one explicitly given here, it is merely a different one.Also the quantile function has an option called “probs” where any list of probability valuesfrom 0 to 1 can be given. For instance:

## Sample quantiles 0%, 10%,..,90%, 100%:

quantile(x, probs = seq(0, 1, by = 0.10), type = 2)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

161.0 163.5 166.5 168.0 173.5 179.0 184.0 187.0 189.0 194.5 198.0

eNote 1 1.2 BASIC PLOTTING, GRAPHICS - DATA VISUALISATION 7

1.1.4 Use of R

Apart from access to probability distributions, the R-software can be used in severalways in our course (and in your future engineering activity)

1. As a pocket calculator substitute - that is making R calculate ”manually” - by simp-le routines - plus, minus, squareroot etc. whatever needs to be calculated, that youhave identified by applying the right formulas from the proper definitions andmethods in the written material.

2. As a ”statistical analysis machine” where with some data fed into it, it will, byinbuilt functions and procedures do all relevant computations for you and presentthe final results in some overview tables and plots.

3. As a high level graphics tool - using it for visualizing both data and models.

We will see and present all types of applications of R during the course, and any kindof flexibility jumping between the three is possible.

It must be stressed that even though the program is able to calculate things for the user,understanding the background of the calculations must NOT be forgotten - understan-ding the methods will always be good.

Remark 1.2 R is not a substitute for your brain activity in this course!

The software R should be seen as the most fantastic and easy computational com-panion that we can have for doing statistical computation. A good question to askyourself each time that you apply en inbuilt R-function is: ”Do I really understandwhat R is computing for me now?”

1.2 Basic plotting, graphics - data visualisation

A really important part of working with data analysis is the visualisation of as wellthe raw data as of the results of the statistical analysis. Let us focus on the first partnow. Depending on the data at hand different types of plots and graphics could berelevant. One can distinguish between quantitative and categorical data. We will touchon the following type of basic plots:


• Quantitative data:

– Frequency plots and histograms

– Boxplots

– Cumulative distribution

– Scatter plot (xy plot)

• Categorical data:

– Bar charts

– Pie charts

1.2.1 Frequency distributions and the histogram

The frequency distribution of the data for a certain grouping of the data is nicely depi-cted by the histogram, which is a barplot of either raw frequencies for some number ofclasses.


Example 1.3

Consider again the n = 10 data from Example 1.1.

## A histogram of the heights:

hist(x)

Histogram of x

x

Fre

quen

cy

160 170 180 190 200

01

23

4

The default histogram uses equidistant class widths (the same width for all classes)and depicts the raw frequencies/counts in each class. One may change the scale intoshowing what we will learn to be densities, that is dividing the raw counts as well by nas by the class width:

Density in histogram =Class counts

n · (Class width)

In a density histogram the area of all the bars add up to 1.


Example 1.4

## A density histogram of the heights:

hist(x, freq = FALSE, col = "red", nclass = 8)

Histogram of x

x

Den

sity

160 170 180 190 200

0.00

0.02

0.04

0.06

The R-function hist makes some choice of the number of classes based on the number ofobservations - it may be changed by the user option nclass as illustrated here, althoughthe original choice seems better in this case due to the very small data set.

1.2.2 Cumulative distributions

The cumulative distribution can be visualized simply as the cumulated relative frequen-cies either across data classes, as also used in the histogram, or individual data points,which is then called the empirical cumulative distribution function:


Example 1.5

plot(ecdf(x), verticals = TRUE)

160 170 180 190 200

0.0

0.4

0.8

ecdf(x)

x

Fn(

x)

●

●

●

●

●

●

●

●

●

The empirical cumulative distribution function Fn is a step function with jumps i/n atobservation values, where i is the number of identical(tied) observations at that value.

For observations (x1, x2, . . . , xn), Fn(x) is the fraction of observations less or equal to x,i.e.,

Fn(x) =#{xi ≤ x}

n

On amazing thing with R are the thousands of available and free/open source extrapackages where one can find basically anything that you could imagine.


Be sure that you have installed the qcc package on your local computer beforeyou try to carry out the R-code in the next example! As this is the first time inthe material that we explicitly refer to and use an add-on package, here is theinstruction on how to install the qcc package:

1. Make sure that you are online

2. In the top of the lower right window of Rstudio, click Packages, clickInstall

3. Write ”qcc” in the empty field (without the quotation marks)

4. Click Install

Alternatively simply run install.packages("qcc") at the command prompt

Example 1.6

## A Pareto diagram based on the class counts from the hist-function:

library(qcc)

myhist <- hist(x,plot=FALSE)

mycounts=myhist$counts

names(mycounts)=myhist$breaks[-1]

pareto.chart(mycounts)

170

180

190

200

Pareto Chart for mycounts

Fre

quen

cy

●

●

●

●

02

46

8

0%50

%10

0%

Cum

ulat

ive

Per

cent

age


1.2.3 The Box-Plot and the modified Box-Plot

The so-called boxplot in its basic form depicts the five quartiles (min, Q1, median, Q3,max) with a box from Q1 to Q3 emphasizing the Inter Quartile Range (IQR):

Example 1.7

## A basic boxplot of the heights: (range=0 makes it "basic")

boxplot(x, range = 0, col = "red", main = "Basic boxplot")

text(1.3, quantile(x), c("Minimum","Q1","Median","Q3","Maximum"), col="blue")

160

170

180

190

Basic boxplot

Minimum

Q1

Median

Q3

Maximum

In the modified boxplot the whiskers only extend to the largest/smallest observationif they are not too far away from the box: defined to be 1.5× IQR. These extreme ob-servations will be plotted individually, and in other words the whisker extends to thelargest/smallest observations within a distance of 1.5× IQR of the box (defined as eit-her 1.5× IQR larger than Q3 or 1.5× IQR smaller than Q1)

Example 1.8

If we add an extreme observation, 235cm, to the heights data, and then both make the so-called modified boxplot - the default in R - and the basic one, we get: (note that since there


are no extreme observations among the original 10 observations, the two ”different” plots areactually the same, so we cannot illustrate the difference without having at least one extremedata point)

boxplot(c(x, 235), col = "red", main = "Modified boxplot")

text(1.4, quantile(c(x, 235)), c("Minimum","Q1","Median","Q3","Maximum"),

col = "blue")

boxplot(c(x, 235), col = "red", main = "Basic boxplot", range = 0)

text(1.4, quantile(c(x, 235)),c("Minimum","Q1","Median","Q3","Maximum"),

col = "blue")

●

160

180

200

220

Modified boxplot

MinimumQ1

Median

Q3

Maximum

160

180

200

220

Basic boxplot

MinimumQ1

Median

Q3

Maximum

The boxplot hence is an alternative to the histogram in visualising the distribution ofthe data. It is a convenient way of comparing distributions in different groups, if suchdata is at hand.


Example 1.9

This example shows some ways of working with R to illustrate data.

In another sample of a statistics course participants the following heights of 17 females and23 males were found:

Males 152 171 173 173 178 179 180 180 182 182 182 185185 185 185 185 186 187 190 190 192 192 197

Females 159 166 168 168 171 171 172 172 173 174 175 175175 175 175 177 178

The two modified boxplots to visualize the height sample distributions for each gender canbe constructed by a single call to the boxplot function:

Males <- c(152, 171, 173, 173, 178, 179, 180, 180, 182, 182, 182, 185,

185 ,185, 185, 185 ,186 ,187 ,190 ,190, 192, 192, 197)

Females <-c(159, 166, 168 ,168 ,171 ,171 ,172, 172, 173, 174 ,175 ,175,

175, 175, 175, 177, 178)

boxplot(list(Males, Females), col = 2:3, names = c("Males", "Females"))

●

●

Males Females

160

170

180

190

At this point, it should be noted that in real work with data using R, one would generallynot import data into R by explicit listings in an R-script file as done here. This only works


for very small data set like this. The more realistic approach is to import the data fromsomewhere else, e.g. from a spread sheet program such as Microsoft Excel.

Example 1.10

Some gender grouped student heights data is available as a .csv-file via http://www2.

compute.dtu.dk/courses/introstat/data/studentheights.csv. The structure of the da-ta file, as it would appear in Excel is two columns and 40+1 rows including a header row:

1 Height Gender

2 152 male

3 171 male

4 173 male

. . .

. . .

24 197 male

25 159 female

26 166 female

27 168 female

. . .

. . .

39 175 female

40 177 female

41 178 female

The data can now be imported into R by the read.table function:

studentheights <- read.table("studentheights.csv", sep = ";", dec = ".",

header = TRUE)

The resulting object studentheights is now a so-called data.frame, which is the R-name fordata sets within R. There are some ways of getting a quick look at what kind of data is reallyin a data set:

http://www2.compute.dtu.dk/courses/introstat/data/studentheights.csv

http://www2.compute.dtu.dk/courses/introstat/data/studentheights.csv


## Have a look at the first 6 rows of the data:

head(studentheights)

Height Gender

1 152 male

2 171 male

3 173 male

4 173 male

5 178 male

6 179 male

## Get a summary of each column/variable in the data:

summary(studentheights)

Height Gender

Min. :152.0 female:17

1st Qu.:172.8 male :23

Median :177.5

Mean :177.9

3rd Qu.:185.0

Max. :197.0

For quantitative variables we get the quartiles and the mean. For categorical variables wesee (some of) the category frequencies. Such a data structure like this would be the mostcommonly encountered (and needed) for statistical analysis of data. The gender groupedboxplot could now be done by the following:


boxplot(Height ~ Gender, data = studentheights, col=2:3)

●

●

female male

160

180

The R-syntax Height ~ Gender with the tilde symbol “~” is one that we will use a lot invarious contexts such as plotting and model fitting. In this context it can be understood as“Height is plotted as a function of Gender”.

1.2.4 The Scatter plot

The scatter plot can be used when there are two quantitative variables at hand, and issimply one variable plotted versus the other using some plotting symbol.

Example 1.11

Now we will use a data set available as part of R itself. Both base R and many addon R-packages includes data sets, that can be used for testing, trying and practicing. Here we willuse the mtcars data set. If you write:

?mtcars

you will be able to read the following as part of the help info:


“The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumptionand 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). A data fra-me with 32 observations on 11 variables. Source: Henderson and Velleman (1981), Building multipleregression models interactively. Biometrics, 37, 391-411.”

Let us plot the gasoline use, (mpg=miles pr. gallon), versus the weigth (wt):

## To make 2 plots on a single plot-region:

par(mfrow=c(1,2))

## First the default version:

plot(mtcars$wt, mtcars$mpg)

## Then a nicer version:

plot(mpg ~ wt, xlab = "Car Weight (1000lbs)", data = mtcars,

ylab = "Miles pr. Gallon", col = factor(am),

sub = "Red: manual transmission", main = "Inverse fuel usage vs. size")

● ●●

●

●●

●

●●

●●

●●●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

2 3 4 5

1020

30

mtcars$wt

mtc

ars$

mpg

● ●●

●

●●

●

●●

●●

●●●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

2 3 4 5

1020

30

Inverse fuel usage vs. size

Red: manual transmissionCar Weight (1000lbs)

Mile

s pr

. Gal

lon

In the second plot call we have used the so-called formula syntax of R, that was intro-duced above for the grouped boxplot. Again, it can be read: “mpg is plotted as a functionof wt.”


1.2.5 Bar plots and Pie charts

All the plots described so far were for quantitative variables. For categorical variablesthe natural basic plot would be a bar plot or pie chart visualizing the relative frequenciesin each category.

Example 1.12

For the gender grouped student heights data we can plot the gender distribution:

## Barplot:

barplot(table(studentheights$Gender), col=2:3)

female male

05

1015

20


## Pie chart:

pie(table(studentheights$Gender), cex=1, radius=1)

female

male

1.2.6 More plots in R?

A good place for getting more inspired on how to do easy and nice plots in R is: http://www.statmethods.net/.

1.2.7 R in 27411

Some of the methods in this course can be covered with R-tools that can be found inthe base R installation and some are available in various add-on R-packages. And notuncommonly, the methods will be available in more than a single version as there bynow are thousands of R-packages available. This R-note should be seen more as a helpto find a good way through (or into) the tools, help and material already available thanproviding lots of specific new tutorial help in itself. So mostly ”copy-and-paste”fromand links to other sources! The two main sources of inspiration for us are the two R-packages chemometrics and ChemometricswithR which again are companions to thetwo books:

http://www.statmethods.net/

http://www.statmethods.net/


• K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Ana-lysis in Chemometrics, CRC Press.

• Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the Na-tural Sciences and Life Sciences. Springer, Heidelberg.

Both of these books are available as ebooks: (that you can access using you DTU ID):

• http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

• http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/page/1

We will mostly in what follows base our introduction on the Varmuza and Filzmoserbook.

1.2.8 Storage of Text and Graphics

Text from windows in R can be copied into other programs in the usual manner:

• mark the text by holding the left mouse button down and drag it over the desiredtext.

• open another program (e.g. StarOffice), place the pointer at the desired locationand press the middle mouse button.

All text in the ’Commands Window’ or the ’Report Window’ can be stored in a text-fileby activating the window and choose ’File’→ ’Save As . . . ’.

Graphics can be stored in a graphics-file by activating the graphic window and choose’File’→ ’Save As . . . ’. It is possible to choose from a range of graphics formats (JPEG isthe default). One may also define explicitly graphics devices, e.g. ”pdf”to directly writethe graphics to a pdf-file.

1.2.9 Scatterplots

Have a look at: http://www.statmethods.net/graphs/scatterplot.html

http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/page/1

http://www.statmethods.net/graphs/scatterplot.html

eNote 1 1.3 INTRODUCTION DAY 23

1.3 Introduction day

1.3.1 Height-weight example from introstat eNote1

(http://introstat.compute.dtu.dk/enote/afsnit/NUID172/) Illustrating centering andscaling cf. Varmuza & Filzmoser, sec. 2.2.2.

# Reading data:

X1 <- c(168, 161, 167, 179, 184, 166, 198, 187, 191, 179)

X2 <- c(65.5, 58.3, 68.1, 85.7, 80.5, 63.4, 102.6, 91.4, 86.7, 78.9)

Basic means and sds:

mean(X1)

[1] 178

mean(X2)

[1] 78.11

sd(X1)

[1] 12.21111

sd(X2)

[1] 14.07184

Centered and standardized data:

http://introstat.compute.dtu.dk/enote/afsnit/NUID172/


# Only centering:

X_cent1 <- X1-mean(X1)

X_cent2 <- X2-mean(X2)

# Standardization

X_auto1 <- X_cent1/sd(X1)

X_auto2 <- X_cent2/sd(X2)

# Table 2.1:

Tab21 <- cbind(X1, X2, X_cent1, X_cent2, X_auto1, X_auto2)

Tab21

X1 X2 X_cent1 X_cent2 X_auto1 X_auto2

[1,] 168 65.5 -10 -12.61 -0.81892664 -0.89611621

[2,] 161 58.3 -17 -19.81 -1.39217528 -1.40777654

[3,] 167 68.1 -11 -10.01 -0.90081930 -0.71134998

[4,] 179 85.7 1 7.59 0.08189266 0.53937526

[5,] 184 80.5 6 2.39 0.49135598 0.16984280

[6,] 166 63.4 -12 -14.71 -0.98271196 -1.04535048

[7,] 198 102.6 20 24.49 1.63785327 1.74035576

[8,] 187 91.4 9 13.29 0.73703397 0.94443969

[9,] 191 86.7 13 8.59 1.06460463 0.61043920

[10,] 179 78.9 1 0.79 0.08189266 0.05614051

1.3.2 Report ready tables with xtable

Nice tables can be produced by the xtable function of the xtable-package. An example:

library(xtable)

first5obs <- Tab21[1:5,]

xtable(first5obs)

% latex table generated in R 3.2.1 by xtable 1.7-4 package


% Mon Feb 01 16:07:04 2016

\begin{table}[ht]

\centering

\begin{tabular}{rrrrrrr}

\hline

& X1 & X2 & X\_cent1 & X\_cent2 & X\_auto1 & X\_auto2 \\

\hline

1 & 168.00 & 65.50 & -10.00 & -12.61 & -0.82 & -0.90 \\

2 & 161.00 & 58.30 & -17.00 & -19.81 & -1.39 & -1.41 \\

3 & 167.00 & 68.10 & -11.00 & -10.01 & -0.90 & -0.71 \\

4 & 179.00 & 85.70 & 1.00 & 7.59 & 0.08 & 0.54 \\

5 & 184.00 & 80.50 & 6.00 & 2.39 & 0.49 & 0.17 \\

\hline

\end{tabular}

\end{table}

And then when this tex-code is included in your tex-file it will appear in the report asnice table:

X1 X2 X cent1 X cent2 X auto1 X auto21 168.00 65.50 -10.00 -12.61 -0.82 -0.902 161.00 58.30 -17.00 -19.81 -1.39 -1.413 167.00 68.10 -11.00 -10.01 -0.90 -0.714 179.00 85.70 1.00 7.59 0.08 0.545 184.00 80.50 6.00 2.39 0.49 0.17

Note how the input to xtable was a matrix here. The function is prepared to recognizea number of different R-objects, see e.g.:

methods(xtable)

[1] xtable.anova* xtable.aov*

[3] xtable.aovlist* xtable.coxph*

[5] xtable.data.frame* xtable.glm*

[7] xtable.lm* xtable.matrix*

[9] xtable.prcomp* xtable.summary.aov*

[11] xtable.summary.aovlist* xtable.summary.glm*

[13] xtable.summary.lm* xtable.summary.prcomp*

[15] xtable.table* xtable.ts*

[17] xtable.zoo*

see ’?methods’ for accessing help and source code

For instance, ANOVA-tables will be recognized. So a LaTex-user can then copy thesetex-lines into the report .tex-document. Or to integrate the R-code into the tex-code, usethe knitR-package to create the pure tex-file from a .Rnw file, which is a kind of tex-filewith all the R-code integrated into it, with a lot of flexibility in controlling what will beshowed/evaluated etc in the output. This can be used for both raw code/results, tablesand figures.

A word user may also use xtable through the html-print-option:

print(xtable(first5obs), type = "html")





<table border=1>

<tr> <th> </th> <th> X1 </th> <th> X2 </th> <th> X_cent1 </th> <th> X_cent2 </th> <th> X_auto1 </th> <th> X_auto2 </th> </tr>

<tr> <td align="right"> 1 </td> <td align="right"> 168.00 </td> <td align="right"> 65.50 </td> <td align="right"> -10.00 </td> <td align="right"> -12.61 </td> <td align="right"> -0.82 </td> <td align="right"> -0.90 </td> </tr>



<tr> <td align="right"> 4 </td> <td align="right"> 179.00 </td> <td align="right"> 85.70 </td> <td align="right"> 1.00 </td> <td align="right"> 7.59 </td> <td align="right"> 0.08 </td> <td align="right"> 0.54 </td> </tr>

<tr> <td align="right"> 5 </td> <td align="right"> 184.00 </td> <td align="right"> 80.50 </td> <td align="right"> 6.00 </td> <td align="right"> 2.39 </td> <td align="right"> 0.49 </td> <td align="right"> 0.17 </td> </tr>

</table>

And then print the table directly into a file:

print(xtable(first5obs), type = "html", file = "myhtmltable.html")

Open the file in a browser and copy-paste to Word.

1.3.3 Height-weight example - continued

Centering and standardization can most easily be performed by the scale-function:

# Raw data in matrix X:

X <- cbind(X1, X2)

# Using scale function to only center:

X_cent <- scale(X, scale = F)

# Using scale function to center and standardize:

X_auto <- scale(X)


Means and standard deviations in each column of a matrix can easily be found by theapply-function (and using the round function to show fewer decimals):

# Means by columns:

round(apply(Tab21, 2, mean), 2)


178.00 78.11 0.00 0.00 0.00 0.00

# Standard deviations by columns:

round(apply(Tab21, 2, sd), 2)


12.21 14.07 12.21 14.07 1.00 1.00

par(mfrow=c(1, 2)) # To make two plots in one page in a 1x2 structure

plot(X1, X2, las = 1)

plot(X_auto1, X_auto2, las = 1)

abline(h = 0, v = 0) # Adding horizontal and vertical lines at zeros

arrows(0, 0, 1, 1) # Adding the arrow


●

●

●

●

●

●

●

●

●

●

160 170 180 190

60

70

80

90

100

X1

X2

●

●

●

●

●

●

●

●

●

●

−1.5 −0.5 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

X_auto1

X_a

uto2

1.3.4 Height-weight example - continued: with some details related toPCA and SVD - singular value decomposition (Appendix 2.6 and2.7)

How to find the correlation and the arrow direction from data


# Correlation using function:

cor(X)

X1 X2

X1 1.0000000 0.9656034

X2 0.9656034 1.0000000

# Correlation using matrix-multiplication:

t(X_auto) %*% X_auto/9

X1 X2

X1 1.0000000 0.9656034

X2 0.9656034 1.0000000

# How to find the arrow direction from data:

eigen(cor(X))

$values

[1] 1.96560343 0.03439657

$vectors

[,1] [,2]

[1,] 0.7071068 -0.7071068

[2,] 0.7071068 0.7071068

# Using standard notation:

W <- eigen(cor(X))$vectors

W # In PCA: loadings

[,1] [,2]

[1,] 0.7071068 -0.7071068

[2,] 0.7071068 0.7071068


# The projected values:

z1 <- X_auto %*% W[,1]

z2 <- X_auto %*% W[,2]

plot(z1, z2, ylim = c(-3, 3), xlim = c(-3, 3))

●● ●●

●● ●●

●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

z1

z2

par(mfrow = c(1, 2))

var(z1)

[,1]

[1,] 1.965603

var(z2)

[,1]

[1,] 0.03439657

hist(z1, xlim = c(-3, 3))

hist(z2, xlim = c(-3, 3))


Histogram of z1

z1

Fre

quen

cy

−3 −2 −1 0 1 2 3

01

23

4

Histogram of z2

z2

Fre

quen

cy

−3 −2 −1 0 1 2 3

01

23

4

These variances are also found as the so-called singular values or eigen values of thecorrelation matrix:

eigen(cor(X))$values

[1] 1.96560343 0.03439657

# And the D-matrix is the diagonal of the square-roots of these:

D <- diag(sqrt(eigen(cor(X))$values))

D # In Pca: The explained variances

[,1] [,2]

[1,] 1.402 0.0000000

[2,] 0.000 0.1854631


# And the z1 and z2 can be standardized by these sds:

z_auto1 <- z1/sd(z1)

z_auto2 <- z2/sd(z2)

cbind(z_auto1, z_auto2)

[,1] [,2]

[1,] -0.86499187 -0.29429720

[2,] -1.41217204 -0.05948223

[3,] -0.81310699 0.72238105

[4,] 0.31334011 1.74422312

[5,] 0.33347947 -1.22581868

[6,] -1.02286513 -0.23881902

[7,] 1.70381944 0.39080657

[8,] 0.84806106 0.79076636

[9,] 0.84481813 -1.73157589

[10,] 0.06961784 -0.09818407


# Or the same could be extracted from the matrices:

Z_auto <- X_auto %*% W %*% solve(D)

Z_auto # In PCA: The standardized scores

[,1] [,2]

[1,] -0.86499187 -0.29429720

[2,] -1.41217204 -0.05948223

[3,] -0.81310699 0.72238105

[4,] 0.31334011 1.74422312

[5,] 0.33347947 -1.22581868

[6,] -1.02286513 -0.23881902

[7,] 1.70381944 0.39080657

[8,] 0.84806106 0.79076636

[9,] 0.84481813 -1.73157589

[10,] 0.06961784 -0.09818407

var(Z_auto)

[,1] [,2]

[1,] 1.000000e+00 4.700949e-16

[2,] 4.700949e-16 1.000000e+00

# So we have done the SVD, check:

cbind(Z_auto %*% D %*% t(W), X_auto)

X1 X2

[1,] -0.81892664 -0.89611621 -0.81892664 -0.89611621

[2,] -1.39217528 -1.40777654 -1.39217528 -1.40777654

[3,] -0.90081930 -0.71134998 -0.90081930 -0.71134998

[4,] 0.08189266 0.53937526 0.08189266 0.53937526

[5,] 0.49135598 0.16984280 0.49135598 0.16984280

[6,] -0.98271196 -1.04535048 -0.98271196 -1.04535048

[7,] 1.63785327 1.74035576 1.63785327 1.74035576

[8,] 0.73703397 0.94443969 0.73703397 0.94443969

[9,] 1.06460463 0.61043920 1.06460463 0.61043920

[10,] 0.08189266 0.05614051 0.08189266 0.05614051

eNote 1 1.4 SOMETHING MORE ON CORRELATION AND COVARIANCE 34

1.4 Something more on correlation and covariance

In section 2.3 in the Varmuza book, some toy data are simulated from a so-called 2-dimensional normal distribution:

library(mvtnorm)

library(StatDA)

sigma <- matrix(c(1, 0.8, 0.8, 1), ncol=2) # sigma1 in Fig. 2.8

X <- rmvnorm(200, mean = c(0, 0), sigma = sigma)

par(mfrow = c(1, 3))

plot(X[,1], X[,2])

edaplot(X[,1])

edaplot(X[,2])

●●

●

●●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3 4

−3

−2

−1

01

23

X[, 1]

X[,

2]

−4 −2 0 2 4

−20

020

4060

Histogram of X[, 1]

X[, 1]

Fre

quen

cy

●● ● ●●

−3 −2 −1 0 1 2 3

−10

010

2030

40

Histogram of X[, 2]

X[, 2]

Fre

quen

cy

●● ●●

The covariance:

eNote 1 1.4 SOMETHING MORE ON CORRELATION AND COVARIANCE 35

## On raw data:

cov(X)

[,1] [,2]

[1,] 1.2278453 0.9541705

[2,] 0.9541705 1.1094357

## On centered data with Matrix multiplication


t(X_cent)%*%X_cent/199

[,1] [,2]

[1,] 1.2278453 0.9541705

[2,] 0.9541705 1.1094357

The correlation:

## On raw data:

cor(X)

[,1] [,2]

[1,] 1.0000000 0.8175289

[2,] 0.8175289 1.0000000

## On centered AND scaled data with Matrix multiplication

X_auto <- scale(X)

t(X_auto)%*%X_auto/199

[,1] [,2]

[1,] 1.0000000 0.8175289

[2,] 0.8175289 1.0000000

Also works for data matrices of higher dimension than 2!

eNote 1 1.5 SOME ADDITIONAL MATRIX SCATTERPLOTTING FOR THE VARMUZATOY DATA IN 2.6.3 36

1.5 Some additional matrix scatterplotting for the Varmuzatoy data in 2.6.3

A starting way of exploring several variables simultaneously is to do multiple scatter-plots in a single page. We show here a few ways of doing this using the little Educationalscores data from the exercis below (Lattin exercise 2.2)

# Importing data: (the Table 2.4 data with an x0 column added)

tab24data <- read.table("Tab24ArtificialData.txt",

header = TRUE, sep = ",", dec = ".")

tab24data

x0 x1 x2 y

1 0.9 0.8 3.5 1

2 0.2 3.0 4.0 1

3 -0.2 4.2 4.8 1

4 -0.7 6.0 6.0 1

5 0.3 6.7 7.1 1

6 0.8 1.5 1.0 2

7 -1.1 4.0 2.5 2

8 -0.9 5.5 3.0 2

9 -0.7 7.3 3.5 2

10 -0.4 8.5 4.5 2

X <- tab24data[,1:3]

# Scatterplot Matrices using pairs:

pairs(X)


x0

2 4 6 8

●

●

●

●

●

●

●

●

●

●

−1.

0−

0.5

0.0

0.5

●

●

●

●

●

●

●

●

●

●

24

68

●

●

●

●

●

●

●

●

●

●

x1

●

●

●

●

●

●

●

●

●

●

−1.0 −0.5 0.0 0.5

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1 2 3 4 5 6 7

12

34

56

7

x2

# Scatterplot Matrices using pairs WITH color coding of groups

Gender <- factor(tab24data[,4]) # Defining column 4 as a grouping factor

pairs(X, main = "Scatterplots - Gender grouping",

col = c("red", "blue")[Gender])


x0

2 4 6 8

●

●

●

●

●

●

●

●

●

●

−1.

0−

0.5

0.0

0.5

●

●

●

●

●

●

●

●

●

●

24

68

●

●

●

●

●

●

●

●

●

●

x1

●

●

●

●

●

●

●

●

●

●

−1.0 −0.5 0.0 0.5

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1 2 3 4 5 6 7

12

34

56

7

x2

Scatterplots − Gender grouping

library(GGally)

ggpairs(X)


x0x1

x2

x0 x1 x2

0.00

0.25

0.50

0.75

Corr:

−0.618

Corr:

−0.0724

2

4

6

8

●

●

●

●

●

●

●

●

●

●

Corr:

0.53

2

4

6

−1.0 −0.5 0.0 0.5 1.0

●

●

●

●

●

●

●

●

●

●

2 4 6 8

●

●

●

●

●

●

●

●

●

●

2 4 6

# 3D Scatterplot

library(scatterplot3d)

scatterplot3d(X, main="3D Scatterplot")


3D Scatterplot

−1.5 −1.0 −0.5 0.0 0.5 1.0

12

34

56

78

0

2

4

6

8

10

x0

x1

x2

●

●

●

●

●

●

●

●

●

●

# 3D Scatterplot with Coloring and Vertical Drop Lines

scatterplot3d(X, pch = 16, highlight.3d = TRUE,

type = "h", main = "3D Scatterplot")


3D Scatterplot

−1.5 −1.0 −0.5 0.0 0.5 1.0

12

34

56

78

0

2

4

6

8

10

x0

x1

x2

●

●

●

●

●

●

●

●

●

●

# 3D Scatterplot with Coloring and Vertical Lines

# and Regression Plane

s3d <- scatterplot3d(X, pch = 16, highlight.3d = TRUE,

type = "h", main = "3D Scatterplot")

fit <- lm(X[,3] ~ X[,1] + X[,2])

s3d$plane3d(fit)


3D Scatterplot

−1.5 −1.0 −0.5 0.0 0.5 1.0

12

34

56

78

0

2

4

6

8

10

x0

x1

x2

●

●

●

●

●

●

●

●

●

●

An interactive spinning 3D-plot can be done with the rgl-package:

# Spinning 3d Scatterplot

library(rgl)

plot3d(X, col="red", size=3)

The code shown here will start up a separate plotting window, in which you can spin

eNote 1 1.6 MATRIX SCATTERPLOTTING THE MTCARS DATA 43

the plot using the mouse.

1.6 Matrix scatterplotting the mtcars data

data(mtcars)

?mtcars

#import:

head(mtcars) # List the top of the data set

summary(mtcars) # Summarize each variable in the data set

dim(mtcars) # Show number of rows and columns in the data set

pairs(mtcars)


mpg

4 6 8

●●● ●●●●

●●●● ●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●●● ●●●●

●●●● ●●●

●●●

●●●

●

●●●

●

●●●

●●

●

●

50 250

●●●●●●

●

●●●● ●●●

●●●

●●●

●

●● ●

●

●●●

●●

●

● ●●●●●●●

●●●●●●●

●●●

● ●●

●

● ● ●

●

●●●

●●●

●

2 4

●●● ●●●●

●●●●●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●● ●●● ●

●

● ●●●●●●●●

●

●●●

●

●●●

●

●●●

●●

●

●

0.0 0.8

●● ●●● ●●

●●●●●●●

●●●

●●●

●

●●●

●

●●●

●●●

● ●●●●●●●

●●●●●●●●●●

●●●

●

●●●

●

●●●

●●●

●

3.0 4.5

●●●●●●●

●●●●●●●

●●●

●●●

●

●●●

●

● ●●

●●●

●

1025●●●●

●●●

●●●●●●●●●●

●●●

●

●● ●

●

●●●

●●

●

●

46

8

●●

●

●

●

●

●

●●

●●

●●●●● ●

●●●●

●●● ●

●● ●

●

●

●

●

cyl ●●

●

●

●

●

●

●●

●●

●●● ●●●

●●●●

●●●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●● ●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

● ●●●

● ● ●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●● ●●●

●●●●

●●●●

●●●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●●●●●●

●●●●

●●● ●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●●●●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●●●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●●●●

● ●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●● ●●

●●●

●

●

●

●

●●●

●

●

●

●

●●●●

●●●

●● ●

●●●●

●●●

●

●● ●

●

●

●

●●●

●

●

●

●

●

●● ●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●

disp●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●

●

●●●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●●●

● ●●●

● ●●

●

●●●

●

●

●

●●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●●●

●

●

●

●

●

● ●●●

●●●

●●●

●●●●

●●●

●

●●●

●

●

●

●●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●● ●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●

100

400

●●●

●

●

●

●

●● ●●

●●●

●●●

●●●●

●●●

●

●●●

●

●

●

●

5025

0

●●●●

●

●

●

●●

●●●●●

●●●

●●●●

●●

●

●

●● ●

●

●

●

● ●●● ●

●

●

●

●●

●●●●●●●●

●●●●

●●

●

●

●●●

●

●

●

● ●●● ●

●

●

●

●●●●

●●●●●

●

●●●●

●●

●

●

●●●

●

●

●

●

hp●●●●

●

●

●

●●●●

●●●●●

●

● ●●●

● ●

●

●

●●●

●

●

●

● ●●● ●

●

●

●

●●●●

●●●●●●

●●●●

●●

●

●

●●●

●

●

●

● ●● ●●

●

●

●

●●

●●●●●●●

●

●●●●

●●

●

●

●●●

●

●

●

● ●● ●●

●

●

●

●●●●

●●●●●●

●●●●

●●

●

●

●● ●

●

●

●

● ●●●●

●

●

●

●●●●●●●●●●

●●●●●●

●

●

●●●

●

●

●

● ●●●●

●

●

●

●●●●

●●●●●●

●●●●●●

●

●

●●●

●

●

●

● ●●●●

●

●

●

●●

●●●●●

●●●

●●●●

●●

●

●

●●●

●

●

●

●

●●●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

●●

●●●●

● ●●

●

●● ●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

● ●

●●●●

● ●●

●

●●●●

●●● ●●●

●

●

●

●

●●

●

●

●●

●●

● ●

●●●●

● ●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

● ●

● drat ●●●

●●●●

●●●●

●●● ●●●

●

●

●

●

●●

●

●

●●

●●

● ●

●●● ●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

●●

●●● ●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

●●

●●●●

●●●●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

●●

●●●●

●●●●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

●●

●

3.0

4.5

●●●

●●●

●

●● ●●

●●●●●●

●

●

●

●

●●

●

●

●●

●●

● ●

●

24

●●●

●●●●●●●●

●●●

●● ●

●●●

●

●●● ●

●●●

●●

●● ●●●

● ●● ●●● ●●

●●●

●●●

●●●●

●●●●

●●●

●●

●● ●●●

● ●● ●●●●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●● ●●●

● ●● ●●●●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●● ●●●

●●● ●●●●●

●●●

●●●

●●●

●

● ●●●

●●●

●●●

●wt

●● ●

●● ●●● ●●●

●●●

●●●

●●●

●

●●● ●

●●●

●●

●● ●● ●

●● ●●●●●●

●●●

●●●

●●●●

●●●●

●●●

●●●

● ●●●

●●●●●●●●●●●

●●●

●●●

●

●●●●

●●●

●●●● ●●●

●●●●

●●●●●●●

●●●

●●●

●

●●●●

● ●●

●●●

● ●●●

●●● ●●● ●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●●

●●●

●

●

●

●

●

●

●●●●●●● ●

●●

●●

●●

●●

●

● ●

●●

●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●●●●● ●●●

●●●●

●●

●●

●

●●

●●

●

●

●●●●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●

●●

● ●

●●

●

●●

●●●

●

●●●

●

●

●

●

●

●

●●●●● ●●●

●●●●

●●

●●

●

●●

●●

●

● qsec●●

●●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

● ●

●●●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●●

●

1622

●●●●

●

●

●

●

●

●●●●●●●●

●●

●●

●●

●●

●

●●

●●

●

●

0.0

0.8

●●

●●

●

●

●

●●●●

●●●●● ●

●●●●

●●● ●

●

●

●

●●●

●

●●

● ●

●

●

●

●● ●●

●●●●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

● ●

●

●

●

●●●●

●●● ●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●● ●●

●

●

●

●● ●

●

●●

●●

●

●

●

●●●●

●●●●●●

● ●●●

● ● ●●

●

●

●

●●●

●

●●

● ●

●

●

●

●●●●

●●● ●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

●●

●

●

●

● ●●●

●●●●●●

●●●●

●●● ●

●

●

●

●●●

●

vs●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●●●●

●

●

●

●●●

●

●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●●●●

●

●

●

●●●

●

●●

●●

●

●

●

●● ●●

●●●●●●

●●●●

●● ●●

●

●

●

● ● ●

●

●●●

●●●● ●●●●●●●●● ●

●●●

●●●● ●

●● ●●●● ● ●●●

● ●● ●●● ●● ●●●●●●

●●●

● ●●●●

●●● ●● ●● ●●●

● ●● ●●●●● ●●● ●●●

●●●

● ●●●●

●●● ●● ●● ●●●

● ●● ●●●●● ●●●●●●

●●●

● ●● ●●

●●● ●● ●● ●●●

●●● ● ●●●●●●●●●●

● ●●

●● ● ●●

●●● ●●● ● ●●●

●●●●●●●●●●● ●●●

●●●

● ●●●●

●●● ●● ●● ●● ●

●● ●● ● ●●●●●●●●●

●●●

●●●● ●

●●●●●● ● ●● ●

●● ●● ●●●●●●●●●●

●●●

●●●●●

●● ●●●● ●

am●●●

●●●● ●●●●●●●●●●

●●●

●●●●●

● ●●●●●●

0.0

0.8●●●

●●● ●●● ●●●●●●●●

●●●

●●● ●●

●●● ● ● ●●

3.0

4.5

●●●

●●●●

●●●●

●●●●● ●

●●●

●●●● ●

●

● ●●●●

● ●●●

● ●● ●

●● ●●

●●●●●●

●●●

● ●●●●

●

●● ●● ●

● ●●●

● ●● ●

●●●●

●●● ●●●

●●●

● ●●●●

●

●● ●● ●

● ●●●

● ●● ●

●●●●

●●●●●●

●●●

● ●● ●●

●

●● ●● ●

● ●●●

●●● ●

●●●●

●●●●●●

● ●●

●● ● ●●

●

●● ●●●

● ●●●

●●●●

●●●●

●●● ●●●

●●●

● ●●●●

●

●● ●● ●

● ●● ●

●● ●●

● ●●●

●●●●●●

●●●

●●●● ●

●

●●●●●

● ●● ●

●● ●●

●●●●

●●●●●●

●●●

●●●●●

●

● ●●●●

● ●●●

●●●●

●●●●

●●●●●●

●●●

●●●●●

●

●●●●●

● gear ●●●

●●● ●

●● ●●

●●●●●●

●●●

●●● ●●

●

●● ● ● ●

●

10 25

●●

●●●●

●

●●

●●●●●

●● ●

●●

●●●●

●

●●

● ●

●

●

●

●

●●

● ●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

100 400

●●

● ●●

●

●

●●

●●●●●

●●●

●●●●

●●

●

●●

●●

●

●

●

●

●●

●●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

3.0 4.5

●●

●●●

●

●

●●

●●●●●

●●●

●●

●●● ●

●

●●

●●

●

●

●

●

●●

● ●●●

●

●●

●●●●●

●●●

●●●●

●●

●

●●●●

●

●

●

●

16 22

●●

●●●

●

●

● ●

●●●●●●●●

●●

●●●●

●

●●

●●

●

●

●

●

●●

●●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●

● ●

●

●

●

●

0.0 0.8

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●

●●

●

●

●

●

1 4 7

14

7

carb


# Scatterplot Matrices from the glus Package

library(cluster)

library(gclus)

dta <- mtcars # get data - just put your own data here!

dta.r <- abs(cor(dta)) # get correlations

dta.col <- dmat.color(dta.r) # get colors

# reorder variables so those with highest correlation

# are closest to the diagonal

dta.o <- order.single(dta.r)

cpairs(dta, dta.o, panel.colors = dta.col, gap =.5,

main = "Variables Ordered and Colored by Correlation" )


qsec

1 4 7

●●●●

●

●

●

●

●

●●●●●●●●

●●

●●

●●

●●

●

●●

●●

●

●

●●●●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

● ●

●●●

●

50 250

●●●●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●

●

●

100 400

●●●

●

●

●

●

●

●

●●●●● ●●●

●●●●

●●

●●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●●●●● ●●●

●●● ●

●●

●●

●

●●

●●

●

●

10 25

●●●

●

●

●

●

●

●

●●●●●●● ●

●●

●●

●●

●●

●

● ●

●●

●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●

●●

● ●

●●

●

●●

●●●

●

0.0 0.6

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●●

●

1622

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●●

●

●●

●●●

●

14

7

●●

●●●

●

●

● ●

●●●●●●●●

●●

●●●●

●

●●

●●

●

●

●

●

carb ●●

●●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●

● ●

●

●

●

●

●●

●●●

●

●

●●

●●●●●

●●●

●●●●

●●

●

●●

●●

●

●

●

●

●●

● ●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

●●

● ●●

●

●

●●

●●●●●

●●●

●●●●

●●

●

●●

●●

●

●

●

●

●●

● ●●●

●

●●

●●●●●

●●●

●●● ●

●●

●

●●●●

●

●

●

●

●●

●●●●

●

●●

●●●●●

●● ●

●●

●●●●

●

●●

● ●

●

●

●

●

●●

●●●

●

●

●●

●●●●●

●●●

●●

●●● ●

●

●●

●●

●

●

●

●

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●

●●

●

●

●

●

●●

●●

●

●

●

● ●●●

●●●●●●

●● ●●

●●● ●

●

●

●

●●●

●

●●

●●

●

●

●

●● ●●

●●●●●●

●●●●

●● ●●

●

●

●

● ● ●

●

vs●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●● ●●

●

●

●

●● ●

●

●●

● ●

●

●

●

●● ●●

●●●●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

● ●

●

●

●

●●●●

●●● ●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

● ●

●

●

●

●●●●

●●● ●●●

●●● ●

●●●●

●

●

●

●● ●

●

●●

●●

●

●

●

●●●●

●●●●● ●

●● ●●

●●● ●

●

●

●

● ●●

●

●●

●●

●

●

●

●●●●

●●●●●●

● ●●●

● ● ●●

●

●

●

●●●

●

●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●●●●

●

●

●

●●●

●

0.0

0.6

●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●●●●

●

●

●

●●●

●

5025

0

●● ●●

●

●

●

●●

●●

●●●●●

●

●● ●●

●●

●

●

●●●

●

●

●

● ●●●●

●

●

●

●●

●●

●●●●●●

●●●●

●●

●

●

●●●

●

●

●

● ●● ●●

●

●

●

●●●●

●●●●●●

●●●●

●●

●

●

●● ●

●

●

●

●

hp●●● ●

●

●

●

●●

●●

●●●●●●

●●●●

●●

●

●

●●●

●

●

●

● ●●● ●

●

●

●

●●●●

●●●●●

●

●●●●

●●

●

●

●●●

●

●

●

● ●●● ●

●

●

●

●●●●

●●●●●●

●●●●

●●

●

●

●●●

●

●

●

● ●●●●

●

●

●

●●

●●

●●●●●

●

●● ●●

●●

●

●

●● ●

●

●

●

● ●●●●

●

●

●

●●●●

●●●●●

●

● ●●●

● ●

●

●

●●●

●

●

●

● ●●●●

●

●

●

●●●●

●●●●●●

●●●●●●

●

●

●●●

●

●

●

● ●●●●

●

●

●

●●●●

●●●●●●

●●●●●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●●●●●●

●● ●●

●●● ●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●● ●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●●●●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●● ●●

●●●

●

●

●

●

cyl ●●

●

●

●

●

●

●●

●●

●●● ●●●

●●●●

●●●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●● ●●●

●●● ●

●●●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●● ●

●● ●●

●●● ●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

● ●●●

● ● ●●

● ●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●●●●

●●●

●

●

●

● 46

8

●●

●

●

●

●

●

●●

●●

●●●●●●

●●●●

●●●●

● ●●

●

●

●

●

100

400

●●●

●

●

●

●

● ●●●

●●●

●●●

●● ●●

●●●

●

●●●

●

●

●

●●●

●

●

●

●

●

●● ●●

●●●

●●●

●●●●

●●●

●

●●●

●

●

●

●●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●● ●

●

●

●

●●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●

●

●●●

●

●

●

●●●

●

●

●

●

●

●● ●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●

disp●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●● ●

●● ●●

●●●

●

●● ●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●●●

● ●●●

● ●●

●

●●●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●●●●

●

●

●

●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●

●●●

●● ●●● ●●●

●●●

●●●

●● ●

●

●●● ●

●●●

●●

●

● ●●●

●●● ●●● ●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●

● ●● ●

●● ●●●●●●

●●●

●●●

●●●●

●●●●

●●●

●●

●

● ●●●

● ●● ●●●●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●

● ●●●

● ●● ●●● ●●

●●●

●●●

●●●●

●●●●

●●●

●●

●

● ●●●

● ●● ●●●●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●

● wt●●●

●●●●●●●●

●●●

●● ●

●● ●

●

●●● ●

●●●

●●

●

● ●●●

●●● ●●●●●

●●●

●●●

●●●

●

● ●●●

● ●●

●●

●

● ●●●

●●●●●●●●●●●

●●●

●●●

●

●●●●

●●●

●●

●

●

24

●●●

●●●●

●●●●●●●

●●●

●●●

●

●●●●

● ●●

●●

●

●

1025 ●● ●●

● ●●

● ●●●●●●

●●●

●●

●

●

●●●

●

●●●

●●

●

● ●●●●●●

●

●●●●●●●

●●●

●●

●

●

●●●

●

●●●

●●

●

● ●● ●●● ●●

●●●●●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●●●●●●

●

●●●● ●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●●● ●●●●

●●●● ●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●●● ●●●●

●●●● ●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●●● ●●●●

●●●● ●●

●

●●●

●●●

●

●●●

●

●●●

●●

●

● mpg ●●●●●●●

●●●●●●●

●●●

●●

●

●

● ●●

●

● ●●

●●

●

● ●●●●●●●

●●●●●●●

●●●

●●●

●

●●●

●

●●●

●●

●

● ●●●●●●●

●●●●●●●

●●●

●●●

●

●●●

●

● ●●

●●

●

●

●● ●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●●●

●●●

●

●● ●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●● ●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●●●

● ●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●●●

● ●●

●

●● ●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●●●

● ●●

●

●●●●

●●● ●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●●●

●●●

●

●●●●

●●● ●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●●●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

● drat ●●●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

3.0

4.5

●●●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

0.0

0.6

●● ●

●● ●● ● ●●●●●●●●●

●● ●

●●●● ●

●●●●●● ● ●●●

●●● ●●● ●●●●●●●●

●●●

●●● ●●

●●● ● ● ●● ●● ●

●● ●● ●●●●●●●●●●

●●●

●●●●●

●● ●●●● ● ●●●

● ●● ●●●●● ●●●●●●

●●●

● ●● ●●

●●● ●● ●● ●●●

● ●● ●●● ●● ●●●●●●

●●●

● ●●●●

●●● ●● ●● ●●●

● ●● ●●●●● ●●● ●●●

●●●

● ●●●●

●●● ●● ●● ●●●

●●●●●●●● ●●● ●●●

●●●

● ●●●●

●●● ●● ●● ●●●

●●●● ●●●●●●●●● ●

●● ●

●●●● ●

●● ●● ●● ● ●●●

●●● ● ●●●●●●●●●●

● ●●

●● ● ●●

● ●● ●●● ●

am●●●

●●●● ●●●●●●●●●●

●●●

●●●●●

● ●●●●●●

16 22

●● ●

●● ●●

● ●●●

●●●●●●

●● ●

●●●● ●

●

●●●●●

● ●●●

●●● ●

●● ●●

●●●●●●

●●●

●●● ●●

●

●● ● ● ●

●

0.0 0.6

●● ●

●● ●●

●●●●

●●●●●●

●●●

●●●●●

●

● ●●●●

● ●●●

● ●● ●

●●●●

●●●●●●

●●●

● ●● ●●

●

●● ●● ●

●

4 6 8

●●●

● ●● ●

●● ●●

●●●●●●

●●●

● ●●●●

●

●● ●● ●

● ●●●

● ●● ●

●●●●

●●● ●●●

●●●

● ●●●●

●

●● ●● ●

●

2 4

●●●

●●●●

●●●●

●●● ●●●

●●●

● ●●●●

●

●● ●● ●

● ●●●

●●●●

●●●●

●●●●● ●

●● ●

●●●● ●

●

● ●● ●●

●

3.0 4.5

●●●

●●● ●

●●●●

●●●●●●

● ●●

●● ● ●●

●

●● ●●●

● ●●●

●●●●

●●●●

●●●●●●

●●●

●●●●●

●

●●●●●

●

3.0 4.5

3.0

4.5

gear

Variables Ordered and Colored by Correlation

# 8 Normal probility plots on the same plot page:

library(car)

par(mfrow=c(2,5))

for (i in 1:10) qqPlot(mtcars[,i])


−2 0 1 2

1015

2025

30

norm quantiles

mtc

ars[

, i]

● ●

●

●●●●●

●●

●

●

●●

●

●●

●

●●●●●

●●

●

●

●

●●

●

●

−2 0 1 2

45

67

8

norm quantiles

mtc

ars[

, i]

● ● ●●●●●●●●●

●●●●●●●

●●●●●●●●●●●● ● ●

−2 0 1 2

100

200

300

400

norm quantiles

mtc

ars[

, i]

●● ●●

●

●

●●●

●●●

●●●●

●

●

●●●

●●

●

●●

●●

●

●

●

●

−2 0 1 2

5010

015

020

025

030

0

norm quantiles

mtc

ars[

, i]

●

●●●●

●●●●

●●●●●

●

●●

●●

●●●●●●

●

●

●

●●

●

●

−2 0 1 2

3.0

3.5

4.0

4.5

5.0

norm quantiles

mtc

ars[

, i]

● ●

●

●

●●●●●

●●

●●

●

●

●●●●

●

●●●●●

●●●

●●

●

●

−2 0 1 2

23

45

norm quantiles

mtc

ars[

, i]

●

●

●

●

●●

●

●

●

●●

●

●●●●

●●●●●●●●

●●●●

●

●

●

●

−2 0 1 2

1618

2022

norm quantiles

mtc

ars[

, i]

●●

●●

●

●

●

●●●●●

●●●

●

●●●

●

●●●

●●

●●

●●●

●

●

−2 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

norm quantiles

mtc

ars[

, i]

● ● ●●●●●●●●●●●●●●●●

●●●●●●●●●●●● ● ●

−2 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

norm quantiles

mtc

ars[

, i]

● ● ●●●●●●●●●●●●●●●●●

●●●●●●●●●●● ● ●

−2 0 1 2

3.0

3.5

4.0

4.5

5.0

norm quantiles

mtc

ars[

, i]

● ● ●●●●●●●●●●●●●

●●●●●●●●●●●●

●●● ● ●

# 9 histograms with color:

par(mfrow=c(2,4))

for (i in 1:8) hist(mtcars[,i],col=i)


Histogram of mtcars[, i]

mtcars[, i]

Fre

quen

cy

10 15 20 25 30 35

02

46

810

12


mtcars[, i]

Fre

quen

cy

4 5 6 7 8

02

46

810

1214


mtcars[, i]

Fre

quen

cy

100 200 300 400 500

01

23

45

67


mtcars[, i]

Fre

quen

cy

50 100 150 200 250 300 350

02

46

810


mtcars[, i]

Fre

quen

cy

2.5 3.0 3.5 4.0 4.5 5.0

02

46

810

12


mtcars[, i]

Fre

quen

cy

2 3 4 5

02

46

8


mtcars[, i]

Fre

quen

cy

14 16 18 20 22

02

46

810


mtcars[, i]

Fre

quen

cy0.0 0.2 0.4 0.6 0.8 1.0

05

1015

# For all of them:

par(mfrow=c(2,5))

for (i in 2:11) {plot(mtcars$mpg ~ mtcars[,i], type="n",xlab=names(mtcars)[i])

text(mtcars[,i],mtcars$mpg,labels=row.names(mtcars))

abline(lm(mtcars$mpg~mtcars[,i]), col="red")

}


4 5 6 7 8

1015

2025

30

cyl

mtc

ars$

mpg

Mazda RX4Mazda RX4 Wag

Datsun 710

Hornet 4 Drive

Hornet SportaboutValiant

Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C

Merc 450SEMerc 450SL

Merc 450SLC

Cadillac FleetwoodLincoln Continental

Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona

Dodge ChallengerAMC Javelin

Camaro Z28

Pontiac Firebird

Fiat X1−9Porsche 914−2

Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

100 300

1015

2025

30

disp

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

50 150 250

1015

2025

30

hp

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

3.0 4.0 5.0

1015

2025

30

drat

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

2 3 4 5

1015

2025

30

wt

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

16 20

1015

2025

30

qsec

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

0.0 0.4 0.8

1015

2025

30

vs

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

0.0 0.4 0.8

1015

2025

30

am

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

3.0 4.0 5.0

1015

2025

30

gear

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

1 3 5 7

1015

2025

30

carb

mtc

ars$

mpg


Datsun 710

Hornet 4 Drive


Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C


Merc 450SLC


Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona


Camaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E


# Saving directly to a pgn-file:

png("mpg_relations.png",width=800,height=600)

par(mfrow=c(2,5))

for (i in 2:11) {plot(mtcars$mpg ~ mtcars[,i], type="n",xlab=names(mtcars)[i])

text(mtcars[,i],mtcars$mpg,labels=row.names(mtcars))

abline(lm(mtcars$mpg~mtcars[,i]), col="red")

}dev.off()

detach(mtcars)

# Correlation (with 2 decimals)

round(cor(mtcars),2)

mpg cyl disp hp drat wt qsec vs am gear carb

mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55

cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53

disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39

hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75

drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09

wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43

qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66

vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57

am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06

gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27

carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00

# Making a table for viewing in a browser - to be included in e.g. Word/Powerpoint

# Go find/use the cortable.html file afterwards

library(xtable)

capture.output(print(xtable(cor(mtcars)),type="html"),file="cortable.html")

ggpairs(mtcars)

eNote 1 1.6 MATRIX SCATTERPLOTTING THE MTCARS DATA 51m

pgcy

ldi

sphp

drat

wt

qsec

vsam

gear

carb

mpg cyl disp hp drat wt qsec vs am gear carb

15202530 Corr:

−0.852

Corr:

−0.848

Corr:

−0.776

Corr:

0.681

Corr:

−0.868

Corr:

0.419

Corr:

0.664

Corr:

0.6

Corr:

0.48

Corr:

−0.551

45678

●●

●

●

●

●

●

●●

●●

●●●●● ●

●●●●

●●● ●

●● ●

●

●

●

●

Corr:

0.902

Corr:

0.832

Corr:

−0.7

Corr:

0.782

Corr:

−0.591

Corr:

−0.811

Corr:

−0.523

Corr:

−0.493

Corr:

0.527

100200300400

●●●

●

●

●

●

●●●●

●●●

●● ●

●●●●

●●●

●

●● ●

●

●

●

●●●

●

●

●

●

●

●● ●●

●●●

●●●

●●●●

●●●●

●●●

●

●

●

●

Corr:

0.791

Corr:

−0.71

Corr:

0.888

Corr:

−0.434

Corr:

−0.71

Corr:

−0.591

Corr:

−0.556

Corr:

0.395

100200300

●●●●

●

●

●

●●

●●

●●●●●

●

●●●●

●●

●

●

●●

●

●

●

●

● ●●● ●

●

●

●

●●

●●

●●●●●●

●●●●

●●

●

●

●●●

●

●

●

● ●●● ●

●

●

●

●●●●

●●●●●

●

●●●●

●●

●

●

●●

●

●

●

●

●

Corr:

−0.449

Corr:

0.659

Corr:

−0.708

Corr:

−0.723

Corr:

−0.243

Corr:

−0.126

Corr:

0.75

3.03.54.04.55.0

●●●

●●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●●●

● ●●

●

●● ●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●●●

● ●●

●

●●●●

●●● ●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●●●

● ●●

●

●●●●

●●●●●●

●

●

●

●

●●

●

●

●●

●

●

● ●

● Corr:

−0.712

Corr:

0.0912

Corr:

0.44

Corr:

0.713

Corr:

0.7

Corr:

−0.0908

2345

●●●

●●●●●●●●

●●●

●● ●

●●●

●

●●● ●

●●●

●●

●

● ●●●

● ●● ●●● ●●

●●●

●●●

●●●●

●●●●

●●●

●●

●

● ●●●

● ●● ●●●●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●

● ●●●

● ●● ●●●●●

●●●

●●●

●●●

●

●●●●

●●●

●●

●

● ●●●

●●● ●●●●●

●●●

●●●

●●●

●

● ●●●

● ●●

●●

●

●

Corr:

−0.175

Corr:

−0.555

Corr:

−0.692

Corr:

−0.583

Corr:

0.428

16182022

●●●

●

●

●

●

●

●

●●●●●●● ●

●●

●●

●●

●

●

●

● ●

●●

●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●●●●● ●●●

●●●●

●●

●

●

●

●●

●●

●

●

●●●●

●

●

●

●

●

●●●●●●●●

●●●●

●●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●●●●●●●●

●●

●●

● ●

●

●

●

●●

●●●

●

●●●

●

●

●

●

●

●

●●●●● ●●●

●●● ●

●●

●

●

●

●●

●●

●

●

Corr:

0.745

Corr:

−0.23

Corr:

−0.213

Corr:

−0.656

0.000.250.500.751.00

●●

●●

●

●

●

●●●●

●●●●● ●

●●●●

●●● ●

●

●

●

● ●●

●

●●

● ●

●

●

●

●● ●●

●●●●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

● ●

●

●

●

●●●●

●●● ●●●

●●●●

●●●●

●

●

●

●● ●

●

●●

●●

●

●

●

●●●●

●●●●●●

●●●●

●● ●●

●

●

●

●● ●

●

●●

●●

●

●

●

●●●●

●●●●●●

● ●●●

● ● ●●

●

●

●

●●●

●

●●

● ●

●

●

●

●●●●

●●● ●●●

●●● ●

●●●●

●

●

●

●● ●

●

●●

●●

●

●

●

● ●●●

●●●●●●

●● ●●

●●● ●

●

●

●

●●●

●

Corr:

0.168

Corr:

0.206

Corr:

−0.57

0.000.250.500.751.00 ●●●

●●●● ●●●●●●●●● ●

●●●

●●●● ●

●● ●● ●● ● ●●●

● ●● ●●● ●● ●●●●●●

●●●

● ●●●●

●●● ●● ●● ●●●

● ●● ●●●●● ●●● ●●●

●●●

● ●●●●

●●● ●● ●● ●●●

● ●● ●●●●● ●●●●●●

●●●

● ●● ●●

●●● ●● ●● ●●●

●●● ● ●●●●●●●●●●

● ●●

●● ● ●●

● ●● ●●● ● ●●●

●●●●●●●● ●●● ●●●

●●●

● ●●●●

●●● ●● ●● ●● ●

●● ●● ● ●●●●●●●●●

●● ●

●●●● ●

●●●●●● ● ●● ●

●● ●● ●●●●●●●●●●

●●●

●●●●●

●● ●●●● ●

Corr:

0.794

Corr:

0.0575

3.03.54.04.55.0

●●●

●●●●

●●●●

●●●●● ●

●●●

●●●● ●

●

● ●● ●●

● ●●●

● ●● ●

●● ●●

●●●●●●

●●●

● ●●●●

●

●● ●● ●

● ●●●

● ●● ●

●●●●

●●● ●●●

●●●

● ●●●●

●

●● ●● ●

● ●●●

● ●● ●

●●●●

●●●●●●

●●●

● ●● ●●

●

●● ●● ●

● ●●●

●●● ●

●●●●

●●●●●●

● ●●

●● ● ●●

●

●● ●●●

● ●●●

●●●●

●●●●

●●● ●●●

●●●

● ●●●●

●

●● ●● ●

● ●● ●

●● ●●

● ●●●

●●●●●●

●● ●

●●●● ●

●

●●●●●

● ●● ●

●● ●●

●●●●

●●●●●●

●●●

●●●●●

●

● ●●●●

● ●●●

●●●●

●●●●

●●●●●●

●●●

●●●●●

●

●●●●●

●

Corr:

0.274

2468

101520253035

●●

●●●●

●

●●

●●●●●

●● ●

●●

●●●●

●

●●

● ●

●

●

●

●

4 5 6 7 8

●●

● ●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

100200300400

●●

● ●●

●

●

●●

●●●●●

●●●

●●●●

●●

●

●●

●●

●

●

●

●

100200300

●●

●●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

3.03.54.04.55.0

●●

●●●

●

●

●●

●●●●●

●●●

●●

●●● ●

●

●●

●●

●

●

●

●

2 3 4 5

●●

● ●●●

●

●●

●●●●●

●●●

●●● ●

●●

●

●●●●

●

●

●

●

16182022

●●

●●●

●

●

● ●

●●●●●●●●

●●

●●●●

●

●●

●●

●

●

●

●

0.000.250.500.751.00

●●

●●●

●

●

●●

●●●●●●●●

●●●●

●●

●

●●

● ●

●

●

●

●

0.000.250.500.751.00

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●

●

●

●

●

3.03.54.04.55.0

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●

●●

●

●

●

●

2 4 6 8

The ggplot2-package offers some really nice plotting options, although when it comes toexactly pairwise matrix scatterplots the suggestions above should be used.

The ggplot2-package is e.g. particularly usefull for nice plotting for various combina-tions of factors in the data. Two of the variables in the Leslie salt data set are actuallybinary information, which we can let R know by the factor function:


mtcars$cyl <- factor(mtcars$cyl)

mtcars$am <- factor(mtcars$am)

This could then be used to create the following plot of mpg versus wt conditioned onthe factors:

library(ggplot2)

# First for each cyl:

p <- ggplot(mtcars, aes(wt, mpg, colour = cyl,

label = row.names(mtcars)))

p <- p + geom_text()

p <- p + geom_smooth(method = lm, fullrange=T)

print(p)

Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive


Duster 360

Merc 240DMerc 230

Merc 280Merc 280CMerc 450SEMerc 450SLMerc 450SLC


Chrysler Imperial

Fiat 128Honda Civic

Toyota Corolla

Toyota Corona

Dodge ChallengerAMC JavelinCamaro Z28

Pontiac Firebird


Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

0

10

20

30

2 3 4 5wt

mpg

cyl

aaa

4

6

8

The shaded area is, by default, the (pointwise) 95% confidence interval for the line esti-mate in each subgroup.

Note the special way of making plots this way - you have to get used to it - copy and


paste from here and search for more help, e.g.

http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/

library(ggplot2)

# Then for each transmission type:

p <- ggplot(mtcars, aes(wt, mpg, colour = am,

label = row.names(mtcars)))


p <- p + geom_smooth(method = lm, fullrange=T)

print(p)

Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive

Hornet SportaboutValiantDuster 360

Merc 240DMerc 230Merc 280Merc 280CMerc 450SEMerc 450SL

Merc 450SLC

Cadillac FleetwoodLincoln ContinentalChrysler Imperial

Fiat 128Honda Civic

Toyota Corolla

Toyota Corona

Dodge ChallengerAMC JavelinCamaro Z28

Pontiac Firebird

Fiat X1−9Porsche 914−2Lotus Europa

Ford Pantera LFerrari Dino

Maserati Bora

Volvo 142E

−10

0

10

20

30

2 3 4 5wt

mpg

am

aa

0

1

# Or all combinations on separate plots: (not so nice here)

library(ggplot2)

p <- ggplot(mtcars, aes(wt, mpg, label = row.names(mtcars)))


p <- p + geom_smooth(method = lm, fullrange=F)

p <- p + facet_wrap(~ am + cyl)

print(p)

http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/


Mazda RX4Mazda RX4 WagDatsun 710 Hornet 4 DriveHornet SportaboutValiantDuster 360 Merc 240D

Merc 230Merc 280Merc 280CMerc 450SEMerc 450SLMerc 450SLC

Cadillac FleetwoodLincoln ContinentalChrysler ImperialFiat 128Honda Civic

Toyota Corolla

Toyota CoronaDodge ChallengerAMC Javelin

Camaro Z28Pontiac FirebirdFiat X1−9

Porsche 914−2 Lotus EuropaFord Pantera LFerrari DinoMaserati BoraVolvo 142E

0, 4 0, 6 0, 8

1, 4 1, 6 1, 8

10

20

30

10

20

30

2 3 4 5 2 3 4 5 2 3 4 5wt

mpg

We can use the same feature to create nicer versions of the multiple scatterplot of mpgversus all the x-variables, as also made above. Using the melt function of the reshape2-package a version of the data set where (relevant) variables are ”stringed out on top ofeach other” as a single variable, and coding for this in a new variable:

library(reshape2)

mtcars2 <- melt(mtcars, measure.vars=c(3:8, 10:11))

summary(mtcars2)

mpg cyl am variable value

Min. :10.40 4: 88 0:152 disp :32 Min. : 0.000

1st Qu.:15.43 6: 56 1:104 hp :32 1st Qu.: 2.982

Median :19.20 8:112 drat :32 Median : 4.000

Mean :20.09 wt :32 Mean : 51.126

3rd Qu.:22.80 qsec :32 3rd Qu.: 30.175

Max. :33.90 vs :32 Max. :472.000

(Other):64


And then using this new ”variable coding” factor to produce multiple plots for eachvariable:

library(ggplot2)

p <- ggplot(mtcars2, aes(value, mpg))

p <- p + geom_point(shape=1)

p <- p + geom_smooth(method = lm)

p <- p + facet_wrap(~ variable, scales="free")

print(p)

●●●

●

●●

●

●●

●●

●●

●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●●●

●●

●●

●

●●

●●

●●●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

●●●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

● ●●

●

●●

●

●●

●●

●●●

● ●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

● ●●

●

● ●

●

●●

●●

●●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

● ●●●●

● ●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●●

●●

●

●●

●●

●●●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

disp hp drat

wt qsec vs

gear carb

10

15

20

25

30

35

10

20

30

10

15

20

25

30

35

10

20

30

10

15

20

25

30

35

10

15

20

25

30

35

10

15

20

25

30

35

10

20

30

100 200 300 400 100 200 300 3.0 3.5 4.0 4.5 5.0

2 3 4 5 16 18 20 22 0.00 0.25 0.50 0.75 1.00

3.0 3.5 4.0 4.5 5.0 2 4 6 8value

mpg

One can then easily use other fit types than the linear one:

eNote 1 1.7 EXERCISES 56

library(ggplot2)

p <- ggplot(mtcars2, aes(value, mpg))

p <- p + geom_point(shape=1)

p <- p + geom_smooth(method="loess")

p <- p + facet_wrap(~ variable, scales="free")

print(p)

●●●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●●

●●

●●

●

●●

●●

●●

●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

●●●

●

●●

●

●●

●●

●●●

●●

●

●●

●

●

● ●●

●

●●

●

●

●

●

●

● ●●

●

●●

●

●●

●●

●●●

● ●

●

●●

●

●

●●●

●

●●

●

●

●

●

●● ●

●●

● ●

●

●●

●●

●●

●

●●

●

●●

●

●

● ●●

●

●●

●

●

●

●

●

●●●●

● ●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●●

●●

●

●●

●●

●●●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

disp hp drat

wt qsec vs

gear carb

10

15

20

25

30

35

10

20

30

10

20

30

40

10

20

30

10

15

20

25

30

35

10

15

20

25

30

35

10

15

20

25

30

35

10

20

30

100 200 300 400 100 200 300 3.0 3.5 4.0 4.5 5.0

2 3 4 5 16 18 20 22 0.00 0.25 0.50 0.75 1.00

3.0 3.5 4.0 4.5 5.0 2 4 6 8value

mpg

1.7 Exercises

Exercise 1 Table 2.4 artificial data from the Varmuza book

a) Have a look at the R introduction in Appendix 3 of the Varmuza-book:http://www.crcnetbase.com.globalproxy.cvt.dk/doi/pdfplus/10.1201/9781420059496.ax3

http://www.crcnetbase.com.globalproxy.cvt.dk/doi/pdfplus/10.1201/9781420059496.ax3


(They do not mention Rstudio, but this is still recommended)

b) Consider the (extended) artificial data from Table 2.4::

1. If needed: Install R and Rstudio

2. Start Rstudio

3. Import the able 2.4 arificial data from the Varmuza book (Hint: Use the filetab24Artificialdata.txt available under File Sharing in Campusnet)

#import:

tab24data <- read.table("Tab24ArtificialData.txt",

header = TRUE, sep = ",", dec = ".")

tab24data

x0 x1 x2 y

1 0.9 0.8 3.5 1

2 0.2 3.0 4.0 1

3 -0.2 4.2 4.8 1

4 -0.7 6.0 6.0 1

5 0.3 6.7 7.1 1

6 0.8 1.5 1.0 2

7 -1.1 4.0 2.5 2

8 -0.9 5.5 3.0 2

9 -0.7 7.3 3.5 2

10 -0.4 8.5 4.5 2



#Select the relevant columns:


X

x0 x1 x2

1 0.9 0.8 3.5

2 0.2 3.0 4.0

3 -0.2 4.2 4.8

4 -0.7 6.0 6.0

5 0.3 6.7 7.1

6 0.8 1.5 1.0

7 -1.1 4.0 2.5

8 -0.9 5.5 3.0

9 -0.7 7.3 3.5

10 -0.4 8.5 4.5

c) Find the ”centroid”vector of X, that is, the 3 means:

?apply()

apply(X, 2, mean)

x0 x1 x2

-0.18 4.75 3.99

d) Find the mean centered matrix Xc:


?scale


X_cent

x0 x1 x2

[1,] 1.08 -3.95 -0.49

[2,] 0.38 -1.75 0.01

[3,] -0.02 -0.55 0.81

[4,] -0.52 1.25 2.01

[5,] 0.48 1.95 3.11

[6,] 0.98 -3.25 -2.99

[7,] -0.92 -0.75 -1.49

[8,] -0.72 0.75 -0.99

[9,] -0.52 2.55 -0.49

[10,] -0.22 3.75 0.51

attr(,"scaled:center")

x0 x1 x2

-0.18 4.75 3.99

e) Find the standardized matrix Xa (”autoscaled”):


apply(X, 2, sd)

x0 x1 x2

0.7036413 2.5074334 1.7400192

X_auto <- scale(X)

X_auto

x0 x1 x2

[1,] 1.53487290 -1.5753160 -0.281606095

[2,] 0.54004787 -0.6979248 0.005747063

[3,] -0.02842357 -0.2193478 0.465512116

[4,] -0.73901288 0.4985177 1.155159696

[5,] 0.68216573 0.7776877 1.787336644

[6,] 1.39275504 -1.2961461 -1.718371886

[7,] -1.30748433 -0.2991106 -0.856312411

[8,] -1.02324860 0.2991106 -0.568959253

[9,] -0.73901288 1.0169762 -0.281606095

[10,] -0.31265930 1.4955532 0.293100221

attr(,"scaled:center")

x0 x1 x2

-0.18 4.75 3.99

attr(,"scaled:scale")

x0 x1 x2

0.7036413 2.5074334 1.7400192

f) Find the sum of squares matrix XtdXd:


?cov

t(X_cent) %*% X_cent

x0 x1 x2

x0 4.456 -9.820 -0.798

x1 -9.820 56.585 20.805

x2 -0.798 20.805 27.249

9*cov(X)

x0 x1 x2

x0 4.456 -9.820 -0.798

x1 -9.820 56.585 20.805

x2 -0.798 20.805 27.249

g) Find the covariance matrix XtcXc/(n− 1):

(t(X_cent)%*%X_cent)/9

x0 x1 x2

x0 0.49511111 -1.091111 -0.08866667

x1 -1.09111111 6.287222 2.31166667

x2 -0.08866667 2.311667 3.02766667

cov(X)

x0 x1 x2

x0 0.49511111 -1.091111 -0.08866667

x1 -1.09111111 6.287222 2.31166667

x2 -0.08866667 2.311667 3.02766667

h) Find the correlation matrix (XtaXa)/(n− 1):


(t(X_auto)%*%X_auto)/9

x0 x1 x2

x0 1.00000000 -0.6184267 -0.07241942

x1 -0.61842671 1.0000000 0.52983638

x2 -0.07241942 0.5298364 1.00000000

cor(X)

x0 x1 x2

x0 1.00000000 -0.6184267 -0.07241942

x1 -0.61842671 1.0000000 0.52983638

x2 -0.07241942 0.5298364 1.00000000

Exercise 2 Matrix scatterplotting

a) Work your way through some inital explorative analysis of the mtcars data, seethe subsection on this above.

introduction to r - dtu · 2019. 4. 15. · enote 1 1.2 basic plotting, graphics - data...

Documents