assignment 2 solutions - dalhousie university · assignment 2 solutions totalvalue:...

Assignment 2 solutionsTotal Value: 40 points (8+12+4+10+6)

QUESTION 1

(a) (VALUE: 2) Plot histograms of the 12 variables x1, y1, z1, ., x4, y4, z4. Using these histograms, identifyvariables that have outlying values.

load(url('http://www.mathstat.dal.ca/~aarms2014/StatLearn/data/har.RData'))source("http://www.mathstat.dal.ca/~aarms2014/StatLearn/R/A2funs.R")par(mfrow=c(3,4),mar=c(2,2,2,1))for (i in 7:18) {hist(har[,i],nclass=100,main=names(har)[i])}

x1

har[, i]

−200 0 200

030

000

7000

0 y1

har[, i]

Fre

quen

cy

−200 0 200

020

000

5000

0

z1

har[, i]

Fre

quen

cy

−600 −200 200

010

000

2500

0

x2

har[, i]

Fre

quen

cy

−400 0 200

015

000

3000

0

y2

har[, i]

−400 0 200

020

000

z2

har[, i]

Fre

quen

cy

−600 −200 0

010

000

2500

0

x3

har[, i]

Fre

quen

cy

−400 0 200

020

000

y3

har[, i]

Fre

quen

cy

−400 0 200

020

000

5000

0

z3

−600 −200 200

020

000

4000

0 x4

Fre

quen

cy

−700 −400 −100

050

0015

000

y4

Fre

quen

cy

−500 −300 −100 100

010

000

2500

0

z4

Fre

quen

cy

−500 −300 −100

010

000

2500

0

Since a histogram chooses the range of its horizontal axis to correspond to the range of the data, a histogramwith a “spike” in the middle and no apparent data at the extreme must have outlying values. With this inmind, and inspecting the histograms it seems that all except y2 and z2 have outlying values in at least one ofthe two “tails” of three distributions. X2 and x4 seem to only have outliers in one direction.

(b) (VALUE: 2)With so many observations we must automate the processing of outliers. The mytrimfunction(in A2funs.R, see above) will do this. Describe what it does. Is it deleting data or changing it? How?

The mytrim function calculates quantiles qp/2 and q1−p/2 corresponding to probabilities p/2 and 1 − p/2. Ifdata value x1 ≤ qp/2 then xi = qp/2. Similarly, if xi ≥ q1−p/2 then xi = q1−p/2. This is evident in the plotwhere we see that extremes on the x axis correspond to thresholded values on the y axis. Note that theresulting vector is still the same length as the original data vector. Observations are not deleted, they arereplaced.

1

mytrim

## function (x, p)## {## q = c(p/2, 1 - p/2)## xq = quantile(x, q, na.rm = TRUE)## x[x <= xq[1]] = xq[1]## x[x >= xq[2]] = xq[2]## return(x)## }

x = rt(10000,df=1)plot(x,mytrim(x,.005))

−1500 −1000 −500 0 500 1000 1500

−10

0−

500

5010

0

x

myt

rim(x

, 0.0

05)

(c) (VALUE: 2) Run the mytrim function on the columns x1, ., z4 using p=0.02. Compare histograms ofthe data “before” and “after” and comment on the changes.

for (i in 7:18) har[,i]=mytrim(har[,i],p=0.02)par(mfrow=c(3,4),mar=c(2,2,2,1))for (i in 7:18) {hist(har[,i],nclass=100,main=names(har)[i])}

2

x1

har[, i]

−30 −10 10

040

0080

00y1

har[, i]

Fre

quen

cy20 60 100 140

040

0080

00

z1

har[, i]

Fre

quen

cy

−150 −100 −50

040

00

x2

har[, i]

Fre

quen

cy

−500 −300 −100

010

000

2000

0

y2

har[, i]

−500 −300 −100 100

010

000

2500

0 z2

har[, i]

Fre

quen

cy

−600 −400 −200 0

050

0015

000

x3

har[, i]F

requ

ency

−150 −50 50

040

0080

00

y3

har[, i]

Fre

quen

cy

50 100 200

010

000

2500

0

z3

−150 −50 0

040

0080

00

x4

Fre

quen

cy

−250 −150

040

00

y4F

requ

ency

−140 −100 −60

020

0050

00

z4

Fre

quen

cy

−180 −140

020

0050

00

We see that the visible bars in the histogram now extend over the whole range of the data. The small “spikes”at the extremes of the new histograms are data values that have been replaced by the threshold values usingmytrim.

(d) (VALUE: 2) Generate a subsample of 5,000 observations and generate 4 different scatterplot matricesusing the pairs()command. The first plot should use x1, y1, z1, the second x2, y2, z2, and so on. Ineach plot, color the points using the class label. Based on how well separated the classes are in the plots,do you think it will be easy, challenging or impossible to classify the 5 different classes?

mysample = sample(nrow(har),5000)for (i in 0:3)

pairs(har[mysample,7:9+i*3],col=har$class[mysample])

3

x1

20 60 100

−30

020

2060

100

y1

−30 0 20 −150 −50−

150

−50

z1

x2

−500 −200 100

−50

0−

200

0

−50

0−

200

100

y2

−500 −200 0 −600 −300 0

−60

0−

300

0

z2

x3

50 150

−15

00

100

5015

0

y3

−150 0 100 −150 −50

−15

0−

50

z3

x4

−140 −80

−25

0−

150

−14

0−

80

y4

−250 −150 −180 −140−

180

−14

0

z4

It seems that some of the classes are separated in some of the scatterplots. But there isn’t any one plot thatseparates all the classes. So it seems like the task of classifying the 5 categories may be “challenging”.

4

QUESTION 2: Logistic regression with two classes

Preliminary setup for question 2 and also questions 3 and 4. Note that I am removing the columns of harthat correspond to the subjects and their height, etc.

templev = levels(har$class)templev = c('down','down','up','up','up')har$class2 = har$classlevels(har$class2)=templevtable(har$class,har$class2)

#### down up## sitting 50631 0## sittingdown 11827 0## standing 0 47370## standingup 0 12414## walking 0 43390

har = har[,-(1:6)]set.seed(101)train = sample(nrow(har),size=nrow(har)-20000)har.test = har[-train,]

(a) (VALUE: 2) Using this two class variable as a response, fit a logistic regression using the 12 variablesx1, y1, z1, . . . , x4, y4, z4 as predictors. Show the result of summary() for this model.

logit1 = glm(class2~.-class,data=har,subset=train,family=binomial)summary(logit1)

#### Call:## glm(formula = class2 ~ . - class, family = binomial, data = har,## subset = train)#### Deviance Residuals:## Min 1Q Median 3Q Max## -4.143 -0.157 0.104 0.228 4.067#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -2.17e+01 2.57e-01 -84.49 <2e-16 ***## x1 -4.59e-02 1.37e-03 -33.49 <2e-16 ***## y1 6.66e-02 8.55e-04 77.91 <2e-16 ***## z1 -3.10e-02 4.68e-04 -66.22 <2e-16 ***## x2 1.12e-02 4.71e-04 23.78 <2e-16 ***## y2 1.54e-02 3.14e-04 48.99 <2e-16 ***## z2 -2.71e-02 2.69e-04 -100.70 <2e-16 ***## x3 4.64e-04 3.63e-04 1.28 0.2## y3 -1.12e-02 4.35e-04 -25.69 <2e-16 ***## z3 8.50e-03 5.07e-04 16.78 <2e-16 ***## x4 -6.03e-03 4.82e-04 -12.51 <2e-16 ***## y4 3.64e-02 9.51e-04 38.29 <2e-16 ***

5

## z4 -9.48e-02 1.05e-03 -90.22 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 193017 on 145631 degrees of freedom## Residual deviance: 47065 on 145619 degrees of freedom## AIC: 47091#### Number of Fisher Scoring iterations: 7

(b) (VALUE: 2) Comment on the statistical significance of the estimated coefficients of this model. Do youthink that any of these coefficients can be interpreted? What might you need to know to interpret them?

All the predictors have significant coefficients with p-values less than 2 × 10−16, except x3. The correspondingz values suggest that some coefficients are more important than others (eg z2 has a z-value of about -100.)

The general intrepretation of coefficients in a logistic regression is that they are the change in log odds of theY = 1 class corresponding to a unit change in the X, holding all other variables fixed. Here, we don’t knowenough from the paper or the other information what the predictor variables are, or what a unit change is.So we can’t comment much on the interpretation of the coefficients.

(c) (VALUE: 2) As I mentioned in the lectures, the performance of a classifier is often measured in termsof its classification accuracy. Both an overall accuracy rate and within class rate can be calculatedfrom a frequency table of the observed classes and the predicted classes. Construct such a table forthe training data (the table function in R is helpful) and calculate the percentage of correctly classifiedtraining observations from it (both an overall rate, and a rate within each class).

logit1.prob = predict(logit1,newdata=har[train,],type="response")logit1.pred = rep('down',length(train))logit1.pred[logit1.prob>.5] = 'up'table(obs = har$class2[train],pred=logit1.pred)

## pred## obs down up## down 49710 5226## up 3438 87258

(49710+87258)/length(train)

## [1] 0.9405

49710/(5226+49710)

## [1] 0.9049

87258/(3438+87258)

## [1] 0.9621

6

So overall we correctly classify about 94%. Within the “down” class it is about 91%, and within the “up”class it is about 96%.

(d) (VALUE: 2) The class.table() function in the file A2funs.R (see above) calculates both the accuracyof the predictions within each class, and an overall “percent correct”. Use the class.table() function toassess the accuracy of the predicted classes of the model (for the training data) and explain the results.

class.table(obs=har$class2[train],pred=logit1.pred)

## pred## obs down up## down 90.5 9.5## up 3.8 96.2## overall: 94.1

This function starts with a table like in part (c), of observed vs. predicted classes. It then divides the observedrows by the row totals, giving percent of predicted categories within each observed category. The diagnoalsare percent correct within each observed class. In a 2-class problem, the off-diagonals are the percent incorrectwithin each observed class.

(e) (VALUE: 2) Use class.table() to summarize the test set predictions from the logistic model.

logit1.test.prob = predict(logit1,newdata=har.test,type="response")logit1.test.pred = rep('down',20000)logit1.test.pred[logit1.test.prob>.5] = 'up'class.table(obs=har.test$class2,pred=logit1.test.pred)


The results are not that different from those for the training set.

(f) (VALUE: 2) In this question, and in the rest of the assignment questions dealing with this data, we didnot use the variables user, gender, age, how_tall_in_meters, weight, body_mass_index as predictors.Explain why one might want to exclude these as predictors.

There are only four people (user). This means that the variables listed above will be the same for allobservations with the same user. These predictors wouldn’t be able to contribute to the prediction of theclasses within the same user. We could have included a user effect by adding user as a factor to the model.And I suppose we could have allowed the logistic regression coefficients to vary across users by including aninteraction between user and the 12 other predictors. Such a model wouldn’t have much extra predictivepower, however.

QUESTION 3. LDA with two classes

(a) (VALUE: 4) Fit an LDA model using the same predictors and responses (and training data) as inquestion 2. Report on the test set classification accuracy and the other results of class.table().

7

library(MASS)l1 = lda(class2~.-class,data=har,subset=train)# train errorl1.pred = predict(l1,har[train,])$classclass.table(obs=har$class2[train],pred=l1.pred)


# test errorl1.test.pred = predict(l1,har[-train,])$classclass.table(obs=har$class2[-train],pred=l1.test.pred)


In both the training set and the test set the overall accuracy is about the same (training set was not requestedin the question). For test we have 93.8%, with 89% correct in the down class and 96.7% correct in the upclass.

QUESTION 4. 5-class discriminant analysis

(a) (VALUE: 4) Fit a 5-class LDA model using the original “class” variable and the same 12 predictors.Report on the test set classification accuracy and the other results of class.table().

l2 = lda(class~.-class2,data=har,subset=train)l2.test.pred = predict(l2,har.test)$classclass.table(obs=har$class[-train],pred=l2.test.pred)

## pred## obs sitting sittingdown standing standingup walking## sitting 99.7 0.2 0.0 0.1 0.0## sittingdown 11.3 45.6 24.9 16.5 1.7## standing 0.0 0.6 92.7 0.1 6.6## standingup 13.5 13.0 22.8 46.2 4.6## walking 0.0 1.4 31.8 2.1 64.6## overall: 80.7

Overall test set accuracy is 80.7%. Accuracy within class varies considerably:

99.7%, 45.6%, 92.7%, 46.2%, 64.6%. The lowest accuracies are in the “sitting down” and “standing up”categories.

8

(b) (VALUE: 4) Perform the same analysis for a QDA model. Does the use of QDA model (which ingeneral should be more flexible than LDA) seem to give better predictions? Why?

l3 = qda(class~.-class2,data=har,subset=train)l3.test.pred = predict(l3,har.test)$classclass.table(obs=har$class[-train],pred=l3.test.pred)

## pred## obs sitting sittingdown standing standingup walking## sitting 97.6 1.2 0.0 1.2 0.0## sittingdown 3.6 79.8 9.1 4.9 2.5## standing 0.0 1.7 95.8 0.9 1.7## standingup 0.6 11.1 12.0 70.9 5.4## walking 0.0 2.9 6.0 1.6 89.5## overall: 91.7

Overall accuracy (91.7%) and most within-class accuracies have increased. Significant gains are notable insitting down (45.6% to 79.8%), standing up (46.2% to 70.9%) and walking (64.6% to 89.5%).

With a dataset this large, the chance of overfitting the data is smaller. So the additional parameters of theQDA model enable it to more accurately represent the decision boundaries.

(c) (VALUE: 2) In part (b) I called the QDA model “more flexible”. Why? What property of the QDAmodel makes it more flexible than the LDA model?

The QDA model is more flexible because it allows the covariance matrices to differ from class to class. TheLDA model is a special case of this model, with a common covariance matrix. So the QDA model canrepresent both the LDA model and nonlinear boundaries.

QUESTION 5: Cross-validation simulation

(a) (VALUE: 2) The code isn’t doing leave-one-out CV. What is the name of the procedure it is doing?

Inspection of the code indicates that it is doing 5-fold CV to evaluate the performance of each of 4 polynomialmodels. The for (degree in 1:4) loop is over the polynomial degree, and the for (i in 1:5) loop isover CV folds (identified by the vector fold which assigns one of the integers 1 to 5 to each player.). It trainson the training set, which is all except fold i and predicts for the validation set (the [-train] subscripting).

(b) (VALUE: 2) Using this code you should be able to choose one of the four models as the best one. Whichis it? What criterion are you using to make this choice? Are there other models among the four thatare nearly as good as your choice?

set.seed (1)y = rnorm(100)x = rnorm(100)y = x - 2*x^2+ rnorm(100)plot(x,y)

9

−2 −1 0 1 2

−8

−6

−4

−2

02

x

y

mydata = data.frame(x,y)

set.seed(3)fold = sample(rep(1:5,20))

mse=rep(0,4)for (degree in 1:4){yhat = rep(0,100)for (i in 1:5){train = (1:100)[fold!=i]lm1.fit = lm(y~poly(x,degree),data = mydata, subset=train)yhat[-train] = predict(lm1.fit,mydata)[-train]

}mse[degree] = mean((yhat-y)^2)

}plot(1:4,mse)

10

1.0 1.5 2.0 2.5 3.0 3.5 4.0

12

34

5

1:4

mse

mse

## [1] 5.717 1.062 1.077 1.091

I would choose the degree=2 model, since it has smallest cross-validated MSE. The degree=3,4 models arenearly as good, in terms of MSE. However they are more complex, so it makes sense to choose the simplestmodel with good fit.

(c) (VALUE: 2) In Ch 3 we learned several methods (hypothesis tests for coefficients and stepwise modelselection) that could be used to determine the best model. Apply one of these to the data, using all 100observations. You’ll need to write a few lines of R code. Does your result agree with your conclusionin part (b)? Note that if you want step() to be able to remove terms, you can’t specify the modelusing poly(). Instead you must specify the formula in “lm” as y~x+I(xˆ2)+I(xˆ3)+I(xˆ4). The I()function “insulates” a mathematical expression from being interpreted as a model formula and forces Rto evaluate the expression.

mylm = lm(y~x+I(x^2)+I(x^3)+I(x^4),data=data.frame(x,y))summary(mylm)

#### Call:## lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4), data = data.frame(x,## y))##

11

## Residuals:## Min 1Q Median 3Q Max## -2.8913 -0.5244 0.0749 0.5932 2.7796#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.13897 0.15973 -0.87 0.3865## x 0.90980 0.24249 3.75 0.0003 ***## I(x^2) -1.72802 0.28379 -6.09 2.4e-08 ***## I(x^3) 0.00715 0.10832 0.07 0.9475## I(x^4) -0.03807 0.08049 -0.47 0.6373## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.04 on 95 degrees of freedom## Multiple R-squared: 0.813, Adjusted R-squared: 0.806## F-statistic: 104 on 4 and 95 DF, p-value: <2e-16

step(mylm)

## Start: AIC=13## y ~ x + I(x^2) + I(x^3) + I(x^4)#### Df Sum of Sq RSS AIC## - I(x^3) 1 0.0 103 11.0## - I(x^4) 1 0.2 103 11.2## <none> 103 13.0## - x 1 15.3 118 24.8## - I(x^2) 1 40.2 143 43.9#### Step: AIC=11## y ~ x + I(x^2) + I(x^4)#### Df Sum of Sq RSS AIC## - I(x^4) 1 0.3 103 9.3## <none> 103 11.0## - I(x^2) 1 51.1 154 49.2## - x 1 62.2 165 56.2#### Step: AIC=9.32## y ~ x + I(x^2)#### Df Sum of Sq RSS AIC## <none> 103 9.3## - x 1 68 171 57.6## - I(x^2) 1 443 547 173.9

#### Call:## lm(formula = y ~ x + I(x^2), data = data.frame(x, y))#### Coefficients:## (Intercept) x I(x^2)## -0.0954 0.8996 -1.8666

12

We see that backward deletion has removed the terms xˆ3 and xˆ4, leaving us with the same model selectedby cross-validation.

13

assignment 2 solutions - dalhousie university · assignment 2 solutions totalvalue:...

Documents