assignment 2 solutions - dalhousie university · assignment 2 solutions totalvalue:...
TRANSCRIPT
Assignment 2 solutionsTotal Value: 40 points (8+12+4+10+6)
QUESTION 1
(a) (VALUE: 2) Plot histograms of the 12 variables x1, y1, z1, ., x4, y4, z4. Using these histograms, identifyvariables that have outlying values.
load(url('http://www.mathstat.dal.ca/~aarms2014/StatLearn/data/har.RData'))source("http://www.mathstat.dal.ca/~aarms2014/StatLearn/R/A2funs.R")par(mfrow=c(3,4),mar=c(2,2,2,1))for (i in 7:18) {hist(har[,i],nclass=100,main=names(har)[i])}
x1
har[, i]
−200 0 200
030
000
7000
0 y1
har[, i]
Fre
quen
cy
−200 0 200
020
000
5000
0
z1
har[, i]
Fre
quen
cy
−600 −200 200
010
000
2500
0
x2
har[, i]
Fre
quen
cy
−400 0 200
015
000
3000
0
y2
har[, i]
−400 0 200
020
000
z2
har[, i]
Fre
quen
cy
−600 −200 0
010
000
2500
0
x3
har[, i]
Fre
quen
cy
−400 0 200
020
000
y3
har[, i]
Fre
quen
cy
−400 0 200
020
000
5000
0
z3
−600 −200 200
020
000
4000
0 x4
Fre
quen
cy
−700 −400 −100
050
0015
000
y4
Fre
quen
cy
−500 −300 −100 100
010
000
2500
0
z4
Fre
quen
cy
−500 −300 −100
010
000
2500
0
Since a histogram chooses the range of its horizontal axis to correspond to the range of the data, a histogramwith a “spike” in the middle and no apparent data at the extreme must have outlying values. With this inmind, and inspecting the histograms it seems that all except y2 and z2 have outlying values in at least one ofthe two “tails” of three distributions. X2 and x4 seem to only have outliers in one direction.
(b) (VALUE: 2)With so many observations we must automate the processing of outliers. The mytrimfunction(in A2funs.R, see above) will do this. Describe what it does. Is it deleting data or changing it? How?
The mytrim function calculates quantiles qp/2 and q1−p/2 corresponding to probabilities p/2 and 1 − p/2. Ifdata value x1 ≤ qp/2 then xi = qp/2. Similarly, if xi ≥ q1−p/2 then xi = q1−p/2. This is evident in the plotwhere we see that extremes on the x axis correspond to thresholded values on the y axis. Note that theresulting vector is still the same length as the original data vector. Observations are not deleted, they arereplaced.
1
mytrim
## function (x, p)## {## q = c(p/2, 1 - p/2)## xq = quantile(x, q, na.rm = TRUE)## x[x <= xq[1]] = xq[1]## x[x >= xq[2]] = xq[2]## return(x)## }
x = rt(10000,df=1)plot(x,mytrim(x,.005))
−1500 −1000 −500 0 500 1000 1500
−10
0−
500
5010
0
x
myt
rim(x
, 0.0
05)
(c) (VALUE: 2) Run the mytrim function on the columns x1, ., z4 using p=0.02. Compare histograms ofthe data “before” and “after” and comment on the changes.
for (i in 7:18) har[,i]=mytrim(har[,i],p=0.02)par(mfrow=c(3,4),mar=c(2,2,2,1))for (i in 7:18) {hist(har[,i],nclass=100,main=names(har)[i])}
2
x1
har[, i]
−30 −10 10
040
0080
00y1
har[, i]
Fre
quen
cy20 60 100 140
040
0080
00
z1
har[, i]
Fre
quen
cy
−150 −100 −50
040
00
x2
har[, i]
Fre
quen
cy
−500 −300 −100
010
000
2000
0
y2
har[, i]
−500 −300 −100 100
010
000
2500
0 z2
har[, i]
Fre
quen
cy
−600 −400 −200 0
050
0015
000
x3
har[, i]F
requ
ency
−150 −50 50
040
0080
00
y3
har[, i]
Fre
quen
cy
50 100 200
010
000
2500
0
z3
−150 −50 0
040
0080
00
x4
Fre
quen
cy
−250 −150
040
00
y4F
requ
ency
−140 −100 −60
020
0050
00
z4
Fre
quen
cy
−180 −140
020
0050
00
We see that the visible bars in the histogram now extend over the whole range of the data. The small “spikes”at the extremes of the new histograms are data values that have been replaced by the threshold values usingmytrim.
(d) (VALUE: 2) Generate a subsample of 5,000 observations and generate 4 different scatterplot matricesusing the pairs()command. The first plot should use x1, y1, z1, the second x2, y2, z2, and so on. Ineach plot, color the points using the class label. Based on how well separated the classes are in the plots,do you think it will be easy, challenging or impossible to classify the 5 different classes?
mysample = sample(nrow(har),5000)for (i in 0:3)
pairs(har[mysample,7:9+i*3],col=har$class[mysample])
3
x1
20 60 100
−30
020
2060
100
y1
−30 0 20 −150 −50−
150
−50
z1
x2
−500 −200 100
−50
0−
200
0
−50
0−
200
100
y2
−500 −200 0 −600 −300 0
−60
0−
300
0
z2
x3
50 150
−15
00
100
5015
0
y3
−150 0 100 −150 −50
−15
0−
50
z3
x4
−140 −80
−25
0−
150
−14
0−
80
y4
−250 −150 −180 −140−
180
−14
0
z4
It seems that some of the classes are separated in some of the scatterplots. But there isn’t any one plot thatseparates all the classes. So it seems like the task of classifying the 5 categories may be “challenging”.
4
QUESTION 2: Logistic regression with two classes
Preliminary setup for question 2 and also questions 3 and 4. Note that I am removing the columns of harthat correspond to the subjects and their height, etc.
templev = levels(har$class)templev = c('down','down','up','up','up')har$class2 = har$classlevels(har$class2)=templevtable(har$class,har$class2)
#### down up## sitting 50631 0## sittingdown 11827 0## standing 0 47370## standingup 0 12414## walking 0 43390
har = har[,-(1:6)]set.seed(101)train = sample(nrow(har),size=nrow(har)-20000)har.test = har[-train,]
(a) (VALUE: 2) Using this two class variable as a response, fit a logistic regression using the 12 variablesx1, y1, z1, . . . , x4, y4, z4 as predictors. Show the result of summary() for this model.
logit1 = glm(class2~.-class,data=har,subset=train,family=binomial)summary(logit1)
#### Call:## glm(formula = class2 ~ . - class, family = binomial, data = har,## subset = train)#### Deviance Residuals:## Min 1Q Median 3Q Max## -4.143 -0.157 0.104 0.228 4.067#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -2.17e+01 2.57e-01 -84.49 <2e-16 ***## x1 -4.59e-02 1.37e-03 -33.49 <2e-16 ***## y1 6.66e-02 8.55e-04 77.91 <2e-16 ***## z1 -3.10e-02 4.68e-04 -66.22 <2e-16 ***## x2 1.12e-02 4.71e-04 23.78 <2e-16 ***## y2 1.54e-02 3.14e-04 48.99 <2e-16 ***## z2 -2.71e-02 2.69e-04 -100.70 <2e-16 ***## x3 4.64e-04 3.63e-04 1.28 0.2## y3 -1.12e-02 4.35e-04 -25.69 <2e-16 ***## z3 8.50e-03 5.07e-04 16.78 <2e-16 ***## x4 -6.03e-03 4.82e-04 -12.51 <2e-16 ***## y4 3.64e-02 9.51e-04 38.29 <2e-16 ***
5
## z4 -9.48e-02 1.05e-03 -90.22 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 193017 on 145631 degrees of freedom## Residual deviance: 47065 on 145619 degrees of freedom## AIC: 47091#### Number of Fisher Scoring iterations: 7
(b) (VALUE: 2) Comment on the statistical significance of the estimated coefficients of this model. Do youthink that any of these coefficients can be interpreted? What might you need to know to interpret them?
All the predictors have significant coefficients with p-values less than 2 × 10−16, except x3. The correspondingz values suggest that some coefficients are more important than others (eg z2 has a z-value of about -100.)
The general intrepretation of coefficients in a logistic regression is that they are the change in log odds of theY = 1 class corresponding to a unit change in the X, holding all other variables fixed. Here, we don’t knowenough from the paper or the other information what the predictor variables are, or what a unit change is.So we can’t comment much on the interpretation of the coefficients.
(c) (VALUE: 2) As I mentioned in the lectures, the performance of a classifier is often measured in termsof its classification accuracy. Both an overall accuracy rate and within class rate can be calculatedfrom a frequency table of the observed classes and the predicted classes. Construct such a table forthe training data (the table function in R is helpful) and calculate the percentage of correctly classifiedtraining observations from it (both an overall rate, and a rate within each class).
logit1.prob = predict(logit1,newdata=har[train,],type="response")logit1.pred = rep('down',length(train))logit1.pred[logit1.prob>.5] = 'up'table(obs = har$class2[train],pred=logit1.pred)
## pred## obs down up## down 49710 5226## up 3438 87258
(49710+87258)/length(train)
## [1] 0.9405
49710/(5226+49710)
## [1] 0.9049
87258/(3438+87258)
## [1] 0.9621
6
So overall we correctly classify about 94%. Within the “down” class it is about 91%, and within the “up”class it is about 96%.
(d) (VALUE: 2) The class.table() function in the file A2funs.R (see above) calculates both the accuracyof the predictions within each class, and an overall “percent correct”. Use the class.table() function toassess the accuracy of the predicted classes of the model (for the training data) and explain the results.
class.table(obs=har$class2[train],pred=logit1.pred)
## pred## obs down up## down 90.5 9.5## up 3.8 96.2## overall: 94.1
This function starts with a table like in part (c), of observed vs. predicted classes. It then divides the observedrows by the row totals, giving percent of predicted categories within each observed category. The diagnoalsare percent correct within each observed class. In a 2-class problem, the off-diagonals are the percent incorrectwithin each observed class.
(e) (VALUE: 2) Use class.table() to summarize the test set predictions from the logistic model.
logit1.test.prob = predict(logit1,newdata=har.test,type="response")logit1.test.pred = rep('down',20000)logit1.test.pred[logit1.test.prob>.5] = 'up'class.table(obs=har.test$class2,pred=logit1.test.pred)
## pred## obs down up## down 89.9 10.1## up 3.7 96.3## overall: 93.9
The results are not that different from those for the training set.
(f) (VALUE: 2) In this question, and in the rest of the assignment questions dealing with this data, we didnot use the variables user, gender, age, how_tall_in_meters, weight, body_mass_index as predictors.Explain why one might want to exclude these as predictors.
There are only four people (user). This means that the variables listed above will be the same for allobservations with the same user. These predictors wouldn’t be able to contribute to the prediction of theclasses within the same user. We could have included a user effect by adding user as a factor to the model.And I suppose we could have allowed the logistic regression coefficients to vary across users by including aninteraction between user and the 12 other predictors. Such a model wouldn’t have much extra predictivepower, however.
QUESTION 3. LDA with two classes
(a) (VALUE: 4) Fit an LDA model using the same predictors and responses (and training data) as inquestion 2. Report on the test set classification accuracy and the other results of class.table().
7
library(MASS)l1 = lda(class2~.-class,data=har,subset=train)# train errorl1.pred = predict(l1,har[train,])$classclass.table(obs=har$class2[train],pred=l1.pred)
## pred## obs down up## down 89.2 10.8## up 3.4 96.6## overall: 93.8
# test errorl1.test.pred = predict(l1,har[-train,])$classclass.table(obs=har$class2[-train],pred=l1.test.pred)
## pred## obs down up## down 89.0 11.0## up 3.3 96.7## overall: 93.8
In both the training set and the test set the overall accuracy is about the same (training set was not requestedin the question). For test we have 93.8%, with 89% correct in the down class and 96.7% correct in the upclass.
QUESTION 4. 5-class discriminant analysis
(a) (VALUE: 4) Fit a 5-class LDA model using the original “class” variable and the same 12 predictors.Report on the test set classification accuracy and the other results of class.table().
l2 = lda(class~.-class2,data=har,subset=train)l2.test.pred = predict(l2,har.test)$classclass.table(obs=har$class[-train],pred=l2.test.pred)
## pred## obs sitting sittingdown standing standingup walking## sitting 99.7 0.2 0.0 0.1 0.0## sittingdown 11.3 45.6 24.9 16.5 1.7## standing 0.0 0.6 92.7 0.1 6.6## standingup 13.5 13.0 22.8 46.2 4.6## walking 0.0 1.4 31.8 2.1 64.6## overall: 80.7
Overall test set accuracy is 80.7%. Accuracy within class varies considerably:
99.7%, 45.6%, 92.7%, 46.2%, 64.6%. The lowest accuracies are in the “sitting down” and “standing up”categories.
8
(b) (VALUE: 4) Perform the same analysis for a QDA model. Does the use of QDA model (which ingeneral should be more flexible than LDA) seem to give better predictions? Why?
l3 = qda(class~.-class2,data=har,subset=train)l3.test.pred = predict(l3,har.test)$classclass.table(obs=har$class[-train],pred=l3.test.pred)
## pred## obs sitting sittingdown standing standingup walking## sitting 97.6 1.2 0.0 1.2 0.0## sittingdown 3.6 79.8 9.1 4.9 2.5## standing 0.0 1.7 95.8 0.9 1.7## standingup 0.6 11.1 12.0 70.9 5.4## walking 0.0 2.9 6.0 1.6 89.5## overall: 91.7
Overall accuracy (91.7%) and most within-class accuracies have increased. Significant gains are notable insitting down (45.6% to 79.8%), standing up (46.2% to 70.9%) and walking (64.6% to 89.5%).
With a dataset this large, the chance of overfitting the data is smaller. So the additional parameters of theQDA model enable it to more accurately represent the decision boundaries.
(c) (VALUE: 2) In part (b) I called the QDA model “more flexible”. Why? What property of the QDAmodel makes it more flexible than the LDA model?
The QDA model is more flexible because it allows the covariance matrices to differ from class to class. TheLDA model is a special case of this model, with a common covariance matrix. So the QDA model canrepresent both the LDA model and nonlinear boundaries.
QUESTION 5: Cross-validation simulation
(a) (VALUE: 2) The code isn’t doing leave-one-out CV. What is the name of the procedure it is doing?
Inspection of the code indicates that it is doing 5-fold CV to evaluate the performance of each of 4 polynomialmodels. The for (degree in 1:4) loop is over the polynomial degree, and the for (i in 1:5) loop isover CV folds (identified by the vector fold which assigns one of the integers 1 to 5 to each player.). It trainson the training set, which is all except fold i and predicts for the validation set (the [-train] subscripting).
(b) (VALUE: 2) Using this code you should be able to choose one of the four models as the best one. Whichis it? What criterion are you using to make this choice? Are there other models among the four thatare nearly as good as your choice?
set.seed (1)y = rnorm(100)x = rnorm(100)y = x - 2*x^2+ rnorm(100)plot(x,y)
9
−2 −1 0 1 2
−8
−6
−4
−2
02
x
y
mydata = data.frame(x,y)
set.seed(3)fold = sample(rep(1:5,20))
mse=rep(0,4)for (degree in 1:4){yhat = rep(0,100)for (i in 1:5){train = (1:100)[fold!=i]lm1.fit = lm(y~poly(x,degree),data = mydata, subset=train)yhat[-train] = predict(lm1.fit,mydata)[-train]
}mse[degree] = mean((yhat-y)^2)
}plot(1:4,mse)
10
1.0 1.5 2.0 2.5 3.0 3.5 4.0
12
34
5
1:4
mse
mse
## [1] 5.717 1.062 1.077 1.091
I would choose the degree=2 model, since it has smallest cross-validated MSE. The degree=3,4 models arenearly as good, in terms of MSE. However they are more complex, so it makes sense to choose the simplestmodel with good fit.
(c) (VALUE: 2) In Ch 3 we learned several methods (hypothesis tests for coefficients and stepwise modelselection) that could be used to determine the best model. Apply one of these to the data, using all 100observations. You’ll need to write a few lines of R code. Does your result agree with your conclusionin part (b)? Note that if you want step() to be able to remove terms, you can’t specify the modelusing poly(). Instead you must specify the formula in “lm” as y~x+I(xˆ2)+I(xˆ3)+I(xˆ4). The I()function “insulates” a mathematical expression from being interpreted as a model formula and forces Rto evaluate the expression.
mylm = lm(y~x+I(x^2)+I(x^3)+I(x^4),data=data.frame(x,y))summary(mylm)
#### Call:## lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4), data = data.frame(x,## y))##
11
## Residuals:## Min 1Q Median 3Q Max## -2.8913 -0.5244 0.0749 0.5932 2.7796#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.13897 0.15973 -0.87 0.3865## x 0.90980 0.24249 3.75 0.0003 ***## I(x^2) -1.72802 0.28379 -6.09 2.4e-08 ***## I(x^3) 0.00715 0.10832 0.07 0.9475## I(x^4) -0.03807 0.08049 -0.47 0.6373## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.04 on 95 degrees of freedom## Multiple R-squared: 0.813, Adjusted R-squared: 0.806## F-statistic: 104 on 4 and 95 DF, p-value: <2e-16
step(mylm)
## Start: AIC=13## y ~ x + I(x^2) + I(x^3) + I(x^4)#### Df Sum of Sq RSS AIC## - I(x^3) 1 0.0 103 11.0## - I(x^4) 1 0.2 103 11.2## <none> 103 13.0## - x 1 15.3 118 24.8## - I(x^2) 1 40.2 143 43.9#### Step: AIC=11## y ~ x + I(x^2) + I(x^4)#### Df Sum of Sq RSS AIC## - I(x^4) 1 0.3 103 9.3## <none> 103 11.0## - I(x^2) 1 51.1 154 49.2## - x 1 62.2 165 56.2#### Step: AIC=9.32## y ~ x + I(x^2)#### Df Sum of Sq RSS AIC## <none> 103 9.3## - x 1 68 171 57.6## - I(x^2) 1 443 547 173.9
#### Call:## lm(formula = y ~ x + I(x^2), data = data.frame(x, y))#### Coefficients:## (Intercept) x I(x^2)## -0.0954 0.8996 -1.8666
12
We see that backward deletion has removed the terms xˆ3 and xˆ4, leaving us with the same model selectedby cross-validation.
13