web viewbecause larger ratings for each question are better, we are looking for stars with large...

21
Project #1 Answers STAT 873 Fall 2013 Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. This problem is partially based on an example in Chapters 2 and 5 of Johnson (1998). Below is the description of the data from Johnson: Forty-eight individuals who had applied for a job with a large firm were interviewed and rated on 15 criteria. Individuals were rated on The form of their letter of application (FL) Their appearance (APP) Their academic ability (AA) Their likability (LA) Their self-confidence (SC) Their lucidity (LC) Their honesty (HON) Their salesmanship (SMS) Their experience (EXP) Their drive (DRV) Their ambition (AMB) Their ability to grasp concepts (GSP) Their potential (POT) Their keenness to join (KJ) Their suitability (SUIT) Each criterion was evaluated on an integer scale ranging from 0 to 10 with 0 being a very low and very unsatisfactory rating, and 10 being a very high rating. The data is in the file job_applicant.csv which is available on my website. Using this data, complete the following. 1

Upload: dinhdang

Post on 06-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

Project #1 Answers STAT 873Fall 2013

Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document.

This problem is partially based on an example in Chapters 2 and 5 of Johnson (1998). Below is the description of the data from Johnson:

Forty-eight individuals who had applied for a job with a large firm were interviewed and rated on 15 criteria. Individuals were rated on

The form of their letter of application (FL) Their appearance (APP) Their academic ability (AA) Their likability (LA) Their self-confidence (SC) Their lucidity (LC) Their honesty (HON) Their salesmanship (SMS) Their experience (EXP) Their drive (DRV) Their ambition (AMB) Their ability to grasp concepts (GSP) Their potential (POT) Their keenness to join (KJ) Their suitability (SUIT)

Each criterion was evaluated on an integer scale ranging from 0 to 10 with 0 being a very low and very unsatisfactory rating, and 10 being a very high rating.

The data is in the file job_applicant.csv which is available on my website. Using this data, complete the following.

1) (10 points) Examine appropriate plots of the data and interpret them in the context of the problem. In your interpretations, make sure to specifically indicate which individuals are overall good applicants.

I used 32-bit R to complete this project.

In order to represent all 15 variables on a plot at once, the most appropriate plots are stars and parallel coordinate plots.

> set1<-read.table(file = "C:\\chris\\job_applicant.csv", header = TRUE, sep = ",")> head(set1)

1

Page 2: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

Applicant FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT1 1 6 7 2 5 8 7 8 8 3 8 9 7 52 2 9 10 5 8 10 9 9 10 5 9 9 8 83 3 7 8 3 6 9 8 9 7 4 9 9 8 64 4 5 6 8 5 6 5 9 2 8 4 5 8 75 5 6 8 8 8 4 4 9 5 8 5 5 8 86 6 7 7 7 6 8 7 10 5 9 6 5 8 6 KJ SUIT1 7 102 8 103 8 104 6 55 7 76 6 6

> win.graph(width = 11, height = 7)> stars(x = set1[,-1], draw.segments = TRUE, key.loc = c(20,10), main = "Job applicant star plot", labels = set1$Applicant)

Job applicant star plot

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30 31 32 33 34 35

36 37 38 39 40 41 42

43 44 45 46 47 48

FLAPP

AALASCLCHONSMSEXPDRVAMB GSPPOT

KJSUIT

Because larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this quality include 39 and 40 where only their appearance (APP) is somewhat in the middle in comparison to others. Applicants 2, 7, 8, 9, 22, 23, and 24 tend to have most of their rays being large. Applicants 28 and 29 tend to have the smallest rays overall, so they may be the least desirable to hire. There are a number of other individuals with small values for some criteria, but large values in other areas. For example, applicants 41 and 42 stand out as having a lot of experience (EXP), good form to their application letter (FL), highly suitable (SUIT), and good academic ability (AA), but they lack in other areas.

2

Page 3: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

> library(MASS)> parcoord(x = set1, main = "Job applicant parallel coordinate plot", col = 1:nrow(set1))

Job applicant parallel coordinate plot

Applicant FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT KJ SUIT

> #Highlight #39 and 40 that stood out in the stars plot> color.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes = "red", no = "black")> #Also could use> #color.select<-c(rep(x = "black", times = 38), "red", "red", rep(x = "black", times = 8))> lwd.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes = 2, no = 1) #Helpful due to overlapping of some lines> parcoord(x = set1, main = "Job applicant parallel coordinate plot (#39 and #40 highlighted in red)", col = color.select, lwd = lwd.select)

3

Page 4: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

Job applicant parallel coordinate plot (#39 and #40 highlighted in red)

Applicant FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT KJ SUIT

> library(iplots)> ipcp(vars = set1)

4

Page 5: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

The second plot above shows that #39 and #40 almost always have ratings close to 10, except for APP.

One could do some brushing on the third plot. I used brushing on a few of the variables to determine if large values for a particular variable of interest tended to also coincide with large values for other variables. For example, the suitability of an applicant is likely a very important quality. When I highlight those applicants with ratings of 9 and 10, I obtain the plot below:

We see that a mix of scores can occur with some variables such as academic ability (AA), but a much smaller mix of scores occurs for some areas like lucidity (LC) and honesty (HON).

Parallel coordinate plots can be a little more difficult to interpret with this type of data due to the discreteness of the ratings. In particular one needs to realize that some observations may be hidden due to this discreteness. For example, it is difficult to know how many observations truly are at a level of 10 for honesty (HON). While I did not require students to solve this problem, I did expect students to at least identify it (discreteness problems were discussed in class).

One possible way to solve the problem described above is to jitter each of the ratings by a small amount. For example, add a simulated value from a N(0,0.1) distribution to each rating. Below is an example:

> N<-nrow(set1)> set.seed(7811)> set1.jit<-set1[,-1] + matrix(data = rnorm(n = N*15, mean = 0, sd = 0.1), nrow =

5

Page 6: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

N, ncol = 15)> parcoord(x = set1.jit, main = "Job applicant parallel coordinate plot (#39 and #40 highlighted in red)", col = color.select, lwd = lwd.select)

Job applicant parallel coordinate plot (#39 and #40 highlighted in red)

FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT KJ SUIT

Due to the jittering, one needs to be careful with judging when one observation has a “higher” rating than another observation very close to it.

One group used my parcoord2() function to plot the data in its original units. This is desirable because each variable is measured on a 0 to 10 scale.

Trellis plots can be helpful as well. Based on a plot given by a group, below is one possible plot:

> library(lattice)> var.names<-names(set1)[-1] #Variable names> set1.long<-reshape(data = set1, idvar = "Applicant", varying = var.names, v.names = "score", timevar = "Criteria", times = var.names, direction = "long")> row.names(set1.long)<-NULL> set1.long[set1.long$Applicant == 1,] #Applicant #1 Applicant Criteria score1 1 FL 649 1 APP 797 1 AA 2145 1 LA 5193 1 SC 8241 1 LC 7289 1 HON 8337 1 SMS 8385 1 EXP 3433 1 DRV 8481 1 AMB 9529 1 GSP 7577 1 POT 5

6

Page 7: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

625 1 KJ 7673 1 SUIT 10 > #Next, I need to create a variable that nicely displays the panel titles> set1.long$Applicant2<-as.factor(paste("Applicant #", set1.long$Applicant, sep = ""))> class(set1.long$Applicant2)[1] "factor"> #levels(set1.long$Applicant2) #Notice ordering is not ideal, and this ordering will be used by histogram()> set1.long$Applicant3<-factor(x = set1.long$Applicant2, levels = paste("Applicant #", 1:48, sep = "")) #Specifies one ordering to the levels of the factor> levels(set1.long$Applicant3) [1] "Applicant #1" "Applicant #2" "Applicant #3" "Applicant #4" "Applicant #5" "Applicant #6" "Applicant #7" [8] "Applicant #8" "Applicant #9" "Applicant #10" "Applicant #11" "Applicant #12" "Applicant #13" "Applicant #14"[15] "Applicant #15" "Applicant #16" "Applicant #17" "Applicant #18" "Applicant #19" "Applicant #20" "Applicant #21"[22] "Applicant #22" "Applicant #23" "Applicant #24" "Applicant #25" "Applicant #26" "Applicant #27" "Applicant #28"[29] "Applicant #29" "Applicant #30" "Applicant #31" "Applicant #32" "Applicant #33" "Applicant #34" "Applicant #35"[36] "Applicant #36" "Applicant #37" "Applicant #38" "Applicant #39" "Applicant #40" "Applicant #41" "Applicant #42"[43] "Applicant #43" "Applicant #44" "Applicant #45" "Applicant #46" "Applicant #47" "Applicant #48" > win.graph(width = 11)> histogram(x = ~ score | Applicant3, data = set1.long, type = "count", layout = c(8,6), xlab = "Score", main = "Histograms of scores by applicant")

Histograms of scores by applicant

Score

Cou

nt

048

12

0 2 4 6 8 10

Applicant #1 Applicant #2

0 2 4 6 8 10

Applicant #3 Applicant #4

0 2 4 6 8 10

Applicant #5 Applicant #6

0 2 4 6 8 10

Applicant #7 Applicant #8

Applicant #9 Applicant #10 Applicant #11 Applicant #12 Applicant #13 Applicant #14 Applicant #15

04812

Applicant #16048

12Applicant #17 Applicant #18 Applicant #19 Applicant #20 Applicant #21 Applicant #22 Applicant #23 Applicant #24

Applicant #25 Applicant #26 Applicant #27 Applicant #28 Applicant #29 Applicant #30 Applicant #31

04812

Applicant #32048

12Applicant #33 Applicant #34 Applicant #35 Applicant #36 Applicant #37 Applicant #38 Applicant #39 Applicant #40

Applicant #41

0 2 4 6 8 10

Applicant #42 Applicant #43

0 2 4 6 8 10

Applicant #44 Applicant #45

0 2 4 6 8 10

Applicant #46 Applicant #47

0 2 4 6 8 10

04812

Applicant #48

This plot helps to show that Applicants #39 and 40 obtain mostly 10’s.

2) This part involves performing a PCA for the data using the correlation matrix. a) (3 points) Discuss the positive and negative aspects of using the covariance matrix for a PCA

rather than the correlation matrix.

7

Page 8: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

In order to receive full credit for this problem, the answer needs to be in the context of the data being analyzed for the project.

Because all of the variables are measured on the same scale, it may be of interest to use the covariance matrix rather than the correlation matrix.

Below are the standard deviations and variances for the data:

> apply(X = set1[,-1], MARGIN = 2, FUN = sd) FL APP AA LA SC LC 2.673749 1.966023 1.987550 2.805690 2.418072 3.170048 HON SMS EXP DRV AMB GSP 2.534514 3.439381 3.308529 2.947457 2.935401 3.035254 POT KJ SUIT 3.183443 2.657036 3.300279

> apply(X = set1[,-1], MARGIN = 2, FUN = var) FL APP AA LA SC LC 7.148936 3.865248 3.950355 7.871897 5.847074 10.049202 HON SMS EXP DRV AMB GSP 6.423759 11.829344 10.946365 8.687500 8.616578 9.212766 POT KJ SUIT 10.134309 7.059840 10.891844

Due to some differences between the above values, we see that some variables will play a larger role in the PCA than others. This may or may not be of interest. For example, is an applicant’s salesmanship (SMS) more important than academic ability (AA)? A PCA with the covariance matrix will treat SMS as more important due its variance being almost 3 times larger than the variance of AA.

Note that Johnson does the PCA with both the covariance and correlation matrices. Personally, I feel more comfortable with using the correlation matrix in this setting.

b) (5 points) Determine the number of PCs needed when using the correlation matrix.

> #Using x = set1[,-1] can be an easier way to specify the variables when there are a lot of them> pca.cor<-princomp(x = set1[,-1], cor = TRUE, scores = TRUE)> summary(pca.cor, loadings = TRUE, cutoff = 0.0)Importance of components: Comp.1 Comp.2 Comp.3Standard deviation 2.7411301 1.4339809 1.20657345Proportion of Variance 0.5009196 0.1370867 0.09705463Cumulative Proportion 0.5009196 0.6380064 0.73506099 Comp.4 Comp.5 Comp.6Standard deviation 1.09448513 0.85973985 0.70326316Proportion of Variance 0.07985985 0.04927684 0.03297194Cumulative Proportion 0.81492084 0.86419768 0.89716961 Comp.7 Comp.8 Comp.9Standard deviation 0.59267346 0.55668844 0.50691374Proportion of Variance 0.02341746 0.02066013 0.01713077Cumulative Proportion 0.92058707 0.94124720 0.95837797 Comp.10 Comp.11 Comp.12Standard deviation 0.43001206 0.39074335 0.312350893Proportion of Variance 0.01232736 0.01017869 0.006504205Cumulative Proportion 0.97070533 0.98088402 0.987388228

8

Page 9: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

Comp.13 Comp.14 Comp.15Standard deviation 0.298024834 0.254230665 0.189009390Proportion of Variance 0.005921253 0.004308882 0.002381637Cumulative Proportion 0.993309481 0.997618363 1.000000000

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7FL -0.162 0.429 -0.315 -0.094 0.114 0.621 0.171APP -0.213 -0.035 0.023 0.262 0.870 -0.038 -0.010AA -0.040 0.237 0.430 0.636 -0.213 0.223 0.311LA -0.225 -0.130 -0.466 0.345 -0.129 0.112 -0.131SC -0.290 -0.249 0.241 -0.173 0.005 0.020 0.143LC -0.315 -0.131 0.150 -0.071 -0.207 0.175 -0.515HON -0.158 -0.405 -0.284 0.416 -0.064 -0.304 0.144SMS -0.324 -0.029 0.186 -0.198 0.037 -0.118 0.010EXP -0.134 0.553 -0.083 0.068 -0.103 -0.367 -0.113DRV -0.315 0.046 0.080 -0.156 -0.201 -0.250 0.490AMB -0.318 -0.068 0.209 -0.199 0.163 0.113 0.201GSP -0.331 -0.023 0.117 0.075 -0.082 0.148 -0.408POT -0.333 0.022 0.073 0.188 -0.127 0.059 -0.016KJ -0.259 -0.082 -0.467 -0.201 -0.112 0.075 0.247SUIT -0.236 0.421 -0.089 -0.020 0.081 -0.414 -0.173

Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14FL -0.155 0.411 0.033 -0.121 -0.185 0.032 0.091 APP 0.009 -0.071 0.014 0.114 -0.004 -0.314 -0.087 AA 0.043 -0.065 0.290 0.184 0.174 0.038 -0.011 LA 0.308 -0.369 0.172 -0.417 0.025 -0.059 0.149 SC -0.386 0.126 0.226 -0.423 0.339 0.001 -0.380 LC -0.024 0.124 0.250 -0.025 0.002 -0.421 0.113 HON -0.344 0.424 -0.007 0.051 -0.145 0.202 0.217 SMS 0.142 -0.083 0.478 0.171 -0.633 0.299 -0.032 EXP -0.584 -0.348 -0.012 -0.082 -0.126 -0.138 0.046 DRV 0.256 0.157 -0.199 0.026 -0.062 -0.558 0.225 AMB -0.041 -0.336 -0.209 -0.179 0.181 0.378 0.524 GSP -0.106 0.082 -0.335 0.450 0.163 0.109 0.161 POT 0.149 -0.051 -0.555 -0.170 -0.286 0.131 -0.557 KJ -0.052 -0.275 0.144 0.525 0.286 0.009 -0.301 SUIT 0.382 0.353 0.138 -0.083 0.396 0.292 -0.051

Comp.15FL -0.028 APP 0.022 AA 0.068 LA -0.298 SC -0.302 LC 0.499 HON 0.173 SMS -0.171 EXP -0.001 DRV -0.171 AMB 0.302 GSP -0.527 POT 0.238 KJ 0.203 SUIT 0.099

> plot(pca.cor, type = "lines", main = "Scree plot for job applicant data")

9

Page 10: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

Scree plot for job applicant data

Var

ianc

es

02

46

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

Notice the following: There are four PCs with eigenvalues greater than 1 The scree plot tends to level off after four PCs 73.5% of the total variability in the data is accounted for by 3 PCs; 81.5% of the total

variability in the data is accounted for by 4 PCs

At least three PCs are needed due to the amount of variability being explained. Four PCs may be enough to use.

It is also important to note that the first PC accounts for 50% of the total variation! Thus, only one PC is needed to account for roughly half of the “information” in the data. Given the interpretation of PC #1 (see next part), this component is very important to examine when judging whether or not to hire an applicant.

c) (8 points) Interpret the PCs chosen from b). Make sure to specifically comment on whether positive or negative scores (or scores close to 0) for a PC would likely be preferred by the firm.

To help see which of the loadings are “away from zero”, one could set the cutoff argument value in summary() to something other than 0.0. For example, below are some of the results when using a value of 0.2.

> summary(pca.cor, loadings = TRUE, cutoff = 0.2)Importance of components:

<OUTPUT EDITED>

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7FL 0.429 -0.315 0.621 APP -0.213 0.262 0.870 AA 0.237 0.430 0.636 -0.213 0.223 0.311LA -0.225 -0.466 0.345 SC -0.290 -0.249 0.241 LC -0.315 -0.207 -0.515

10

Page 11: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

HON -0.405 -0.284 0.416 -0.304 SMS -0.324 EXP 0.553 -0.367 DRV -0.315 -0.201 -0.250 0.490AMB -0.318 0.209 0.201GSP -0.331 -0.408POT -0.333 KJ -0.259 -0.467 -0.201 0.247SUIT -0.236 0.421 -0.414

<OUTPUT EDITED>

Large values for each of the fifteen original variables would likely be preferred. For example, the larger SUIT is, the more suitable an applicant would be for the firm. This type of thinking then needs to be taken into account when interpreting the PCs. Below are my interpretations:

PC #1: All of the loadings are negative. Thus, the smaller the value of PC #1, the better the applicant. This PC appears to be an overall measure of the applicant.

PC #2: This appears to be a contrast between FL, AA, EXP, SUIT and SC, HON. The interpretation here is difficult. Notice that SC and HON appear to be character issues. AA and EXP are easily measurable qualities (e.g., received all A’s in school or has 10 years of prior experience). Perhaps FL is also easily measurable too and reflects experience (e.g., are their misspellings? Is it organized professionally?). Excluding SUIT then, the PC could be a contrast between measurable aspects and more difficult to measure aspects. It would be nice to know more about what SUIT really is in order to better interpret the PC.

PC #3: This appears to be a contrast between AA, SC, AMB and FL, LA, KJ. The interpretation here is again difficult. People who have a lot of self-confidence (SC) and ambition (AMB) are likely to do well in school (large AA). A person who is lucid (LC) is likely to have a good form to their cover letter (FL). I am not sure how KJ factors in with LC and FL.

PC #4: This appears to be a contrast between APP, AA, LA, HON and KJ. The interpretation here is again difficult. Note that 64-bit R users had their eigenvector be -1 multiplied by the eigenvector given above.

With respect to PCs #2 – #4, it is not necessarily true that large positive or small negative scores would be desirable. A later part in this project will address this more.

Overall, you can see how it can be difficult to interpret PCs. A subject matter researcher could likely judge better why particular linear combinations of the original variables make sense.

d) (5 points) Examine plots of the PC scores and interpret them in the context of the problem. For example, what do you think of applicant #42?

Below are a number of 3D and 4D plots:

> #Need to change the scale component of pca.cor list> pca.cor$scale<-apply(X = set1[,-1], MARGIN = 2, FUN = sd)> score.cor<-predict(pca.cor, newdata = set1)> #head(score.cor) > PC3.positive<-score.cor[,3] - min(score.cor[,3]) #Bubble needs to contain all

11

Page 12: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

values > 0> common.limits<-c(min(score.cor[,1:2]), max(score.cor[,1:2])) > #Different colors for positive and negative PC #3> pos.PC3<-score.cor[,3]>0> col.symbol<-ifelse(test = score.cor[,3]>0, yes = "red", no = "blue")> symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive, xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5, xlim = common.limits, ylim = common.limits, panel.first = grid(col = "lightgray", lty = "dotted"), fg = col.symbol)> text(x = score.cor[,1], y = score.cor[,2])> abline(h = 0)> abline(v = 0)

-4 -2 0 2 4

-4-2

02

4

Principal components

PC #1

PC

#2

12

3

45

6

78 9

1011

1213

1415

16 17

1819

20 21

222324 25

26

27

28

29

303132 33

343536

3738

3940

4142

43

4445

46

4748

> #Due to the overlapping of some points, I constructed the same plot as above but now using identify() to identify particular points. > symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive, xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5, xlim = common.limits, ylim = common.limits, panel.first = grid(col = "lightgray", lty = "dotted"), fg = col.symbol)

12

Page 13: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

> identify(x = score.cor[,1], y = score.cor[,2])[1] 2 7 8 22 23 24 39 40> abline(h = 0)> abline(v = 0)

-4 -2 0 2 4

-4-2

02

4Principal components

PC #1

PC

#2

2

78

2223 24

39

40

> library(rgl)> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], xlab = "PC #1", ylab = "PC #2", zlab = "PC #3", type = "h", xlim = common.limits, ylim = common.limits)> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], add = TRUE, col = "red", size = 6)> persp3d(x = common.limits, y = common.limits, z = matrix(data = c(0,0,0,0), nrow = 2, ncol = 2), add = TRUE, col = "green")> grid3d(side = c("x", "y", "z"), col = "lightgray")

13

Page 14: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

> score.cor2<-data.frame(Applicant = 1:N, score.cor[,1:4])> parcoord(x = score.cor2, main = "PC parallel coordinate plot (#39 and #40 highlighted in red)", col = color.select, lwd = lwd.select)

PC parallel coordinate plot (#39 and #40 highlighted in red)

Applicant Comp.1 Comp.2 Comp.3 Comp.4

Again, PC #1 appears to be an overall measure of the applicant. Because this is the first PC, we see the most variability among all PCs.

Applicant #42 has the lowest PC #1 score and the largest PC #2 score. This indicates that his/her ratings overall are very low while also having a large contrast between FL, AA, EXP,

14

Page 15: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

SUIT and SC, HON (i.e., he/she could be very honest and self-confident, but just does not have the background for the job).

Applicants #39 and #40 stand out as having the smallest PC #1 score. Using this PC alone, they would appear to be the best applicants.

Note that identifying the grouping of observations was not necessary for this problem (it was for the goblet data due to its stated goal).

e) (5 points) Suppose a late applicant submits his/her application after the PCA has been completed. The applicant receives 10’s for all 15 original variables! Through using the previous PCA results, discuss how this particular individual would compare to the other applicant.

I chose these ratings for the new applicant because this would seem to be a “perfect” applicant. I put the applicant’s ratings into a new data frame and predicted the PC scores:

> new.app<-data.frame(FL = 10, APP = 10, AA = 10, LA = 10, SC = 10, LC = 10, HON = 10, SMS = 10, EXP = 10, DRV = 10, AMB = 10, GSP = 10, POT = 10, KJ = 10, SUIT = 10)> new.PC<-predict(pca.cor, newdata = new.app)> new.PC Comp.1 Comp.2 Comp.3 Comp.4 Comp.5[1,] -5.022355 1.257912 -0.2854768 0.9633 0.1075786 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10[1,] 0.1703098 0.7593572 -0.5198497 -0.397986 0.6378478 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15[1,] 0.1678422 0.08579758 -0.2559453 0.001761864 0.02957071

We can see that this applicant has the smallest PC #1 value. Notice that PC #2 – #4 are not necessarily the largest or smallest among the other individuals.

This set of scores indicates the types of scores we want to look for with respect to the other applicants in order to choose who to hire.

f) (10 points) If the overall goal is to find the best individuals to hire, where each of the 15 criteria are given equal weighting, suggest which applicants are the best. Remember that it is most desirable to score as high as possible among the fifteen variables, so you need to take this into account when using the PCA to make your judgments.

> order(score.cor[,1]) [1] 40 39 23 8 22 2 24 7 9 10 3 16 44 12 20 1 46 17[19] 11 21 6 27 38 5 37 45 13 14 4 18 26 19 15 32 31 36[37] 25 33 30 43 34 35 41 28 47 29 48 42

> score.cor[c(39,40),1:4] Comp.1 Comp.2 Comp.3 Comp.4[1,] -4.231745 1.359829 -0.7904450 0.3170062[2,] -4.448426 1.313393 -0.6482829 0.1812216

> symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive, xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5, xlim = c(-6, 6), ylim = c(-6, 6), panel.first = grid(col = "lightgray", lty = "dotted"), fg = col.symbol) > identify(x = score.cor[,1], y = score.cor[,2])

15

Page 16: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

[1] 2 7 8 22 23 24 39 40> abline(h = 0)> abline(v = 0) > text(x = new.PC[,1], y = new.PC[, 2], labels = "YES!")

-6 -4 -2 0 2 4 6

-6-4

-20

24

6Principal components

PC #1

PC

#2

YES!

2

78

2223

39

40

> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], xlab = "PC #1", ylab = "PC #2", zlab = "PC #3", type = "h", xlim = c(-6,6), ylim = c(-6,6))> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], add = TRUE, col = "red", size = 6)> persp3d(x = c(-6,6), y = c(-6,6), z = matrix(data = c(0,0,0,0), nrow = 2, ncol = 2), add = TRUE, col = "green")> grid3d(side = c("x", "y", "z"), col = "lightgray")> plot3d(x = new.PC[,1], y = new.PC[, 2], z = new.PC[, 3], add = TRUE, col = "blue", size = 12)

16

Page 17: Web viewBecause larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this

With respect to PC #1, we see the best individuals are #40 and #39. The next best individuals include #8, #22, #23, #2, and #24.

In order to choose among these individuals, we could use the results from e) to allow us to determine a “sweet spot” on the PC scores plot. Applicants with similar PC scores to this spot may be the best to hire. Using this as a criterion, applicants #40 and #39 are the best.

Overall, I would expect PC #1 to be used to reduce the number of applicants to consider further. The remaining applicants than could be decided among by looking at the PC scores or simply focusing on particular criteria from among the 15 original variables.

17