st 370 - correlation and regressionlu/st370/note_4.pdf · chapter 5 st 370 - correlation and...

Chapter 5

ST 370 - Correlation and Regression

Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2

Recap: So far we’ve learned:

• Why we want a ‘random’ sample and how to achieve it (Sampling Scheme)

• How to use randomization, replication, and control/blocking to create a valid experi-ment.

• Methods for summarizing and graphing data in meaningful ways.

• Analyzing CRD type designs to investigate which factors are important for a particularresponse.

• Next we’ll look at how to deal with quantitative explanatory variables and a quantita-tive response.

Multi-Way ANOVA vs Linear Regression

• Multi-Way ANOVA used with variable(s)

and a

– Determines which factors are significantly associated with the response

• Correlation and Linear Regression used with

variable(s) and a

– Determines if there is a significant linear relationship between the variables

– Not necessarily cause and effect!

63

wlu4

Typewritten Text

wlu4

Typewritten Text

quantitative response

wlu4

Typewritten Text

multiple categorical

wlu4

Typewritten Text

wlu4

Typewritten Text

, i.e. factors

wlu4

Typewritten Text

wlu4

Typewritten Text

quantitative (mostly continuous)

wlu4

Typewritten Text

quantitative response

wlu4

Typewritten Text

Scatterplots and Correlation

Goal: Determine the relationship between two quantitative variables!

• Scatterplots can display the relationship between two quantitative variables

• Each dot corresponds to one observation

Example: Using number of cups of coffee to predict quiz score (observational study).

Cups of Coffee Quiz Score10 164 72 53 78 111 52 87 95 106 11

64

Examining a Scatterplot: We are looking for 3 things

Form

Strength

65

wlu4

Typewritten Text

form, strength and direction, explained below.

Direction

Correlation

• We will focus on the

• Correlation a statistic that measures the strength and direction of the linear asso-ciation between two quantitative variables.

• No need to distinguish between explanatory variable and response for correlation!

Properties of Correlation:

• Unitless quantity

• ρ and R (or r) are always between

• Sign indicates the direction of association

• Distance from 0 (magnitude) indicates the strength of the association

Interpreting Correlation:

ρ (or R/r) close to 1 =



66

wlu4

Typewritten Text

linear association between two variables.

wlu4

Typewritten Text

rho = sigma_{XY}/(sigma_X * sigma_Y)

wlu4

Typewritten Text

-1 and +1

wlu4

Typewritten Text

wlu4

Typewritten Text

strong positive linear association

wlu4

Typewritten Text

no linear association

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

-

wlu4

Typewritten Text

strong negative linear association

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

, population correlation parameter;

wlu4

Typewritten Text

R (or r) = S_{XY}/(S_X * S_Y)

wlu4

Typewritten Text

, sample correlation.

wlu4

Typewritten Text

http://www.rossmanchance.com/applets/GuessCorrelation.html

What would a correlation of 1 or -1 look like on a scatterplot?

• Not resistant to outliers, why?

67

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

R = 1: a straight increasing line; R = -1: a straight decreasing line.

wlu4

Typewritten Text

Its calculation depends on sample mean and sample variance, which are sensitive to outliers.

Example: For the cups of coffee and quiz score example, we can find the sample correlation,r (which is estimating the true underlying population correlation (ρ)) and interpret it.

Cups of Coffee Quiz Score10 164 72 53 78 111 52 87 95 106 11

68

wlu4

Typewritten Text

r = 0.91

wlu4

Typewritten Text

wlu4

Typewritten Text

r = S_XY / (S_X * S_Y) = 8.87/(2.94*3.31) = 0.91

wlu4

Typewritten Text

wlu4

Typewritten Text

S_XY = 8.87 S_XX = 8.62; S_X =

wlu4

Typewritten Text

2.94;

wlu4

Typewritten Text

S_YY = 10.99; S_Y = 3.31; Xbar = 4.8; Ybar = 8.9.

Linear RegressionWhat if I want to use an explanatory variable (x) to predict a response (y)?

• Simple Linear Regression (SLR) finds the line of best fit between one quantitativeexplanatory variable x and the response y measured on the same subjects.

• Data in the form of pairs of observations (x1, y1), (x2, y2), , (xn, yn) - same set up ascorrelation data

Often done using a model of the form:

Note: β0 and β1 are true unknown parameters. When we have data we ’fit’ this model andget estimates denoted by

69

wlu4

Typewritten Text

E(Y|x) = beta0 + beta1 * x; Y = beta0 + beta1 * x + epsilon, where epsilon is the random error term with the mean 0 and variance sigma^2.

wlu4

Typewritten Text

wlu4

Typewritten Text

beta0_hat and beta1_hat.

How to get the estimates?

Names of line with estimates plugged in: ’fitted line’ or ’regression line’ or ’prediction line’or ’prediction equation’ or ’least squares line’

70

wlu4

Typewritten Text

the method of least squares: L = sum_{i=1,...,n} (y_i - beta0 - beta1 * x_i)^2. We want to find beta0 and beta1 to minimize L. The resulting minimizers are named the least squares estimators. beta0_hat = ybar - beta1_hat * xbar, beta1_hat = S_xy / S_xx, where S_xy = sum_{i=1,...,n} (y_i-ybar)(x_i-xbar), S_xx = sum_{i=1,...,n} (x_i-xbar)^2, and xbar and ybar and sample averages of x's and y's respectively.

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

fitted line: y_hat = beta0_hat + beta1_hat * x

Interpretation of line:

Major use of Regression line!!

Useful for predicting the response (called y) for a given value of x

• Just plug the new x into the equation!

• Warning! Extrapolation is bad!

71

wlu4

Typewritten Text

y_hat = beta0_hat + beta1_hat * x_new

wlu4

Typewritten Text

wlu4

Typewritten Text

Extrapolation means you predict the response for an x value that is out of the range of the x values in your data used for fitting the least squares line. In the above figure, for example, you want to predict the response for x = 0.

wlu4

Typewritten Text

Interpretation of parameter estimates:

• β0 = b0= sample intercept

• β1 = b1 = sample slope

72

wlu4

Typewritten Text

where the fitted line intercepts with Y-axis, i.e. the value of y_hat when x = 0.

wlu4

Typewritten Text

the change of the fitted response variable (y_hat) when the explanatory variable (x) increases by 1 unit. It can be positive or negative, depending on the sign of beta1_hat.

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

residual: e_i = y_i - (beta0_hat + beta1_hat * x_i)

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

, measures the difference between the actual response value and the fitted/predicted response value based on the obtained least squares line.

wlu4

Typewritten Text

wlu4

Typewritten Text

here y_i_hat = beta0_hat + beta1_hat * x_i is called fitted or predicted value when x = x_i.

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

The estimated error variance: sigma2_hat = SS(E) / (n - 2) = MSE, where n is the number of observations (x_i, y_i)'s, and SS(E) = sum_{i=1,...,n} e_i^2 = sum_{i=1,...,n} (y_i - y_i_hat)^2.

wlu4

Typewritten Text

wlu4

Typewritten Text

Example: Coffee and Quiz Score

1. Find the fitted least squares regression line.

2. Interpret the slope.

3. Interpret the intercept. Is it meaningful in this example?

4. Jenny happens to stop in the coffee shop. She notices your plot and says, ’I had 5 cupsof coffee the night before the quiz. Guess what my score was!’

5. Jenny actually scored a 14. Find her residual.

6. If Enrique drank 12 cups of coffee, can we safely predict his quiz score?

73

wlu4

Typewritten Text

beta1_hat =

wlu4

Typewritten Text

S_xy / S_xx = 8.87 / 8.62 =

wlu4

Typewritten Text

wlu4

Typewritten Text

1.03;

wlu4

Typewritten Text

beta0_hat = ybar - beta1_hat * xbar = 8.9 - 1.03 * 4.8 = 3.96.

wlu4

Typewritten Text

When increasing one cup of coffee, the quiz score will increase by 1.03 point.

wlu4

Typewritten Text

The predicted quiz score when drinking 0 cup of coffee. Here, 0 is a meaningful value for x, so the intercept here is meaningful.

wlu4

Typewritten Text

wlu4

Typewritten Text

y_hat = 3.96 + 1.03 * 5 =

wlu4

Typewritten Text

9.11

wlu4

Typewritten Text

e = 14 - 9.11 = 4.89

wlu4

Typewritten Text

We should be cautious since 12 is out of the range (1 to 10 in our data). If the same linear relationship still holds, the predicted quiz score is given by 3.96 + 1.03 * 12 = 16.32

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

Note: Always Graph your Data

• To check linearity assumption.

• Four data sets below give the same fitted line, but only the first one makes sense!

Note: Causation vs Correlation

• Association between variables does not mean that one variable causes the changes inanother.

• Only well designed experiments can establish that an explanatory variable causes achange in a response variable.

74

wlu4

Typewritten Text

The relationship is not linear here

wlu4

Typewritten Text

There is one outlier in data

wlu4

Typewritten Text

The explanatory variable is not quantitative.

wlu4

Typewritten Text

How do we know we have a Significant Linear Relationship

IDEA:

• Given data, we find

• An estimate of the (true) mean response

• If we did another experiment would we get the same data??

• Need to do a statistical test to see if there is a significant linear relationship betweenthe explanatory and response variable.

What ’hypothesis’ do we want to test?

75

wlu4

Typewritten Text

the best fitted line using the least squares method.

wlu4

Typewritten Text

y_hat = beta0_hat + beta1_hat * x

wlu4

Typewritten Text

In our example, we want to know whether there is a statistically significant relationship between cups of coffee and quiz score.

wlu4

Typewritten Text

H_0: beta1 = 0 vs. H_a: beta1 is not equal to 0. Under H_0, the linear model becomes Y = beta0 + epsilon. In other words, the responses are randomly distributed around a constant beta0 and have no relationship with the explanatory variable X. Under H_a, there is a linear relationship between the response variable and explanatory variable, and the direction of the association (positive vs. negative) depends on the sign of beta1.

How to investigate this hypothesis? ........ ANOVA!

ANOVA = ANalysis Of VAriance. What variation are we measuring here?

Which line above gives more evidence of a significant linear relationship? Why?

A measure of the total amount of variability in the response is the sample variance of Y .Consider the numerator, which we call the total sum of squares:

SS(Tot) =n∑i=1

(yi − y)2 (variability of observations about the mean)

dfTot = n− 1

76

wlu4

Typewritten Text

The left panel, because the data are less spread around the fitted line, or in other words, the fitted line explains more variations of the data compared with the right panel.

This variability is partitioned into independent components:Sum of squares due to regression, SS(R) (or sum of squares model, SS(M))

SS(R) =n∑i=1

(yi − y)2 (variability of the fitted line about the mean)

dfR = 1

Sum of squares due to error, SS(E)

SS(E) =∑

e2i

=∑

(yi − yi)2 (variability of observations about the fitted line)

dfE = n− 2

Note: SS(Tot) = SS(R) + SS(E) and dfTot = dfR + dfE(Recall: DF = Number of independent pieces of data for the sum of squares).

The ANOVA table from simple linear regressionSource df Sum of squares Mean Square F-Ratio

Regression 1 SS(R) MS(R) MS(R)/MS(E)Error n− 2 SS(E) MS(E)Total n− 1 SS(Tot)

Meaning of pieces very similar to Multi-Way ANOVA!

• mean squares - represent standardized measures of variation due to the differentsources and are given by SS(source)/dfsource.

• F-Ratio - Ratios of mean squares often follow an F -distribution and are appropriatefor testing different hypotheses of interest.

In this case, to test(H0) β1 = 0 vs (H1) β1 6= 0

Use the test statisticF = MS(R)/MS(E)

This is used to find a p-value which can then be used to determine if we have a statisticallysignificant result.

77

wlu4

Typewritten Text

MS(E) is an estimator of sigma^2.

wlu4

Typewritten Text

The larger the F statistic, the smaller the p-vale, and thus the stronger evidence to reject the null hypothesis and claim that the response and explanatory variable have a significant linear relationship.

Conducting the analysis in Statcrunch

78

79

wlu4

Typewritten Text

Questions: 1. Is there a statistically significant linear relationship between cups of coffee and quiz score? Why? 2. What is the error sum of squares? 3. What is the estimated error variance? 4. What are the estimated standard errors of the least squares estimates, beta0_hat and beta1_hat?

wlu4

Typewritten Text

r2 the ’coefficient of determination’

r2 = SS(Model)/SS(Tot) - Interpretation:

1. High r2 then the line is good for prediction.

2. In the coffee example, we obtained r2 = 0.8298Interpretation:

80

wlu4

Typewritten Text

It is a measure of the proportion of variation in Y that can be explained by the fitted least squares line based on X. Often expressed as a percentage.

wlu4

Typewritten Text

It means about 83% of the variation in the quiz score can be explained by the fitted least squares line based on the cups of coffee.

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

Some remarks: A steep slope (positive or negative) does not necessarily indicate a strong linear relationship. We use correlation: a unit free measure to express the strength and direction of the linear relationship between X and Y. A small (in absolute value) slope does not indicate a weak linear relationship. We can have a perfect linear relationship that is a fairly flat line.

wlu4

Typewritten Text

.

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

r^2 ranges 0 to 100%, and is equal to the square of the sample correlation.

wlu4

Typewritten Text

wlu4

Typewritten Text

sample correlation r measures the strength and direction of the linear relationship between X and Y. The sign of r is the same as the sign of the slope estimate.

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

beta1_hat = sqrt(SS_YY/S_XX) * r.

wlu4

Typewritten Text

Example: One type of fuel is biodiesel, which comes from plants. An experiment was doneto determine how much biodiesel could be generated from a certain type of plant grown indifferent media. The final biomass was also recorded on 44 the plants from the experiment.Let‘s consider these two variables, the log of biodiesel and biomass.

proc reg data=bioexp ;

model logbiodiesel=biomass/clb;

run;

1. Describe the scatterplot (3 things!).

2. Give the value of the sample correlation.

81

wlu4

Typewritten Text

form: linear strength: moderate direction: positive association

wlu4

Typewritten Text

r =

wlu4

Typewritten Text

Note: the 95% prediction confidence interval is usually wider than the 95% (estimation) confidence interval. Recall our linear model is Y = beta0 + beta1*x + epsilon; The (estimation) confidence interval is for yhat = beta0_hat + beta1_hat * x; However, the prediction confidence interval is for y.pred = beta0_hat + beta1_hat*x + epsilon.

wlu4

Typewritten Text

wlu4

Typewritten Text

+ sqrt(0.3348) = 0.58

wlu4

Typewritten Text

wlu4

Typewritten Text

Output From Proc Reg for Biomass and Log(Biodiesel) Example 1

The REG ProcedureModel: MODEL1

Dependent Variable: logBiodiesel

Output From Proc Reg for Biomass and Log(Biodiesel) Example 1


Dependent Variable: logBiodiesel

Number of Observations Read 44

Number of Observations Used 44

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > F

Model 1 26.76509 26.76509 21.14 <.0001

Error 42 53.18557 1.26632

Corrected Total 43 79.95066

Root MSE 1.12531 R-Square 0.3348

Dependent Mean 2.46474 Adj R-Sq 0.3189

Coeff Var 45.65627

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|

95%Confidence

Limits

Intercept 1 0.27977 0.50463 0.55 0.5822 -0.73862 1.29816

biomass 1 0.19769 0.04300 4.60 <.0001 0.11091 0.28447

3. State and interpret the value of the MSE.

4. Determine if there is a significant linear relationship between these variables. State yourreasoning and your hypothesis.

5. Give the fitted line and use it to predict for a biomass of 10. If you have an observedvalue of 4 for a biomass of 10, what is the residual?

82

wlu4

Typewritten Text

What is the percentage of variation in log of biodiesel that can be explained by the fitted line based on biomass?

wlu4

Typewritten Text

What is the fitted least squares line?

wlu4

Typewritten Text

33.5%

wlu4

Typewritten Text

logbiodiesel = 0.28 + 0.20*biomass;

wlu4

Typewritten Text

MSE = 1.27, which is an unbiased estimate of error variance, sigma^2.

wlu4

Typewritten Text

There is a significant linear relationship since the p-value in the ANOVA table <.0001. H0: beta1 = 0 vs. Ha: beta1 != 0.

wlu4

Typewritten Text

yhat = 0.28 + 0.2*10 = 2.28, residual = y - yaht = 4 - 2.28 = 1.72.

Consider the regression of Y = brain size (MRI counts) on x=Height(in). Use the outputbelow to answer the following:

1. Interpret the scatterplot and state the correlation.

2. What is the % of variability in the MRI counts explained by the data.

3. Determine the fitted regression line.

4. Is there a significant liner relationship between the variables. Why or why not?

5. Does the intercept have a meaningful interpretation?

6. Predict brain size for an individual that is 60inches tall.

83

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

------

wlu4

Typewritten Text

wlu4

Typewritten Text

wlu4

Typewritten Text

fitted least squares line?

wlu4

Typewritten Text

form: strength: direction: correlation = 0.6

wlu4

Typewritten Text

wlu4

Typewritten Text

36.2%

wlu4

Typewritten Text

MRI_Count = 153834.09 + 11022.732*Height

wlu4

Typewritten Text

Yes, there is, because the p-value in the ANOVA table <0.0001.

wlu4

Typewritten Text

No, since no one will have height 0.

wlu4

Typewritten Text

MRI_count = 153834.09 + 11022.732*60

wlu4

Typewritten Text

wlu4

Typewritten Text

84

wlu4

Typewritten Text

wlu4

Typewritten Text

Multiple Linear Regression (Optional Reading)These ideas can easily be extended to the case of more than 1 quantitative explanatoryvariable. This is called Multiple Linear Regression:

Model Yi = β0 + β1xi1 + β2xi2 + ...+ βpxip + Ei

• βj are called ’partial slopes’

• xp can be quadratic values, cubics, or interaction terms

– x21, x3

1, x1 ∗ x2

• fitted model, y = b0 + b1x1 + b2x2 + ...+ bpxp

• Model often still fit using ’least squares’ i.e. minimize SS(E) =∑n

i=1(yi − yi)2

Example(Taken from Probabilitiy and Statistics, Devore) Soil and sediment adsorption, the extentto which chemicals collect in a condensed form on the surface, is an important characteristicinfluencing the effectiveness of pesticides and various agricultural chemicals. A study wasdone on 13 soil samples that measured Y = phosphate adsorption index, X1 = amount ofextractable aluminum, and X2 = amount of extractable iron. The data are given below:

Adsorption Label Aluminum Label Iron Label4 y1 13 x1,1 61 x1,2

18 y2 21 x2,1 175 x2,2

14 y3 24 x3,1 111 x3,2

18 y4 23 x4,1 124 x4,2

26 y5 64 x5,1 130 x5,2

26 y6 38 x6,1 173 x6,2

21 y7 33 x7,1 169 x7,2

30 y8 61 x8,1 169 x8,2

28 y9 39 x9,1 160 x9,2

36 y10 71 x10,1 244 x10,2

65 y11 112 x11,1 257 x11,2

62 y12 88 x12,1 333 x12,2

40 y13 54 x13,1 199 x13,2

85

Output From Proc Reg for Adsorption Example 1


Dependent Variable: adsorp

Output From Proc Reg for Adsorption Example 1


Dependent Variable: adsorp

Number of Observations Read 14

Number of Observations Used 13

Number of Observations with Missing Values 1

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > F

Model 2 3529.90308 1764.95154 92.03 <.0001

Error 10 191.78922 19.17892

Corrected Total 12 3721.69231

Root MSE 4.37937 R-Square 0.9485

Dependent Mean 29.84615 Adj R-Sq 0.9382

Coeff Var 14.67316

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|

95%Confidence

Limits

Intercept 1 -7.35066 3.48467 -2.11 0.0611 -15.11498 0.41366

aluminum 1 0.34900 0.07131 4.89 0.0006 0.19012 0.50788

iron 1 0.11273 0.02969 3.80 0.0035 0.04658 0.17889

86

st 370 - correlation and regressionlu/st370/note_4.pdf · chapter 5 st 370 - correlation and...

Documents