st 370 - correlation and regressionlu/st370/note_4.pdf · chapter 5 st 370 - correlation and...

24
Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve learned: Why we want a ‘random’ sample and how to achieve it (Sampling Scheme) How to use randomization, replication, and control/blocking to create a valid experi- ment. Methods for summarizing and graphing data in meaningful ways. Analyzing CRD type designs to investigate which factors are important for a particular response. Next we’ll look at how to deal with quantitative explanatory variables and a quantita- tive response. Multi-Way ANOVA vs Linear Regression Multi-Way ANOVA used with variable(s) and a Determines which factors are significantly associated with the response Correlation and Linear Regression used with variable(s) and a Determines if there is a significant linear relationship between the variables Not necessarily cause and effect! 63

Upload: others

Post on 14-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Chapter 5

ST 370 - Correlation and Regression

Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2

Recap: So far we’ve learned:

• Why we want a ‘random’ sample and how to achieve it (Sampling Scheme)

• How to use randomization, replication, and control/blocking to create a valid experi-ment.

• Methods for summarizing and graphing data in meaningful ways.

• Analyzing CRD type designs to investigate which factors are important for a particularresponse.

• Next we’ll look at how to deal with quantitative explanatory variables and a quantita-tive response.

Multi-Way ANOVA vs Linear Regression

• Multi-Way ANOVA used with variable(s)

and a

– Determines which factors are significantly associated with the response

• Correlation and Linear Regression used with

variable(s) and a

– Determines if there is a significant linear relationship between the variables

– Not necessarily cause and effect!

63

wlu4
Typewritten Text
wlu4
Typewritten Text
quantitative response
wlu4
Typewritten Text
multiple categorical
wlu4
Typewritten Text
wlu4
Typewritten Text
, i.e. factors
wlu4
Typewritten Text
wlu4
Typewritten Text
quantitative (mostly continuous)
wlu4
Typewritten Text
quantitative response
wlu4
Typewritten Text
Page 2: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Scatterplots and Correlation

Goal: Determine the relationship between two quantitative variables!

• Scatterplots can display the relationship between two quantitative variables

• Each dot corresponds to one observation

Example: Using number of cups of coffee to predict quiz score (observational study).

Cups of Coffee Quiz Score10 164 72 53 78 111 52 87 95 106 11

64

Page 3: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Examining a Scatterplot: We are looking for 3 things

Form

Strength

65

wlu4
Typewritten Text
form, strength and direction, explained below.
Page 4: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Direction

Correlation

• We will focus on the

• Correlation a statistic that measures the strength and direction of the linear asso-ciation between two quantitative variables.

• No need to distinguish between explanatory variable and response for correlation!

Properties of Correlation:

• Unitless quantity

• ρ and R (or r) are always between

• Sign indicates the direction of association

• Distance from 0 (magnitude) indicates the strength of the association

Interpreting Correlation:

ρ (or R/r) close to 1 =

ρ (or R/r) close to 0 =

ρ (or R/r) close to 1 =

66

wlu4
Typewritten Text
linear association between two variables.
wlu4
Typewritten Text
rho = sigma_{XY}/(sigma_X * sigma_Y)
wlu4
Typewritten Text
-1 and +1
wlu4
Typewritten Text
wlu4
Typewritten Text
strong positive linear association
wlu4
Typewritten Text
no linear association
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
-
wlu4
Typewritten Text
strong negative linear association
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
, population correlation parameter;
wlu4
Typewritten Text
R (or r) = S_{XY}/(S_X * S_Y)
wlu4
Typewritten Text
, sample correlation.
wlu4
Typewritten Text
Page 5: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

http://www.rossmanchance.com/applets/GuessCorrelation.html

What would a correlation of 1 or -1 look like on a scatterplot?

• Not resistant to outliers, why?

67

wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
R = 1: a straight increasing line; R = -1: a straight decreasing line.
wlu4
Typewritten Text
Its calculation depends on sample mean and sample variance, which are sensitive to outliers.
Page 6: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Example: For the cups of coffee and quiz score example, we can find the sample correlation,r (which is estimating the true underlying population correlation (ρ)) and interpret it.

Cups of Coffee Quiz Score10 164 72 53 78 111 52 87 95 106 11

68

wlu4
Typewritten Text
r = 0.91
wlu4
Typewritten Text
wlu4
Typewritten Text
r = S_XY / (S_X * S_Y) = 8.87/(2.94*3.31) = 0.91
wlu4
Typewritten Text
wlu4
Typewritten Text
S_XY = 8.87 S_XX = 8.62; S_X =
wlu4
Typewritten Text
2.94;
wlu4
Typewritten Text
S_YY = 10.99; S_Y = 3.31; Xbar = 4.8; Ybar = 8.9.
Page 7: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Linear RegressionWhat if I want to use an explanatory variable (x) to predict a response (y)?

• Simple Linear Regression (SLR) finds the line of best fit between one quantitativeexplanatory variable x and the response y measured on the same subjects.

• Data in the form of pairs of observations (x1, y1), (x2, y2), , (xn, yn) - same set up ascorrelation data

Often done using a model of the form:

Note: β0 and β1 are true unknown parameters. When we have data we ’fit’ this model andget estimates denoted by

69

wlu4
Typewritten Text
E(Y|x) = beta0 + beta1 * x; Y = beta0 + beta1 * x + epsilon, where epsilon is the random error term with the mean 0 and variance sigma^2.
wlu4
Typewritten Text
wlu4
Typewritten Text
beta0_hat and beta1_hat.
Page 8: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

How to get the estimates?

Names of line with estimates plugged in: ’fitted line’ or ’regression line’ or ’prediction line’or ’prediction equation’ or ’least squares line’

70

wlu4
Typewritten Text
the method of least squares: L = sum_{i=1,...,n} (y_i - beta0 - beta1 * x_i)^2. We want to find beta0 and beta1 to minimize L. The resulting minimizers are named the least squares estimators. beta0_hat = ybar - beta1_hat * xbar, beta1_hat = S_xy / S_xx, where S_xy = sum_{i=1,...,n} (y_i-ybar)(x_i-xbar), S_xx = sum_{i=1,...,n} (x_i-xbar)^2, and xbar and ybar and sample averages of x's and y's respectively.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
fitted line: y_hat = beta0_hat + beta1_hat * x
Page 9: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Interpretation of line:

Major use of Regression line!!

Useful for predicting the response (called y) for a given value of x

• Just plug the new x into the equation!

• Warning! Extrapolation is bad!

71

wlu4
Typewritten Text
y_hat = beta0_hat + beta1_hat * x_new
wlu4
Typewritten Text
wlu4
Typewritten Text
Extrapolation means you predict the response for an x value that is out of the range of the x values in your data used for fitting the least squares line. In the above figure, for example, you want to predict the response for x = 0.
wlu4
Typewritten Text
Page 10: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Interpretation of parameter estimates:

• β0 = b0= sample intercept

• β1 = b1 = sample slope

72

wlu4
Typewritten Text
where the fitted line intercepts with Y-axis, i.e. the value of y_hat when x = 0.
wlu4
Typewritten Text
the change of the fitted response variable (y_hat) when the explanatory variable (x) increases by 1 unit. It can be positive or negative, depending on the sign of beta1_hat.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
residual: e_i = y_i - (beta0_hat + beta1_hat * x_i)
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
, measures the difference between the actual response value and the fitted/predicted response value based on the obtained least squares line.
wlu4
Typewritten Text
wlu4
Typewritten Text
here y_i_hat = beta0_hat + beta1_hat * x_i is called fitted or predicted value when x = x_i.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
The estimated error variance: sigma2_hat = SS(E) / (n - 2) = MSE, where n is the number of observations (x_i, y_i)'s, and SS(E) = sum_{i=1,...,n} e_i^2 = sum_{i=1,...,n} (y_i - y_i_hat)^2.
wlu4
Typewritten Text
wlu4
Typewritten Text
Page 11: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Example: Coffee and Quiz Score

1. Find the fitted least squares regression line.

2. Interpret the slope.

3. Interpret the intercept. Is it meaningful in this example?

4. Jenny happens to stop in the coffee shop. She notices your plot and says, ’I had 5 cupsof coffee the night before the quiz. Guess what my score was!’

5. Jenny actually scored a 14. Find her residual.

6. If Enrique drank 12 cups of coffee, can we safely predict his quiz score?

73

wlu4
Typewritten Text
beta1_hat =
wlu4
Typewritten Text
S_xy / S_xx = 8.87 / 8.62 =
wlu4
Typewritten Text
wlu4
Typewritten Text
1.03;
wlu4
Typewritten Text
beta0_hat = ybar - beta1_hat * xbar = 8.9 - 1.03 * 4.8 = 3.96.
wlu4
Typewritten Text
When increasing one cup of coffee, the quiz score will increase by 1.03 point.
wlu4
Typewritten Text
The predicted quiz score when drinking 0 cup of coffee. Here, 0 is a meaningful value for x, so the intercept here is meaningful.
wlu4
Typewritten Text
wlu4
Typewritten Text
y_hat = 3.96 + 1.03 * 5 =
wlu4
Typewritten Text
9.11
wlu4
Typewritten Text
e = 14 - 9.11 = 4.89
wlu4
Typewritten Text
We should be cautious since 12 is out of the range (1 to 10 in our data). If the same linear relationship still holds, the predicted quiz score is given by 3.96 + 1.03 * 12 = 16.32
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
Page 12: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Note: Always Graph your Data

• To check linearity assumption.

• Four data sets below give the same fitted line, but only the first one makes sense!

Note: Causation vs Correlation

• Association between variables does not mean that one variable causes the changes inanother.

• Only well designed experiments can establish that an explanatory variable causes achange in a response variable.

74

wlu4
Typewritten Text
The relationship is not linear here
wlu4
Typewritten Text
There is one outlier in data
wlu4
Typewritten Text
The explanatory variable is not quantitative.
wlu4
Typewritten Text
Page 13: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

How do we know we have a Significant Linear Relationship

IDEA:

• Given data, we find

• An estimate of the (true) mean response

• If we did another experiment would we get the same data??

• Need to do a statistical test to see if there is a significant linear relationship betweenthe explanatory and response variable.

What ’hypothesis’ do we want to test?

75

wlu4
Typewritten Text
the best fitted line using the least squares method.
wlu4
Typewritten Text
y_hat = beta0_hat + beta1_hat * x
wlu4
Typewritten Text
In our example, we want to know whether there is a statistically significant relationship between cups of coffee and quiz score.
wlu4
Typewritten Text
H_0: beta1 = 0 vs. H_a: beta1 is not equal to 0. Under H_0, the linear model becomes Y = beta0 + epsilon. In other words, the responses are randomly distributed around a constant beta0 and have no relationship with the explanatory variable X. Under H_a, there is a linear relationship between the response variable and explanatory variable, and the direction of the association (positive vs. negative) depends on the sign of beta1.
Page 14: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

How to investigate this hypothesis? ........ ANOVA!

ANOVA = ANalysis Of VAriance. What variation are we measuring here?

Which line above gives more evidence of a significant linear relationship? Why?

A measure of the total amount of variability in the response is the sample variance of Y .Consider the numerator, which we call the total sum of squares:

SS(Tot) =n∑i=1

(yi − y)2 (variability of observations about the mean)

dfTot = n− 1

76

wlu4
Typewritten Text
The left panel, because the data are less spread around the fitted line, or in other words, the fitted line explains more variations of the data compared with the right panel.
Page 15: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

This variability is partitioned into independent components:Sum of squares due to regression, SS(R) (or sum of squares model, SS(M))

SS(R) =n∑i=1

(yi − y)2 (variability of the fitted line about the mean)

dfR = 1

Sum of squares due to error, SS(E)

SS(E) =∑

e2i

=∑

(yi − yi)2 (variability of observations about the fitted line)

dfE = n− 2

Note: SS(Tot) = SS(R) + SS(E) and dfTot = dfR + dfE(Recall: DF = Number of independent pieces of data for the sum of squares).

The ANOVA table from simple linear regressionSource df Sum of squares Mean Square F-Ratio

Regression 1 SS(R) MS(R) MS(R)/MS(E)Error n− 2 SS(E) MS(E)Total n− 1 SS(Tot)

Meaning of pieces very similar to Multi-Way ANOVA!

• mean squares - represent standardized measures of variation due to the differentsources and are given by SS(source)/dfsource.

• F-Ratio - Ratios of mean squares often follow an F -distribution and are appropriatefor testing different hypotheses of interest.

In this case, to test(H0) β1 = 0 vs (H1) β1 6= 0

Use the test statisticF = MS(R)/MS(E)

This is used to find a p-value which can then be used to determine if we have a statisticallysignificant result.

77

wlu4
Typewritten Text
MS(E) is an estimator of sigma^2.
wlu4
Typewritten Text
The larger the F statistic, the smaller the p-vale, and thus the stronger evidence to reject the null hypothesis and claim that the response and explanatory variable have a significant linear relationship.
Page 16: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Conducting the analysis in Statcrunch

78

Page 17: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

79

wlu4
Typewritten Text
Questions: 1. Is there a statistically significant linear relationship between cups of coffee and quiz score? Why? 2. What is the error sum of squares? 3. What is the estimated error variance? 4. What are the estimated standard errors of the least squares estimates, beta0_hat and beta1_hat?
wlu4
Typewritten Text
Page 18: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

r2 the ’coefficient of determination’

r2 = SS(Model)/SS(Tot) - Interpretation:

1. High r2 then the line is good for prediction.

2. In the coffee example, we obtained r2 = 0.8298Interpretation:

80

wlu4
Typewritten Text
It is a measure of the proportion of variation in Y that can be explained by the fitted least squares line based on X. Often expressed as a percentage.
wlu4
Typewritten Text
It means about 83% of the variation in the quiz score can be explained by the fitted least squares line based on the cups of coffee.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
Some remarks: A steep slope (positive or negative) does not necessarily indicate a strong linear relationship. We use correlation: a unit free measure to express the strength and direction of the linear relationship between X and Y. A small (in absolute value) slope does not indicate a weak linear relationship. We can have a perfect linear relationship that is a fairly flat line.
wlu4
Typewritten Text
.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
r^2 ranges 0 to 100%, and is equal to the square of the sample correlation.
wlu4
Typewritten Text
wlu4
Typewritten Text
sample correlation r measures the strength and direction of the linear relationship between X and Y. The sign of r is the same as the sign of the slope estimate.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
beta1_hat = sqrt(SS_YY/S_XX) * r.
wlu4
Typewritten Text
Page 19: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Example: One type of fuel is biodiesel, which comes from plants. An experiment was doneto determine how much biodiesel could be generated from a certain type of plant grown indifferent media. The final biomass was also recorded on 44 the plants from the experiment.Let‘s consider these two variables, the log of biodiesel and biomass.

proc reg data=bioexp ;

model logbiodiesel=biomass/clb;

run;

1. Describe the scatterplot (3 things!).

2. Give the value of the sample correlation.

81

wlu4
Typewritten Text
form: linear strength: moderate direction: positive association
wlu4
Typewritten Text
r =
wlu4
Typewritten Text
Note: the 95% prediction confidence interval is usually wider than the 95% (estimation) confidence interval. Recall our linear model is Y = beta0 + beta1*x + epsilon; The (estimation) confidence interval is for yhat = beta0_hat + beta1_hat * x; However, the prediction confidence interval is for y.pred = beta0_hat + beta1_hat*x + epsilon.
wlu4
Typewritten Text
wlu4
Typewritten Text
+ sqrt(0.3348) = 0.58
wlu4
Typewritten Text
wlu4
Typewritten Text
Page 20: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Output From Proc Reg for Biomass and Log(Biodiesel) Example 1

The REG ProcedureModel: MODEL1

Dependent Variable: logBiodiesel

Output From Proc Reg for Biomass and Log(Biodiesel) Example 1

The REG ProcedureModel: MODEL1

Dependent Variable: logBiodiesel

Number of Observations Read 44

Number of Observations Used 44

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > F

Model 1 26.76509 26.76509 21.14 <.0001

Error 42 53.18557 1.26632

Corrected Total 43 79.95066

Root MSE 1.12531 R-Square 0.3348

Dependent Mean 2.46474 Adj R-Sq 0.3189

Coeff Var 45.65627

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|

95%Confidence

Limits

Intercept 1 0.27977 0.50463 0.55 0.5822 -0.73862 1.29816

biomass 1 0.19769 0.04300 4.60 <.0001 0.11091 0.28447

3. State and interpret the value of the MSE.

4. Determine if there is a significant linear relationship between these variables. State yourreasoning and your hypothesis.

5. Give the fitted line and use it to predict for a biomass of 10. If you have an observedvalue of 4 for a biomass of 10, what is the residual?

82

wlu4
Typewritten Text
What is the percentage of variation in log of biodiesel that can be explained by the fitted line based on biomass?
wlu4
Typewritten Text
What is the fitted least squares line?
wlu4
Typewritten Text
33.5%
wlu4
Typewritten Text
logbiodiesel = 0.28 + 0.20*biomass;
wlu4
Typewritten Text
MSE = 1.27, which is an unbiased estimate of error variance, sigma^2.
wlu4
Typewritten Text
There is a significant linear relationship since the p-value in the ANOVA table <.0001. H0: beta1 = 0 vs. Ha: beta1 != 0.
wlu4
Typewritten Text
yhat = 0.28 + 0.2*10 = 2.28, residual = y - yaht = 4 - 2.28 = 1.72.
Page 21: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Consider the regression of Y = brain size (MRI counts) on x=Height(in). Use the outputbelow to answer the following:

1. Interpret the scatterplot and state the correlation.

2. What is the % of variability in the MRI counts explained by the data.

3. Determine the fitted regression line.

4. Is there a significant liner relationship between the variables. Why or why not?

5. Does the intercept have a meaningful interpretation?

6. Predict brain size for an individual that is 60inches tall.

83

wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
------
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
fitted least squares line?
wlu4
Typewritten Text
form: strength: direction: correlation = 0.6
wlu4
Typewritten Text
wlu4
Typewritten Text
36.2%
wlu4
Typewritten Text
MRI_Count = 153834.09 + 11022.732*Height
wlu4
Typewritten Text
Yes, there is, because the p-value in the ANOVA table <0.0001.
wlu4
Typewritten Text
No, since no one will have height 0.
wlu4
Typewritten Text
MRI_count = 153834.09 + 11022.732*60
wlu4
Typewritten Text
wlu4
Typewritten Text
Page 22: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

84

wlu4
Typewritten Text
wlu4
Typewritten Text
Page 23: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Multiple Linear Regression (Optional Reading)These ideas can easily be extended to the case of more than 1 quantitative explanatoryvariable. This is called Multiple Linear Regression:

Model Yi = β0 + β1xi1 + β2xi2 + ...+ βpxip + Ei

• βj are called ’partial slopes’

• xp can be quadratic values, cubics, or interaction terms

– x21, x3

1, x1 ∗ x2

• fitted model, y = b0 + b1x1 + b2x2 + ...+ bpxp

• Model often still fit using ’least squares’ i.e. minimize SS(E) =∑n

i=1(yi − yi)2

Example(Taken from Probabilitiy and Statistics, Devore) Soil and sediment adsorption, the extentto which chemicals collect in a condensed form on the surface, is an important characteristicinfluencing the effectiveness of pesticides and various agricultural chemicals. A study wasdone on 13 soil samples that measured Y = phosphate adsorption index, X1 = amount ofextractable aluminum, and X2 = amount of extractable iron. The data are given below:

Adsorption Label Aluminum Label Iron Label4 y1 13 x1,1 61 x1,2

18 y2 21 x2,1 175 x2,2

14 y3 24 x3,1 111 x3,2

18 y4 23 x4,1 124 x4,2

26 y5 64 x5,1 130 x5,2

26 y6 38 x6,1 173 x6,2

21 y7 33 x7,1 169 x7,2

30 y8 61 x8,1 169 x8,2

28 y9 39 x9,1 160 x9,2

36 y10 71 x10,1 244 x10,2

65 y11 112 x11,1 257 x11,2

62 y12 88 x12,1 333 x12,2

40 y13 54 x13,1 199 x13,2

85

Page 24: ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve

Output From Proc Reg for Adsorption Example 1

The REG ProcedureModel: MODEL1

Dependent Variable: adsorp

Output From Proc Reg for Adsorption Example 1

The REG ProcedureModel: MODEL1

Dependent Variable: adsorp

Number of Observations Read 14

Number of Observations Used 13

Number of Observations with Missing Values 1

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > F

Model 2 3529.90308 1764.95154 92.03 <.0001

Error 10 191.78922 19.17892

Corrected Total 12 3721.69231

Root MSE 4.37937 R-Square 0.9485

Dependent Mean 29.84615 Adj R-Sq 0.9382

Coeff Var 14.67316

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|

95%Confidence

Limits

Intercept 1 -7.35066 3.48467 -2.11 0.0611 -15.11498 0.41366

aluminum 1 0.34900 0.07131 4.89 0.0006 0.19012 0.50788

iron 1 0.11273 0.02969 3.80 0.0035 0.04658 0.17889

86