Transcript
Page 1: Multiple linear regression

MULTIPLE LINEAR REGRESSION Avjinder Singh Kaler and Kristi Mai

Page 2: Multiple linear regression

We will look at a method for analyzing a linear relationship involving

more than two variables.

We focus on these key elements:

1. Finding the multiple regression equation.

2. The values of the adjusted R2, and the p-value as measures of

how well the multiple regression equation fits the sample data.

Page 3: Multiple linear regression

β€’ Multiple Regression Equation – given a collection of sample data with several (π‘˜βˆ’π‘šπ‘Žπ‘›π‘¦) explanatory variables, the regression equation that algebraically describes the relationship between the response variable 𝑦 and two or more explanatory variables π‘₯1, π‘₯2, … π‘₯π‘˜ and is:

𝑦 = 𝑏0 + 𝑏1π‘₯1 + 𝑏2π‘₯2 + β‹―+ π‘π‘˜π‘₯π‘˜ β€’ We are using more than one explanatory variable to predict a response variable now

β€’ In practice, you need large amounts of data to use several predictor/explanatory variables

* Guideline: Your sample size should be 10 times larger than the number of π‘₯ variables*

β€’ Multiple Regression Line – the graph of the multiple regression equation β€’ This multiple regression line still fits the sample points best according to the least squares

property

Page 4: Multiple linear regression

β€’ Visualization – multiple scatterplots of each pair (π‘₯π‘˜ , 𝑦) of quantitative data can still be helpful in determining whether there is a relationship between two variables

β€’ These scatterplots can be created one at a time. However, it is common to visualize all the pairs of variables within one plot. This is often called a pairs plot, pairwise scatterplot or scatterplot matrix.

Page 5: Multiple linear regression

Population Parameter Sample Statistic

Equation 𝑦 = 𝛽0 + 𝛽1π‘₯1 + 𝛽2π‘₯2 + β‹―+ π›½π‘˜π‘₯π‘˜ 𝑦 = 𝑏0 + 𝑏1π‘₯1 + 𝑏2π‘₯2 + β‹―+ π‘π‘˜π‘₯π‘˜

Note:

β€’ 𝑦 is the predicted value of 𝑦

β€’ π‘˜ is the number of predictor variables (also called independent

variables or π‘₯ variables)

Page 6: Multiple linear regression

β€’ Requirements for Regression:

1. The sample data is a Simple Random Sample of quantitative data

2. Each of the pairs of data (π‘₯π‘˜ , 𝑦) has a bivariate normal distribution (recall this definition)

3. Random errors associated with the regression equation (i.e. residuals) are independent and normally distributed with a mean of 0 and a standard deviation 𝜎

β€’ Formulas for π‘π‘˜:

β€’ Statistical software will be used to calculate the individual coefficient estimates, π‘π‘˜

Page 7: Multiple linear regression

1. Use common sense and practical considerations to include or exclude variables

2. Consider the P-value for the test of overall model significance

β€’ Hypotheses: 𝐻0: 𝛽1 = 𝛽2 = β‹― = π›½π‘˜ = 0𝐻1: 𝐴𝑑 π‘™π‘’π‘Žπ‘ π‘‘ π‘œπ‘›π‘’ π›½π‘˜ β‰  0

β€’ Test Statistic: 𝐹 =𝑀𝑆 π‘…π‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›

𝑀𝑆(πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ)

β€’ This will result in an ANOVA table with a p-value that expresses the overall statistical significance of the model

Page 8: Multiple linear regression

3. Consider equations with high adjusted π‘ΉπŸ values

β€’ 𝑅 is the multiple correlation coefficient that describes the correlation between the observed 𝑦 values and the predicted 𝑦 values

β€’ 𝑅2 is the multiple coefficient of determination and measures how well the multiple regression equation fits the sample data

β€’ Problems: This measure of model β€œfitness” increases as more variables are included until it can usually raise no more or only by a very little amount no matter how significant the most recently added predictor variable may be

β€’ Adjusted 𝑅2 is the multiple coefficient of determination that is modified to account for the number of variables in the model and the sample size

Page 9: Multiple linear regression

4. Consider equations with the fewest number of predictor/explanatory variables if models that are being compared are nearly equivalent in terms of significance and fit (i.e. p-value and adjusted 𝑅2)

β€’ This is known as the β€œLaw of Parsimony”

β€’ We are looking for the simplest yet most informative model

β€’ Individual t-tests of particular regression parameters may help select the correct model and eliminate insignificant explanatory variables

Notice: If the regression equation does not appear to be useful for predictions, the best predicted value of a 𝑦 variable is still its point estimate [i.e. the sample mean of the 𝑦 variable would be the best predicted value for that variable]

Page 10: Multiple linear regression

β€’ Identify the response and potential explanatory variables by constructing a scatterplot matrix

β€’ Create a multiple regression model

β€’ Perform the appropriate tests of the following:

β€’ Overall model significance (the ANOVA i.e. the 𝐹 test)

β€’ Individual variable significance (𝑑 tests)

β€’ In addition, find the following:

β€’ Find the adjusted 𝑅2 value to assess the predictive power of the model

Page 11: Multiple linear regression

β€’ Perform a Residual Analysis to verify the Requirements for Linear Regression have been satisfied:

1. Construct a residual plot and verify that there is no pattern (other than a straight line pattern) and also verify that the residual plot does not become thicker or thinner

β€’ Examples are shown below:

Page 12: Multiple linear regression

2. Use a histogram, normal quantile plot, or Shapiro Wilk test of normality to confirm that the values of the residuals have a distribution that is approximately normal β€’ Normal Quantile Plot (aka QQ Plot) * Examples on the next 3 slides *

β€’ Shapiro Wilk Normality Test

β€’ This will help you assess the normality of a given set of data (in this case, the normality of the residuals) when the visual examination of the QQ Plot and/or the histogram of the data seem unclear to you and leave you stumped!

β€’ Hypotheses:

H0: Th݁ έ€π‘Žπ‘‘π‘Ž άΏπ‘šέέŽέ‚ έπ‘š π‘Ž π‘›έŽπ‘šπ‘Žέˆ έ€έ…έπ‘‘έ‘άΎέ…έŽπ‘‘έ…π‘›

H1: Th݁ έ€π‘Žπ‘‘π‘Ž ݀ݏ݁ 𝑛𝑑 π‘Žέπ‘ŽέŽ 𝑑ܿ π‘šέ έ‚έŽπ‘š π‘Ž π‘›έŽπ‘šπ‘Žέˆ έ€έ…έπ‘‘έ‘άΎέ…έŽπ‘‘έ…π‘›

Page 13: Multiple linear regression

Normal: Histogram of IQ scores is close to being bell-shaped, suggests that the IQ

scores are from a normal distribution. The normal quantile plot shows points that are

reasonably close to a straight-line pattern. It is safe to assume that these IQ scores

are from a normally distributed population.

Page 14: Multiple linear regression

Uniform: Histogram of data having a uniform distribution. The corresponding

normal quantile plot suggests that the points are not normally distributed because

the points show a systematic pattern that is not a straight-line pattern. These

sample values are not from a population having a normal distribution.

Page 15: Multiple linear regression

Skewed: Histogram of the amounts of rainfall in Boston for every Monday during

one year. The shape of the histogram is skewed, not bell-shaped. The

corresponding normal quantile plot shows points that are not at all close to a

straight-line pattern. These rainfall amounts are not from a population having a

normal distribution.

Page 16: Multiple linear regression

The table to the right includes a random

sample of heights of mothers, fathers, and their

daughters (based on data from the National

Health and Nutrition Examination).

Find the multiple regression equation in which

the response (y) variable is the height of a

daughter and the predictor (x) variables are

the height of the mother and height of the

father.

Page 17: Multiple linear regression

The StatCrunch results are shown here:

From the display, we see that the multiple

regression equation is:

π·π‘Žπ‘’π‘”β„Žπ‘‘π‘’π‘Ÿ = 7.5 + 0.707π‘€π‘œπ‘‘β„Žπ‘’π‘Ÿ + 0.164 πΉπ‘Žπ‘‘β„Žπ‘’π‘Ÿ

We could write this equation as:

𝑦 = 7.5 + 0.707π‘₯1 + 0.164π‘₯2

where 𝑦 is the predicted height of a

daughter,

π‘₯1 is the height of the mother, and π‘₯2 is the

height of the father.

Page 18: Multiple linear regression

The preceding technology display shows the adjusted coefficient of

determination as R-Sq(adj) = 63.7%.

When we compare this multiple regression equation to others, it is better

to use the adjusted R2 of 63.7%

Page 19: Multiple linear regression

Based on StatCrunch, the p-value is less than 0.0001, indicating that the multiple regression equation has good overall significance and is usable for predictions.

That is, it makes sense to predict the heights of daughters based on heights of mothers and fathers.

The p-value results from a test of the null hypothesis that Ξ²1 = Ξ²2 = 0, and rejection of this hypothesis indicates the equation is effective in predicting the heights of daughters.

Page 20: Multiple linear regression

Data Set 2 in Appendix B includes the age, foot length, shoe print length, shoe size, and height for each of 40 different subjects. Using those sample data, find the regression equation that is the best for predicting height. The table on the next slide includes key results from the combinations of the five predictor variables.

Page 21: Multiple linear regression
Page 22: Multiple linear regression
Page 23: Multiple linear regression
Page 24: Multiple linear regression
Page 25: Multiple linear regression
Page 26: Multiple linear regression
Page 27: Multiple linear regression
Page 28: Multiple linear regression
Page 29: Multiple linear regression

Using critical thinking and statistical analysis:

1. Delete the variable age.

2. Delete the variable shoe size, because it is really a rounded form of foot length.

3. For the remaining variables of foot length and shoe print length, select foot length because its adjusted R2 of 0.7014 is greater than 0.6520 for shoe print length.

4. Although it appears that only foot length is best, we note that criminals usually wear shoes, so shoe print lengths are likely to be found than foot lengths.

Hence, the final regression equation only including foot length: 𝑦 = 𝛽0 + 𝛽1π‘₯1

where 𝛽0 is the intercept, 𝛽1 is the coefficient corresponding to x1 variable (foot length).

Page 30: Multiple linear regression

The methods of the above section (Multiple Linear Regression) rely on variables that are continuous in nature. Many times we are interested in dichotomous or binary variables.

These variables have only two possible categorical outcomes such as male/female, success/failure, dead/alive, etc.

Indicator or dummy variables are artificial variables that can be used to specify the categories of the binary variable such as 0=male/1=female.

If an indicator variable is included in the regression model as a predictor/explanatory variable, the methods we have are appropriate.

HOWEVER, can we handle a situation when the variable we are trying to predict is categorical and/or binary? Notice that this is a different situation.

But, YES!!

Page 31: Multiple linear regression

The data in the table also includes

the dummy variable of sex (coded as 0 = female and 1 = male).

Given that a mother is 63 inches tall

and a father is 69 inches tall, find the regression equation and use it to

predict the height of a daughter and

a son.

Page 32: Multiple linear regression

Using technology, we get the regression equation:

π»π‘’π‘–π‘”β„Žπ‘‘ π‘œπ‘“ πΆβ„Žπ‘–π‘™π‘‘ = 25.6 + 0.377 π»π‘’π‘–π‘”β„Žπ‘‘ π‘œπ‘“ π‘€π‘œπ‘‘β„Žπ‘’π‘Ÿ + 0.195 π»π‘’π‘–π‘”β„Žπ‘‘ π‘œπ‘“ πΉπ‘Žπ‘‘β„Žπ‘’π‘Ÿ + 4.15(𝑠𝑒π‘₯)

We substitute in 0 for the sex variable, 63 for the mother, and 69 for the

father, and predict the daughter will be 62.8 inches tall.

We substitute in 1 for the sex variable, 63 for the mother, and 69 for the

father, and predict the son will be 67 inches tall.


Top Related