chapter 10 regression and correlationmathcs.slu.edu/~currey/statistics/spss guide/spss_10.… ·...

13
CHAPTER 10 REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION: TWO VARIABLES (SECTIONS 10.1–10.3 OF UNDERSTANDABLE STATISTICS) Chapter 10 of Understandable Statistics introduces linear regression. The formula for the correlation coefficient r is given in Section 10.1. Formulas to find the equation of the least squares line, y = a + bx are given in Section 10.2. This section also contains the formula for the coefficient of determination . The equation for the standard error of estimate and the procedure to find a confidence interval for the predicted value of y are given in Section 10.3. 2 r The menu selection Analyze Regression Linear gives the equation of the least-squares line, the value of the standard error of estimate (Std. Error of the Estimate), the value of the correlation coefficient r, the value of the coefficient of determination (R Square), as well as several other values such as Adjusted R Square (an unbiased estimate of the population 2 r 2 ). r The standard deviation (Std Error), t-ratio and P values (Sig.) of the coefficients are also given. The P value is useful for testing the coefficients to see that the population coefficient is not zero (see Section 10.4 of Understandable Statistics for a discussion about testing the coefficients). For the time being we do not use these values. An analysis of variance chart is also given. We do not use the analysis of variance chart in the introduction to regression. However, in more advanced treatments of regression, it will be useful. To find the equation of the least-squares line and the value of the correlation coefficient, use the menu options Analyze Regression Linear Dialog Box Responses Dependent: Enter the variable name of the column containing the responses (that is Y values). Independent: Enter the variable name of the column containing the explanatory variables (that is, X values). [Options]: Check the option [Include Constant in Equation]. [Save]: This is used to find predicted values and their confidence intervals. To graph the scatter plot and show the least-squares line on the graph, use the menu options Analyze Regression Curve Estimate Dialog Box Responses Dependent: Enter the variable name of the column containing the Y values. Independent: Enter the variable name of the column containing the X values. Models: Select Linear. Check on the options [Include constant in equation] and [Plot models]. We may also use Graph Interactive Scatter Plot to show the least-square line, least-square equation, as well as optional prediction bands. Dialog Box Responses Move the dependent variable to the vertical axis box, independent variable to the horizontal axis box. Copyright © Houghton Mifflin Company. All rights reserved. 353

Upload: hoangliem

Post on 22-May-2018

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

CHAPTER 10 REGRESSION AND CORRELATION

SIMPLE LINEAR REGRESSION: TWO VARIABLES (SECTIONS 10.1–10.3 OF UNDERSTANDABLE STATISTICS)

Chapter 10 of Understandable Statistics introduces linear regression. The formula for the correlation coefficient r is given in Section 10.1. Formulas to find the equation of the least squares line,

y = a + bx

are given in Section 10.2. This section also contains the formula for the coefficient of determination . The equation for the standard error of estimate and the procedure to find a confidence interval for the predicted value of y are given in Section 10.3.

2r

The menu selection Analyze Regression Linear gives the equation of the least-squares line, the value of the standard error of estimate (Std. Error of the Estimate), the value of the correlation coefficient r, the value of the coefficient of determination (R Square), as well as several other values such as Adjusted R Square (an unbiased estimate of the population

2r2 ).r

The standard deviation (Std Error), t-ratio and P values (Sig.) of the coefficients are also given. The P value is useful for testing the coefficients to see that the population coefficient is not zero (see Section 10.4 of Understandable Statistics for a discussion about testing the coefficients). For the time being we do not use these values. An analysis of variance chart is also given. We do not use the analysis of variance chart in the introduction to regression. However, in more advanced treatments of regression, it will be useful.

To find the equation of the least-squares line and the value of the correlation coefficient, use the menu options

Analyze Regression Linear

Dialog Box Responses

Dependent: Enter the variable name of the column containing the responses (that is Y values).

Independent: Enter the variable name of the column containing the explanatory variables (that

is, X values).

[Options]: Check the option [Include Constant in Equation].

[Save]: This is used to find predicted values and their confidence intervals.

To graph the scatter plot and show the least-squares line on the graph, use the menu options

Analyze Regression Curve Estimate

Dialog Box Responses

Dependent: Enter the variable name of the column containing the Y values.

Independent: Enter the variable name of the column containing the X values.

Models: Select Linear.

Check on the options [Include constant in equation] and [Plot models].

We may also use Graph Interactive Scatter Plot to show the least-square line, least-square equation, as well as optional prediction bands.

Dialog Box Responses

Move the dependent variable to the vertical axis box, independent variable to the horizontal axis box.

Copyright © Houghton Mifflin Company. All rights reserved. 353

Page 2: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

354 Technology Guide Understandable Statistics, 8th Edition

Click the [fit] tab, select [Regression], and check the option [Include constant in the equation]. For optional prediction band, check the options [individual], enter the confidence level, and check [total].

Example

Merchandise loss due to shoplifting, damage, and other causes is called shrinkage. Shrinkage is a major concern to retailers. The managers of H.R. Merchandise think there is a relationship between shrinkage and number of clerks on duty. To explore this relationship, a random sample of 7 weeks was selected. During each week the staffing level of sales clerks was kept constant and the dollar value (in hundreds of dollars) of the shrinkage was recorded.

X 10 12 11 15 9 13 8

Y 19 15 20 9 25 12 31 (in hundreds)

Store the value of X in the first column and name it X. Store the values of Y in the second column with variable name Y.

Use menu choices to give descriptive statistics regarding the values of X and Y. Use commands to draw an (X, Y) scatter plot together with the least-squares line and then to find the equation of the regression line. Find the value of the correlation coefficient, and test to see if it is significant.

(a) First we will use Analyze Descriptive Statistics Explore and each of the columns X, and Y. The results follow.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 3: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

Part IV: SPSS Guide 355

(b) Next we will use Analyze Regression Curve Estimate to graph the scatter plot and show the least-squares line on the graph.

Notice that the coefficients of the equation of the regression line are given in the table above the figure, as well as the value of 2.r

(c) To find out more information about the linear regression model, we use the menu selection Analyze Regression Linear. Use Y as Dependent and X as Independent.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 4: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

356 Technology Guide Understandable Statistics, 8th Edition

The results follow.

Notice that of the coefficients of the regression equation are given in the table Coefficients. Thus the regression equation is

y = 52.5 – 3.03x

The value of the standard error of estimate is given as Std. Error of the Estimate = 2.228. The value of reS 2 (R Square in the screen) is 92.8%, and the value of r is 96.3%.

(d) Next, let’s use the prediction option to find the shrinkage when 14 clerks are available. To do that, we first add the value 14 to the column for variable X, but leave the corresponding cell in the Y column blank, as shown below.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 5: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

Part IV: SPSS Guide 357

Next, in the Analyze Regression Linear dialog box, click on [Save]. In the [Save] tab, check [Unstandardized] under “Predicted values”, check [Individual] under “Prediction intervals” with a confidence level 95%.

Click [Continue] and then [OK]. The results follow.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 6: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

358 Technology Guide Understandable Statistics, 8th Edition

The predicted value (PRE_1) of the shrinkage when 14 clerks are on duty is 10.05 hundred dollars, or

$1005. A 95% prediction interval goes from 3.33 hundred dollars (LICI_1) to 16.77 hundred dollars (UICI_1) — that is, from $333 to $1677.

(e) Graph a prediction band at a confidence level 95% for predicted values. For this we use Graph Interactive Scatter Plot. First we move the dependent variable to the vertical axis box, and

independent variable to the horizontal axis box, as shown below.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 7: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

Part IV: SPSS Guide 359

Next, we click the [fit] tab, select [Regression], and check the option [Include constant in the equation]. Then

we check the options [individual], enter the confidence level 95%, and check [total].

The results follow.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 8: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

360 Technology Guide Understandable Statistics, 8th Edition

LAB ACTIVITIES FOR SIMPLE LINEAR REGRESSION: TWO VARIABLES

1. Open or retrieve the worksheet Slr01.sav from the CD-ROM. This worksheet contains the following data, with the list price in the first column and the best price in the second column. The best price is the best price negotiated by a team from the magazine.

List Price versus Best Price for a New GMC Pickup Truck In the following data pairs (X, Y) X = List Price (in $1000) for a GMC Pickup Truck Y = Best Price (in $1000) for a GMC Pickup Truck

SOURCE: CONSUMERS DIGEST, FEBRUARY 1994

(12.4, 11.2) (14.3, 12.5) (14.5, 12.7)

(14.9, 13.1) (16.1, 14.1) (16.9, 14.8)

(16.5, 14.4) (15.4, 13.4) (17.0, 14.9)

(17.9, 15.6) (18.8, 16.4) (20.3, 17.7)

(22.4, 19.6) (19.4, 16.9) (15.5, 14.0)

(16.7, 14.6) (17.3, 15.1) (18.4, 16.1)

(19.2, 16.8) (17.4, 15.2) (19.5, 17.0)

(19.7, 17.2) (21.2, (18.6)

(a) Use SPSS to find the least-squares regression line using the best price as the response variable and list price as the explanatory variable.

(b) Use SPSS to draw a scatter plot of the data. (c) What is the value of the standard error of estimate?

(d) What is the value of the coefficient of determination of the correlation coefficient r? 2 ?r (e) Use the least-squares model to predict the best price for a truck with a list price of $20,000. Note:

Enter this value as 20 since X is assumed to be in thousands of dollars. Find a 95% confidence interval for the prediction.

2. Other SPSS data files appropriate to use for simple linear regression are

Cricket Chirps Versus Temperature: Slr02.sav Source: The Song of Insects by Dr. G.W. Pierce, Harvard College Press

The chirps per second for the striped grouped cricket are stored in the first column; the corresponding temperature in degrees Fahrenheit is stored in the second column.

Diameter of Sand Granules Versus Slope on a Beach: Slr03.sav Source: Physical Geography by A.M. King, Oxford press The median diameter (mm) of granules of sand is stored in the first column; the corresponding gradient of

beach slope in degrees is stored in the second column. National Unemployment Rate Male Versus Female: Slr04.sav Source: Statistical Abstract of the United States

Copyright © Houghton Mifflin Company. All rights reserved.

Page 9: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

Part IV: SPSS Guide 361

The national unemployment rate for adult males is stored in the first column; the corresponding unemployment rate for adult females for the same period of time is stored in the second column.

The data in these worksheets are described in the Appendix of this Guide. Select these worksheets and repeat parts (a)–(e) of problem 1, using the first column as the explanatory variable and the second column as the response variable.

3. A psychologist interested in job stress is studying the possible correlation between interruptions and job stress. A clerical worker who is expected to type, answer the phone and do reception work has many interruptions. A store manager who has to help out in various departments as customers make demands also has interruptions. An accountant who is given tasks to accomplish each day and who is not expected to interact with other colleagues or customers except during specified meeting times has few interruptions. The psychologist rated a group of jobs for interruption level. The results follow, with X being interruption level of the job on a scale of 1 to 20, with 20 having the most interruptions, and Y the stress level on a scale of 1 to 50, with 50 the most stressed.

Person 1 2 3 4 5 6 7 8 9 10 11 12

X 9 15 12 18 20 9 5 3 17 12 17 6 Y 20 37 45 42 35 40 20 10 15 39 32 25

(a) Enter the X values into the first column and the Y values into the second column. Use the menu

selections Analyze Descriptive Statistics Explore on the two columns. What is the mean of the Y-values? Of the X-values? What are the respective standard deviations?

(b) Make a scatter plot of the data using the Analyze Regression Curve Estimate menu selection. From the diagram, do you expect a positive or negative correlation?

(c) Use the Analyze Regression Linear menu choices to get the value of r. Is this value consistent with your response in part (b)? Also, find the equation of the regression line.

(d) Use the Analyze Regression Linear menu choices with Y as the response variable and X as the explanatory variable. Get the predicted stress level of jobs with interruption levels of 5, 10, 15, 20. Look at the 95% P.I. intervals. Which are the longest? Why would you expect these results? Find the standard error of estimate. Is value of r the same as that you found in part (c)? What is the equation of the least-squares line?

(e) Use the Analyze Regression Linear menu option again, this time using X as the response variable and Y as the explanatory variable. Is the equation different than that of part (d)? What about the value of the standard error of estimate (s on your output)? Did it change? Did R Square change?

4. The researcher of problem 3 was able to add to her data. Another random sample of 11 people had their jobs rated for interruption level and were then evaluated for stress level.

Person 13 14 15 16 17 18 19 20 21 22 23

X 4 15 19 13 10 9 3 11 12 15 4 Y 20 35 42 37 40 23 15 32 28 38 12

Add this data to the data in problem 3, and repeat parts (a) through (e). Compare the values of s, the

standard error of estimate in parts (c). Did more data tend to reduce the value of s? Look at the 95% P.I. intervals. How do they compare to the corresponding ones of problem 3? Are they shorter or longer? Why would you expect this result?

Copyright © Houghton Mifflin Company. All rights reserved.

Page 10: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

362 Technology Guide Understandable Statistics, 8th Edition

MULTIPLE REGRESSION (SECTION 10.4 OF UNDERSTANDABLE STATISTICS)

The Analyze Regression Linear menu choices also do multiple regression.

Analyze Regression Linear

Dialog Box Responses

Dependent: Enter the variable name of the column containing the responses (that is Y values).

Independent: Enter the variable names of the columns containing the explanatory variables.

[Options]: Check the option [Include Constant in Equation].

[Save]: This is used to find predicted values and their confidence intervals.

Example

Bowman Brothers is a large sporting goods store in Denver that has a giant ski sale every year during the month of October. The chief executive officer at Bowman Brothers is studying the following variables regarding the ski sale:

1X = Total dollar receipts from October ski sale

2X = Total dollar amount spent advertising ski sale on local TV

3X = Total dollar amount spent advertising ski sale on local radio

4X = Total dollar amount spent advertising ski sale in Denver newspapers

Data for the past eight years is shown below (in thousands of dollars):

Year 1 2 3 4 5 6 7 8

X1 751 768 801 832 775 718 739 780

X2 19 23 27 32 25 18 20 24

X3 14 17 20 24 19 9 10 19

X4 11 15 16 18 12 5 7 14

(a) Enter the data in the first four columns and name them as X1, X2, X3, and X4. Use Analyze Descriptive Statistics Explore to study the data.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 11: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

Part IV: SPSS Guide 363

(b) Next use Analyze Regression Linear menu option to conduct a linear regression, using X1 as dependent variable, X2, X3 and X4 as independent variables. We may also obtain the correlation between each pair of variables by clicking on [Statistics] and selecting [Descriptives]. The results are shown below (in two screen images).

Copyright © Houghton Mifflin Company. All rights reserved.

Page 12: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

364 Technology Guide Understandable Statistics, 8th Edition

For this regression model, note the least-squares equation, the standard error of estimate, and the coefficient of multiple determination R Square. Look at the P values of the coefficients. Remember we are testing the null hypothesis 0 1:H 0.β = against the alternate hypothesis .1 1: 0H β ≠ A P value less than α is evidence to reject 0.H

(c) Finally, we use Analyze Regression Linear. Use X1 as the response variable with 3 predictors X2, X3, X4. Find the predicted value of X1 for X2 = 21, X3 = 11, and X4 = 8. The procedure is similar to that for simple linear regression. Add values 21, 11 and 8 to the columns X2, X3 and X4, respectively. Leave the X4 cell in that row blank. Then use the [Save] option in the Analyze

Regression Linear dialog box. The results follow.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 13: CHAPTER 10 REGRESSION AND CORRELATIONmathcs.slu.edu/~currey/Statistics/SPSS guide/spss_10.… ·  · 2008-08-20CHAPTER 10 REGRESSION AND CORRELATION ... (12.4, 11.2) (14.3, 12.5)

Part IV: SPSS Guide 365

LAB ACTIVITIES FOR MULTIPLE REGRESSION Solve problems 3–6 of Section 10.4. Each of these problems have SPSS data files stored on the CD-ROM.

Section 10.4 problem #3 (Systolic Blood Pressure Data)

SPSS data file: Mlr02.sav

Section 10.4 problem #4 (Test Scores for General Psychology)

SPSS data file: Mlr03.sav

Section 10.4 problem #5 (Hollywood Movies data)

SPSS data file: Mlr04.sav

Section 10.4 problem #6 (All Greens Franchise Data)

SPSS data file: Mlr05.sav

Two additional case studies are available on the CD-ROM. The data are listed in the Appendix. For each of these studies, explore the relationships among the variables.

SPSS data file Mlr07.sav

This is file contains data of a case study of public health, income, and population density for small cities in eight midwestern states: Ohio, Indiana, Illinois, Iowa, Missouri, Nebraska, Kansas, and Oklahoma. The data is for a sample of 53 small cities in these states.

X1 = Death Rate per 1000 Residents

X2 = Doctor Availability per 100,000 Residents

X3 = Hospital Availability per 100,000 Residents

X4 = Annual per Capita Income in Thousands of Dollars

X5 = Population Density People per Square Mile

SPSS data file Mlr06.sav

This file contains data of a case study of education, crime, and police funding for small cities in ten eastern and southeastern states. The states are New Hampshire, Connecticut, Rhode Island, Maine, New York, Virginia, North Carolina, South Carolina, Georgia, and Florida. The data is for a sample of 50 small cities in these states.

X1 = Total Overall Reported Crime Rate per 1 Million Residents

X2 = Reported Violent Crime Rate per 100,000 Residents

X3 = Annual Police Funding in Dollars per Resident

X4 = Percent of People 25 Years and Older that have had 4 years

of High School

X5 = Percent of 16 to 19 Year-Olds Not in High School and Not High School

Graduates

X6 = Percent of 18 to 24 Year-Olds Enrolled in College

X7 = Percent of People 25 Years and Older with at Least 4 Years of College

Copyright © Houghton Mifflin Company. All rights reserved.