solving stepwise regression problems

98
Slide 1 Stepwise Multiple Regression

Upload: neetu-gupta

Post on 28-Oct-2014

203 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Solving Stepwise Regression Problems

Slide 1

Stepwise Multiple Regression

Page 2: Solving Stepwise Regression Problems

Slide 2

Different Methods for Entering Variables in Multiple Regression

Different types of multiple regression are distinguished by the method for entering the independent variables into the analysis.

In standard (or simultaneous) multiple regression, all of the independent variables are entered into the analysis at the same.

In hierarchical (or sequential) multiple regression, the independent variables are entered in an order prescribed by the analyst.

In stepwise (or statistical) multiple regression, the independent variables are entered according to their statistical contribution in explaining the variance in the dependent variable.

No matter what method of entry is chosen, a multiple regression that includes the same independent variables and the same dependent variables will produce the same multiple regression equation.

The number of cases required for stepwise regression is greater than the number for the other forms. We will use the norm of 40 cases for each independent variable.

Page 3: Solving Stepwise Regression Problems

Slide 3

Purpose of Stepwise Multiple Regression

Stepwise regression is designed to find the most parsimonious set of predictors that are most effective in predicting the dependent variable.

Variables are added to the regression equation one at a time, using the statistical criterion of maximizing the R² of the included variables.

After each variable is entered, each of the included variables are tested to see if the model would be better off it were excluded. This does not happen often.

The process of adding more variables stops when all of the available variables have been included or when it is not possible to make a statistically significant improvement in R² using any of the variables not yet included.

Since variables will not be added to the regression equation unless they make a statistically significant addition to the analysis, all of the independent variable selected for inclusion will have a statistically significant relationship to the dependent variable.

An example of how SPSS does stepwise regression is shown below.

Page 4: Solving Stepwise Regression Problems

Slide 4

Stepwise Multiple Regression in SPSS

Each time SPSS includes or removes a variable from the analysis, SPSS considers it a new step or model, i.e. there will be one model and result for each variable included in the analysis.

SPSS provides a table of variables included in the analysis and a table of variables excluded from the analysis.  It is possible that none of the variables will be included.  It is possible that all of the variables will be included.

The order of entry of the variables can be used as a measure of relative importance.

Once a variable is included, its interpretation in stepwise regression is the same as it would be using other methods for including regression variables.

Page 5: Solving Stepwise Regression Problems

Slide 5

Pros and Cons of Stepwise Regression

Stepwise multiple regression can be used when the goal is to produce a predictive model that is parsimonious and accurate because it excludes variables that do not contribute to explaining differences in the dependent variable.

Stepwise multiple regression is less useful for testing hypotheses about statistical relationships. It is widely regarded as atheoretical and its usage is not recommended.

Stepwise multiple regression can be useful in finding relationships that have not been tested before. Its findings invite one to speculate on why an unusual relationship makes sense.

It is not legitimate to do a stepwise multiple regression and present the results as though one were testing a hypothesis that included the variables found to be significant in the stepwise regression.

Using statistical criteria to determine relationships is vulnerable to over-fitting the data set used to develop the model at the expense of generalizability.

When stepwise regression is used, some form of validation analysis is a necessity. We will use 75/25% cross-validation.

Page 6: Solving Stepwise Regression Problems

Slide 6

75/25% Cross-validation

To do cross validation, we randomly split the data set into a 75% training sample and a 25% validation sample. We will use the training sample to develop the model, and we test its effectiveness on the validation sample to test the applicability of the model to cases not used to develop it.

In order to be successful, the follow two questions must be answers affirmatively: Did the stepwise regression of the training sample produce the same subset of

predictors produced by the regression model of the full data set? If yes, compare the R2 for the 25% validation sample to the R2 for the 75% training

sample. If the shrinkage (R2 for the 75% training sample - R2 for the 25% validation sample) is 2% (0.02) or less, we conclude that validation was successful.

Note: shrinkage may be a negative value, indicating that the accuracy rate for the validation sample is larger than the accuracy rate for the training sample. Negative shrinkage (increase in accuracy) is evidence of a successful validation analysis.

If the validation is successful, we base our interpretation on the model that included all cases.

Page 7: Solving Stepwise Regression Problems

Slide 7

DV

IV1

DV

IV2

Correlations between dependent variable and independent variables

DV and IV1 are correlated at r = .70. The area of overlap is r² = .49.

We have two independent variables, IV1 and IV2, which each have a relationship to the dependent variable. The areas of IV1 and IV2 which overlap with DV are r² values, i.e. the proportion of the dv that is explained by the iv.

DV and IV2 are correlated at r = .40. The area of overlap is r² = .16.

Page 8: Solving Stepwise Regression Problems

Slide 8

Correlations between independent variables

IV1 IV2

The two independent variables, IV1 and IV2, are correlated at r = .20. This correlation represents redundant information in the independent variables.

Page 9: Solving Stepwise Regression Problems

Slide 9

Variance in the dependent variable explained by the independent variables

The variance explained in DV is divided into three areas. The total variance explained is the sum of the three areas.

DV

IV1 IV2

The brown area is the variance in DV that is explained by both IV1 and IV2.

The green area is the variance in DV uniquely explained by IV1.

The orange area is the variance in DV uniquely explained by IV2.

Page 10: Solving Stepwise Regression Problems

Slide 10

Correlations at step 1 of the stepwise regression

Since IV1 had the stronger relationship with DV (.70 versus .40), it will be the variable entered first in the stepwise regression.

As the only variable in the regression equation, it is given full credit (.70) for its relationship to DV.

The partial correlation and the part correlation have the same value as the zero-order correlation at .70.

DV

IV1

Page 11: Solving Stepwise Regression Problems

Slide 11

Change in variance explained when a second variable in included

At step 2, IV2 enters the model, increasing the total variance explained from .49 to .56, an increase 0f .07.

By itself, IV2 explained .16 of the variance in DV, but since it was itself correlated with IV1, a portion of what it could explain had already been attributed to IV1.

Page 12: Solving Stepwise Regression Problems

Slide 12

Differences in correlations when a second variable is entered

While the zero-order correlations do not change, both the partial and the part correlations decrease.

Partial correlation represents the relationship between the dependent variable and an independent variable when the relationship between the dependent variable and other independent variables has been removed from the variance of both the dependent and the independent variable.

Part (or semi-partial) correlation is the portion of the total variance in the dependent variable that is by only that independent variable. The square of part correlation is the amount of change in R² by including this variable.

Page 13: Solving Stepwise Regression Problems

Slide 13

Zero-order, partial, and part correlations

DV

IV1 IV2

DV

IV1

The zero-order correlation is based on the relationship between the independent variable and the dependent variable, ignoring all other independent variables.

The partial correlation for IV1 is the green area divided by the area in DV and IV1 that is not part of IV2, i.e. green divided by green + yellow.

Part correlation for IV1 is the green area divided by all parts of DV, i.e. including areas associated with IV2.

NOTE: diagrams are scaled to r2 rather than r.

Page 14: Solving Stepwise Regression Problems

Slide 14

DV

IV2Zero-order, partial, and part correlations

DV

IV1 IV2

The zero-order correlation is based on the relationship between the independent variable and the dependent variable, ignoring all other independent variables.

The partial correlation for IV2 is the green area divided by the area in DV and IV2 that is not part of IV2, i.e. orange divided by orange + yellow.

Part correlation for IV2 is the orange area divided by all parts of DV, i.e. including areas associated with IV1.

Page 15: Solving Stepwise Regression Problems

Slide 15

How SPSS Stepwise Regression Chooses Variables - 1

The table of Correlations shows the the variable with the strongest individual relationship with the dependent variable is RACE OF HOUSEHOLD=WHITE, with a correlation of -.247.

Provided that the relationship between this variable and the dependent variable is statistically significant, this will be the variable that enters first.

We can use the table of correlations to identify which variable will be entered at the first step of the stepwise regression.

Page 16: Solving Stepwise Regression Problems

Slide 16

How SPSS Stepwise Regression Chooses Variables - 2

The correlation between RACE OF HOUSEHOLD=WHITE and importance of ethnic group to R is statistically significant at p < .001.

It will be the first variable entered into the regression equation.

Page 17: Solving Stepwise Regression Problems

Slide 17

How SPSS Stepwise Regression Chooses Variables - 3

Model 1 contains the variable RACE OF HOUSEHOLD=WHITE, with a Multiple R of .247, producing an R² of .061 (.247²), which is statistically significant at p < .001.

We cannot use the table of correlations to show which variable will be entered second, since the variable entered second must take into account its correlation to the independent variable entered first.

Page 18: Solving Stepwise Regression Problems

Slide 18

How SPSS Stepwise Regression Chooses Variables - 4

Partial correlation is a measure of the relationship of the dependent variable to an independent variable, where the variance explained by previously entered independent variables has been removed from both.

The table of Excluded Variables, however, shows the Partial Correlation between each candidate for entry and the dependent variable.

In this example, RACE OF HOUSEHOLD=BLACK has the largest Partial Correlation (.252) and is statistically significant at p < .001, so it will be entered on the next step

Page 19: Solving Stepwise Regression Problems

Slide 19

How SPSS Stepwise Regression Chooses Variables - 5

As expected, Model 2 contains the variable RACE OF HOUSEHOLD=WHITE and RACE OF HOUSEHOLD=BLACK. The R² for Model 2 increased by 0.059 to a total of .120. The increase in R² was statistically significant at p < .001.

Page 20: Solving Stepwise Regression Problems

Slide 20

How SPSS Stepwise Regression Chooses Variables - 6

The increase in R² of .059 is the square of the Part Correlation for RACE OF HOUSEHOLD=BLACK (.244² = 0.059). Part correlation, also referred to as semi-partial correlation, is the unique relationship between this independent variable and the dependent variable.

Page 21: Solving Stepwise Regression Problems

Slide 21

How SPSS Stepwise Regression Chooses Variables - 7

In the table of Excluded Variables for model 2, the next largest partial correlation is HOW OFTEN R ATTENDS RELIGIOUS SERVICES at .149.

This is the variable that will be added in Model 3 because the relationships is statistically significant at p = 0.32.

PartialCorrelatio

nColumn

Sig. Colum

n

Page 22: Solving Stepwise Regression Problems

Slide 22

How SPSS Stepwise Regression Chooses Variables - 8

As expected, Model 3 contains the variable RACE OF HOUSEHOLD=WHITE and RACE OF HOUSEHOLD=BLACK, and HOW OFTEN R ATTENDS RELIGIOUS SERVICES . The R² for Model 3 increased by 0.019 to a total of .140. The increase in R² was statistically significant at p = .032.

Page 23: Solving Stepwise Regression Problems

Slide 23

How SPSS Stepwise Regression Chooses Variables - 9

PartialCorrelatio

nColumn

Sig. Colum

n

However, the partial correlation is not significant (p=.203), so no additional variables will be added to the model.

In the table of Excluded Variables for model 3, the next largest partial correlation is THINK OF SELF AS LIBERAL OR CONSERVATIVE at .089.

Page 24: Solving Stepwise Regression Problems

Slide 24

What SPSS Displays when Nothing is Significant

If none of the independent variables has a statistically significant relationship to the dependent variable, SPSS displays an empty table for Variables Entered/Removed.

Page 25: Solving Stepwise Regression Problems

Slide 25

The Problem in BlackBoard - 1

The introductory problem statement tells us:• the data set to use: GSS2002_PrejudiceAndAltruism.SAV• the method for including variables in the regression• The dependent variable for the analysis the list of independent variables that stepwise

regression will select from

Page 26: Solving Stepwise Regression Problems

Slide 26

This Week’s Problems

The problems this week take the 13 questions on prejudice from the general social survey and explore the relationship of each to the demographic characteristics of age, education, income, political views (conservative versus liberal), religiosity (attendance at church), socioeconomic index, gender, and race.

I had no specific hypothesis about which demographic factors would be related to which question on prejudice, beyond an expectation that race would be a significant contributor to explaining differences on each of the questions.

My analyses were exploratory (to identify what demographic characteristics were associated with different aspects of prejudice) and, thus, appropriate for stepwise regression.

Page 27: Solving Stepwise Regression Problems

Slide 27

The Problem in BlackBoard - 2

In these problems, we will assume that our data satisfies the assumptions required by multiple regression without explicitly testing for it.

We should recognize that failing to use a needed transformation could preclude a variable from being selected as a predictor.

In your analyses, you would, of course, want to test for conformity to all of the assumptions.

Page 28: Solving Stepwise Regression Problems

Slide 28

The Problem in BlackBoard - 3

The next sequence of specific instructions tell us whether each variable should be treated as metric or non-metric, along with the reference category to use when dummy-coding non-metric variables.

Though we will not use the script to test for assumptions, we can use it to do the dummy coding that we need for the problem.

Page 29: Solving Stepwise Regression Problems

Slide 29

The Problem in BlackBoard - 4

The next pair of instructions tell us the probability values to use for alpha for both the tests of statistical relationships and for the diagnostic tests.

Page 30: Solving Stepwise Regression Problems

Slide 30

The Problem in BlackBoard - 4

The final instruction tells us the random number seed to use in the validation analysis.

If you do not use this number for the seed, it is likely that you will get different results from those shown in the feedback.

Page 31: Solving Stepwise Regression Problems

Slide 31

The Statement about Level of Measurement

The first statement in the problem asks about level of measurement. Stepwise multiple regression requires the dependent variable and the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous.

The only way we would violate the level of measurement would be to use a nominal variable as the dependent variable, or to attempt to dummy-code an interval level variable that was not grouped.

Page 32: Solving Stepwise Regression Problems

Slide 32

Marking the Statement about Level of Measurement - 1

Mark the check box as a correct statement because:• "Importance of ethnic identity" [ethimp] is ordinal level, but the

problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level.

• The metric independent variable "age" [age] was interval level, satisfying the requirement for independent variables.

• The metric independent variable "highest year of school completed" [educ] was interval level, satisfying the requirement for independent variables.

• "Income" [rincom98] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level.

Stepwise multiple regression requires the dependent variable and the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous.

Page 33: Solving Stepwise Regression Problems

Slide 33

Marking the Statement about Level of Measurement - 2

In addition:• "Description of political views" [polviews] is ordinal level, but the

problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level.

• "Frequency of attendance at religious services" [attend] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level.

• The metric independent variable "socioeconomic index" [sei] was interval level, satisfying the requirement for independent variables.

• The non-metric independent variable "sex" [sex] was dichotomous level, satisfying the requirement for independent variables.

• The non-metric independent variable "race of the household" [hhrace] was nominal level, but will satisfy the requirement for independent variables when dummy coded.

Page 34: Solving Stepwise Regression Problems

Slide 34

The Statement for Sample Size

The statement for sample size indicates that the available data satisfies the requirement.

Because of the tendency for stepwise regression to over-fit the data, we have a larger sample size requirement, i.e. 40 cases per independent variable (Tabachnick and Fidell, p. 117)

To obtain the number of cases available for this analysis, we run the stepwise regression.

Page 35: Solving Stepwise Regression Problems

Slide 35

Using the Script to Create Dummy-coded Variables - 1

Before we can run the stepwise regression, we need to dummy code sex and race. We will use the script to create the dummy-coded variables.

Select the Run Script command from the Utilities menu.

Page 36: Solving Stepwise Regression Problems

Slide 36

Using the Script to Create Dummy-coded Variables - 2

Navigate to the My Documents folder, if necessary.

Highlight the script fileSatisfyingRegressionAssumptionsWithMetricAndNonMetricVariables.SBS.

Click on the Run button to open the script.

Page 37: Solving Stepwise Regression Problems

Slide 37

Using the Script to Create Dummy-coded Variables - 3

Move the non-metric variable "sex" [sex] to the list box for Non-metric independent variables list box.

With the variable highlighted, select the reference category, 2=FEMALE from the Reference category drop down menu.

Page 38: Solving Stepwise Regression Problems

Slide 38

Using the Script to Create Dummy-coded Variables - 4

Move the non-metric variable "race of the household" [hhrace] to the list box for Non-metric independent variables list box.

With the variable highlighted, select the reference category, 3=OTHER from the Reference category drop down menu.

The OK button to run the regression is deactivated until we select a dependent variable.

Page 39: Solving Stepwise Regression Problems

Slide 39

Using the Script to Create Dummy-coded Variables - 5

We select the dependent variable "importance of ethnic identity" [ethimp], though since we are not going to interpret the output, we could select any variable.

To have the script save the dummy-coded variables, clear the check box Delete variables created in this analysis.

Page 40: Solving Stepwise Regression Problems

Slide 40

Using the Script to Create Dummy-coded Variables - 6

Click on the OK button to run the regression, creating the dummy-coded variables as a by-product.

Page 41: Solving Stepwise Regression Problems

Slide 41

The Dummy-Coded Variables in the Data Editor

If we scroll the variable list to the right, we see that the three dummy-coded variables have been added to the data set.

Page 42: Solving Stepwise Regression Problems

Slide 42

Run the Stepwise Regression - 1

To run the regression, select Regression > Linear from the Analyze menu.

Page 43: Solving Stepwise Regression Problems

Slide 43

Run the Stepwise Regression - 2

Move the dependent variable•"importance of ethnic identity" [ethimp]to the Dependent text box.

Move the independent variables:•"age" [age]•"highest year of school completed" [educ],•"income" [rincom98], •"description of political views" [polviews], •"frequency of attendance at religious services" [attend], •"socioeconomic index" [sei], •“survey respondents were male" [sex_1], •"survey respondents who were white" [hhrace_1], •"survey respondents who were black" [hhrace_2] to the Independent(s) list box.

Page 44: Solving Stepwise Regression Problems

Slide 44

Run the Stepwise Regression - 3

Select Stepwise from the Method drop down menu.

The critical step to produce a stepwise regression is the selection of the method for entering variables.

Page 45: Solving Stepwise Regression Problems

Slide 45

Run the Stepwise Regression - 4

Click on the Statistics button to specify additional output.

Page 46: Solving Stepwise Regression Problems

Slide 46

Run the Stepwise Regression - 5We mark the check boxes for optional statistics:

• R squared change,• Descriptives,• Part and partial

correlations,• Collinearity diagnostics, and• Durbin-Watson.

Click on the Continue button to close the dialog box.

Page 47: Solving Stepwise Regression Problems

Slide 47

Run the Stepwise Regression - 6

Click on the OK button to produce the output.

Page 48: Solving Stepwise Regression Problems

Slide 48

Answering the Sample Size Question

The analysis included 9 independent variables (6 metric independent variables plus 3 dummy-coded variables). The number of cases available for the analysis was 209, not satisfying the requirement for 360 cases based on the rule of thumb that the required number of cases for stepwise multiple regression should be 40 x the number of independent variables recommended by Tabachnick and Fidell (p. 117).

We should consider mentioning the sample size issue as a limitation of the analysis.

Page 49: Solving Stepwise Regression Problems

Slide 49

Marking the Statement for Sample Size

The check box is not marked because we did not satisfy the sample size requirement.

Page 50: Solving Stepwise Regression Problems

Slide 50

Statements about Variables Included in Stepwise Regression

Three statements in the problem list different combinations of the variables included in the stepwise regression.

To determine which is correct, we look at the table of Variables Entered and Removed in the SPSS output.

Page 51: Solving Stepwise Regression Problems

Slide 51

Answering the Question about Variables Included in Stepwise Regression - 1

Three independent variables satisfied the statistical criteria for entry into the model. The variable "survey respondents who were white" [hhrace_1] had the largest individual impact on the dependent variable "importance of ethnic identity" [ethimp]. The second variable included in the model was "survey respondents who were black" [hhrace_2]. The third variable included in the model was "frequency of attendance at religious services" [attend].

The column for Variables Removed is empty, telling us that no variables were removed after being entered.

Page 52: Solving Stepwise Regression Problems

Slide 52

Marking the Statement about Variables Included in Stepwise Regression

Three independent variables satisfied the statistical criteria for entry into the model. The variable "survey respondents who were white" [hhrace_1] had the largest individual impact on the dependent variable "importance of ethnic identity" [ethimp]. The second variable included in the model was "survey respondents who were black" [hhrace_2]. The third variable included in the model was "frequency of attendance at religious services" [attend].

We mark the check box for the first of the three statements.

Page 53: Solving Stepwise Regression Problems

Slide 53

Statement about the Strength of the Relationship

The next two statements focus on the strength of the overall relationship between the dependent variable and the set of predictors that are selected in the stepwise entry of variables. The statement assumes that the overall relationship will be statistically significant, which will be true if any variables are selected for the model.

We will use Cohen’s scale for assigning an adjective to the strength of the relationship:•less than .10 = trivial•.10 up to 0.30 = weak•.30 up to .50 = moderately strong•.50 or greater = strong

Page 54: Solving Stepwise Regression Problems

Slide 54

Statement about the Strength of the Relationship

The overall relationship was statistically significant (F(3, 205) = 11.11, p < .001. The null hypothesis that "all of the partial slopes (b coefficients) = 0" is rejected, supporting the research hypothesis that "at least one of the partial slopes (b coefficients) is not equal to 0".

Applying Cohen's criteria for effect size, the relationship was correctly characterized as moderately strong (Multiple R = .374).

Three independent variables satisfied the statistical criteria for inclusion in the model. We interpret the results for the last step for all of the questions about statistical relationships (Model 3 in this example).

Page 55: Solving Stepwise Regression Problems

Slide 55

Marking the Statement about the Strength of the Relationship

The Multiple R of .374 translates to a moderately strong relationship, so we mark the check box for the second statement on strength of the relationship.

Page 56: Solving Stepwise Regression Problems

Slide 56

Statements about Relationships to the Dependent Variable for Individual Predictors

The next set of statements focus on individual relationships between predictors and the dependent variable. In order for a statement to be true, it must have a statistically significant individual relationship (i.e. it entered into the model), and the direction of the relationship must be interpreted correctly.

Page 57: Solving Stepwise Regression Problems

Slide 57

Answering Question about Relationship of RACE OF HOUSEHOLD=WHITE

Again, we base our interpretation about statistical relationships on the last model for variables entered, i.e. Model 3 for this problem.

We reject the null hypothesis that the partial slope (b coefficient) for the variable "survey respondents who were white" = 0 and conclude that the partial slope (b coefficient) for the variable "survey respondents who were white" is not equal to 0. The negative sign of the b coefficient (-0.518) means that survey respondents who were white attached less importance to ethnic identity compared to the average for all survey respondents.

The statement that "survey respondents who were white attached less importance to ethnic identity compared to the average for all survey respondents" is correct. The individual relationship between the independent variable "survey respondents who were white" [hhrace_1] and the dependent variable "importance of ethnic identity" [ethimp] was statistically significant, ß = -.290, t(199) = -4.38, p < .001.

Page 58: Solving Stepwise Regression Problems

Slide 58

Marking the Statement about Relationship of RACE OF HOUSEHOLD=WHITE

Since the statement “survey respondents who were white attached less importance to ethnic identity compared to the average for all survey respondents” is supported by our statistical results, we mark the check box.

Page 59: Solving Stepwise Regression Problems

Slide 59

Answering Question about Relationship of RACE OF HOUSEHOLD=BLACK

We reject the null hypothesis that the partial slope (b coefficient) for the variable "survey respondents who were black" = 0 and conclude that the partial slope (b coefficient) for the variable "survey respondents who were black" is not equal to 0. The positive sign of the b coefficient (0.524) means that survey respondents who were black attached greater importance to ethnic identity compared to the average for all survey respondents.

The statement that "survey respondents who were black attached greater importance to ethnic identity compared to the average for all survey respondents" is correct. The individual relationship between the independent variable "survey respondents who were black" [hhrace_2] and the dependent variable "importance of ethnic identity" [ethimp] was statistically significant, ß = .225, t(199) = 3.37, p < .001.

Page 60: Solving Stepwise Regression Problems

Slide 60

Marking the Statement about Relationship of RACE OF HOUSEHOLD=BLACK

Since the statement “survey respondents who were black attached greater importance to ethnic identity compared to the average for all survey respondents" is supported by our statistical results, we mark the check box.

Since the previous statement was correct, this statement cannot be true, so the check box is not marked.

Page 61: Solving Stepwise Regression Problems

Slide 61

Answering Question about Relationship of ATTEND RELIGIOUS SERVICES

We reject the null hypothesis that the partial slope (b coefficient) for the variable "frequency of attendance at religious services" = 0 and conclude that the partial slope (b coefficient) for the variable "frequency of attendance at religious services" is not equal to 0. The positive sign of the b coefficient (0.062) means that higher values of frequency of attendance at religious services were associated with higher values of "importance of ethnic identity".

The statement that "survey respondents who attended religious services more often attached greater importance to ethnic identity" is correct. The individual relationship between the independent variable "frequency of attendance at religious services" [attend] and the dependent variable "importance of ethnic identity" [ethimp] was statistically significant, ß = .141, t(199) = 2.16, p = .032.

Page 62: Solving Stepwise Regression Problems

Slide 62

Marking the Statement about Relationship of ATTEND RELIGIOUS SERVICES

Since the statement “survey respondents who attended religious services more often attached greater importance to ethnic identity" is supported by our statistical results, we mark the check box.

The following check box is not marked because the statement contradicts the finding we have just made.

Page 63: Solving Stepwise Regression Problems

Slide 63

Answering Question about Relationship of AGE

The statement that "survey respondents who were older attached greater importance to ethnic identity" is not correct. The variable "age" [age] was not among the list of variables included in the stepwise model.

Page 64: Solving Stepwise Regression Problems

Slide 64

Marking the Statement for Age

The check box for the statement for age is not marked because the variable did not enter the model in the stepwise regression.

Page 65: Solving Stepwise Regression Problems

Slide 65

Statement about Cross-validation

The final statement concerns the generalizability of our findings to the larger population. To answer this question, we will do a 75/25% cross-validation.

The findings from our analysis are generalizable to the extent that they are applicable to cases not included in the analysis. Since we cannot collect new cases, we will divide our sample into two subsets, using one subset to create the model and test the findings on the second subset of cases which were not included in the analysis that created the model.

Page 66: Solving Stepwise Regression Problems

Slide 66

Creating the Training Sample and the Validation Sample - 1

The 75/25% cross-validation requires that we randomly divide the cases for this analysis into two parts:75% of the cases will be used to run the stepwise regression (the training sample), which will be tested for accuracy on the remaining 25% of the cases (the validation sample).

To set the seed for the random number generator, select Random Number Generator from the Transform menu.

NOTE: you must use the random number seed that is stated in the problem in order to produce the same results that I found. Any other seed will generate a different random sequence that can produce results that are very different from mine.

Page 67: Solving Stepwise Regression Problems

Slide 67

Creating the Training Sample and the Validation Sample - 2

Third, type the seed number provided in the problem directions: 726201.

First, mark the check for Set Starting Point.

Second, select the option button for a Fixed Value.

Fourth, click on the OK button to complete the action.

NOTE: SPSS does not provide any feedback that the seed has been set or changed. If you are in doubt, you can reopen the dialog box and see what it indicates.

Page 68: Solving Stepwise Regression Problems

Slide 68

Creating the Training Sample and the Validation Sample - 3

We will create a variable that will contain the information about whether a case is in the training sample or the validation sample. We will name this variable “split” and use a value of 1 to indicate the training sample and a value of 0 to indicate the validation sample.

To create the new variable, select Compute from the Transform menu.

Page 69: Solving Stepwise Regression Problems

Slide 69

Creating the Training Sample and the Validation Sample - 4

Type the name of the new variable, split, in the Target Variable text box.

Type the formula as shown in the Numeric Expression text box.

Click on the OK button to create the variable.

The formula uses the SPSS UNIFORM function to create a uniform distribution of decimal numbers between 0 and 1. If the generated number for a case is less than or equal to 0.75, the statement in the text box is True and the split variable will be assigned a 1 for that case. If the generated number is larger than 0.75, the statement is false and the case will be assigned a 0 for split.

Page 70: Solving Stepwise Regression Problems

Slide 70

Creating the Training Sample and the Validation Sample - 5

If we scroll the data editor window to the right, we see the split variable in a new column.

Page 71: Solving Stepwise Regression Problems

Slide 71

Creating the Training Sample and the Validation Sample - 6

If we created a frequency distribution for the split variable, we see that the breakdown is approximately, not exactly, correct. This is a consequence of generating random numbers – you have no control over the sequence that it generates beyond setting an initial seed.

Though I have done it to create specific results for homework problems, it is not acceptable to run repeated series of random numbers until one gets a sequence that has desirable properties.

Page 72: Solving Stepwise Regression Problems

Slide 72

An Additional Task before Running the Stepwise Regression on the Training Sample

Before we run the regression on the training sample, we need an additional step that will enable us to compare the accuracy of the model for the training sample to the accuracy of the model for the validation sample, using the R2 for each as our measure of accuracy.

We need to exclude from the analysis cases that are missing data for any of the variables that we have designated as candidates for inclusion. If we don’t specifically do this, SPSS may include different cases in predicting values for the dependent variable than it does in determining which variables to include in the model.

In model building, SPSS does listwise exclusion of missing data and omits any cases that have missing data for any variable. In predicting scores on the dependent variable, it excludes cases that are missing data for only the variables included in the stepwise model. Thus, when selecting variables, SPSS assumes that only respondents who answer all questions are valid cases; in predicting scores, it assumes that failing to answer a question on a variable that is not included has no importance in the analysis.

Page 73: Solving Stepwise Regression Problems

Slide 73

Selecting Cases with Valid Data for All Variables in the Analysis - 1

To include only those cases that have valid data for all variables in the analysis, choose the Select Cases command from the Data menu.

Page 74: Solving Stepwise Regression Problems

Slide 74

Selecting Cases with Valid Data for All Variables in the Analysis - 2

First, mark the option button for If condition is satisfied.

Second, click on the If button to add the condition.

Page 75: Solving Stepwise Regression Problems

Slide 75

Selecting Cases with Valid Data for All Variables in the Analysis - 3

Type

NMISS(ethimp,age,educ,rincom98,polviews,attend,sei,sex_1,hhrace_1,hhrace_2) = 0

in the condition textbox. In the parentheses, we type the names of the dependent variable and all of the independent variables.

The SPSS NMISS function counts the number of variables in the list that have missing data.

Telling SPSS to include cases for which this calculation results in 0 indicates that the case was not missing data for any of the variables.

Page 76: Solving Stepwise Regression Problems

Slide 76

Selecting Cases with Valid Data for All Variables in the Analysis - 4

Click on the Continue button to close the dialog box.

Page 77: Solving Stepwise Regression Problems

Slide 77

Selecting Cases with Valid Data for All Variables in the Analysis - 5

Click on the OK button to execute the command.

Page 78: Solving Stepwise Regression Problems

Slide 78

Selecting Cases with Valid Data for All Variables in the Analysis - 6

The excluded cases have a slash through the case number.

Page 79: Solving Stepwise Regression Problems

Slide 79

Run the Stepwise Regression on the Training Sample - 1

To run the regression, select Regression > Linear from the Analyze menu.

Page 80: Solving Stepwise Regression Problems

Slide 80

Run the Stepwise Regression on the Training Sample - 2

Move the dependent variable•"importance of ethnic identity" [ethimp]to the Dependent text box.

Move the independent variables:•"age" [age]•"highest year of school completed" [educ],•"income" [rincom98], •"description of political views" [polviews], •"frequency of attendance at religious services" [attend], •"socioeconomic index" [sei], •“survey respondents were male" [sex_1], •"survey respondents who were white" [hhrace_1], •"survey respondents who were black" [hhrace_2] to the Independent(s) list box.

Page 81: Solving Stepwise Regression Problems

Slide 81

Run the Stepwise Regression on the Training Sample - 3

Select Stepwise from the Method drop down menu.

The critical steps to produce a stepwise regression on the training sample are the selection of the stepwise method for entering variables and the inclusion of the training sample cases.

Page 82: Solving Stepwise Regression Problems

Slide 82

Run the Stepwise Regression on the Training Sample - 4

First, highlight the split variable.

To select the training sample, we move the split variable to the Selection Variable text box.

Second, click on the right arrow button to the left of the Selection Variable text box..

Page 83: Solving Stepwise Regression Problems

Slide 83

Run the Stepwise Regression on the Training Sample - 5

Click on the Rule button to specify the value that we want split to use to select cases.

Page 84: Solving Stepwise Regression Problems

Slide 84

Run the Stepwise Regression on the Training Sample - 6

First, type 1 in the Value text box. Recall that this is the value of split indicating training cases.

Second, click on the Continue button to close the dialog box.

Page 85: Solving Stepwise Regression Problems

Slide 85

Run the Stepwise Regression on the Training Sample - 7

Click on the Statistics button to specify additional output.

Page 86: Solving Stepwise Regression Problems

Slide 86

Run the Stepwise Regression on the Training Sample - 8

We mark the check boxes for optional statistics:

• R squared change,• Descriptives,• Part and partial

correlations,• Collinearity diagnostics, and• Durbin-Watson.

Click on the Continue button to close the dialog box.

Page 87: Solving Stepwise Regression Problems

Slide 87

Run the Stepwise Regression on the Training Sample - 9We mark the check boxes for optional statistics:

• R squared change,• Descriptives,• Part and partial

correlations,• Collinearity diagnostics, and• Durbin-Watson.

Click on the Continue button to close the dialog box.

Page 88: Solving Stepwise Regression Problems

Slide 88

Run the Stepwise Regression on the Training Sample - 10

Click on the OK button to produce the output.

Page 89: Solving Stepwise Regression Problems

Slide 89

Validating the Model - 1

The first step in our validation is to make certain that the model based on the training sample reasonably approximates the model based on the full sample.

Here we see that both models included 3 variables.

If the number of models (steps) were different, the validation would fail.

Page 90: Solving Stepwise Regression Problems

Slide 90

Validating the Model - 2

Second, we verify that the model based on the training sample included the same three variables as the model based on the full data set.

We do not require that the variables be entered in the same order, as the difference in samples can easily result in small shifts.

The same variables entered into the stepwise regression of the training sample that entered into the stepwise regression using the full sample ("frequency of attendance at religious services" [attend], "survey respondents who were black" [hhrace_2] and "survey respondents who were white" [hhrace_1]).

Page 91: Solving Stepwise Regression Problems

Slide 91

Validating the Model - 3

Third, we compare the accuracy of the model for the validation sample to the accuracy of the model for the training sample.

We have to calculate the R² for the validation sample (split ~= 1.0) by hand from the Multiple R: .402² = .162.

The R² for the 75% training sample was 0.131 and the R² for the 25% validation sample was 0.162, resulting in a value of .131 – 162 = -.031 for shrinkage. Since -.031 is <= .02, the validation is successful.

If the shrinkage were greater than .02 (2%), the validation fails.

Page 92: Solving Stepwise Regression Problems

Slide 92

Marking the Check Box for the Cross-validation Statement

The validation analysis supported the generalizability of the findings of the analysis to the population represented by the sample in the data set.

We mark the check box for the validation.

Page 93: Solving Stepwise Regression Problems

Slide 93

The Question Graded in Blackboard

When the problem was submitted, BlackBoard confirmed that all marked answers were correct.

Page 94: Solving Stepwise Regression Problems

Slide 94

Logic Diagram for Solving Homework Problems: Level of Measurement

No

No

Ordinal level variable treated as metric?

• Do not mark check box• Mark: Inappropriate

application of the statistic

• Stop

Yes

Yes

Level of measurement ok?

Consider limitation in discussion of findings

Run script to dummy-codenon-metric variables, if needed

Run stepwise regression

Page 95: Solving Stepwise Regression Problems

Slide 95

Logic Diagram for Solving Homework Problems: Sample Size and Overall Relationship

• Do not mark check box• Consider limitation in

discussion of findings

Yes

Sample size ok (number of Iv’s x 40)? No

Mark check box for correct sample size

1+ variables entered in model?

No

Yes

Model is not trivial (Multiple R >= .10)

No

Yes

Stop (no significant predictors)

Stop (model is not usable)

Model will be statistically significant if any variables entered

Page 96: Solving Stepwise Regression Problems

Slide 96

Logic Diagram for Solving Homework Problems: Strength of Overall Relationship

Do not mark check box

Yes

Strength of modelcorrectly characterized No

Mark check box for correct strength

Do not mark check box for correct subset

Yes

Subset of entered variables correctly

identified?No

Mark check box for correct subset

Page 97: Solving Stepwise Regression Problems

Slide 97

Yes

Variable entered and not removed? No

Mark check box for individual relationship

Correct interpretation of direction of relationship?

Yes

Do not mark check box for individual relationship

No

Logic Diagram for Solving Homework Problems: Individual Relationships

Additional variablesentered?

No

Yes

Page 98: Solving Stepwise Regression Problems

Slide 98

Logic Diagram for Solving Homework Problems: Cross-validation

Create split variableusing specified seed

Select cases with no missingvalues for all variables

Run stepwise regressionon training sample

Same variables entered in full model?

Yes

Do not mark check box for supporting validation

No

Shrinkage< or = 2%?

Yes

Mark check box for supporting validation

No Do not mark check box for supporting validation