week 11 november 10-14

53
Week 11 November 10-14 Four Mini-Lectures QMM 510 Fall 2014

Upload: janine

Post on 22-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Week 11 November 10-14. Four Mini-Lectures QMM 510 Fall 2014 . Chapter Learning Objectives LO12-1: Calculate and test a correlation coefficient for significance . LO12-2: Interpret the slope and intercept of a regression equation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Week  11  November  10-14

Week 11 November 10-14

Four Mini-Lectures QMM 510Fall 2014

Page 2: Week  11  November  10-14

12-2

Chapter Learning Objectives

LO12-1: Calculate and test a correlation coefficient for significance.LO12-2: Interpret the slope and intercept of a regression equation.LO12-3: Make a prediction for a given x value using a regression equation.LO12-4: Fit a simple regression on an Excel scatter plot.LO12-5: Calculate and interpret confidence intervals for regression coefficients.LO12-6: Test hypotheses about the slope and intercept by using t tests.LO12-7: Perform regression analysis with Excel or other software.LO12-8: Interpret the standard error, R2, ANOVA table, and F test.LO12-9: Distinguish between confidence and prediction intervals for Y.LO12-10: Test residuals for violations of regression assumptions.LO12-11: Identify unusual residuals and high-leverage observations.

Chapter 12

Chapter 12: Correlation and RegressionToo

much?

Page 3: Week  11  November  10-14

12-3

Correlation Analysis ML 11.1

• Begin the analysis of bivariate data (i.e., two variables) with a scatter plot. • A scatter plot:

- displays each observed data pair (xi, yi) as a dot on an X / Y grid.- indicates visually the strength of the relationship between X and Y

Visual Displays

Chapter 12

Sample Scatter Plot

Page 4: Week  11  November  10-14

12-4

Strong Positive Correlation Weak Positive Correlation Weak Negative Correlation

Strong Negative Correlation No Correlation Nonlinear Relation

Chapter 12

Correlation Analysis

Note: r is an estimate of the population correlation coefficient r (rho).

Page 5: Week  11  November  10-14

12-5

• Step 1: State the Hypotheses H0: r = 0 H1: r ≠ 0

• Step 2: Specify the Decision RuleFor degrees of freedom d.f. = n 2, look up the critical value ta in Appendix D or Excel =T.INV.2T(α,df). for a 2-tailed test

Steps in Testing if r = 0 (population correlation = 0)

Chapter 12

Correlation Analysis

• Step 3: Calculate the Test Statistic

• Step 4: Make the DecisionIf using the t statistic method, reject H0 if t > ta or if the p-value a.

1 ≤ r ≤ +1

r = 0 indicates no linear relationship

Page 6: Week  11  November  10-14

12-6

• Equivalently, you can calculate the critical value for the correlation coefficient using

• This method gives a benchmark for the correlation coefficient.

• However, there is no p-value and is inflexible if you change your mind about a.

• MegaStat uses this method, giving two-tail critical values for a = 0.05 and a = 0.01.

Alternative Method to Test for r = 0

Chapter 12

Correlation Analysis

Critical values of r for various sample sizes

Page 7: Week  11  November  10-14

12-7

• Simple regression analyzes the relationship between two variables.

• It specifies one dependent (response) variable and one independent (predictor) variable.

• This hypothesized relationship (in this chapter) will be linear.

What is Simple Regression?

Chapter 12

Simple Regression ML 11.2

Page 8: Week  11  November  10-14

12-8

Interpreting an Estimated Regression Equation

Chapter 12

Simple Regression

Page 9: Week  11  November  10-14

12-9

Prediction Using Regression: Examples

Chapter 12

Simple Regression

Page 10: Week  11  November  10-14

12-10

Cause-and-Effect?

Chapter 12

Simple Regression

Can We Make Predictions?

Page 11: Week  11  November  10-14

12-11

• The assumed model for a linear relationship is

y = b0 + b1x + e .

• The relationship holds for all pairs (xi, yi).

• The error term e is not observable; it is assumed to be independently normally distributed with mean of 0 and standard deviation s.

Model and Parameters

Chapter 12

Regression Terminology

• The unknown parameters are:b0 Interceptb1 Slope

Page 12: Week  11  November  10-14

12-12

• The fitted model or regression model used to predict the expected value of Y for a given value of X is

Model and Parameters

• The fitted coefficients areb0 Estimated interceptb1 Estimated slope

Chapter 12

Regression Terminology

Page 13: Week  11  November  10-14

12-13

Chapter 12

A more precise method is to let Excel calculate the estimates. Enter observations on the independent variable x1, x2, . . ., xn and the dependent variable y1, y2, . . ., yn into separate columns, and let Excel fit the regression equation. Excel will choose the regression coefficients so as to produce a good fit.

Regression Terminology

Page 14: Week  11  November  10-14

12-14

Chapter 12

• Slope Interpretation: The slope of 0.0785 says that for each additional unit of engine horsepower, the miles per gallon decreases by 0.0785 mile. This estimated slope is a statistic because a different sample might yield a different estimate of the slope.

• Intercept Interpretation: The intercept value of 49.216 suggests that when the engine has no horsepower, the fuel efficiency would be quite high. However, the intercept has little meaning in this case, not only because zero horsepower makes no logical sense, but also because extrapolating to x = 0 is beyond the range of the observed data.

Regression Terminology

Scatter plot shows a sample of miles per gallon and horsepower for 15 vehicles.

Page 15: Week  11  November  10-14

12-15

• The ordinary least squares method (OLS) estimates the slope and intercept of the regression line so that the sum of squared residuals is minimized.

• The sum of the residuals = 0.

• The sum of the squared residuals is SSE.

OLS Method

Chapter 12

Ordinary Least Squares (OLS) Formulas

Page 16: Week  11  November  10-14

12-16

• The OLS estimator for the slope is:

• The OLS estimator for the intercept is:

Slope and Intercept

Chapter 12

Ordinary Least Squares (OLS) Formulas

=SLOPE(YData, XData)

These formulas are built into

Excel.

=INTERCEPT(YData, XData)Excel function:

Excel function:

Page 17: Week  11  November  10-14

12-17

Example: Achievement Test Scores

Chapter 12

20 high school students’ achievement exam scores.

Quant VerbalObs X Y

1 520 3982 329 5053 225 183

4 424 3325 650 737

6 491 5787 384 3448 311 3679 236 298

10 344 600

11 541 64312 324 32813 515 556

14 528 52715 380 50416 629 69517 228 13318 454 47819 514 41320 677 742

Note that verbal scores average higher than quant scores (slope exceeds 1 and intercept shifts the line up almost 20 points).

Ordinary Least Squares (OLS) Formulas

Page 18: Week  11  November  10-14

12-18

Slope and Intercept

Chapter 12

Ordinary Least Squares (OLS) Formulas

Page 19: Week  11  November  10-14

12-19

• We want to explain the total variation in Y around its mean (SST for total sums of squares).

• The regression sum of squares (SSR) is the explained variation in Y.

Assessing Fit

Chapter 12

Assessing Fit

Page 20: Week  11  November  10-14

12-20

• The error sum of squares (SSE) is the unexplained variation in Y.

• If the fit is good, SSE will be relatively small compared to SST.

• A perfect fit is indicated by an SSE = 0.

• The magnitude of SSE depends on n and on the units of measurement.

Chapter 12

Assessing Fit

Assessing Fit

Page 21: Week  11  November  10-14

12-21

Coefficient of Determination

0 R2 1

• Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates perfect fit.

• In simple regression, R2 = r2 where r2 is the squared correlation coefficient).

• R2 is a measure of relative fit based on a comparison of SSR (explained variation) and SST (total variation).

Chapter 12

Assessing Fit

Page 22: Week  11  November  10-14

12-22

Example: Achievement Test Scores

Chapter 12

Excel shows the sums needed to calculate R2.

Strong relationship between quant score and verbal score (68 percent of variation explained)

Regression StatisticsMultiple R 0.82694384R Square 0.68383612Adjusted R Square 0.66627146Standard Error 99.8002618Observations 20

ANOVA TableSource df SS MS FRegression 1 387771.3 387771.3 38.9325Residual 18 179281.7 9960.092Total 19 567053

R2 = SSR / SST = 387771 / 567053 = .6838

SST

SSR

Assessing Fit

Page 23: Week  11  November  10-14

12-23

• The standard error (se) is an overall measure of model fit. Standard Error of Regression

• If the fitted model’s predictions are perfect (SSE = 0), then se = 0. Thus, a small se indicates a better fit.

• Used to construct confidence intervals.

• Magnitude of se depends on the units of measurement of Y and on data magnitude.

Chapter 12

Tests for Significance

Excel’s Data Analysis > Regression

calculates se.

Page 24: Week  11  November  10-14

12-24

• Standard error of the slope and intercept:

Confidence Intervals for Slope and Intercept

Chapter 12

Excel’s Data Analysis > Regression constructs confidence intervals

for the slope and intercept.

Tests for Significance

• Confidence interval for the true slope and intercept:

Page 25: Week  11  November  10-14

12-25

• If b1 = 0, then the regression model collapses to a constant b0 plus random error.

• The hypotheses to be tested are:

Hypothesis Tests

Chapter 12

Reject H0 if tcalc > ta/2 or if p-value α.

d.f. = n 2

Tests for Significance

Excel ‘s Data Analysis >

Regression performs these tests.

Page 26: Week  11  November  10-14

12-26

Example: Achievement Test Scores

Chapter 12

Analysis of Variance: Overall Fit

MegaStat is similar but rounds off and highlights p-values to show significance (light yellow .05, bright yellow .01)

20 high school students’ achievement exam scores.

Excel shows 95% confidence intervals and t test statistics

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 19.5924736 75.25768037 0.260339 0.797557 -138.5180458 177.702993Slope 1.03046307 0.165149128 6.239591 6.93E-06 0.683497623 1.37742851

Regression output confidence intervalvariables coefficients std. error t (df=18) p-value 95% lower 95% upperIntercept 19.5925 75.2577 0.260 .7976 -138.5180 177.7030 Slope 1.0305 0.1651 6.240 6.93E-06 0.6835 1.3774

Page 27: Week  11  November  10-14

12-27

• To test a regression for overall significance, we use an F test to compare the explained (SSR) and unexplained (SSE) sums of squares.

F Test for Overall Fit

Chapter 12

Analysis of Variance: Overall Fit

Page 28: Week  11  November  10-14

12-28

Example: Achievement Test Scores

Chapter 12

Analysis of Variance: Overall Fit

MegaStat is similar, but also highlights p-values to indicate significance (light yellow .05, bright yellow .01)

20 high school students’ achievement exam scores.

ANOVA TableSource df SS MS F Significance FRegression 1 387771.2894 387771.3 38.9325 6.92538E-06Residual 18 179281.6606 9960.092Total 19 567052.95

ANOVA tableSource SS df MS F p-valueRegression 387,771.2894 1 387,771.2894 38.93 6.93E-06Residual 179,281.6606 18 9,960.0923 Total 567,052.9500 19

Excel shows the ANOVA sums, the F test statistic , and its p-value.

Page 29: Week  11  November  10-14

12-29

Confidence and Prediction Intervals for Y

• Confidence interval for the conditional mean of Y is shown below.

• Prediction intervals are wider than confidence intervals for the mean because individual Y values vary more than the mean of Y.

How to Construct an Interval Estimate for Y

Chapter 12

Excel does not do these CIs!

Page 30: Week  11  November  10-14

12-30

Tests of Assumptions 11.3

Three Important Assumptions

1. The errors are normally distributed.

2. The errors have constant variance (i.e., they are homoscedastic).

3. The errors are independent (i.e., they are nonautocorrelated).

Chapter 12

Non-normal Errors• Non-normality of errors is a mild violation since the regression

parameter estimates b0 and b1 and their variances remain unbiased and consistent.

• Confidence intervals for the parameters may be untrustworthy because the normality assumption is used to justify using Student’s t distribution.

Page 31: Week  11  November  10-14

12-31

Non-normal Errors

• A large sample size would compensate.• Outliers could pose serious problems.

Chapter 12

Normal Probability Plot

• The normal probability plot tests the assumptionH0: Errors are normally distributedH1: Errors are not normally distributed

• If H0 is true, the residual probability plot should be linear, as shown in the example.

Residual Tests

Page 32: Week  11  November  10-14

12-32

What to Do about Non-normality?

1. Trim outliers only if they clearly are mistakes.

2. Increase the sample size if possible.

3. If data are totals, try a logarithmic transformation of both X and Y.

Chapter 12

Residual Tests

Page 33: Week  11  November  10-14

12-33

Heteroscedastic Errors (Nonconstant Variance)• The ideal condition is if the error magnitude is constant (i.e.,

errors are homoscedastic).

• Heteroscedastic errors increase or decrease with X.

• In the most common form of heteroscedasticity, the variances of the estimators are likely to be understated.

• This results in overstated t statistics and artificially narrow confidence intervals.

Chapter 12

Tests for Heteroscedasticity• Plot the residuals against X.

Ideally, there is no pattern in the residuals moving from left to right.

Residual Tests

Page 34: Week  11  November  10-14

12-34

Tests for Heteroscedasticity

• The “fan-out” pattern of increasing residual variance is the most common pattern indicating heteroscedasticity.

Chapter 12

Residual Tests

Page 35: Week  11  November  10-14

12-35

What to Do about Heteroscedasticity?

• Transform both X and Y, for example, by taking logs.

• Although it can widen the confidence intervals for the coefficients, heteroscedasticity does not bias the estimates.

Chapter 12

Autocorrelated Errors

• Autocorrelation is a pattern of non-independent errors.

• In a first-order autocorrelation, et is correlated with et 1.

• The estimated variances of the OLS estimators are biased, resulting in confidence intervals that are too narrow, overstating the model’s fit.

Residual Tests

Page 36: Week  11  November  10-14

12-36

Runs Test for Autocorrelation• In the runs test, count the number of the residuals’ sign reversals (i.e., how often

does the residual cross the zero centerline?).

• If the pattern is random, the number of sign changes should be n/2.

• Fewer than n/2 would suggest positive autocorrelation.

• More than n/2 would suggest negative autocorrelation.

Chapter 12

Durbin-Watson (DW) Test• Tests for autocorrelation under the hypotheses

H0: Errors are nonautocorrelatedH1: Errors are autocorrelated

• The DW statistic will range from 0 to 4.DW < 2 suggests positive autocorrelationDW = 2 suggests no autocorrelation (ideal)DW > 2 suggests negative autocorrelation

Residual Tests

Page 37: Week  11  November  10-14

12-37

What to Do about Autocorrelation?• Transform both variables using the method of first differences in which

both variables are redefined as changes. Then we regress Y against X.

• Although it can widen the confidence interval for the coefficients, autocorrelation does not bias the estimates.

• Don’t worry about it at this stage of your training. Just learn to detect whether it exists.

Chapter 12

Residual Tests

Page 38: Week  11  November  10-14

12-38

Example: Excel’s Tests of Assumptions

Chapter 12

Residual Tests

Warning: Excel offers normal probability plots for residuals, but they are done incorrectly.

Excel’s Data Analysis > Regression does residual plots and gives the DW test statistic. Its standardized residuals are done in a strange way, but usually they are not misleading.

Page 39: Week  11  November  10-14

12-39

Example: MegaStat’s Tests of Assumptions

Chapter 12

Residual Tests

MegaStat will do all three tests (if you check the boxes). Its runs plot (residuals by observation) is a visual test for autocorrelation, which Excel does not offer.

Page 40: Week  11  November  10-14

12-40

Example: MegaStat’s Tests of Assumptions

Chapter 12

Residual Tests

near-linear plot - indicates normal errors no pattern - suggests homoscedastic errors

no pattern - suggests homoscedastic errors DW near 2 - suggests no autocorrelation

Page 41: Week  11  November  10-14

12-41

Unusual Observations

Standardized Residuals• Use Excel, MINITAB, MegaStat or other software to compute

standardized residuals.

• If the absolute value of any standardized residual is at least 2, then it is classified as unusual.

Chapter 12

Leverage and Influence• A high leverage statistic indicates the observation is far from the

mean of X.

• These observations are influential because they are at the “end of the lever.”

• The leverage for observation i is denoted hi.

Page 42: Week  11  November  10-14

12-42

Leverage

• A leverage that exceeds 4/n is unusual.

Chapter 12

Unusual Observations

Page 43: Week  11  November  10-14

12-43

• If the absolute value of any standardized residual is at least 2, then it is classified as unusual.

• Leverage that exceeds 4/n indicates an influential X value (far from mean of X).

Chapter 12

Unusual Observations

Quant VerbalObs X Y Predicted Y Residual Leverage Std Residual

1 520 398 555.4 -157.4 0.070 -1.6362 329 505 358.6 146.4 0.081 1.5303 225 183 251.4 -68.4 0.171 -0.7534 424 332 456.5 -124.5 0.050 -1.2805 650 737 689.4 47.6 0.176 0.5266 491 578 525.5 52.5 0.059 0.5427 384 344 415.3 -71.3 0.057 -0.7368 311 367 340.1 26.9 0.092 0.2839 236 298 262.8 35.2 0.159 0.385

10 344 600 374.1 225.9 0.073 2.35111 541 643 577.1 65.9 0.081 0.68912 324 328 353.5 -25.5 0.084 -0.26713 515 556 550.3 5.7 0.067 0.05914 528 527 563.7 -36.7 0.074 -0.38215 380 504 411.2 92.8 0.058 0.95916 629 695 667.8 27.2 0.153 0.29717 228 133 254.5 -121.5 0.168 -1.33518 454 478 487.4 -9.4 0.051 -0.09719 514 413 549.3 -136.3 0.067 -1.41320 677 742 717.2 24.8 0.210 0.279

Example: Achievement Test Scores

Page 44: Week  11  November  10-14

12-4412B-44

Other Regression Problems

Outliers

To fix the problem• delete the observation(s) if you are

sure they are actually wrong.• formulate a multiple regression

model that includes the lurking variable.

Outliers may be caused by

• an error in recording data.

• impossible data (can be omitted).

• an observation that has been influenced by an unspecified “lurking” variable that should have been controlled but wasn’t.

Chapter 12

Page 45: Week  11  November  10-14

12-45

Model Misspecification• If a relevant predictor has been omitted, then the model

is misspecified.

• For example, Height depends on Gender as well as Age.

• Use multiple regression instead of bivariate regression.

Ill-Conditioned Data

• Well-conditioned data values are of the same general order of magnitude.

• Ill-conditioned data have unusually large or small data values and can cause loss of regression accuracy or awkward estimates.

Chapter 12

Other Regression Problems

Page 46: Week  11  November  10-14

12-46

Ill-Conditioned Data• Avoid mixing magnitudes by adjusting the magnitude of your

data before running the regression.

• For example, Revenue= 139,405,377 mixed with ROI = .037.

Spurious Correlation

• In a spurious correlation two variables appear related because of the way they are defined.

• This problem is called the size effect or problem of totals.

• Expressing variables as per capita or per cent may be helpful.

Chapter 12

Other Regression Problems

Page 47: Week  11  November  10-14

12-47

Model Form and Variable Transforms

• Sometimes a nonlinear model is a better fit than a linear model. Excel offers other model forms for simple regression (one X and one Y)

• Variables may be transformed (e.g., logarithmic or exponential functions) in order to provide a better fit.

• Log transformations reduce heteroscedasticity.

• Nonlinear models may be difficult to interpret.

Chapter 12

Other Regression Problems

Page 48: Week  11  November  10-14

0-48

Assignments ML 11.4

• Connect C-8 (covers chapter 12)• You get three attempts• Feedback is given if requested• Printable if you wish• Deadline is midnight each Monday

• Project P-3 (data, tasks, questions)• Review instructions• Look at the data• Your task is to write a nice, readable report (not a spreadsheet)• Length is up to you

Page 49: Week  11  November  10-14

0-49

Projects: General Instructions

General Instructions

For each team project, submit a short (5-10 page) report (using Microsoft Word or equivalent) that answers the questions posed. Strive for effective writing (see textbook Appendix I). Creativity and initiative will be rewarded. Avoid careless spelling and grammar. Paste graphs and computer tables or output into your written report (it may be easier to format tables in Excel and then use Paste Special > Picture to avoid weird formatting and permit sizing within Word). Allocate tasks among team members as you see fit, but all should review and proofread the report (submit only one report).

Page 50: Week  11  November  10-14

0-50

You will be assigned team members and a dependent variable (see Moodle) from the 2010 state database Big Dataset 09 - US States. The team may change the assigned dependent variable (instructor assigned one just to give you a quick start). Delegate tasks and collaborate as seems appropriate, based on your various skills. Submit one report. Data: Choose an interesting dependent variable (non-binary) from the 2010 state database posted on Moodle. Analysis: (a). Propose a reasonable model of the form Y = f(X1, X2, ... , Xk) using not more than 12 predictors. (b) Use regression to investigate the hypothesized relationship. (c) Try deleting poor predictors until you feel that you have a parsimonious model, based on the t-values, p-values, standard error, and R2

adj. (d) For the preferred model only, obtain a list of residuals and request residual tests and VIFs. (e) List the states with high leverage and/or unusual residuals. (f) Make a histogram and/or probability plot of the residuals. Are the residuals normal? (g) For the predictors that were retained, analyze the correlation matrix and/or VIFs. Is multicollinearity a problem? If so, what could be done? (h) If you had more time, what might you do?

Watch the instructor’s video walkthrough using Assault as an example (posted on Moodle)

Project P-3

Page 51: Week  11  November  10-14

12-51

Project P-3 (preview, data, tasks)

• Example using the 2005 state database:

• 170 variables on n = 50 states

• Choose one variable as Y ( the response).

• Goal: to explain why Y varies from state to state.

• Start choosing X1, X2, … , Xk (the predictors).

• Copy Y and X1, X2, … , Xk to a new spreadsheet.

• Study the definitions for each variable (e.g., Burglary is the burglary rate per 100,000 population.

Page 52: Week  11  November  10-14

12-52

• Why multiple predictors?

• One predictor usually is an incorrect specification.

• Fit can usually be improved.

• How many predictors: Evans’ Rule (k n/10)

• Up to one predictor per 10 observations

• For example, n = 50 suggests k = 5 predictors.

• Evans’ Rule is conservative. It’s OK to start with more (you will end up with fewer after deleting weak predictors).

Project P-3 (preview, data, tasks)

Page 53: Week  11  November  10-14

12-53

Project P-3 (preview, data, tasks)

• Work with partners? Absolutely – it will be more fun.

• Post questions for peers or instructor on Moodle.

• Get started. But don’t run a bunch of regressions until you have studied Chapter 13.

• It’s a good idea to have the instructor look over your list of intended Y and X1, X2, … , Xk in order to avoid unnecessary re-work if there are obvious problems.

• Look at all the categories of variables – don’t just grab the first one you see (there are 170 variables). Or just use the one your instructor assigned.