l11 regression

Upload: donald-yum

Post on 20-Feb-2018

251 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 L11 Regression

    1/62

    David ChowNov 2014

    Chapter 13

    Simple Linear Regression

  • 7/24/2019 L11 Regression

    2/62

    Chap 13-22

    Learning Objectives

    Understand the simple linear regression model

    Model assumptions

    Meaning of the coefficients b0and b1

    Predict the value of a dependent variable usingregression results

    Make inferences about the coefficients

    Estimate mean values and predict individual values

  • 7/24/2019 L11 Regression

    3/62

    Overview

    This PPT

    Linear regression model

    Estimate regression line

    Measures of variation

    Residual analysis

    Statistical inference Testing & interval estimate

    Excel file

    Eg: House price Scatter plot

    Reg: standard output Reg: residuals

    Data for review question

    Answer

    3

  • 7/24/2019 L11 Regression

    4/62

    Chap 13-4

    What We Know

    1. Equation of a straight line

    2. Scatter plot

    Linear / non-linear relationship

    Strong / weak relationship

    3. Correlation:____ relationshipbetween two variables

    Correlation doesnt imply casual relationship

    Eg: No. of firefighters and damage (in $)

    4. Hypothesis testing

  • 7/24/2019 L11 Regression

    5/62

    Chap 13-5Chap 13-5

    Regression Analysis

    Dependent variable (y): the variable we wish topredict or explain

    Independent variable (x): the variable used to

    predict or explain y

    Regression analysisis used to:

    Explainthe impact of changes in x on the dependent variable y

    Predictthe value of y based on the value of x(s)

    Linear Regression:

    Only one x

    Linear relationship between x and y

  • 7/24/2019 L11 Regression

    6/62

    Y=___, X=___

    Useful tip:A visual check BEFORErunningregression is always helpful!

    Examples

  • 7/24/2019 L11 Regression

    7/62

    Linear Regression Model

  • 7/24/2019 L11 Regression

    8/62Chap 13-8Chap 13-8

    ii10i XY

    Linear component

    Linear Regression Model

    PopulationY-intercept

    PopulationSlopeCoefficient Random

    Error termDependent

    Variable

    IndependentVariable

    Random error component

    Assumptions1. X and Y are linearly related2. 0 and 1are population parameters3. Given an X-value (say, Xi), Y= Yiis random, because of

    4. Assume E()=0, so that E(Y)=____

  • 7/24/2019 L11 Regression

    9/62Chap 13-9Chap 13-9

    Random Errorfor this Xivalue

    Y

    X

    Observed Value

    of Y for Xi

    Predicted Valueof Y for Xi

    ii10i XY

    Xi

    Slope = 1

    Intercept = 0

    i

    Linear Regression Model

    Population Regressionline: E(Y) = 0 +1 X

  • 7/24/2019 L11 Regression

    10/62Chap 13-10Chap 13-10

    Model Assumptions: LINE

    Linearity The relationship between X and Y is linear

    Independence of errors

    Error values are statistically independent Normality of error

    Error values are normally distributed for any given value of X

    Equal variance (also called homoscedasticity)

    The probability distribution of the errors has constant variance

    Assumption on error terms:~ N (0, 2)

    See the appendix for a graphical presentation

  • 7/24/2019 L11 Regression

    11/62

    Estimate the RegressionLine and Predict Y Values

  • 7/24/2019 L11 Regression

    12/62Chap 13-12Chap 13-12

    i10i XbbY

    The regression equation provides anestimateof the population regression line

    Est. Regression Equation(Prediction Line)

    Estimate ofintercept

    Estimate of slopeEstimated (orpredicted) Y valuefor observation i

    Value of X forobservation i

  • 7/24/2019 L11 Regression

    13/62Chap 13-13Chap 13-13

    Least Squares Method(The Best Fitted Line)

    b0and b1are obtained by finding the values of

    that minimize the sum of the squared

    differences between Y and :

    2

    i10i

    2

    ii ))Xb(b(Ymin)Y(Ymin

    Y

    The computations (to find b0and b1) are done by Excel

    Formulae for b0and b1are in the appendix

  • 7/24/2019 L11 Regression

    14/62Chap 13-14Chap 13-14

    b0is the estimated mean value of

    Y when the value of X is zero

    b1is the estimated changein the

    mean value of Y as a result of a

    one-unit increase in X

    Interpreting b0and b1

  • 7/24/2019 L11 Regression

    15/62Chap 13-15Chap 13-15

    Eg: House Price(in Excel File)

    A real estate agent wishes to examine the relationshipbetween the selling price of a home and its size(measured in square feet)

    A random sample of 10 houses is selected

    Dependent var (Y) = house price (in $1000s)

    Independent var (X) = square feet

  • 7/24/2019 L11 Regression

    16/62Chap 13-16Chap 13-16

    Eg: House PriceData & Scatter Plot

    House Price in$1000s (Y)

    Square Feet(X)

    245 1400

    312 1600

    279 1700

    308 1875

    199 1100

    219 1550405 2350

    324 2450

    319 1425

    255 1700

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    0 500 1000 1500 2000 2500 3000

    HousePrice($1000s)

    Square Feet

  • 7/24/2019 L11 Regression

    17/62Chap 13-17Chap 13-17

    Using Excel

    1. Choose Data 2. Choose Data Analysis

    3. Choose Regression

    4. Input data range and output options

  • 7/24/2019 L11 Regression

    18/62Chap 13-18Chap 13-18

    Using Excel: Regression Output

    Regress ion Statis t ics

    Multiple R 0.76211

    R Square 0.58082

    Adjusted R Square 0.52842

    Standard Error 41.33032

    Observations 10

    ANOVAd f SS MS F Signi f icance F

    Regression 1 18934.9348 18934.9348 11.0848 0.01039

    Residual 8 13665.5652 1708.1957

    Total 9 32600.5000

    Coeff ic ients Standard Error t Stat P-value Low er 95% Upper 95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    The regression equation is:

    feet)(square0.1097798.24833pricehouse

  • 7/24/2019 L11 Regression

    19/62Chap 13-19Chap 13-19

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    0 500 1000 1500 2000 2500 3000

    Square Feet

    House

    Price

    ($1000s

    )

    Eg: House Price

    House price model: Scatter Plot and Prediction Line

    feet)(square0.1097798.24833pricehouse

    Slope= 0.10977

    Intercept= 98.248

  • 7/24/2019 L11 Regression

    20/62Chap 13-20Chap 13-20

    Eg: Interpreting bo

    bois the estimated mean value of Y when the value of Xis zero (if X = 0 is in the range of observed X values)

    Because a house cannot have a square footage of 0, bo

    has no practical application

    Generally speaking, intercept

    bois not our focus in regression

    feet)(square0.1097798.24833pricehouse

  • 7/24/2019 L11 Regression

    21/62Chap 13-21Chap 13-21

    Eg: Interpreting b1

    b1estimates the change in the mean value of Yas a result of a one-unit increase in X

    Here, b1= 0.10977 tells us that the mean value of a house

    increases by 0.10977($1000) = $109.77, on average, for each

    additional square foot of size

    feet)(square0.1097798.24833pricehouse

  • 7/24/2019 L11 Regression

    22/62Chap 13-22Chap 13-22

    317.78

    00)0.10977(2098.24833(sq.ft.)0.1097798.24833pricehouse

    Predict the price for a house with 2000 sq feet:

    The predicted price for a house with 2000square feet is 317.78($1,000s) = $317,780

    Eg: Making Predictions

  • 7/24/2019 L11 Regression

    23/62

    Chap 13-23Chap 13-23

    0

    50

    100150

    200

    250

    300

    350

    400

    450

    0 500 1000 1500 2000 2500 3000

    Square Feet

    House

    Price

    ($1000s)

    Making Predictions

    General rule:Predict only within the relevant range of Xs

    Relevant range forinterpolation

    Do not try to extrapolatebeyond the range of

    observed Xs

  • 7/24/2019 L11 Regression

    24/62

    Measures of Variation

    and r2

  • 7/24/2019 L11 Regression

    25/62

    Chap 13-25Chap 13-25

    Measures of Variation

    Total variation is made up of two parts:

    SSESSRSST

    Total Sum ofSquares

    Regression Sumof Squares

    Error Sum ofSquares

    2

    i )YY(SST

    2

    ii

    )YY(SSE 2

    i )YY(SSR

    where:

    = Mean value of the dependent variable

    Yi= Observed value of the dependent variable

    = Predicted value of Y for the given XivalueiY

    Y

  • 7/24/2019 L11 Regression

    26/62

    Chap 13-26Chap 13-26

    SST = total sum of squares (Total Variation)

    Measures the variation of the Yivalues around theirmean Y

    SSR = regression sum of squares (Explained Variation)

    Variation attributable to the relationship between Xand Y

    SSE = error sum of squares (Unexplained Variation) Variation in Y attributable to factors other than X

    Measures of Variation

  • 7/24/2019 L11 Regression

    27/62

    Chap 13-27Chap 13-27

    Xi

    Y

    X

    Yi

    SST=(Yi-Y)2

    SSE=

    (Yi-Yi )2

    SSR =

    (Yi-Y)2

    _

    _

    _

    Y

    Y

    Y

    _Y

    Measures of Variation

  • 7/24/2019 L11 Regression

    28/62

    Chap 13-28Chap 13-28

    The coefficient of determinationis the portion of thetotal variation in Y that is explained by variation in X

    The coefficient of determination is also called r-squaredand is denoted as r2

    Coefficient of Determination, r2

    1r0 2

    squaresofsumtotal

    squaresofsumregression2

    SST

    SSRr

  • 7/24/2019 L11 Regression

    29/62

    Chap 13-29Chap 13-29

    Examples of r2

    Y

    X

    Y

    X

    0 < r2< 1

    Weaker linear relationshipsbetween X and Y:

    Some but not all of the variation inY is explained by variation in X

    Q: Graph for r2 = 1?

  • 7/24/2019 L11 Regression

    30/62

    Chap 13-30Chap 13-30

    Measures of Variation and r2in ExcelHouse Price Again

    Regress ion Statis t ics

    Multiple R 0.76211

    R Square 0.58082

    Adjusted R Square 0.52842

    Standard Error 41.33032

    Observations 10

    ANOVAd f SS MS F Signi f icance F

    Regression 1 18934.9348 18934.9348 11.0848 0.01039

    Residual 8 13665.5652 1708.1957

    Total 9 32600.5000

    Coeff ic ients Standard Error t Stat P-value Low er 95% Upper 95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    58.08% of the variation in house pricesis explained by variation in sq feet

    0.580832600.5000

    18934.9348

    SST

    SSRr2

  • 7/24/2019 L11 Regression

    31/62

    Chap 13-31Chap 13-31

    Standard Error of Estimate

    The standard deviation of the variation of observationsaround the regression line is estimated by

    2

    )(

    2

    1

    2

    n

    YY

    n

    SSES

    n

    i

    ii

    YX

    whereSSE = error sum of squares

    n = sample size

  • 7/24/2019 L11 Regression

    32/62

    Chap 13-32Chap 13-32

    Standard Error of Estimate in ExcelHouse Price Again

    Regress ion Statis t ics

    Multiple R 0.76211

    R Square 0.58082

    Adjusted R Square 0.52842

    Standard Error 41.33032

    Observations 10

    ANOVAd f SS MS F Signi f icance F

    Regression 1 18934.9348 18934.9348 11.0848 0.01039

    Residual 8 13665.5652 1708.1957

    Total 9 32600.5000

    Coeff ic ients Standard Error t Stat P-value Low er 95% Upper 95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    41.33032SYX

  • 7/24/2019 L11 Regression

    33/62

    Chap 13-33Chap 13-33

    Comparing Standard Errors

    YY

    X X

    YXSsmall

    YXSlarge

    SYXis a measure of the variation of observedY values from the regression line

    SYXcarries the same unit as Y

    It should be judged relative to the size of the Y values in thesample data

    Eg: SYX= $41.33K ismoderately small relative to house

    prices in the $200K - $400K range

  • 7/24/2019 L11 Regression

    34/62

    Years of employment and salary ($1000) in 7 subjects:

    Years Salary6.6 32

    7.4 42

    8.8 52

    9.7 61

    10.5 62

    10.7 66

    11.8 65

    Example: Salary Data

  • 7/24/2019 L11 Regression

    35/62

    Residual Analysis

    (Autocorrelation is NOT covered)

  • 7/24/2019 L11 Regression

    36/62

    Chap 13-36Chap 13-36

    Residual Analysis

    The residual for observation i, ei, is the differencebetween its observed and predicted value

    Recall the regression assumptions

    L: X and Y linearly related?

    I: Errors statistically independent?

    N: Errors normally distributed? E: Errors have constant variance (homoscedasticity)?

    A residual plot (residuals vs X)is very useful in

    checking the assumptions

    iii YYe

  • 7/24/2019 L11 Regression

    37/62

    Chap 13-37Chap 13-37

    L: Linearity

    Not Linear Linear

    x

    residua

    ls

    x

    Y

    x

    Y

    x

    residua

    ls

  • 7/24/2019 L11 Regression

    38/62

    Chap 13-38Chap 13-38

    I: Independence

    Not Independent

    Independent

    X

    Xresiduals

    residuals

    X

    residuals

  • 7/24/2019 L11 Regression

    39/62

    Chap 13-39Chap 13-39

    N: Normality

    N: Are the errors normally distributed?

    Many ways to check for normality

    Stem-and-Leaf

    Histogram

    Normal Probability Plot, etc.

    My recommendation is always ___

  • 7/24/2019 L11 Regression

    40/62

    Chap 13-40Chap 13-40

    E: Equal Variance

    Non-constant variance Constant variance

    x x

    Y

    x x

    Y

    residua

    ls

    residua

    ls

    A R id l Pl t b E l

  • 7/24/2019 L11 Regression

    41/62

    Chap 13-41Chap 13-41

    House Price Model Residual Plot

    -60

    -40

    -20

    0

    20

    40

    60

    80

    0 1000 2000 3000

    Square Feet

    Residuals

    A Residual Plot by ExcelHouse Price Again

    RESIDUAL OUTPUT

    Predicted

    House Price Residuals

    1 251.92316 -6.923162

    2 273.87671 38.123293 284.85348 -5.853484

    4 304.06284 3.937162

    5 218.99284 -19.99284

    6 268.38832 -49.38832

    7 356.20251 48.79749

    8 367.17929 -43.17929

    9 254.6674 64.33264

    10 284.85348 -29.85348

    Key: Any patternin the residual plot?

    NOTE: Autocorrelation is NOTcoveredin this course

  • 7/24/2019 L11 Regression

    42/62

    Testing for Significance

    and Interval Estimate

  • 7/24/2019 L11 Regression

    43/62

    Chap 13-43Chap 13-43

    Inferences About the Slope

    The standard error of the regression slopecoefficient (b1) is estimated by

    2

    i

    YXYXb

    )X(X

    S

    SSX

    SS1

    where:

    = Estimate of the standard error of the slope

    = Standard error of the estimate

    1bS

    2n

    SSES

    YX

  • 7/24/2019 L11 Regression

    44/62

    Chap 13-44Chap 13-44

    t Test for the Slope

    Typical question: X and Y linearly related? H0: 1= 0 (no linear relationship)

    H1: 10 (linear relationship does exist)

    Test statistic

    1b

    11STAT

    S

    bt

    2nd.f.

    where:

    b1= regression slope

    coefficient1= hypothesized slope

    Sb1= standarderror of the slope

    I f Ab t th Sl i E l

  • 7/24/2019 L11 Regression

    45/62

    Chap 13-45Chap 13-45

    Inferences About the Slope in ExcelHouse Price Again

    Want to know if ____

    H0: 1= 0; H1: 1 0

    Next, how to test if slopeequals a particular value?

    Coeff ic ients Standard Error t Stat P-value

    Intercept 98.24833 58.03348 1.69296 0.12892Square Feet 0.10977 0.03297 3.32938 0.01039

    1bSb1

    329383032970

    0109770

    S

    bt

    1b

    11

    STAT ..

    .

  • 7/24/2019 L11 Regression

    46/62

    Chap 13-46Chap 13-46

    Inferences About the Slope in ExcelHouse Price Again

    Test Statistic: tSTAT= 3.329

    Critical values?

    a 0.05

    d.f. 10 2 8

    Decision: Reject H0

    Reject H0Reject H0

    a/2=.025

    -t/2Do not reject H0

    0t/2

    a/2=.025

    -2.3060 2.3060 3.329

  • 7/24/2019 L11 Regression

    47/62

    Chap 13-47Chap 13-47

    Coeff ic ients Standard Error t Stat P-value

    Intercept 98.24833 58.03348 1.69296 0.12892

    Square Feet 0.10977 0.03297 3.32938 0.01039

    p-value

    Decision: Reject H0, since p-value <

    Inferences About the Slope in ExcelHouse Price Again

    The p-value Approach

    f S f

  • 7/24/2019 L11 Regression

    48/62

    Chap 13-48Chap 13-48

    F Test for Model Significance

    F Test statistic:

    where

    MSE

    MSRFSTAT

    1kn

    SSEMSE

    k

    SSRMSR

    FSTATfollows an F distribution Numerator d.f. = k Denominator d.f. = n-k-1 k = the number of independent variables in the regression model

    H0: 1= 0; H1: 1 0

    F T t f M d l Si ifi

  • 7/24/2019 L11 Regression

    49/62

    Chap 13-49Chap 13-49

    F-Test for Model SignificanceHouse Price Again

    Regress ion Statis t ics

    Multiple R 0.76211

    R Square 0.58082

    Adjusted R Square 0.52842

    Standard Error 41.33032

    Observations 10

    ANOVAdf SS MS F Signi f icance F

    Regression 1 18934.9348 18934.9348 11.0848 0.01039

    Residual 8 13665.5652 1708.1957

    Total 9 32600.5000

    11.08481708.1957

    18934.9348

    MSE

    MSRFSTAT

    With 1 and 8 degreesof freedom

    p-value forthe F-Test

  • 7/24/2019 L11 Regression

    50/62

    Chap 13-50Chap 13-50

    H0: 1= 0 H1: 1 0

    a= .05

    df1= 1 df2= 8

    Test Statistic:

    Decision:

    Reject H0at = 0.05

    0

    a= .05

    F.05 = 5.32Reject H0Do not

    reject H0

    11.08FSTAT

    MSE

    MSR

    Critical Value:

    F

    = 5.32

    F Test for Significance

    F

  • 7/24/2019 L11 Regression

    51/62

    Chap 13-51Chap 13-51

    Interval Estimate for the SlopeHouse Price Again

    Excel Printout

    At 95% level of confidence, the confidence interval for

    the slope is (0.0337, 0.1858) we are 95% confident that the average impact on sales

    price is between $33.74 and $185.80 per sq foot of size

    Same conclusion as t-test?

    1b2/1

    Sb

    t

    Coeff ic ients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    d.f. = n - 2

  • 7/24/2019 L11 Regression

    52/62

    Chap 13-52Chap 13-52

    t Test for a Correlation Coefficient

    Hypotheses

    H0: = 0 (no correlation between X and Y)

    H1: 0 (correlation exists)

    Test statistic

    (with n2 degrees of freedom)

    2n

    r1

    -rt

    2STAT

    0bifrr

    0bifrr

    where

    1

    2

    1

    2

  • 7/24/2019 L11 Regression

    53/62

    Chap 13-53Chap 13-53

    t-test For A Correlation Coefficient

    Is there evidence of a linear relationship between squarefeet and house price at the .05 level of significance?

    H0: = 0 (No correlation)

    H1: 0 (correlation exists)a=.05 , df=10 - 2 = 8

    3.329

    210.7621

    0.762

    2nr1

    rt

    22STAT

    Critical Value =Decision:

    E ti ti M V l d

  • 7/24/2019 L11 Regression

    54/62

    Chap 13-54Chap 13-54

    Estimating Mean Values andPredicting Individual Values

    Y

    XXi

    Y = b0+b1Xi

    ConfidenceInterval for

    the meanofY, given Xi

    Prediction Interval

    for an individualY,given X

    i

    Goal: Form intervals around Y to expressuncertainty about the value of Y for a given Xi

    Y

    S t d P d f

  • 7/24/2019 L11 Regression

    55/62

    Chap 13-55Chap 13-55

    Suggested Procedures for aRegression Analysis

    1. Start with a scatter plot to observe possiblerelationship

    2. Run regression

    3. Perform residual analysis Plot the residuals vs. X to check for assumptions such as

    linearity & homoscedasticity

    Use a histogram to see if the residuals are normallydistributed

    S ggested Proced res for a

  • 7/24/2019 L11 Regression

    56/62

    Chap 13-56Chap 13-56

    Suggested Procedures for aRegression Analysis

    4. If there is violation of any assumption, use alternativemethods or models

    5. If there is no evidence of assumption violation, then

    test for the significance of the regression coefficientsand construct confidence intervals and predictionintervals

    6. Avoid making predictions or forecasts outside therelevant range

  • 7/24/2019 L11 Regression

    57/62

    1. In a regression analysis, the error term is a randomvariable with an expected value of ____

    2. In a regression analysis, if r2= 1, SST = ____

    3. A regression analysis between sales (in $1000) andadvertising (in $100) resulted in the following equation:

    Y-hat = 500 + 4X

    a) How to interpret b1?b) Based on the above equation, if advertising is $10,000,

    the point estimate for sales (in dollars) is ____

    Review Questions 1-3

  • 7/24/2019 L11 Regression

    58/62

    4. Shown below is a portion of an Excel printout for a regressionanalysis relating Y (quantity demanded) and X (unit price).

    a) Perform a t test to determine if Y and X are related. Let = 0.05

    b) Compute R2and interpret its meaning

    Review Question 4

    ANOVA

    df

    SS

    Regression 1 5048.818

    Residual 46 3132.661

    Total 47 8181.479

    Coefficients Standard Er ror

    Intercept 80.390 3.102

    X -2.137 0.248

  • 7/24/2019 L11 Regression

    59/62

    A regression model is developed to relate the price per personand the sum of the ratings for food, decor, and service

    1. Find r 2

    2. Calculate and Interpret the regression coefficients

    3. At the 0.05 level of significance, is there evidence of a linearrelationship between the price per person and the summated

    rating?

    4. Construct a 95% confidence interval for the slope

    5. Evaluate the usefulness of this model

    Review Question 5:

    Restaurant Ratings

  • 7/24/2019 L11 Regression

    60/62

    Appendix

    Figure: Regression Model Assumptions

  • 7/24/2019 L11 Regression

    61/62

    Figure: Regression Model Assumptions

    S h d

  • 7/24/2019 L11 Regression

    62/62

    1 2

    ( )( )

    ( )

    i i

    i

    x x y yb

    x x

    Least Squares Method(Formulae for b0and b1)

    0 1b y b x