linear regression hypothesis testing and estimation

Linear Regression

Hypothesis testing and Estimation

Assume that we have collected data on two variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

The Statistical Model

Each yi is assumed to be randomly generated from a normal distribution with

mean i = + xi and standard deviation . (, and are unknown)

yi

+ xi

xi

Y = + X

slope =

The Data The Linear Regression Model

• The data falls roughly about a straight line.

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

Y = + X

unseen

The Least Squares Line

Fitting the best straight line

to “linear” data

LetY = a + b X

denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.

For example, if X = xi (as for the ith case) then the predicted value of Y is:

ii bxay ˆ

The residual

can be computed for each case in the sample,

The residual sum of squares (RSS) is

a measure of the “goodness of fit of the line

Y = a + bX to the data

iiiii bxayyyr ˆ

,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

The optimal choice of a and b will result in the residual sum of squares

attaining a minimum.

If this is the case than the line:

Y = a + bX

is called the Least Squares Line

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

The equation for the least squares line

Let

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

Linear Regression


The Least Squares Line

Fitting the best straight line

to “linear” data

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

Computing Formulae:

Then the slope of the least squares line can be shown to be:

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

and the intercept of the least squares line can be shown to be:

xS

Syxbya

xx

xy

The residual sum of Squares

22

1 1

ˆn n

i i i ii i

RSS y y y a bx

2

xy

yyxx

SS

S

Computing formula

Estimating , the standard deviation in the regression model :

22

ˆ1

2

1

2

n

bxay

n

yys

n

iii

n

iii

xx

xyyy S

SS

n

2

2

1

This estimate of is said to be based on n – 2 degrees of freedom

Computing formula

Sampling distributions of the estimators

The sampling distribution slope of the least squares line :

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

It can be shown that b has a normal distribution with mean and standard deviation

n

ii

xx

bb

xxS

1

2

and

Thus

has a standard normal distribution, and

b

b

xx

b bz

S

b

b

xx

b bt

ssS

has a t distribution with df = n - 2

(1 – )100% Confidence Limits for slope :

t/2 critical value for the t-distribution with n – 2 degrees of freedom

xxS

st ˆ

2/

Testing the slope

The test statistic is:

0 0 0: vs : AH H

0

xx

bt

sS

- has a t distribution with df = n – 2 if H0 is true.

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

xx

bt t t t

sS

df = n – 2

This is a two tailed tests. One tailed tests are also possible

The sampling distribution intercept of the least squares line :

It can be shown that a has a normal distribution with mean and standard deviation

n

ii

aa

xx

x

n

1

2

21 and

xS

Syxbya

xx

xy

Thus

has a standard normal distribution and

2

2

1

1

a

a

n

ii

a az

xn x x

2

2

1

1

a

a

n

ii

a at

s xs

n x x

has a t distribution with df = n - 2

(1 – )100% Confidence Limits for intercept :


1

ˆ2

2/xxS

x

nst

Testing the intercept


0 0 0: vs : AH H


0

2

2

1

1

n

ii

at

xs

n x x

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

a

at t t t

s

df = n – 2

Example

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

Australia

CanadaDenmark

Finland

Great Britain

Holland

Iceland

NorwaySweden

Switzerland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

deat

h ra

tes f

rom

lung

can

cer

(195

0)

Per capita consumption of cigarettes

404,541

2

n

iix

914,161

n

iii yx

018,61

2

n

iiy

Fitting the Least Squares Line

6641

n

iix

2261

n

iiy

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS

Fitting the Least Squares Line

First compute the following three quantities:

Computing Estimate of Slope (), Intercept () and standard deviation (),

288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

35.8

2

12

xx

xyyy S

SS

ns

95% Confidence Limits for slope :

t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

xxS

st ˆ

2/

0.0706 to 0.3862

8.350.288 2.262

1432255

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

-4.34 to 17.85


2664 111

6.756 2.262 8.35 11 1432255

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

95% confidence Limits for slope 0.0706 to 0.3862

95% confidence Limits for intercept -4.34 to 17.85

Testing the positive slope


0 : 0 vs : 0 AH H

0

xx

bt

sS

The Critical Region

Reject

0 : 0 in favour of : 0 AH H

0.05

0if =1.833

xx

bt t

sS

df = 11 – 2 = 9

A one tailed test

and conclude

0 : 0 H

0Since

xx

bt

sS

0.28841.3 1.833

8.351432255

we reject

: 0 AH

Confidence Limits for Points on the Regression Line

• The intercept is a specific point on the regression line.

• It is the y – coordinate of the point on the regression line when x = 0.

• It is the predicted value of y when x = 0.• We may also be interested in other points on the

regression line. e.g. when x = x0

• In this case the y – coordinate of the point on the regression line when x = x0 is + x0

x0

+ x0

y = + x

(1- )100% Confidence Limits for + x0 :

1 20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Prediction Limits for new values of the Dependent variable y

• An important application of the regression line is prediction.

• Knowing the value of x (x0) what is the value of y?

• The predicted value of y when x = x0 is:

• This in turn can be estimated by:.

ˆ 0xy

00 ˆˆˆ bxaxy

The predictor

• Gives only a single value for y. • A more appropriate piece of information would

be a range of values.• A range of values that has a fixed probability of

capturing the value for y.• A (1- )100% prediction interval for y.

00 ˆˆˆ bxaxy

(1- )100% Prediction Limits for y when x = x0:

11

20

2/0xxS

xx

nstbxa


Example

In this example we are studying building fires in a city and interested in the relationship between:

1. X = the distance of the closest fire hall and the building that puts out the alarm

and

2. Y = cost of the damage (1000$)

The data was collected on n = 15 fires.

The DataFire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Scatter Plot

Computations

Fire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6

10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

2.491

n

iix

2.3961

n

iiy

16.1961

2

n

iix

5.113761

2

n

iiy

65.14701

n

iii yx

Computations Continued

28.3152.491

n

xx

n

ii

4133.26152.3961

n

yy

n

ii


784.34152.4916.196

2

2

1

1

2

n

xxS

n

iin

iixx

517.911152.3965.11376

2

2

1

1

2

n

yyS

n

iin

iiyy

n

yxyxS

n

ii

n

iin

iiixy

11

1

114.171152.3962.4965.1470


92.4784.34

114.171ˆ xx

xy

S

Sb

28.1028.3919.44133.26ˆ xbya

2

2

n

SS

Ss xx

xyyy

316.213

784.34114.171517.911

2

95% Confidence Limits for slope :


xxS

st ˆ

2/

4.07 to 5.77

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

7.21 to 13.35


0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Least Squares Line

y=4.92x+10.28


1 20

2/0xxS

xx

nstbxa


95% Confidence Limits for + x0 :

x 0 lower upper

1 12.87 17.522 18.43 21.803 23.72 26.354 28.53 31.385 32.93 36.826 37.15 42.44

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Confidence Limits for + x0

Confidence limits


11

20

2/0xxS

xx

nstbxa


95% Prediction Limits for y when x = x0

x 0 lower upper

1 9.68 20.712 14.84 25.403 19.86 30.214 24.75 35.165 29.51 40.246 34.13 45.45

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Prediction Limits for y when x =x0

Prediction limits

Linear RegressionSummary


(1 – )100% Confidence Limits for slope :


xxS

st ˆ

2/

Testing the slope


0 0 0: vs : AH H

0

xx

bt

sS


(1 – )100% Confidence Limits for intercept :


1

ˆ2

2/xxS

x

nst

Testing the intercept


0 0 0: vs : AH H


0

2

2

1

1

n

ii

at

xs

n x x


1 20

2/0xxS

xx

nstbxa



11

20

2/0xxS

xx

nstbxa


Correlation

The statistic:

Definition

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

is called Pearsons correlation coefficient

1. -1 ≤ r ≤ 1, |r| ≤ 1, r2 ≤ 1

2. |r| = 1 (r = +1 or -1) if the points

(x1, y1), (x2, y2), …, (xn, yn) lie along a straight line. (positive slope for +1, negative slope for -1)

Properties

The test for independence (zero correlation)

The test statistic:

22

1

rt n

r

Reject H0 if |t| > ta/2 (df = n – 2)

H0: X and Y are independent

HA: X and Y are correlated

The Critical region

This is a two-tailed critical region, the critical region could also be one-tailed

Example

In this example we are studying building fires in a city and interested in the relationship between:

1. X = the distance of the closest fire hall and the building that puts out the alarm

and

2. Y = cost of the damage (1000$)

The data was collected on n = 15 fires.

The DataFire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Scatter Plot

Computations

Fire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6

10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

2.491

n

iix

2.3961

n

iiy

16.1961

2

n

iix

5.113761

2

n

iiy

65.14701

n

iii yx


28.3152.491

n

xx

n

ii

4133.26152.3961

n

yy

n

ii


784.34152.4916.196

2

2

1

1

2

n

xxS

n

iin

iixx

517.911152.3965.11376

2

2

1

1

2

n

yyS

n

iin

iiyy

n

yxyxS

n

ii

n

iin

iiixy

11

1

114.171152.3962.4965.1470

The correlation coefficient

171.1140.961

34.784 911.517xy

xx yy

Sr

S S


The test statistic:

2 2

0.9612 13 12.525

1 1 0.961

rt n

r

We reject H0: independence, if |t| > t0.025 = 2.160

H0: independence, is rejected

Relationship between Regression and Correlation

Recall xy

xx yy

Sr

S S

Also

ˆ xy yy xy yy y

xx xx xx xxx yy

S S S S sr r

S S S sS S

since and 1 1

yyxxx y

SSs s

n n

Thus the slope of the least squares line is simply the ratio of the standard deviations × the correlation coefficient


Uses the test statistic:

22

1

rt n

r

H0: X and Y are independent


Note: andˆ yy

xx

Sr

S ˆxx

yy

Sr

S

1. The test for independence (zero correlation)H0: X and Y are independent


are equivalent

The two tests

2. The test for zero slopeH0: = 0.

HA: ≠ 0

1. the test statistic for independence:

22

1

rt n

r

2 2 2 2

1 1

xy xy

xx yy xx

xy xyyy

xx yy xx yy

S S

S S St n n

S SS

S S S S

Thus

2

ˆ

12

the same statistic for testing for slope.

xy

xx

xyyy xx

xxxx

S

SsS

S n SSS

zero

Regression (in general)

In many experiments we would have collected data on a single variable Y (the dependent variable ) and on p (say) other variables X1, X2, X3, ... , Xp (the independent variables). One is interested in determining a model that describes the relationship between Y (the response (dependent) variable) and X1, X2, …, Xp (the predictor (independent) variables.

This model can be used for– Prediction– Controlling Y by manipulating X1, X2, …, Xp

The Model:is an equation of the form

Y = f(X1, X2,... ,Xp | 1, 2, ... , q) +

where 1, 2, ... , q are unknown parameters of the function f and is a random disturbance (usually assumed to have a normal distribution with mean 0 and standard deviation .

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

Examples:

1. Y = Blood Pressure, X = age

The model

Y = + X + thus 1 = and 2 = .

This model is called:

the simple Linear Regression Model

Y = + X

8

8.5

9

9.5

10

10.5

11

11.5

12

12.5

1930 1940 1950 1960 1970 1980 1990 2000 2010

2. Y = average of five best times for running the 100m, X = the year

The model

Y = e-X + thus 1 = 2 = and 2 =

.


the exponential Regression Model

Y = e-X +

2. Y = gas mileage ( mpg) of a car brand

X1 = engine size

X2 = horsepower

X3 = weight

The model

Y = 0 + 1 X1 + 2 X2 + 3 X3 + .


the Multiple Linear Regression Model

The Multiple Linear Regression Model

In Multiple Linear Regression we assume the following model

Y = 0 + 1 X1 + 2 X2 + ... + p Xp +

This model is called the Multiple Linear Regression Model.

Again are unknown parameters of the model and where0, 1, 2, ... , p are unknown parameters and

is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation .

The importance of the Linear model

1. It is the simplest form of a model in which each dependent variable has some effect on the independent variable Y. – When fitting models to data one tries to find the

simplest form of a model that still adequately describes the relationship between the dependent variable and the independent variables.

– The linear model is sometimes the first model to be fitted and only abandoned if it turns out to be inadequate.

2. In many instance a linear model is the most appropriate model to describe the dependence relationship between the dependent variable and the independent variables.

– This will be true if the dependent variable increases at a constant rate as any or the independent variables is increased while holding the other independent variables constant.

3. Many non-Linear models can be Linearized (put into the form of a Linear model by appropriately transformation the dependent variables and/or any or all of the independent variables.) – This important fact ensures the wide utility of

the Linear model. (i.e. the fact the many non-linear models are linearizable.)

An Example

The following data comes from an experiment that was interested in investigating the source from which corn plants in various soils obtain their phosphorous.

–The concentration of inorganic phosphorous (X1) and

the concentration of organic phosphorous (X2) was

measured in the soil of n = 18 test plots.–In addition the phosphorous content (Y) of corn grown in the soil was also measured. The data is displayed below:

InorganicPhosphorous

X1

OrganicPhosphorous

X2

Plant Available

PhosphorousY

InorganicPhosphorous

X1

OrganicPhosphorous

X2

Plant Available

Phosphorous

Y

0.4 53 64 12.6 58 51

0.4 23 60 10.9 37 76

3.1 19 71 23.1 46 96

0.6 34 61 23.1 50 77

4.7 24 54 21.6 44 93

1.7 65 77 23.1 56 95

9.4 44 81 1.9 36 54

10.1 31 93 26.8 58 168

11.6 29 93 29.9 51 99

Coefficients

Intercept 56.2510241 (0)

X1 1.78977412 (1)

X2 0.08664925 (2)

Equation:

Y = 56.2510241 + 1.78977412 X1 + 0.08664925 X2

The Multiple Linear Regression Model

In Multiple Linear Regression we assume the following model

Y = 0 + 1 X1 + 2 X2 + ... + p Xp +

This model is called the Multiple Linear Regression Model.

Again are unknown parameters of the model and where0, 1, 2, ... , p are unknown parameters and

is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation .

Summary of the Statistics used in

Multiple Regression

The Least Squares Estimates:

0 1 2, , , , ,p

2

1

ˆn

i ii

RSS y y

2

0 1 1 2 21

n

i i i p pii

y x x x

- the values that minimize

The Analysis of Variance Table Entries

a) Adjusted Total Sum of Squares (SSTotal)

b) Residual Sum of Squares (SSError)

c) Regression Sum of Squares (SSReg)

Note:

i.e. SSTotal = SSReg +SSError

SSTotal n

i1

yi y_2. d.f. n 1

RSS SSError n

i1

yi yi2. d.f. n p 1

SSReg SS1,2, . . . , p n

i1

yi y_2. d.f. p

n

i1

yi y_2

n

i1

yi y_2

n

i1

yi yi 2 .

The Analysis of Variance Table

Source Sum of Squares d.f. Mean Square F

Regression SSReg p SSReg/p = MSReg MSReg/s2

Error SSError n-p-1 SSError/(n-p-1) =MSError = s2

Total SSTotal n-1

Uses: 1. To estimate 2 (the error variance).

- Use s2 = MSError to estimate 2.

2. To test the Hypothesis

H0: 1 = 2= ... = p = 0.

Use the test statistic 2

Reg RegErrorF MS MS MS s

Reg 1ErrorSS p SS n p

- Reject H0 if F > F(p,n-p-1).

3. To compute other statistics that are useful in describing the relationship between Y (the dependent variable) and X1, X2, ... ,Xp (the independent variables).a) R2 = the coefficient of determination

= SSReg/SSTotal

=

= the proportion of variance in Y explained by

X1, X2, ... ,Xp

1 - R2 = the proportion of variance in Y

that is left unexplained by X1, X2, ... , Xp

= SSError/SSTotal.

ˆ y i y 2

i1

n

y i y 2

i1

n

b) Ra2 = "R2 adjusted" for degrees of freedom.

= 1 -[the proportion of variance in Y that is left unexplained by X1, X2,... , Xp adjusted for

d.f.]1 Error TotalMS MS

11

1Error

Total

SS n p

SS n

11

1Error

Total

n SS

n p SS

211 1

1

nR

n p

c) R=R2 = the Multiple correlation coefficient of Y with X1, X2, ... ,Xp

=

= the maximum correlation between Y and a linear combination of X1, X2, ... ,Xp

Comment: The statistics F, R2, Ra2 and R are

equivalent statistics.

SSRe g

SSTotal

Using Statistical Packages

To perform Multiple Regression

Using SPSS

Note: The use of another statistical package such as Minitab is similar to using SPSS

After starting the SSPS program the following dialogue box appears:

If you select Opening an existing file and press OK the following dialogue box appears

The following dialogue box appears:

If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range:

Once you “click OK”, two windows will appear

One that will contain the output:

The other containing the data:

To perform any statistical Analysis select the Analyze menu:

Then select Regression and Linear.

The following Regression dialogue box appears

Select the Dependent variable Y.

Select the Independent variables X1, X2, etc.

If you select the Method - Enter.

All variables will be put into the equation.

There are also several other methods that can be used :

1. Forward selection

2. Backward Elimination

3. Stepwise Regression

Forward selection

1. This method starts with no variables in the equation

2. Carries out statistical tests on variables not in the equation to see which have a significant effect on the dependent variable.

3. Adds the most significant.

4. Continues until all variables not in the equation have no significant effect on the dependent variable.

Backward Elimination

1. This method starts with all variables in the equation

2. Carries out statistical tests on variables in the equation to see which have no significant effect on the dependent variable.

3. Deletes the least significant.

4. Continues until all variables in the equation have a significant effect on the dependent variable.

Stepwise Regression (uses both forward and backward techniques)

1. This method starts with no variables in the equation

2. Carries out statistical tests on variables not in the equation to see which have a significant effect on the dependent variable.

3. It then adds the most significant.

4. After a variable is added it checks to see if any variables added earlier can now be deleted.

5. Continues until all variables not in the equation have no significant effect on the dependent variable.

All of these methods are procedures for attempting to find the best equation

The best equation is the equation that is the simplest (not containing variables that are not important) yet adequate (containing variables that are important)

Once the dependent variable, the independent variables and the Method have been selected if you press OK, the Analysis will be performed.

The output will contain the following table

Model Summary

.822a .676 .673 4.46Model1

R R SquareAdjustedR Square

Std. Errorof the

Estimate

Predictors: (Constant), WEIGHT, HORSE, ENGINEa.

R2 and R2 adjusted measures the proportion of variance in Y that is explained by X1, X2, X3, etc (67.6% and 67.3%)

R is the Multiple correlation coefficient (the maximum correlation between Y and a linear combination of X1, X2, X3, etc)

The next table is the Analysis of Variance Table

The F test is testing if the regression coefficients of the predictor variables are all zero. Namely none of the independent variables X1, X2, X3, etc have any effect on Y

ANOVAb

16098.158 3 5366.053 269.664 .000a

7720.836 388 19.899

23818.993 391

Regression

Residual

Total

Model1

Sum ofSquares df

MeanSquare F Sig.

Predictors: (Constant), WEIGHT, HORSE, ENGINEa.

Dependent Variable: MPGb.

The final table in the output

Gives the estimates of the regression coefficients, there standard error and the t test for testing if they are zeroNote: Engine size has no significant effect on Mileage

Coefficientsa

44.015 1.272 34.597 .000

-5.53E-03 .007 -.074 -.786 .432

-5.56E-02 .013 -.273 -4.153 .000

-4.62E-03 .001 -.504 -6.186 .000

(Constant)

ENGINE

HORSE

WEIGHT

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: MPGa.

The estimated equation from the table below:

Is:

Coefficientsa

44.015 1.272 34.597 .000

-5.53E-03 .007 -.074 -.786 .432

-5.56E-02 .013 -.273 -4.153 .000

-4.62E-03 .001 -.504 -6.186 .000

(Constant)

ENGINE

HORSE

WEIGHT

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: MPGa.

5.53 5.56 4.6244.0

1000 100 1000Mileage Engine Horse Weight Error

Note the equation is:

Mileage decreases with:

5.53 5.56 4.6244.0

1000 100 1000Mileage Engine Horse Weight Error

1. With increases in Engine Size (not significant, p = 0.432)With increases in Horsepower (significant, p = 0.000)With increases in Weight (significant, p = 0.000)

Logistic regression

Recall the simple linear regression model:

y = 0 + 1x +

where we are trying to predict a continuous dependent variable y from a continuous independent variable x.

This model can be extended to Multiple linear regression model:

y = 0 + 1x1 + 2x2 + … + + pxp + Here we are trying to predict a continuous dependent variable y from a several continuous dependent variables x1 , x2 , … , xp .

Now suppose the dependent variable y is binary.

It takes on two values “Success” (1) or “Failure” (0)

This is the situation in which Logistic Regression is used

We are interested in predicting a y from a continuous dependent variable x.

Example

We are interested how the success (y) of a new antibiotic cream is curing “acne problems” and how it depends on the amount (x) that is applied daily.

The values of y are 1 (Success) or 0 (Failure).

The values of x range over a continuum

The logisitic Regression ModelLet p denote P[y = 1] = P[Success].

This quantity will increase with the value of x.

The ratio: 1

p

pis called the odds ratio

This quantity will also increase with the value of x, ranging from zero to infinity.

The quantity: ln1

p

p

is called the log odds ratio

Example: odds ratio, log odds ratio

Suppose a die is rolled:Success = “roll a six”, p = 1/6

1 16 6

516 6

1

1 1 5

p

p

The odds ratio

1ln ln ln 0.2 1.69044

1 5

p

p

The log odds ratio

The logisitic Regression Model

i. e. :

0 1

1xp

ep

In terms of the odds ratio

0 1ln1

px

p

Assumes the log odds ratio is linearly related to x.


0 1

1xp

ep

or

Solving for p in terms x.

0 1 1xp e p

0 1 0 1x xp pe e

0 1

0 11

x

x

ep

e

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Interpretation of the parameter 0

(determines the intercept)

p

0

01

e

e

x

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10


(determines when p is 0.50 (along with 0))

p 0 1

0 1

1 1

1 1 1 2

x

x

ep

e

x

00 1

1

0 or x x

when

Also0 1

0 11

x

x

dp d e

dx dx e

0

1

x

when

0 1 0 1 0 1 0 1

0 1

1 1

2

1

1

x x x x

x

e e e e

e

0 1

0 1

1 12 41

x

x

e

e

1

4

is the rate of increase in p with respect to x when p = 0.50

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10


(determines slope when p is 0.50 )

p

x

1slope 4

The data

The data will for each case consist of

1. a value for x, the continuous independent variable

2. a value for y (1 or 0) (Success or Failure)

Total of n = 250 cases

case x y

1 0.8 02 2.3 13 2.5 04 2.8 15 3.5 16 4.4 17 0.5 08 4.5 19 4.4 110 0.9 011 3.3 112 1.1 013 2.5 114 0.3 115 4.5 116 1.8 017 2.4 118 1.6 019 1.9 120 4.6 1

case x y

230 4.7 1231 0.3 0232 1.4 0233 4.5 1234 1.4 1235 4.5 1236 3.9 0237 0.0 0238 4.3 1239 1.0 0240 3.9 1241 1.1 0242 3.4 1243 0.6 0244 1.6 0245 3.9 0246 0.2 0247 2.5 0248 4.1 1249 4.2 1250 4.9 1

Estimation of the parameters

The parameters are estimated by Maximum Likelihood estimation and require a statistical package such as SPSS

Using SPSS to perform Logistic regression

Open the data file:

Choose from the menu:

Analyze -> Regression -> Binary Logistic

The following dialogue box appears

Select the dependent variable (y) and the independent variable (x) (covariate).

Press OK.

Here is the output

The Estimates and their S.E.

The parameter Estimates

SE

X 1.0309 0.1334Constant -2.0475 0.332

1 1.03090 -2.0475


(determines the intercept)

0

0

-2.0475

-2.0475intercept 0.1143

1 1

e e

e e


(determines when p is 0.50 (along with 0))

0

1

2.04751.986

1.0309x

Another interpretation of the parameter 1

1

4

is the rate of increase in p with respect to x when p = 0.50

1 1.03090.258

4 4

The dependent variable y is binary.

It takes on two values “Success” (1) or “Failure” (0)

The Logistic Regression Model

We are interested in predicting a y from a continuous dependent variable x.

The logisitic Regression ModelLet p denote P[y = 1] = P[Success].

This quantity will increase with the value of x.

The ratio: 1

p

pis called the odds ratio

This quantity will also increase with the value of x, ranging from zero to infinity.

The quantity: ln1

p

p

is called the log odds ratio


i. e. :

0 1

1xp

ep

In terms of the odds ratio

0 1ln1

px

p

Assumes the log odds ratio is linearly related to x.


In terms of p

0 1

0 11

x

x

ep

e

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

The graph of p vs x

p 0 1

0 11

x

x

ep

e

x

The Multiple Logistic Regression model

Here we attempt to predict the outcome of a binary response variable Y from several independent variables X1, X2 , … etc

0 1 1ln1 p p

pX X

p

0 1 1

0 1 1 or

1

p p

p p

X X

X X

ep

e

Multiple Logistic Regression an example

In this example we are interested in determining the risk of infants (who were born prematurely) of developing BPD (bronchopulmonary dysplasia)

More specifically we are interested in developing a predictive model which will determine the probability of developing BPD from

X1 = gestational Age and X2 = Birthweight

For n = 223 infants in prenatal ward the following measurements were determined

1. X1 = gestational Age (weeks),

2. X2 = Birth weight (grams) and3. Y = presence of BPD

The datacase Gestational Age Birthweight presence of BMD

1 28.6 1119 12 31.5 1222 03 30.3 1311 14 28.9 1082 05 30.3 1269 06 30.5 1289 07 28.5 1147 08 27.9 1136 19 30 972 0

10 31 1252 011 27.4 818 012 29.4 1275 013 30.8 1231 014 30.4 1112 015 31.1 1353 116 26.7 1067 117 27.4 846 118 28 1013 019 29.3 1055 020 30.4 1226 021 30.2 1237 022 30.2 1287 023 30.1 1215 024 27 929 125 30.3 1159 026 27.4 1046 1

The resultsVariables in the Equation

-.003 .001 4.885 1 .027 .998

-.505 .133 14.458 1 .000 .604

16.858 3.642 21.422 1 .000 2.1E+07

Birthweight

GestationalAge

Constant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: Birthweight, GestationalAge.a.

ln 16.858 .003 .5051

pBW GA

p

16.858 .003 .505

1BW GAp

ep

16.858 .003 .505

16.858 .003 .5051

BW GA

BW GA

ep

e

Graph: Showing Risk of BPD vs GA and BrthWt

0

0.2

0.4

0.6

0.8

1

700 900 1100 1300 1500 1700

GA = 27

GA = 28

GA = 29

GA = 30

GA = 31

GA = 32

Non-Parametric Statistics

linear regression hypothesis testing and estimation

Documents

value of x

variables x

best straight line

predicted value of y

unknownthe data

linear datalety

arbitrary equation

ith case