22s:152 applied linear regression well-known...

4
22s:152 Applied Linear Regression Chapter 2: Regression Analysis ———————————————————— Regression analysis a class of statistical methods for studying relationships between variables that can be measured e.g. predicting blood pressure from age using known values of certain variables to predict the values of other variables for the same subjects e.g. given a person’s age, cholesterol, and weight, predict blood pressure 1 Well-known Example: Space Shuttle Challenger On January 27, 1986, the night before a planned launch, a 3-hour discussion took place. The discussion was about the forecasted low temperature for the next day of 31 F, and the eect of low tempera- ture on O-ring performance. (O-rings seal joints). In their discussion they utilized the following plot show- ing the relationship between the number of O-rings hav- ing some thermal distress and the temperature to decide whether the shuttle should take-oas planned. 50 55 60 65 70 75 80 85 0.5 0.5 1.0 1.5 2.0 2.5 3.0 temperature Number of incidents ●● 2 The final decision was to launch the shuttle as planned. - 7 astronauts were killed - combustion gas leak through an O-ring was the cause of the accident Post-tragedy, a commission noted that a mistake in the analysis of the data was that the flights with zero inci- dents were left obecause it was felt that these flights did not contribute any information about the tempera- ture eect. 50 55 60 65 70 75 80 85 0.5 0.5 1.0 1.5 2.0 2.5 3.0 temperature Number of incidents ●● ●● ●● ●●●● 3 What may have helped in the decision making process? - use oall the data (rather than using data conditional on the occurrence of an incident) - quantification of the relationship between tem- perature and O-ring failure (perhaps as a conditional probability) - prediction of the probability of O-ring failure at 31 F (logistic regression, Dalal et al. used this approach in the their 1989 article) Dalal, S.R, Fowlkes, E.B. and Hoadley, B. (1989). Risk analysis of the Space Shuttle: Pre-Chellenger Predicton of Failure. Journal of the American Statistical Association, v.84, 945-957. 4

Upload: vuongkhue

Post on 29-Apr-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 22s:152 Applied Linear Regression Well-known …homepage.divms.uiowa.edu/~rdecook/stat3200/notes/ch2_4pp.pdfAn example of inappropriate removal of outliers ... King, B. (1998) Critique

22s:152 Applied Linear Regression

Chapter 2: Regression Analysis

————————————————————

Regression analysis

• a class of statistical methods for

– studying relationships between variablesthat can be measured

e.g. predicting blood pressure from age

– using known values of certain variables topredict the values of other variables for thesame subjects

e.g. given a person’s age, cholesterol,and weight, predict blood pressure

1

Well-known Example:

Space Shuttle Challenger

On January 27, 1986, the night before a planned launch,a 3-hour discussion took place.

The discussion was about the forecasted low temperaturefor the next day of 31◦ F, and the effect of low tempera-ture on O-ring performance. (O-rings seal joints).

In their discussion they utilized the following plot show-

ing the relationship between the number of O-rings hav-

ing some thermal distress and the temperature to decide

whether the shuttle should take-off as planned.

50 55 60 65 70 75 80 85

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

temperatureN

umbe

r of i

ncid

ents

● ● ● ●●

2

The final decision was to launch the shuttle as planned.

- 7 astronauts were killed

- combustion gas leak through an O-ring was the causeof the accident

Post-tragedy, a commission noted that a mistake in the

analysis of the data was that the flights with zero inci-

dents were left off because it was felt that these flights

did not contribute any information about the tempera-

ture effect.

50 55 60 65 70 75 80 85

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

temperature

Num

ber o

f inc

iden

ts

● ● ● ●●

●●●●

● ● ●● ● ● ● ●

● ● ● ● ●

3

What may have helped in the decision makingprocess?

- use off all the data (rather than using dataconditional on the occurrence of an incident)

- quantification of the relationship between tem-perature and O-ring failure (perhaps as aconditional probability)

- prediction of the probability of O-ring failureat 31◦ F (logistic regression, Dalal et al. used this approach

in the their 1989 article)

Dalal, S.R, Fowlkes, E.B. and Hoadley, B. (1989). Risk analysis of

the Space Shuttle: Pre-Chellenger Predicton of Failure. Journal of

the American Statistical Association, v.84, 945-957.

4

Page 2: 22s:152 Applied Linear Regression Well-known …homepage.divms.uiowa.edu/~rdecook/stat3200/notes/ch2_4pp.pdfAn example of inappropriate removal of outliers ... King, B. (1998) Critique

‘Investing it: duffers need not apply’

New York Times, May 31, 1998

An example of inappropriate removal of outliers

- An investment compensation expert carriedout a study purporting to show that the ma-

jor companies, whose C.E.O’s had

low golf scores, had high performing

stocks.

- The expert obtained data for golf scores fromthe journal Golf Digest and used his own dataon the stock market performance of the com-panies of 51 chief executives.

- He created a Stock Rating which gave eachcompany a stock rating based on how in-vestors who held their stock did with 100being highest and 0 lowest.

5

All data points Points consideredoutliers

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

All data points

corr = −0.04

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

X XX X XX

X

'Outliers' marked

Data in final analysis

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

'Outliers' removed

corr = −0.41

King, B. (1998) Critique of ‘Investing it: duffers need not

apply.’ Chance News 7.06.

6

Ch.2 Regression analysis...

(as stated in book p. 16)

examines the relationship between a quanti-tative dependent variable Y and one or morequantitative independent variables, X1, . . . ,Xk. (He reserves the term regression for quantita-

tive variables)

Regression analysis traces the conditional

distribution of Y - or some aspect of thedistribution, such as its mean - as a functionof the X ’s

Examples:

- General relationship between X and Y(where � represents a random error).

Y = f (X) + �↑

May be a linear ornon-linear relationship.

7

Linear Models (linear in the parameters)

- Simple linear relationship:Model the conditional mean response of acontinuous variable using a linear relation-ship to a single continuous variable assumingnormal errors

Y = β0+β1X+� with � ∼ N(0, σ2)

Given X , Y has a normal distribution witha mean(center) of [β0 + β1X ] and a varianceof σ2.

Also written as: Y |X ∼ N(β0 + β1X, σ2)

Sketch of plot showing normal conditional distributions:

8

Page 3: 22s:152 Applied Linear Regression Well-known …homepage.divms.uiowa.edu/~rdecook/stat3200/notes/ch2_4pp.pdfAn example of inappropriate removal of outliers ... King, B. (1998) Critique

- Quadratic relationship:Model the conditional mean response of acontinuous variable as a quadratic relation-ship to a single continuous variable (this isstill a linear model as it’s linear in the pa-rameters)

Y = β0 + β1X + β2X2 + � with

� ∼ N(0, σ2)

- Multiple linear relationships:Model the conditional mean response of acontinuous variable as a linear relationshipwith each of two continuous variables (no in-teraction)

Y = β0 + β1X1 + β2X2 + � with� ∼ N(0, σ2)

Mean response surface shown on next page...

9

Mean response surface (errors not shown):

x1

y

Z

This surface is a plane in space.

10

Non-Linear Models

(not linear in the parameters)

- Specific relationship:

Y = β0 + β1Xβ21 + β3X

β42 + � with

� ∼ N(0, σ2)

- Specific relationship:

Y = f (X1, X2) + � with� ∼ N(0, σ2)

Mean response surface (errors not shown):

11

Non-normality

The conditional distribution of Y given X doesnot have to be normal. BUT the validity ofmany of our common hypothesis tests dependson normality.

Y = β0+β1X+� with � ∼ a right-skeweddistribution

sketch

- Might attain normality of errors through trans-formations ⇒ if so, common statistical testsvalid

- Could use the original skewed data and maxi-mum likelihood methods for estimation (witha specified non-normal distribution)

12

Page 4: 22s:152 Applied Linear Regression Well-known …homepage.divms.uiowa.edu/~rdecook/stat3200/notes/ch2_4pp.pdfAn example of inappropriate removal of outliers ... King, B. (1998) Critique

Nonparametric Regression

LOWESS (locally weighted scatterplot smoother)

● ●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

0 5000 10000 15000 20000 25000

2040

6080

Average Income, USD

Pres

tige

- The lowess smoother estimates the function...Yi = f (xi) + �i

- The predicted Yi for a given xi is determinedby considering only ‘local’ points in a ‘win-dow’ around xi

- Often a simple linear regression is fit to thelocal points, and the prediction falls on thisline

- Researcher chooses width of window

13

Other analyses

• The type of data will affect how the data ismodeled and the choice of analysis

– Binary response (0/1) with covariate pre-dictors:

Logistic regression

– Relationship between categorical/ordinalvariables:

Contingency tables, chi-squared test(we won’t cover this in this class)

– Relationship between a quantitative de-pendent variable (Y) and qualitative pre-dictor:

t-test or ANOVA

14

– Predicting a continuous response from bothquantitative and qualitative variables:

Dummy-variable regression or ANCOVA

– Response is a count (Poisson distribution)and the Poisson distribution mean is de-pendent on the covariates:

Poisson regression

15