12 simple linear regression and...

12 Simple LinearRegression and

Correlation

Chapter 12 Stat 4570/5570

Material from Devore’s book (Ed 8), and Cengage

So far, we tested whether subpopulations’ means are equal.

What happens when we have multiple subpopulations?

For example, so far we can answer:

Is fever reduction effect of a certain drug the same forchildren vs. adults? Our hypothesis would be:H0 : µ1 = µ2

But what happens if we now wish to consider many agegroups: infants (0-1), toddlers (1-3), pre-school (3-6), earlyschool (6-9), … old age (65 and older). i.e:

H0 : µ1 = µ2 = µ3 = µ4 = … = µ15

How do we do this? What would some of the alternatives be?

What would some of the alternatives be?

Ha : µ1 < µ2 < µ3 < µ4 < … < µ15

This is doable, but it would be a cumbersome test.

Instead, we can postulate:

Ha : µ1 = µ2 - b = µ3 -2b = … = µ15 – 14b ;

Then, H0: b = 0

In other words, we postulate a specific relationship betweenthe means.

4

Modeling relations:Simple Linear Models

(relationship status: it’s not complicated)

5

The Simple Linear Regression ModelThe simplest deterministic mathematical relationshipbetween two variables x and y is a linear relationship

y = β0 + β1x.

The set of pairs (x, y) for which y = β0 + β1x determines astraight line with slope β1 and y-intercept β0.

The objective of this section is to develop an equivalentlinear probabilistic model.

If the two variables are probabilistically related, then for afixed value of x, there is uncertainty in the value of thesecond variable.

6

The Simple Linear Regression Model

The variable whose value is fixed (by the experimenter) willbe denoted by x and will be called the independent,predictor, or explanatory variable.

For fixed x, the second variable will be random; we denotethis random variable and its observed value by Y and y,respectively, and refer to it as the dependent or responsevariable.

Usually observations will be made for a number of settingsof the independent variable.

7

The Simple Linear Regression Model

Let x1, x2, …, xn denote values of the independent variablefor which observations are made, and let Yi and yi,respectively, denote the random variable and observedvalue associated with xi.

The available bivariate data then consists of the n pairs(x1, y1), (x2, y2), …, (xn, yn).

A picture of this data called a scatter plot gives preliminaryimpressions about the nature of any relationship. In such aplot, each (xi, yi) is represented as a point plotted on a twodimensional coordinate system.

8

Example 1The order in which observations were obtained was notgiven, so for convenience data are listed in increasingorder of x values:

Thus (x1, y1) = (.40, 1.02), (x5, y5) = (.57, 1.52), and so on.

cont’d

9

Example 1The scatter plot (OSA=y, palwidth = x):

cont’d

10

Example 1• There is a strong tendency for y to increase as x increases. That is, larger values of OSA tend to be associated with larger values of width—a positive relationship between the variables.

• It appears that the value of y could be predicted from x by finding a line that is reasonably close to the points in the plot (the authors of the cited article superimposed such a line on their plot).

cont’d

11

A Linear Probabilistic ModelFor the deterministic model y = β0 + β1x, the actualobserved value of y is a linear function of x.

The appropriate generalization of this to a probabilisticmodel assumes that

The average of Y is a linear function of x

-- ie, that for fixed x the actual value of variableY differsfrom its expected value by a random amount.

12

A Linear Probabilistic ModelDefinitionThe Simple Linear Regression Model

There are parameters β0, β1, and σ 2, such that for any fixedvalue of the independent variable x, the dependent variableis a random variable related to x through the modelequation Y = β0 + β1x + ∈

The quantity ∈ in the model equation is the ERROR -- arandom variable, assumed to be symmetrically distributedwith

E(∈) = 0 and V(∈) = σ 2.

(12.1)

13

A Linear Probabilistic ModelThe variable ∈ is sometimes referred to as the randomdeviation or random error term in the model.

Without ∈, any observed pair (x, y) would correspond to apoint falling exactly on the line β0 + β1x, called the true (orpopulation) regression line.

The inclusion of the random error term allows (x, y) to falleither above the true regression line (when ∈ > 0) or belowthe line (when ∈ < 0).

14

A Linear Probabilistic ModelThe points (x1, y1), …, (xn, yn) resulting from n independentobservations will then be scattered about the trueregression line:

15

A Linear Probabilistic ModelOn occasion, the appropriateness of the simple linearregression model may be suggested by theoreticalconsiderations (e.g., there is an exact linear relationshipbetween the two variables, with ∈ representingmeasurement error).

Much more frequently,the reasonableness of themodel is indicated bydata -- ascatter plot exhibiting asubstantial linear pattern.

16

If we think of an entire population of (x, y) pairs, thenµY | x∗ is the mean of all y values for which x = x∗, andσ 2Y | x∗ is a measure of how much these values of y spreadout about the mean value.

If, for example, x = age of a child and y = vocabulary size,then µY | 5 is the average vocabulary size for all 5-year-oldchildren in the population, and σ 2Y | 5 describes the amountof variability in vocabulary size for this part of thepopulation.

A Linear Probabilistic Model

17

Once x is fixed, the only randomness on the right-hand sideof the linear model equation is in the random error ∈.Recall that its mean value and variance are 0 and σ 2,respectively, whatever the value of x. This implies that

µY | x∗ = E(β0 + β1x∗ + ∈) = ?σ 2Y | x∗ = ?

Replacing x∗ in µY | x∗ by x gives the relation µY | x = β0 + β1x,which says that the mean value (expected value, average)of Y, rather than Y itself, is a linear function of x.


18

The true regression line y = β0 + β1x is thus the line of mean values; itsheight for any particular x value is the expected value of Y for thatvalue of x.

The slope β1 of the true regression line is interpreted as the expected(average) change in Y associated with a 1-unit increase in the value ofx.

(This is equivalent to saying “the change in average (expected) Yassociated with a 1-unit increase in the value of x.”)

The variance (amount of variability) of the distribution of Y values is thesame at each different value of x (homogeneity of variance).


19

The variance parameter σ 2 determines the extent to whicheach normal curve spreads out about the regression line

When errors are normally distributed…

distribution of ∈

(b) distribution of Y for different values of x

20

When σ 2 is small, an observed point (x, y) will almostalways fall quite close to the true regression line, whereasobservations may deviate considerably from their expectedvalues (corresponding to points far from the line) when σ 2

is large.

Thus, this variance can be used to tell us how good thelinear fit is (we’ll explore this notion of model fit later.)


21

Example 2Suppose the relationship between applied stress x andtime-to-failure y is described by the simple linear regressionmodel with true regression line y = 65 – 1.2x and σ = 8.

Then for any fixed value x∗ of stress, time-to-failure has anormal distribution with mean value 65 – 1.2x∗ andstandard deviation 8.

Roughly speaking, in the population consisting of all (x, y)points, the magnitude of a typical deviation from the trueregression line is about 8.

22

Example 2For x = 20, Y has mean value…?

What is P(Y > 50 when x = 20)?P(Y > 50 when x = 25)?

cont’d

23

Example 2These probabilities are illustrated in:

cont’d

24

Example 2Suppose that Y1 denotes an observation on time-to-failure made with x= 25 and Y2 denotes an independent observation made with x = 24.

ThenY1 – Y2 is normally distributed

with mean valueE(Y1 – Y2) = β0 + β1(25) - β0 + β1(24) = β1 = -1.2,

varianceV(Y1 – Y2 ) = σ 2 + σ 2 = 128,

standard deviation .

25

Example 2The probability that Y1 exceeds Y2 is

P(Y1 – Y2 > 0) =

= P(Z > .11)

= .4562

That is, even though we expected Y to decrease whenx increases by 1 unit, it is not unlikely that the observed Yat x + 1 will be larger than the observed Y at x.

cont’d

26

A couple of caveats with linear relationships…

“no linear relationship” does not mean “no relationship”

Here, no linear, but perfect quadratic relation between x and y

010

2030

4050

Y

-10 -5 0 5 10X

27

A couple of caveats…

Relation not linear – but an ok approximation in this range of x

24

68

10

4 6 8 10 12 14X2

Fitted values Y2

28


Too dependent on that one point out there

46

810

12

4 6 8 10 12 14X3

Fitted values Y3

29


Totally dependent on that one point out there

68

1012

14

5 10 15 20X4

Fitted values Y4

30


Nice!

46

810

12

4 6 8 10 12 14X1

Fitted values Y1

31

Estimating modelparameters

32

Estimating Model Parameters

The values of β0, β1, and σ 2 will almost never be known to

an investigator.

Instead, sample data consists of n observed pairs(x1, y1), …, (xn, yn),from which the model parameters and the true regressionline itself can be estimated.

The data (pairs) are assumed to have been obtainedindependently of one another.

33

Estimating Model ParametersThat is, yi is the observed value of Yi, whereYi = β0 + β1xi + ∈I

and the n deviations ∈1, ∈2,…, ∈n are independent rv’s.

Independence of Y1, Y2, …, Yn follows from independenceof the ∈i’s.

According to the model, the observed points will bedistributed around the true regression line in a randommanner.

34

Estimating Model ParametersFigure shows a typical plot of observed pairs along with twocandidates for the estimated regression line.

35

Estimating Model ParametersThe “best fit” line is motivated by the principle of least squares,which can be traced back to the German mathematicianGauss (1777–1855):

a line provides the bestfit to the data if the sumof the squared verticaldistances (deviations)from the observed pointsto that line is as smallas it can be.

36

Estimating Model ParametersThe sum of squared vertical deviations from the points(x1, y1),…, (xn, yn) to the line is then

The point estimates of β0 and β1, denoted by and , arecalled the least squares estimates – they are thosevalues that minimize f(b0, b1).

That is, and are such that for any b0and b1.

37

Estimating Model ParametersThe estimated regression line or least squares line isthen the line whose equation is y = + x.

The minimizing values of b0 and b1 are found by takingpartial derivatives of f(b0, b1) with respect to both b0 and b1,equating them both to zero [analogously to f ʹ′(b) = 0 inunivariate calculus], and solving the equations

38

Estimating Model ParametersCancellation of the –2 factor and rearrangement gives thefollowing system of equations, called the normalequations:

nb0 + (Σxi)b1 = Σyi

(Σxi)b0 + (Σxi2)b1 = Σxiyi

These equations are linear in the two unknowns b0 and b1.

Provided that not all xi’s are identical, the least squaresestimates are the unique solution to this system.

39

Estimating Model ParametersThe least squares estimate of the slope coefficient β1 of thetrue regression line is

Shortcut formulas for the numerator and denominator of are

Sxy = (Σxiyi – (Σxi)(Σyi)) / n Sxx = (Σxi2 – (Σxi)2)/n

(12.2)

40

Estimating Model ParametersThe least squares estimate of the intercept β0 of the trueregression line is

The computational formulas for Sxy and Sxx require only thesummary statistics Σxi, Σyi, Σxi

2 and Σxiyi (Σyi2 will be

needed shortly).

In computing use extra digits in because, if x is largein magnitude, rounding will affect the final answer.In practice, the use of a statistical software package ispreferable to hand calculation and hand-drawn plots.

(12.3)

41

ExampleThe cetane number is a critical property in specifying theignition quality of a fuel used in a diesel engine.

Determination of this number for a biodiesel fuel isexpensive and time-consuming.

The article “Relating the Cetane Number of BiodieselFuels to Their Fatty Acid Composition: A Critical Study”(J. of Automobile Engr., 2009: 565–583) included thefollowing data on x = iodine value (g) and y = cetanenumber for a sample of 14 biofuels.

42

ExampleThe iodine value is the amount of iodine necessary tosaturate a sample of 100 g of oil. Fit the simple linearregression model to this data.

cont’d

43

ExampleScatter plot with the least squares line superimposed.

cont’d

44

Estimating σ 2 and σ

45

Estimating σ 2 and σThe parameter σ 2 determines the amount of spread aboutthe true regression line. Two separate examples:

46

Estimating σ 2 and σAn estimate of σ 2 will be used in confidence interval (CI)formulas and hypothesis-testing procedures presented inthe next two sections.

Because the equation of the true line is unknown, theestimate is based on the extent to which the sampleobservations deviate from the estimated line.

Many large deviations (residuals) suggest a large value ofσ 2, whereas deviations all of which are small in magnitudesuggest that σ 2 is small.

47


DefinitionThe fitted (or predicted) values are obtainedby successively substituting x1,…, xn into the equation ofthe estimated regression line:

The residuals are the differencesbetween the observed and fitted y values.

Note – the residuals are estimates of the true error,since they are deviations of the y’s from the estimatedregression line, while the errors are deviations from the .

48


In words, the predicted value is:-The value of y that we would predict or expect when usingthe estimated regression line with x = xi;- The height of the estimated regression line above thevalue xi for which the ith observation was made.- It’s the mean for that population where x = xi.

The residual is the vertical deviation between the point (xi,yi) and the least squares line—a positive number if thepoint lies above the line and a negative number if it liesbelow the line.

49


When the estimated regression line is obtained via theprinciple of least squares, the sum of the residuals shouldin theory be zero, if the error distribution is symmetric (andit is, due to our assumption).

50

Example

Relevant summary quantities (summary statistics) are

Σxi = 2817.9, Σyi = 1574.8, Σx2i = 415,949.85,

Σxi yi = 222,657.88, and Σy2i = 124,039.58,

from which x = 140.895, y = 78.74, Sxx = 18,921.8295,Sxy = 776.434.

51

ExampleAll predicted values (fits) and residuals appear in theaccompanying table.

cont’d

52


The error sum of squares (equivalently, residual sum ofsquares), denoted by SSE, is

and the estimate of σ 2 is

53


The divisor n – 2 in s2 is the number of degrees of freedom(df) associated with SSE and the estimate s2.

This is because to obtain s2, the two parameters β0 and β1must first be estimated, which results in a loss of 2 df (justas µ had to be estimated in one sample problems, resultingin an estimated variance based on n – 1df).

Replacing each yi in the formula for s2 by the rv Yi gives theestimator S2.

It can be shown that S2 is an unbiased estimator for σ 2

54

Example, cont.The residuals for the filtration rate–moisture content datawere calculated previously.

The corresponding error sum of squares is?

The estimate of σ 2 is?

55


Computation of SSE from the defining formula involvesmuch tedious arithmetic, because both the predicted valuesand residuals must first be calculated.

Use of the following computational formula does not requirethese quantities.

56

The Coefficient of Determination

57

The Coefficient of DeterminationDifferent variability in observed y values:

Using the model to explain y variation:(a) data for which all variation is explained;(b) data for which most variation is explained;(c) data for which little variation is explained

58

The Coefficient of DeterminationThe error sum of squares SSE can be interpreted as ameasure of how much variation in y is left unexplained bythe model—that is, how much cannot be attributed to alinear relationship.

What would SSE be in the first plot?

A quantitative measure of the total amount of variation inobserved y values is given by the total sum of squares

59

The Coefficient of Determination

Total sum of squares is the sum of squared deviationsabout the sample mean of the observed y values – whenno predictors are taken into account.

Thus the same number y is subtracted from each yi in SST,whereas SSE involves subtracting each different predictedvalue from the corresponding observed yi.

This in some sense is as bad as SSE can get if there is noregression model (ie, slope is 0).

60

The Coefficient of DeterminationJust as SSE is the sum of squared deviations about theleast squares line SST is the sum ofsquared deviations about the horizontal line at height (sincethen vertical deviations are yi – y).

Figure 12.11

Sums of squares illustrated: (a) SSE = sum of squared deviations about the least squares line;(b) SSE = sum of squared deviations about the horizontal line

(b)(a)

61

The Coefficient of DeterminationDefinitionThe coefficient of determination, denoted by r2, is givenby

It is interpreted as the proportion of observed y variationthat can be explained by the simple linear regression model(attributed to an approximate linear relationship between yand x).

The higher the value of r2, the more successful is thesimple linear regression model in explaining y variation.

62

The Coefficient of DeterminationWhen regression analysis is done by a statistical computerpackage, either r2 or 100r2 (the percentage of variationexplained by the model) is a prominent part of the output.

What do you do if r2 is small?

63

ExampleThe scatter plot of the iodine value–cetane number data inFigure 12.8 portends a reasonably high r2 value.

Figure 12.8

Scatter plot for Example 4 with least squares line superimposed, from Minitab

64

ExampleThe coefficient of determination is then

r2 = 1 – SSE/SST = 1 – (78.920)/(377.174) = .791

How do we interpret this value?

cont’d

65

The Coefficient of DeterminationThe coefficient of determination can be written in a slightlydifferent way by introducing a third sum ofsquares—regression sum of squares, SSR—given by

SSR = Σ( – y)2 = SST – SSE.

Regression sum of squares is interpreted as the amount oftotal variation that is explained by the model.

Then we have

r2 = 1 – SSE/SST = (SST – SSE)/SST = SSR/SST

the ratio of explained variation to total variation.

66

Inference About the SlopeParameter β1

67

Example 10The following data is representative of that reported in thearticle “An Experimental Correlation of Oxides of NitrogenEmissions from Power Boilers Based on Field Data” (J. ofEngr. for Power, July 1973: 165–170), x = burner-arealiberation rate (MBtu/hr-ft2) and y = NOx emission rate(ppm).

There are 14 observations, made at the x values 100, 125,125, 150, 150, 200, 200, 250, 250, 300, 300, 350, 400, and400, respectively.

68

Simulation experimentSuppose that the slope and intercept of the true regression line areβ1 = 1.70 and β0 = –50, with σ = 35.

Let’s fix x values 100, 125, 125, 150, 150, 200, 200, 250, 250, 300, 300,350, 400, and 400.

We then generate a sample of random deviationsfrom a normal distribution with mean 0 and standard deviation 35and then add to β0 + β1xi to obtain 14 corresponding y values.

LS calculations were then carried out to obtain the estimated slope,intercept, and standard deviation for this sample of 14 pairs (xi,yi).

cont’d

69

Simulation experimentThis process was repeated a total of 20 times, resulting inthe values:

There is clearly variation in values of the estimated slopeand estimated intercept, as well as the estimated standarddeviation.

cont’d

70

Simulation experiment:graphs of the true regression line and 20 least squares lines

71

Inferences About the Slope Parameter β1

The estimators are:

72


Invoking properties of a linear function of random variablesas discussed earlier, leads to the following results.

1. The mean value of is….2. The variance and standard deviation are…3. The distribution of is ….

73


The variance of equals the variance σ 2 of therandom error term—or, equivalently, of any Yi,divided by . This denominator is ameasure of how spread out the xi’s are about .

Making observations at xi values that are quitespread out results in a more precise estimator ofthe slope parameter.

If the xi’s are spread out too far, a linear modelmay not be appropriate throughout the range ofobservation.

74


TheoremThe assumptions of the simple linear regression modelimply that the standardized variable

has a t distribution with n – 2 df.

75

A Confidence Interval for β1

76

A Confidence Interval for β1

As in the derivation of previous CIs, we begin with aprobability statement:

Manipulation of the inequalities inside the parentheses toisolate β1 and substitution of estimates in place of theestimators gives the CI formula.

77

ExampleVariations in clay brick masonry weight have implications notonly for structural and acoustical design but also for design ofheating, ventilating, and air conditioning systems.

The article “Clay Brick Masonry Weight Variation” (J. ofArchitectural Engr., 1996: 135–137) gave a scatter plot ofy = mortar dry density (lb/ft3) versus x = mortar air content (%)for a sample of mortar specimens, from which the followingrepresentative data was read:

78

ExampleThe scatter plot of this data in Figure 12.14 certainlysuggests the appropriateness of the simple linear regressionmodel; there appears to be a substantial negative linearrelationship between air content and density, one in whichdensity tends to decrease as air content increases.

Scatter plot of the data from Example 11

cont’d

79

Hypothesis-Testing Procedures

80

Hypothesis-Testing ProceduresThe most commonly encountered pair of hypotheses aboutβ1 is H0: β1 = 0 versus Ha: β1 ≠ 0. When this null hypothesisis true, µY x = β0 independent of x. Then knowledge of xgives no information about the value of the dependentvariable.

Null hypothesis: H0: β1 = β10

Test statistic value:

81

Hypothesis-Testing ProceduresAlternative Hypothesis Alternative Hypothesis

Ha: β1 > β10 t ≥ tα,n – 2

Ha: β1 < β10 t ≤ –tα,n – 2

Ha: β1 ≠ β10 either t ≥ tα/2,n – 2 or t ≤ – tα/2,n – 2

A P-value based on n – 2 can be calculated just as wasdone previously for t tests.

the test statistic value is the t ratio t =

82

Regression analysis in R

83


Fitting Linear Models in R

Description

lm is used to fit linear models (regression)

Command:lm(formula, data, subset, ...)

Example:Model1 = lm(outcome ~ predictor)

84

Example: Regression analysis in RRobert Hooke (England, 1653-1703) was able to assess the relationshipbetween the length of a spring and the load placed on it. He just hungweights of different sizes on the end of a spring, and watched whathappened. When he increased the load, the spring got longer. When hereduced the load, the spring got shorter. And the relationship was more orless linear.

Let b be the length of the spring with no load. When a weightof x kilograms is tied to the end of the spring, the spring stretches to anew length.

85

Example: Regression analysis in R

According to Hooke's law, the amount of stretch isproportional to the weight x. So the new length of the spring is

y = mx + b

In this equation, m and b are constants which depend on thespring.

Their values are unknown and have to be estimated usingexperimental data.

86

Hooke’s Law in R

Weight Length

0 kg 439.00 cm

2 439.124 439.216 439.318 439.4010 439.50

Data below come from an experiment in which weights ofvarious sizes were loaded on the end of a length of piano wire.

The first column shows the weight of the load. The secondcolumn shows the measured length. With 20 pounds of load,this "spring" only stretched about 0.2 inch (10 kg isapproximately 22 lb, 0.5 cm is approximately 0.2 in).

Piano wire is not very stretchy!

439 cm is about 14.4 feet(the room needed to have afairly high ceiling)

87


x = c(0, 2, 4, 6, 8, 10)y = c( 439.00, 439.12, 439.21, 439.31, 439.40, 439.50)plot( x, y, xlab = 'Load (kg)', ylab = 'Length (cm)' )

0 2 4 6 8 10

439.

043

9.1

439.

243

9.3

439.

443

9.5

Load (kg)

Leng

th (c

m)

88

Regression analysis in RHooke.lm = lm( y ~ x)

summary(Hooke.lm)

Residuals: 1 2 3 4 5 6-0.010952 0.010762 0.002476 0.004190 -0.004095 -0.002381

Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 4.390e+02 6.076e-03 72254.92 < 2e-16 ***x 4.914e-02 1.003e-03 48.98 1.04e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.008395 on 4 degrees of freedomMultiple R-squared: 0.9983, Adjusted R-squared: 0.9979F-statistic: 2399 on 1 and 4 DF, p-value: 1.04e-06

89

Regression analysis in Rcoefficients(Hooke.lm)# (Intercept) x# 439.01095238 0.04914286

Hooke.coefs = coefficients(Hooke.lm)

Hooke.coefs[1]439.011

Hooke.coefs[2]0.04914286

# expected stretch for x = 5kgHooke.coefs[1] + Hooke.coefs[2] * 5 = 439.2567

If we knew that line for certain, then the estimated standard deviation of the actualstretch around that expected value would be:Residual standard error: 0.008395

90


coefficients(Hooke.lm)# (Intercept) x# 439.01095238 0.04914286

So, in the Hookeʼs law, m = 0.05 cm per kg, and b = 439.01cm. The method of least squares estimates the length of thespring under no load to be 439.01 cm.

And each kilogram of load makes the spring stretch by anamount estimated as 0.05 cm.

Note that there is no question here about what is causing what:correlation is not causation, but in this controlled experimentalsetting, the causation is clear and simple. This is an exceptionmore than a rule in statistical modeling.

91


The method of least squares estimates the length of the springunder no load to be 439.01 cm. This is a bit longer than themeasured length at no load (439.00 cm).

A statistician would trust the least squares estimate over themeasurement. Why?

example taken from analytictech.com

92

Regression analysis in Ry.hat <- Hooke.coefs[1] + Hooke.coefs[2]* x439.0110 439.1092 439.2075 439.3058 439.4041 439.5024

> predict(Hooke.lm)439.0110 439.1092 439.2075 439.3058 439.4041 439.5024

e.hat <- y - y.hat

mean( e.hat )# -2.842171e-14

cbind( y, y.hat, e.hat )# y y.hat e.hat# [1,] 439.00 439.0110 -0.010952381# [2,] 439.12 439.1092 0.010761905# [3,] 439.21 439.2075 0.002476190# [4,] 439.31 439.3058 0.004190476# [5,] 439.40 439.4041 -0.004095238# [6,] 439.50 439.5024 -0.002380952

93


plot( x, y, xlab = 'Load (kg)', ylab = 'Length (cm)' )lines( x, y.hat, lty = 1, col = 'red' )

0 2 4 6 8 10

439.

043

9.1

439.

243

9.3

439.

443

9.5

Load (kg)

Leng

th (c

m)

94


plot(y.hat, e.hat, xlab = 'Predicted y values', ylab = 'Residuals' )abline(h = 0, col = 'red')

This plot doesn'tlook good, actually

439.0 439.1 439.2 439.3 439.4 439.5

-0.0

10-0

.005

0.00

00.

005

0.01

0

Predicted y values

Res

idua

ls

95


qqnorm( e.hat, main = 'Residual Normal QQplot' )abline( mean( e.hat ), sd( e.hat ), col = 'red' )

This plotlooks good

-1.0 -0.5 0.0 0.5 1.0

-0.0

10-0

.005

0.00

00.

005

0.01

0

Residual Normal QQplot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

96

Inferences Concerning µY x∗ and thePrediction of Future Y Values

97

Inference Concerning Mean of Future YHooke.coefs = coefficients(Hooke.lm)# (Intercept) x# 439.01095238 0.04914286


If we knew that line for certain, then the estimated standarddeviation of the actual stretch around that expected valuewould be:

Residual standard error: 0.008395

But we donʼt know m and b for certain – weʼve just estimatedthem, and we know that their estimators are Normal randomvariables.

98

Let x∗ denote a specified value of the independentvariable x.

Once the estimates and have been calculated, + x∗ can be regarded either as a point estimate of(the expected or true average value of Y when x = x∗) or asa prediction of the Y value that will result from a singleobservation made when x = x∗.

The point estimate or prediction by itself gives noinformation concerning how precisely has beenestimated or Y has been predicted.

Inference Concerning Mean of Future Y

99

This can be remedied by developing a CI for and aprediction interval (PI) for a single Y value.

Before we obtain sample data, both and are subject tosampling variability—that is, they are both statistics whosevalues will vary from sample to sample.

Suppose, for example, that _0 = 439 and _1 = 0.05.

Then a first sample of (x, y) pairs might give = 439.35, = 0.048; a second sample might result in = 438.52, = 0.051; and so on.


100

It follows that = + x∗ itself varies in value from sampleto sample, so it is a statistic.

If the intercept and slope of the population line are theaforementioned values 439 and 0.05, respectively, and x∗=5kgs, then this statistic is trying to estimate the value

439 + 0.05(5) = 439.25

The estimate from a first sample might be439.35 + 0.048(5) = 439.59, from a second sample mightbe 438.52 + 0.051(5) = 438.775 , and so on.


101

Consider the value x∗ = 10.Recall that because 10 is further than x = 5 from the “centerof the data”, the estimated “y.hat” values are more variable.


102

In the same way, inferences about the mean Y value + x∗ are based on properties of the samplingdistribution of the statistic + x∗.

Substitution of the expressions for and into + x∗followed by some algebraic manipulation leads to therepresentation of + x∗ as a linear function of the Yi’s:

The coefficients d1, d2, …., dn in this linear function involvethe xi’s and x∗, all of which are fixed.


103

Application of the rules to this linear function gives thefollowing properties.

PropositionLet = + where x∗ is some fixed value of x. Then

1. The mean value of is

Thus + x∗ is an unbiased estimator for + x∗ (i.e., for ).


104

2. The variance of is

And the standard deviation is the square root of this expression. The estimated standard deviation of + x∗, denoted by or , results from replacing σ by its estimate s:

3. has a normal distribution.


105

The variance of + x∗ is smallest when x∗ = x andincreases as x∗ moves away from x in either direction.

Thus the estimator of µY x∗ is more precise when x∗ is nearthe center of the xi’s than when it is far from the values atwhich observations have been made. This will imply thatboth the CI and PI are narrower for an x∗ near x than for anx∗ far from x.

Most statistical computer packages will provide both + x∗ and for any specified x∗ upon request.


106

Inference Concerning Mean of Future YHooke.coefs = coefficients(Hooke.lm)# (Intercept) x# 439.01095238 0.04914286

> predict(Hooke.lm)439.0110 439.1092 439.2075 439.3058 439.4041 439.5024


> predict(Hooke.lm, se.fit=TRUE)$fit 1 2 3 4 5 6439.0110 439.1092 439.2075 439.3058 439.4041 439.5024

$se.fit[1] 0.006075862 0.004561497 0.003571111 0.0035711110.004561497 0.006075862

107

Just as inferential procedures for _1 were based on the tvariable obtained by standardizing _1, a t variable obtainedby standardizing + x∗ leads to a CI and test procedureshere.

TheoremThe variable



108

A probability statement involving this standardized variablecan now be manipulated to yield a confidence interval for

A 100(1 – α )% CI for the expected value of Y whenx = x∗, is

This CI is centered at the point estimate for andextends out to each side by an amount that depends on theconfidence level and on the extent of variability in theestimator on which the point estimate is based.


109


preds = predict(Hooke.lm, se.fit=TRUE)$fit439.0110 439.1092 439.2075 439.3058 439.4041 439.5024

$se.fit0.00608 0.00456 0.00357 0.00357 0.00456 0.00608

> plot( x, y, xlab = 'Load (kg)', ylab = 'Length (cm)',cex=2)> lines( x, y.hat, lty = 1, col = 'red' )> preds = predict(Hooke.lm, se.fit=TRUE)> preds = predict(Hooke.lm, se.fit=TRUE)> CI.ub = preds$fit + preds$se.fit*qt(.975,6-2)> CI.lb = preds$fit - preds$se.fit*qt(.975,6-2)> lines(x,CI.ub,lty=2,col="red")> lines(x,CI.lb,lty=2,col="red")

110


111

Example: concreteCorrosion of steel reinforcing bars is the most importantdurability problem for reinforced concrete structures.

Carbonation of concrete results from a chemical reactionthat also lowers the pH value by enough to initiatecorrosion of the rebar.

Representative data on x = carbonation depth (mm) andy = strength (MPa) for a sample of core specimens takenfrom a particular building follows

112

Example -- concrete cont’d

113

Example -- concreteLet’s now calculate a 95% confidence interval, for the meanstrength for all core specimens having a carbonation depthof 45. The interval is centered at

The estimated standard deviation of the statistic is

The 16 df t critical value for a 95% confidence level is 2.120,from which we determine the desired interval to be

cont’d

114

A Prediction Interval for a FutureValue of Y

115

A Prediction Interval for a Future Value of Y

Rather than calculate an interval estimate for , aninvestigator may wish to obtain a range or an interval ofplausible values for the value of Y associated with somefuture observation when the independent variable hasvalue x∗.

Consider, for example, relating vocabulary size y to age ofa child x. The CI with x∗ = 6 would provide a range thatcovers with 95% confidence the true average vocabularysize for all 6-year-old children.

Alternatively, we might wish an interval of plausible valuesfor the vocabulary size of a particular 6-year-old child. Howcan you tell that a child is “off the chart” for example?

116


What is the difference between a confidence interval and aprediction interval?

117


The error of prediction is a differencebetween two random variables. Because the future value Yis independent of the observed Yi’s,

118


What is ?

It can be shown that the standardized variable


119


Manipulating to isolate Y between the two inequalitiesyields the following interval.

A 100(1 – α)% PI for a future Y observation to be madewhen x = x∗ is

120


The interpretation of the prediction level 100(1 – α)% issimilar to that of previous confidence levels—if is usedrepeatedly, in the long run the resulting interval will actuallycontain the observed y values 100(1 – α)% of the time.

As n → , the width of the CI approaches 0, whereas thewidth of the PI does not (because even with perfectknowledge of _0 and _1, there will still be randomness inprediction).

121

Example -- concreteLet’s return to the carbonation depth-strength data exampleand calculate a 95% PI for a strength value that wouldresult from selecting a single core specimenwhose depth is 45 mm. Relevant quantities from thatexample are

For a prediction level of 95% based on n – 2 = 16 df, thet critical value is 2.120, exactly what we previously used fora 95% confidence level.

122

Diagnostic Plots

123

Diagnostic PlotsThe basic plots that many statisticians recommend for an

assessment of model validity and usefulness are thefollowing:

1. ei∗ (or ei) on the vertical axis versus xi on the horizontal

axis

2. ei∗ (or ei) on the vertical axis versus on the horizontal

axis

3. on the vertical axis versus yi on the horizontal axis

4. A histogram of the standardized residuals

124

Example

125

Difficulties and Remedies

126

Difficulties and RemediesAlthough we hope that our analysis will yield plots like

these, quite frequently the plots will suggest one or moreof the following difficulties:

1. A nonlinear relationship between x and y is appropriate.

2. The variance of ∈ (and of Y) is not a constant σ2 but depends on x.

3. The selected model fits the data well except for a very few discrepant or outlying data values, which may have greatly influenced the choice of the best-fit function.

127

Difficulties and Remedies4. The error term ∈ does not have a normal distribution.

5. When the subscript i indicates the time order of the observations, the ∈i’s exhibit dependence over time.

6. One or more relevant independent variables have been omitted from the model.

128


(a) (b)

(a) nonlinear relationship; (b) nonconstant variance;Plots that indicate abnormality in data:

129


.

(c) (d)

(c) discrepant observation; (d) observation with large influence;

Plots that indicate abnormality in data:

cont’d

130


.

(e) (f)

Plots that indicate abnormality in data:

(e) dependence in errors; (f) variable omitted

cont’d

131

Difficulties and RemediesIf the residual plot exhibits a curved pattern, then a

nonlinear function of x may be used.

Figure 13.2(a)

132

Difficulties and RemediesThe residual plot below suggests that, although a straight-

line relationship may be reasonable, the assumption thatV(Yi) = σ2 for each i is of doubtful validity.

Using advanced methods like weighted LS (WLS), as canmore advanced models, is recommended for inference.

133

Difficulties and RemediesWhen plots or other evidence suggest that the data set

contains outliers or points having large influence on theresulting fit, one possible approach is to omit theseoutlying points and re-compute the estimated regressionequation.

This would certainly be correct if it were found that theoutliers resulted from errors in recording data values orexperimental errors.

If no assignable cause can be found for the outliers, it isstill desirable to report the estimated equation both withand without outliers omitted.

134

Difficulties and RemediesYet another approach is to retain possible outliers but to

use an estimation principle that puts relatively lessweight on outlying values than does the principle of leastsquares.

One such principle is MAD (minimize absolute deviations),which selects and to minimize Σ | yi – (b0 + b1xi |.

Unlike the estimates of least squares, there are no niceformulas for the MAD estimates; their values must befound by using an iterative computational procedure.

135

Difficulties and RemediesSuch procedures are also used when it is suspected that

the ∈i’s have a distribution that is not normal but insteadhave “heavy tails” (making it much more likely than forthe normal distribution that discrepant values will enterthe sample); robust regression procedures are those thatproduce reliable estimates for a wide variety ofunderlying error distributions.

136

So far we have learned about LR…1. How to fit SLR

2. How to do inference in SLR

3. How to check for assumption violations - we learned to usestandardized residuals instead of ordinary residuals

4. We learned about outliers

5. We learned about the more sneaky kind: the outliers in the Xspace – they almost always have a small ordinary residual (thatresidual’s variance is tiny, since their X is far away from the meanof all X’s), so they won’t be visible on the ordinary residual plot (butthey may be big on the standardized res plot). They are calledleverage points (points with high leverage)

6. Removal of leverage points results in a very different line, thenthey are called influential points.

137

Today we will talk about MLR

1. How to fit MLR – estimate the true linear relationship(regression surface) between the outcome and a predictor.Interpret slope on X1 as the Average Change in Y when X1increases by 1 unit, holding all other X’s constant.

2. How to do inference for one slope, several slopes (is asubset of slopes significant – does this subset of variables”matter”?), or all slopes.

3. Same approach to checking for assumption violations (non-constant variance, non-linear patterns in residuals)

4. Outliers (outliers in Y space – those points with largeresiduals)

5. Must think about *any* outliers in the X space – leveragepoints -- they may have a small ordinary residual.

138

Multiple Regression AnalysisDefinitionThe multiple regression model equation is

Y = β0 + β1x1 + β2x2 + ... + βkxk + ∈

where E(∈) = 0 and V(∈) = σ 2.

In addition, for purposes of testing hypotheses andcalculating CIs or PIs, it is assumed that ∈ is normallydistributed.

This is not a regression line any longer, but a regressionsurface.

139

Multiple Regression AnalysisThe regression coefficient β1 is interpreted as the expected

change in Y associated with a 1-unit increase in x1 whilex2,..., xk are held fixed.

Analogous interpretations hold for β2,..., βk.

Thus, these coefficients are called partial or adjustedregression coefficients.

In contrast, the simple regression slope is called themarginal (or unadjusted) coefficient.

140

Models with Interaction and Quadratic Predictors

For example, if an investigator has obtained observations on y, x1, andx2, one possible model is

Y = β0 + β1x1 + β2x2 + ∈.

However, other models can be constructed by formingpredictors that are mathematical functions of x1 and/or x2.

For example, with x3 = x1^2 and x4 = x1x2, the model

Y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ∈

also has the general MR form.

141

Models with Interaction and Quadratic Predictors

In general, it is not only permissible for some predictorsto be mathematical functions of others but also oftendesirable in the sense that the resulting model may bemuch more successful in explaining variation in y thanany model without such predictors. This is still “linearregression”, even though the relationship betweenoutcome and predictors may not be.

For example, the model

Y = β0 + β1x + β2x2 + ∈ is still MLR with k = 2, x1 = x, and x2= x2.

142

Example -- travel timeThe article “Estimating Urban Travel Times:

A Comparative Study” (Trans. Res., 1980: 173–175)described a study relating the dependent variabley = travel time between locations in a certain city and theindependent variable x2 = distance between locations.

Two types of vehicles, passenger cars and trucks, wereused in the study.

Let

143

Example – travel timeOne possible multiple regression model is

Y = β0 + β1x1 + β2x2 + ∈

The mean value of travel time depends on whether avehicle is a car or a truck:

mean time = β0 + β2x2 when x1 = 0 (cars)

mean time = β0 + β1 + β2x2 when x1 = 1 (trucks)

cont’d

144

Example -- travel timeThe coefficient β1 is the difference in mean times between

trucks and cars with distance held fixed; if β1 > 0, onaverage it will take trucks the same amount of timelonger to traverse any particular distance than it will forcars.

A second possibility is a model with an interactionpredictor:Y = β0 + β1x1 + β2x2 + β3x1x2 + ∈

cont’d

145

Example – travel timeNow the mean times for the two types of vehicles are

mean time = β0 + β2x2 when x1 = 0

mean time = β0 + β1 + (β2 + β3 )x2 when x1 = 1

Does it make sense to have different intercepts for carsand truck? (Think about what intercept actually means).

So, what would be a third model you could consider?

cont’d

146

Example – travel timeFor each model, the graph of the mean time versus

distance is a straight line for either type of vehicle, asillustrated in the figure.

Regression functions for models with one dummy variable (x1) and one quantitative variable x2

(a) no interaction (b) interaction

cont’d

12 simple linear regression and...

Documents