© 1998, geoff kuenning linear regression models what is a (good) model? estimating model parameters...
TRANSCRIPT
© 1998, Geoff Kuenning
Linear Regression Models
• What is a (good) model?• Estimating model parameters• Allocating variation• Confidence intervals for regressions• Verifying assumptions visually
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
Samples
Samples
.
.
.
Predictorvalues x,
factor levels
Samples of response
. . .
y1
y2
yn
Regression models• response var = f (predictors)
© 1998, Geoff Kuenning
What Is a (Good) Model?
• For correlated data, model predicts response given an input
• Model should be equation that fits data• Standard definition of “fits” is least-squares
– Minimize squared error– While keeping mean error zero– Minimizes variance of errors
© 1998, Geoff Kuenning
Least-Squared Error
• If then error in estimate for xi is
• Minimize Sum of Squared Errors (SSE)
• Subject to the constraint
y b b x 0 1
e y b b xii
n
i ii
n2
10 1
2
1
e y b b xii
n
i ii
n
10 1
1
0
ei = yi - yi
© 1998, Geoff Kuenning
Estimating Model Parameters
• Best regression parameters are
where
• Note error in eq. 14.1 book!
bxy nxy
x nx1 2 2
b y b x0 1
xn
xi 1y
nyi 1
xy x yi i x xi2 2
© 1998, Geoff Kuenning
Parameter EstimationExample
• Execution time of a script for various loop counts:
• = 6.8, = 2.32, xy = 88.54, x2 = 264
• b0 = 2.32 (0.29)(6.8) = 0.35
Loops 3 5 7 9 10Time 1.19 1.73 2.53 2.89 3.26
x y
b1 2
88 54 5 6 8 2 32
264 5 6 80 29
. . .
..
© 1998, Geoff Kuenning
Graph of Parameter Estimation Example
0
1
2
3
3 4 5 6 7 8 9 10
© 1998, Geoff Kuenning
Allocating Variation
• If no regression, best guess of y is• Observed values of y differ from , giving
rise to errors (variance)• Regression gives better guess, but there are
still errors• We can evaluate quality of regression by
allocating sources of errors
y
y
© 1998, Geoff Kuenning
The Total Sum of Squares
• Without regression, squared error is
SST
SSY SS
y y y y y y
y y y ny
y y ny ny
y ny
ii
n
i ii
n
ii
n
ii
n
ii
n
ii
n
2
1
2 2
1
2
1 1
2
2
1
2
2
1
2
2
2
2
0
© 1998, Geoff Kuenning
The Sum of Squaresfrom Regression
• Recall that regression error is
• Error without regression is SST• So regression explains SSR = SST - SSE• Regression quality measured by coefficient
of determination
R2 SSR
SST
SST SSE
SST
SSE e y yi i i2
© 1998, Geoff Kuenning
Evaluating Coefficientof Determination
• Compute• Compute
• Compute
SST ( ) y ny2 2
SSE y b y b xy20 1
R2 SST SSE
SST
© 1998, Geoff Kuenning
Example of Coefficientof Determination
• For previous regression example
– y = 11.60, y2 = 29.79, xy = 88.54,
– SSE = 29.79-(0.35)(11.60)-(0.29)(88.54) = 0.05
– SST = 29.79-26.9 = 2.89– SSR = 2.89-.05 = 2.84– R2 = (2.89-0.05)/2.89 = 0.98
3 5 7 9 101.19 1.73 2.53 2.89 3.26
ny2 25 2 32 26 9 . .
© 1998, Geoff Kuenning
Standard Deviationof Errors
• Variance of errors is SSE divided by degrees of freedom– DOF is n2 because we’ve calculated 2
regression parameters from the data– So variance (mean squared error, MSE)
is SSE/(n2)
• Standard deviation of errors is square root:
sne
SSE
2
© 1998, Geoff Kuenning
Checking Degreesof Freedom
• Degrees of freedom always equate:– SS0 has 1 (computed from )– SST has n1 (computed from data and ,
which uses up 1)– SSE has n2 (needs 2 regression
parameters)– So
y
y
SST SSY SS SSR SSE
( )
0
1 1 1 2n n n
© 1998, Geoff Kuenning
Example ofStandard Deviation of Errors
• For our regression example, SSE was 0.05, so MSE is 0.05/3 = 0.017 and se = 0.13
• Note high quality of our regression:– R2 = 0.98– se = 0.13– Why such a nice straight-line fit?
© 1998, Geoff Kuenning
Confidence Intervals for Regressions
• Regression is done from a single population sample (size n)– Different sample might give different
results– True model is y = 0 + 1x– Parameters b0 and b1 are really means
taken from a population sample
© 1998, Geoff Kuenning
Calculating Intervalsfor Regression Parameters
• Standard deviations of parameters:
• Confidence intervals arewhere t has n - 2 degrees of freedom
s sn
x
x nx
ss
x nx
b e
be
0
1
1 2
2 2
2 2
b tsi bi
© 1998, Geoff Kuenning
Example of Regression Confidence Intervals
• Recall se = 0.13, n = 5, x2 = 264, = 6.8
• So
• Using a 90% confidence level, t0.95;3 = 2.353
x
s
s
b
b
0
1
0 131
5
6 8
264 5 6 80 16
0 13
264 5 6 80 004
2
2
2
.( . )
( . ).
.
( . ).
© 1998, Geoff Kuenning
Regression Confidence Example, cont’d
• Thus, b0 interval is
– Not significant at 90%
• And b1 is
– Significant at 90% (and would survive even 99.9% test)
0 29 2 353 0 004 0 28 0 30. . ( . ) ( . , . )
0 35 2 353 0 16 0 03 0 73. . ( . ) ( . , . )
© 1998, Geoff Kuenning
Confidence Intervalsfor Nonlinear Regressions
• For nonlinear fits using exponential transformations:– Confidence intervals apply to
transformed parameters– Not valid to perform inverse
transformation on intervals
© 1998, Geoff Kuenning
Confidence Intervalsfor Predictions
• Previous confidence intervals are for parameters– How certain can we be that the
parameters are correct?• Purpose of regression is prediction
– How accurate are the predictions?– Regression gives mean of predicted
response, based on sample we took
© 1998, Geoff Kuenning
Predicting m Samples
• Standard deviation for mean of future sample of m observations at xp is
• Note deviation drops as m • Variance minimal at x =
• Use t-quantiles with n–2 DOF for interval
s s
m n
x x
x nxy ep
mp
1 1
2
2 2
x
© 1998, Geoff Kuenning
Example of Confidenceof Predictions
• Using previous equation, what is predicted time for a single run of 8 loops?
• Time = 0.35 + 0.29(8) = 2.67• Standard deviation of errors se = 0.13
• 90% interval is then
sy p .
.
( . ).
10 13 1
1
5
8 6 8
264 5 6 80 14
2
2 67 2 353 0 14 2 34 3 00. . ( . ) ( . , . )
© 1998, Geoff Kuenning
Verifying Assumptions Visually
• Regressions are based on assumptions:– Linear relationship between response y and
predictor x• Or nonlinear relationship used in fitting
– Predictor x nonstochastic and error-free– Model errors statistically independent
• With distribution N(0,c) for constant c
• If assumptions violated, model misleading or invalid
© 1998, Geoff Kuenning
Testing Linearity
• Scatter plot x vs. y to see basic curve type
Linear Piecewise Linear
Outlier Nonlinear (Power)
© 1998, Geoff Kuenning
Testing Independenceof Errors
• Scatter-plot i versus• Should be no visible trend• Example from our curve fit:
yi
-0.1
-0.05
0
0.05
0.1
0.15
0 0.5 1 1.5 2 2.5 3 3.5 4
© 1998, Geoff Kuenning
More on Testing Independence
• May be useful to plot error residuals versus experiment number– In previous example, this gives same plot
except for x scaling• No foolproof tests
© 1998, Geoff Kuenning
Testing for Normal Errors
• Prepare quantile-quantile plot• Example for our regression:
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
-1.3 0 1.3
© 1998, Geoff Kuenning
Testing for Constant Standard Deviation
• Tongue-twister: homoscedasticity• Return to independence plot• Look for trend in spread• Example:
-0.1
-0.05
0
0.05
0.1
0.15
0 0.5 1 1.5 2 2.5 3 3.5 4
© 1998, Geoff Kuenning
Linear RegressionCan Be Misleading
• Regression throws away some information about the data– To allow more compact summarization
• Sometimes vital characteristics are thrown away– Often, looking at data plots can tell you
whether you will have a problem
© 1998, Geoff Kuenning
Example of Misleading Regression
I II III IVx y x y x y x y
10 8.04 10 9.14 10 7.46 8 6.588 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.719 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.4714 9.96 14 8.10 14 8.84 8 7.046 7.24 6 6.13 6 6.08 8 5.254 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.567 4.82 7 7.26 7 6.42 8 7.915 5.68 5 4.74 5 5.73 8 6.89
© 1998, Geoff Kuenning
What Does Regression Tell Us About These Data Sets?
• Exactly the same thing for each!• N = 11• Mean of y = 7.5• Y = 3 + .5 X• Standard error of regression is 0.118• All the sums of squares are the same• Correlation coefficient = .82• R2 = .67
© 1998, Geoff Kuenning
Now Look at the Data Plots
0
2
4
6
8
10
12
0 5 10 15 200
2
4
6
8
10
12
0 5 10 15 20
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14
0
2
4
6
8
10
12
14
0 5 10 15 20
I II
III IV