simple linear regression in the previous lectures, we only focus on one random variable. in many...
TRANSCRIPT
Simple Linear Regression
In the previous lectures, we only focus on one random variable. In many applications, we often work with a pair of variables. For example the distance travels and the time spent driving; one’s age and height. Generally, there are two types of relationships between a pair of variable: deterministic relationship and probabilistic relationship.
Deterministic relationship
time
distance
vtss 0
S: distance travel S0: initial distance v: speed t: traveledS0
vslope
intercept
Probabilistic Relationship
age
height
In many occasions we are facing a different situation. One variable is related to another variable as in the following.
Here we can not definitely to predict one’s height from his age as we did in
vtss 0
Linear Regression
Statistically, the way to characterize the relationship between two variables as we shown before is to use a linear model as in the following:
bxay
Here, x is called independent variable y is called dependent variable is the error term a is intercept b is slope
x
y
a
b
Error:
Least Square Lines
Given some pairs of data for independent and dependent variables, we may draw many lines through the scattered points
x
y
The least square line is a line passing through the points that minimize the vertical distance between the points and the line. In other words, the least square line minimizes the error term .
Least Square Method
For notational convenience, the line that fits through the points is often written as bxay ˆThe linear model we wrote before is bxay
If we use the value on the line, ŷ , to estimate y, the difference is (y- ŷ) For points above the line, the difference is positive, while the difference is negative for points below the line.
bxay ˆ
ŷ
y
(y- ŷ)
For some points, the values of (y- ŷ) are positive (points above the line) and for some other points, the values of (y- ŷ) are negative (points below the line). If we add all these up, the positive and negative values can get cancelled. Therefore, we take a square for all these difference and sum them up. Such a sum is called the Error Sum of Squares (SSE)
n
i
yySSE1
2)ˆ(
The constant a and b is estimated so that the error sum of squares is minimized, therefore the name least squares.
Error Sum of Squares
Estimating Regression Coefficients
If we solve the regression coefficients a and b from by minimizing SSE, the following are the solutions.
n
ii
n
iii
xx
yyxxb
1
2
1
)(
))((
xbya
Where xi is the ith independent variable value yi is dependdent variable value corresponding to xi x_bar and y_bar are the mean value of x and y.
The constant b is the slope, which gives the change in y (dependent variable) due to a change of one unit in x (independent variable). If b> 0, x and y are positively correlated, meaning y increases as x increases, vice versus. If b<0, x and y are negatively correlated.
Interpretation of a and b
b>0
x
y
a
b<0
x
y
a
Correlation Coefficient
Although now we have a regression line to describe the relationship between the dependent variable and the independent variable, it is not enough to characterize the relationship between x and y. We may see the situation in the following graphs.
x
y
x
y(a) (b)
Obviously the relationship between x and y in (a) is stronger than that in (b) even though the line in (b) is the best fit line. The statistic that characterizes the strength of the relationship is correlation coefficient or R2
How R2 is Calculated?
y
yy
)ˆ()ˆ( yyyyyy
If we use ŷ to represents y, then the error is (y- ŷ ). However, we used ŷ to represent y, therefore the error is reduced to (y- ŷ ). Thus (ŷ- y_bar ) is the improvement. This is true for all points in the graph. To account how much total improvement we get, we take a sum of all improvements, (ŷ -y_bar). Again we face the same situation as we did while calculating variance. We take the square of the difference and sum the squared difference for all points
R Square
SST
SSRR 2
n
ii yySSR
1
2)ˆ(
n
ii yySST
1
2)(
We already calculated SSE (Error Sum of Squares) while estimating a and b. In fact, the following relationship holds true:
SST=SSR+SSE
y
yy
R square indicates the percent variance in y explained by the regression.
Regression Sum of Squares
Total Sum of Squares
An Simple Linear Regression Example
The followings are some survey data showing how much a family spend on food in relation to household income (x=income in thousand $, y=$ on food)
x y x-x_bar y-y_bar (x-x_bar)(y-y_bar) (x-x_bar) 2 y_hat (y-y_bar) 2 (y_hat-y_bar)^2 (y-y_hat) 26.5 81 1.185714 1.571429 1.863265306 1.40591837 73.254325 2.46938776 38.12130132 59.99548121
4 96 -1.31429 16.57143 -21.77959184 1.72734694 86.2722 274.612245 46.83527158 94.630092842.5 93 -2.81429 13.57143 -38.19387755 7.92020408 94.082925 184.183673 214.7501205 1.1727265567.2 68 1.885714 -11.4286 -21.55102041 3.55591837 69.60932 130.612245 96.41767056 2.5899108628.1 63 2.785714 -16.4286 -45.76530612 7.76020408 64.922885 269.897959 210.4148973 3.6974867233.4 84 -1.91429 4.571429 -8.751020408 3.6644898 89.39649 20.8979592 99.35942913 29.122104325.5 71 0.185714 -8.42857 -1.565306122 0.0344898 78.461475 71.0408163 0.935272739 55.67360918
sum 37.2 556 -135.7428571 26.0685714 953.714286 706.8339631 246.8814117mean 5.31429 79.4286slope -5.2071intercept 107.101SST 953.714SSR 706.834SSE 246.881SST+SSR 953.715R-square 0.74114
x y
6.5 81
4 96
2.5 93
7.2 68
8.1 63
3.4 84
5.5 71
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.860893
R Square 0.741137
Adjusted R Square 0.689364
Standard Error 7.026826
Observations 7
ANOVA
df SS MS F Significance F
Regression 1 706.8329 706.8328741 14.31523 0.012843
Residual 5 246.8814 49.37628233
Total 6 953.7143
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 107.1008 7.781132 13.76417115 3.63E-05 87.0988 127.1029 87.0988 127.1029
X Variable 1 -5.20715 1.37626 -3.783547373 0.012843 -8.74494 -1.66936 -8.74494 -1.66936