ms. khatijahhusna abd rani school of electrical system engineering sem ii 2014/2015
TRANSCRIPT
SIMPLE LINEAR REGRESSION
Ms. Khatijahhusna Abd RaniSchool of Electrical System EngineeringSem II 2014/2015
• Regression analysis explores the relationship between a quantitative response variable and one or more explanatory variables.
• 1 exp.var/ind. Var :SLR• >1 exp.var/ ind.var :MLR
• 3 major objectives:
i. Description
ii. Control
iii. Prediction
To describe the effect of income on expenditure
To increase the export of rubber by controlling other factors such as price
To predict the price of houses based on lot size & location
1) A nutritionist studying weight loss programs might wants to find out if reducing intake of carbohydrate can help a person reduce weight.a) X is the carbohydrate intake (independent variable).b) Y is the weight (dependent variable).
2) An entrepreneur might want to know whether increasing the cost of packaging his new product will have an effect on the sales volume.a) X is cost of packagingb) Y is sales volume
4
EXAMPLE 4.1
FIRST: PLOT YOUR DATA!A graph of the ordered pairs (x,y) of num. consisting of the ind. Var Xand the dep. Var. Y
• Can we use a known value of temperature (X) to help predict the number of pairs of gloves (Y)
THE QUESTION IS…..
FITTING LINEUSING THIS LINE FOR PREDICTION
1. Good fitting line??2. Is a line reasonable summaryof the r/ship between variables??
Negative relationship: since as the num. of absences increases, the final grade decrease
Positive relationship: since as the num. of cars rented increases, revenue tends to increase
• Linear regression : we assume to have linear r/ship between X and Y
E(Y|X)= β₀+ β₁Xi
Expectation of Y for a given value of X
interceptslope
XY 10ˆˆˆ
How we are going to estimate the 2 parameters values??
We usually use the method of least squares to estimate 10ˆˆ and
• Recall the assumed relationship between Y and X:
XY 10ˆˆˆ
iXY 10
We use data to find the estimated regression line:
How we are going to choose them wisely… so that we can have a good regression line.
• are chosen to minimize the sum of the squared residual:
2
10
22
)ˆˆ(
ˆ
ii
ii
XY
YYe
This is called the method of least squares.
10ˆ&ˆ
Assumptions About the Error Term
1. The error is a random variable with mean of zero.
2. The variance of , denoted by , is the same for all values of the independent variable.
3. The values of are independent.
4. The error is a normally distributed random variable.
2
• Solution:
xy
xy
S
S
S
yy
xx
xy
5121.06545.73ˆ
6545.73)75.74(5121.0375.35
ˆˆ
5121.05.123
25.63ˆ
875.378
)283(10049
5.1238
)598(44824
25.638
)283)(598(21091
10
1
2
2
10049,21091,283,598 2 yxyyx
When we increase 1 unit of X, so it will decrease 0.5121 unit of Y
xy 5121.06545.73ˆ
gloves of pairs 36
)74(5121.06545.73
5121.0 6545.73ˆ
xy
• The coefficient of determination is a measure of the variation of the dependent variable (Y) that is explained by the regression line and the independent variable (X).
• The symbol for the coefficient of determination is r2 or R2
21.00 r26
COEFFICIENT OF DETERMINATION (R2)
• If r =0.90, then r2 =0.81. It means that 81% of the variation in the dependent variable (Y) is accounted for by the variations in the independent variable (X).
• The rest of the variation, 0.19 or 19%, is unexplained and called the coefficient of nondetermination.
• Formula for the coefficient of nondetermination is (1.00 - r2 )
• Relationship Among SST, SSR, SSE
where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error
SST = SSR + SSE
2( )iy y 2ˆ( )iy y 2ˆ( )i iy y
The coefficient of determination is:
where:SSR = sum of squares due to regressionSST = total sum of squares 28
222ˆˆ iiii yyyyyy
yyxx
xy
SS
S
SST
SSRr
22
variationtotal
variationexplained
COEFFICIENT OF DETERMINATION (R2)
Refer Example 4.2
855.0
)875.37)(5.123(
25.63 2
22
yyxx
xy
SS
Sr
It means that 85.5% of the variation in the dependent variable (Y: number of pairs of gloves) is explained by the variations in the independent variable (X:temperature).
• Correlation measures the strength of a linear relationship between the two variables.
• Also known as Pearson’s product moment coefficient of correlation.
• The symbol for the sample coefficient of correlation is r , population coefficient of correlation is .
• Formula :
COEFFICIENT OF CORRELATION (r)
yyxx
xy
SS
Sr
Properties of r :
• • Values of r close to 1 implies there is a strong
positive linear relationship between x and y.• Values of r close to -1 implies there is a strong
negative linear relationship between x and y. • Values of r close to 0 implies little or no linear
relationship between x and y
11 r
Refer Example 4.2: Number of pairs of gloves
Solution:
Thus, there is a strong negative linear relationship between score obtain before (x) and after (y).
92.0
)875.37)(5.123(
25.63
yyxx
xy
SS
Sr Or
92.0
855.0
855.02
r
r
Next, refer to equation
xy 5121.0 6545.73ˆ
Negative relationship, since sign for b1
is negative
(d)
• To determine whether X provides information in predicting Y, we proceed with testing the hypothesis.
TEST OF SIGNIFICANCE
Two test are commonly used:
F-testT-test
1. Hypotheses:
2. Significance level,
3. Rejection Region
t-test
r/ship)linear (exist 0:
r/ship)linear (no 0:
11
10
H
H
2,2
2,2
or
or
value-p
ntest
ntest tttt
P-value approach
Critical -value approach
4. Test Statistic
5. Decision Rule
6. Conclusion
t-test
xx
xyyy
Sn
SS
Vart
1
2
ˆ)ˆvar(
)ˆ(
ˆ
11
1
1
2,2
2,2
0
or
or
value-p
:ifReject
ntest
ntest tttt
H
There is a significant relationship between variable X and Y.
Refer Example 4.2
r/ship)linear (exist 0:
r/ship)linear (no 0:
11
10
H
H
Significance level, 05.0
Rejection region:
2.447or 2.447
or
or
6,025.06,025.0
28,2
05.028,
2
05.0
2,2
2,2
tttt
tttt
tttt
testtest
testtest
ntest
ntest
1
2
3
(e)
953.50074.0
5121.0
)ˆ(
ˆ
0074.05.123
1
6
63.25 x5121.0(875.371
2
ˆ)ˆvar(
)ˆ(
ˆ
1
1
11
1
1
Vart
Sn
SS
Vart
test
xx
xyyy
test
06,025.0 reject weso ,447.2953.5 Since Htttest
We conclude that the temperature is linearly related to the number of pairs of gloves produced
Test Statistic4
5
6
Decision Rule
Conclusion
• We may also use the analysis of variance approach to test significance of regression.
• The ANOVA approach involves the partitioning of total variability in the response variable Y.
• SST (total sum of squares).
• If SST=0, all observations are the same.
• The greater is SST, the greater is the variation among the Y observations.
ANOVA
• SSE (error sum of squares).
• If SSE=0, all observations falls on the fitted regression line.
• The larger the SSE, the greater is the variation of the Y observations around the regression line.
ANOVA
• SSR (Regression sum of squares)
• SSR: measure of the variability of the Y’s associated with regression line.
• The larger is SSR in relation to SST, the greater is the effect of the regression line relation in accounting for the total variation in the Y observations.
ANOVA
1. Hypotheses:
2. Significance level,
3. Rejection Region
ANOVA (F-TEST)
r/ship)linear (exist 0:
r/ship)linear (no 0:
11
10
H
H
987.56,1,05.0 fFtestCritical -value
approach
05.0
4. Test Statistic
5. Decision Rule
6. Conclusion
We can conclude that there is a significant relationship between variable X and Y.
Alternatively, we conclude that the regression model is significant
reject weso 5.987,46.35 Since, 06,1,05.0 HfFtest
F-TEST
46.3583.112
5625.4000
83.1126
5625.40005625.4677
282
1
5625.4000
.
F
SSRSST
n
SSEMSE
fd
SSRMSR
MSE
MSRF
To calculate MSR and MSE, first compute the regression sum of squares (SSR) and the error sum of squares (SSE).