research statistics
DESCRIPTION
Examining Relationship of Variables. RESEARCH STATISTICS. Jobayer Hossain Larry Holmes, Jr. November 6, 2008. Examining Relationship of Variables. Response (dependent) variable - measures the outcome of a study. - PowerPoint PPT PresentationTRANSCRIPT
RESEARCH STATISTICSRESEARCH STATISTICS
Jobayer Hossain
Larry Holmes, Jr
November 6, 2008
Examining Relationship of Variables
Examining Relationship of Examining Relationship of VariablesVariables
Response (dependent) variable - measures the outcome of a study.
Explanatory (Independent) variable - explains or influences the changes in a response variable
Outlier - an observation that falls outside the overall pattern of the relationship.
Positive Association - An increase in an independent variable is associated with an increase in a dependent variable.
Negative Association - An increase in an independent variable is associated with a decrease in a dependent variable.
ScatterplotScatterplot
Shows relationship
between two variables (age
and height in this case).
Reveals form, direction,
and strength of the
relationship.
Scatterplot - DemoScatterplot - Demo
MS-Excel: – Step 1: Select Chart wizard– Step 2: From the chart type select XY(Scatter)– Step 3: Select Chart from chart sub type and click next– Step 4: Data range: range for variable in the x axis (age in our
case), range for the variable in the y axis (hgt in our case) and click on next
– Step 5: Click on titles to write title and labels of the chart , click on axes and mark on Value (X) and Value (Y) axis , click on Gridlines and take all the mark off( if you want no gridlines), click on legend and put options of legend (if you don’t want legend) then click on Finish
Scatterplot - DemoScatterplot - Demo
SPSS:
– Step 1: Click on Graphs and select Scatter/Dot
– Step 2: Select simple scatter
– Step 3: Select variables for Y axis (usually response variable)
and X axis (explanatory variable) and click on Titles to write
titles in Line 1. After writing title click on ‘Continue’.
– Step 4: Click on ‘Ok’.
ScatterplotsScatterplots
60 80 100 120 140
4045
5055
Scatterplot of Age vs Height
Age
Hei
ght
60 80 100 120 140
4045
5055
Scatterplot of varibles X and Y
X
Y
6 8 10 12 14 16 18
3050
7090
Scaterplot of variables X1 and Y1
X1
Y1
Strong positive association
Points are scattered with a poor association
Strong negative association
Correlation of two variablesCorrelation of two variables
Correlation measures the direction and strength of the relationship between two quantitative variables. Suppose that we have data on variables x and y for n individuals. Then the correlation coefficient r between x and y is defined as,
Where, sx and sy are the standard deviations of x and y.
r is always a number between -1 and 1. Values of r near 0 indicate little or no linear relationship. Values of r near -1 or 1 indicate a very strong linear relationship.
The extreme values r=1 or r=-1 occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line.
y
in
i x
i
s
yy
s
xx
nr
11
1
Correlation of two variablesCorrelation of two variables
Positive r indicates positive association i.e. association between
two variables in the same direction, and negative r indicates
negative association.
Scatterplot of Height and Age shows that these two variables
possess a strong, positive linear relationship. The correlation
coefficient of these two variables is 0.9829632, which is very
close to 1.
Correlation of two variablesCorrelation of two variables
60 80 100 120 140
4045
5055
Scatterplot of Age vs Height
Age
Hei
ght
60 80 100 120 140
4045
5055
Scatterplot of varibles X and Y
X
Y
6 8 10 12 14 16 18
3050
7090
Scaterplot of variables X1 and Y1
X1
Y1
r=0.02
r=-0.96r=0.98
Correlation of two variablesCorrelation of two variables
MS-Excel:
– Step 1: Click on function Wizard fx
– Step 2: Select function Correl
– Step 3: Input range Array 1 and Array 2 (for our example age and hgt) and click
on Ok.
Another option:
Check if the data of two variables are in a side by side position. If not, copy and paste
them side by side in a new worksheet and follow the following steps-
– Step 1: Select Data Analysis from the menu Tools
– Step 2: Select Correlation and type or select Input and Output range and click on
ok.
Correlation of two variablesCorrelation of two variables
SPSS: – Step 1: Select Correlate from the menu Analyze– Step2: Click on Bivariate and select the variables
(age and hgt) and click on Ok
Example 1: Calculating Correlation Coefficient Example 1: Calculating Correlation Coefficient of the variables Height and Age in our data of the variables Height and Age in our data
set1.set1. MS-Excel: Click on function Wizard fx > Correl > Input range Array 1: age Input
range Array 1: hgt > ok. It gives correlation coefficient, r=.98. or Tools > data analysis > correlation > range for variables age and hgt > ok. It gives you the following results.
age hgt
age 1
hgt 0.982 1
Example 1: Calculating Correlation Coefficient Example 1: Calculating Correlation Coefficient of the variables Height and Age in our data set1of the variables Height and Age in our data set1
SPSS: Analyze > Correlate > Bivariate > select age and hgt > ok. It gives the following result
age hgt
age Pearson Correlation
sig (- tailed)
N
1
60
.983
.000
60
Hgt Pearson Correlation
sig (- tailed)
N
.983
000
60
1
60
Strong Association but no CorrelationStrong Association but no Correlation
20 30 40 50 60
20
25
30
35
Scatter Plot
Speed
MP
GGas mileage of an auto mobile first increases than decreases as the speed increases like the following data: Speed 20 30 40 50 60
MPG 24 28 30 28 24
Scatter plot shows an strong association. But calculated, r = 0, why?
r = 0 ?It’s because the relationship is not linear and r measures the linear relationship between two variables.
Influence of an outlierInfluence of an outlier
Consider the following data set of two variables X and Y:
r = -0.237 After dropping the last pair,
r = 0.996
X 20 30 40 50 60 80 Y 24 28 30 34 37 15
X 20 30 40 50 60 Y 24 28 30 34 37
Simple Linear RegressionSimple Linear Regression
Regression refers to the value of a response variable as a function of the value of an explanatory variable.
A regression model is a function that describes the relationship between response and explanatory variables.
A simple linear regression has one explanatory variable and the regression line is straight.
The linear relationship of variable Y and X can be written as in the following regression model form
Y= b0 + b1X + e
where, ‘Y’ is the response variable, ‘X’ is the explanatory variable, ‘e’ is the residual (error), and b0 and b1 are two parameters. Basically, bo is the intercept and
b1 is the slope of a straight line y= b0 + b1X.
Simple Linear RegressionSimple Linear Regression
A simple regression line is fitted for height on age. The intercept is 31.019 and the slope (regression coefficient) is .1877.
Regresion line of Height on Age
Height = 0.1877Age + 31.019
0
10
20
30
40
50
60
70
0 50 100 150 200
Age (month)
Hei
gh
t (i
nch
)
Simple Linear Regression Assumptions:Simple Linear Regression Assumptions:
1. Response variable is normally distributed.
2. Relationship between the two variables is linear.
3. Observations of response variable are independent.
4. Residual error is normally distributed with mean 0
and constant standard deviation.
Simple Linear RegressionSimple Linear Regression
Estimating Parameters b0 and b1
– Least Square method estimates b0 and b1 by fitting a straight line through the data points so that it minimizes the sum of square of the deviation from each data point.
– Formula:
n
ii
n
iii
XX
YYXXb
1
2
11
)(
))((ˆ XbYb 10
ˆˆ
Simple Linear RegressionSimple Linear Regression
Fitted Least Square Regression line – Fitted Line:
– Where is the fitted / predicted value of ith observation (Yi) of the response variable.
– Estimated Residual:
– Least square method estimates b0 and b1 to minimize the summed error:
iY
ii XbbY 10ˆˆˆ
iii YYe ˆˆ
n
iie
1
2ˆ
Simple Linear RegressionSimple Linear Regression
Average Income versus % with a College Degree (by State)
22,000
22,500
23,000
23,500
24,000
24,500
25,000
25,500
26,000
15 15.5 16 16.5 17 17.5 18 18.5 19 19.5 20
Percentage of Population with College Degree or Higher
Ave
rag
e In
co
me
Le
vel (
$ p
er
yea
r)
e1 e3
e2
In this example, a regression line (red line) has been fitted to a series of observations (blue diamonds) and residuals are shown for a few observations (arrows).
Fitted Least Square Regression line
Simple Linear RegressionSimple Linear Regression
Interpretation of the Regression Coefficient and Intercept
– Regression coefficient (b1) reflects the average change in
the response variable Y for a unit change in the
explanatory variable X. That is, the slope of the regression
line. E.g.
– Intercept (b0) estimates the average value of the response
variable Y without the influence of the explanatory
variable X. That is, when the explanatory variable = 0.0.
Simple Linear RegressionSimple Linear Regression
MS-Excel:
– Step1: Select Data Analysis from the menu Tools
– Step2: Input Y range: range for the response (dependent) variable, Input X range:
range for the independent variable
– Step 3: Make selections of out puts you want and click on ok
SPSS:
– Step 1: Select Regression from the menu Analyze
– Step 2: Select Linear and then for Dependent: Select dependent variable and for
Independent(s): select Independent variable
– Step 3: Click on Statistics and select options, clik on Plots and select variables for
scatter plots (if you want), click on continue an d then ok
Example 2: MS-Excel Linear Regression Example 2: MS-Excel Linear Regression output: Height on ageoutput: Height on age
SUMMARY OUTPUT
Regression Statistics Multiple R 0.982963 R Square 0.966217 Adjusted R Square 0.965634 Standard Error 5.604001 Observations 60 ANOVA
df SS MS F Significance
F Regression 1 52095.1 52095.1 1658.825 2.28E-44 Residual 58 1821.48 31.40483 Total 59 53916.58
Coefficients Standard
Error t Stat P-value Lower 95% Upper 95%
Lower 95.0%
Upper 95.0%
Intercept -156.589 6.107669 -25.6381 2.48E-33 -168.815 -144.363 -168.815 -144.363 hgt 5.146693 0.126365 40.72867 2.28E-44 4.893746 5.399641 4.893746 5.399641
Example 2: SPSS Linear Regression Example 2: SPSS Linear Regression output: Height on ageoutput: Height on age
Model Summary
Model R R
Square
Adjusted R
Square Std. Error of the Estimate Change Statistics
R Square Change
F Change df1 df2
Sig. F Change
1 .983(a) .966 .966 1.0703 .966 1658.825 1 58 .000
a Predictors: (Constant), age Coefficients(a)
Unstandardized Coefficients
Standardized Coefficients
Model B Std. Error Beta t Sig.
(Constant) 31.019 .439 70.645 .000 1
age .188 .005 .983 40.729 .000
a Dependent Variable: hgt
Multiple RegressionMultiple Regression
Two or more independent variables to predict a single
dependent variable.
Multiple regression model of Y on p number of explanatory
variables can be written as,
Y = b0 + b1X1 + b2X2 +… +bpXp +e
where bi (i=1,2, …, p) is the regression coefficient of Xi
Multiple RegressionMultiple Regression Fitted Y is given by,
The estimated residual error is the same as that in the simple linear regression,
ii
pp
bb
XbXbXbbY
of estimate theis ˆ Where,
ˆ...ˆˆˆˆ22110
iii YYe ˆˆ
Multiple RegressionMultiple Regression
MS-Excel: Every thing is the same as for the simple regression except
we need to select multiple columns for the range of independent
variables. If the columns of independent variables are not side by side,
then we need to have them side by side. For that we can insert another
excel sheet and copy the data from original sheet and paste the data in the
new sheet. We first copy the dependent variable and paste that in one of
the columns and then copy independent variables and paste them in side
by side columns.
SPSS: Everything is the same as for the simple regression except we
select more than one independent variables.
Example 3: MS Excel Multiple Regression Example 3: MS Excel Multiple Regression output: Height on Age and PLUC.postoutput: Height on Age and PLUC.post
SUMMARY OUTPUT
Regression Statistics Multiple R 0.983103 R Square 0.966491 Adjusted R Square 0.965315 Standard Error 5.629974 Observations 60 ANOVA
df SS MS F Significance
F Regression 2 52109.88 26054.94 822.0105 9.27E-43 Residual 57 1806.706 31.6966 Total 59 53916.58
Coefficients Standard
Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -158.474 6.728226 -23.5536 4.65E-31 -171.947 -145.001 -171.947 -145.001 hgt 5.136708 0.127791 40.19623 1.6E-43 4.880811 5.392605 4.880811 5.392605 PLUC.post 0.194634 0.28509 0.68271 0.497556 -0.37625 0.765517 -0.37625 0.765517
Example 3: SPSS Multiple Regression Example 3: SPSS Multiple Regression output: Height on Age and PLUC.postoutput: Height on Age and PLUC.post
Model Summary
Model R R
Square Adjusted R
Square Std. Error of the Estimate Change Statistics
R Square Change
F Change df1 df2
Sig. F Change
1 .983(a) .966 .965
1.077191687455185
.966 818.970 2 57 .000
a Predictors: (Constant), PLUC.post, age Coefficients(a)
Unstandardized Coefficients
Standardized Coefficients
Model B Std. Error Beta t Sig.
(Constant) 31.330 .752 41.635 .000 age .188 .005 .985 40.196 .000
1
PLUC.post -.028 .055 -.013 -.511 .612
a Dependent Variable: hgt
Coefficient of Determination (Multiple Coefficient of Determination (Multiple R-squared)R-squared)
Total variation in the response variable Y is due to (i) regression of all variables in the model (ii) residual (error).
Total variation of y, SS (y) = SS(Regression) +SS(Residual)
The Coefficient of Determination is,
(Y) Total
Residual)(1
(Y) Total
)Regression(2
SS
SS
SS
SSR
Coefficient of DeterminationCoefficient of Determination (Multiple R-squared) (Multiple R-squared)
R2 lies between 0 and 1.
R2 = 0.8 implies that 80% of the total variation in the
response variable Y is due to the contribution of all
explanatory variables in the model. That is, the fitted
regression model explains 80% of the variance in the
response variable.
Example 4: Calculating coefficient of Example 4: Calculating coefficient of determination for the variables in example 3.determination for the variables in example 3.
MS-Excel: It’s in the MS-Excel output of Example 3.
SPSS: It’s in the SPSS output of Example 3.
QuestionsQuestions