the basic analysis using spss - bijay lal pradhan, ph.d. · the basic analysis using spss dr. bijay...
TRANSCRIPT
The basic analysis
using SPSS
Dr. Bijay Lal Pradhan Tribhuvan University
The basic analysis using SPSS
Today’s Learning
• Sampling procedure
• Point and interval estimation
• Hypothesis testing
• Correlation Analysis
• Regression Analysis
Descriptive & Inferential Statistics
Statistics
Descriptive Inferential
Tabular Graphical Numerical
Estimation Hypothesis Testing
Point Interval Parametric Non-Parametric
What
type of
data
?
1. Prepare frequency table
2. Compute mode
3. Compute median (ordinal)
4. Draw graphs
• Bar diagram
• Pie-chart
5. Chi-square test
1. Prepare frequency table (discrete)
2. Compute mean. Median and mode
3. Compute positional statistics
4. Compute SD, range etc.
5. Draw graphs.
• Steam-and-leaf plot (discrete).
• Box-Whisker plot.
• Histogram (continuous).
• Bar diagram (discrete).
6. Z, t, F & 2 tests
7. Transform into categorical.
Nominal or Ordinal Scale data
Univariate Data Analysis
Analysis of data of a single variable at a time is univariate
analysis. The suitable univariate data analysis methods by
scale of variables are listed below
Bivariate Data Analysis
Analysis of data of two variables at a time. The kinds of data
analysis are listed below.
Nominal Ordinal
1. Prepare two-way frequency tables
2. Compute row or column percentages
3. Draw charts and diagrams
4. Test hypotheses (chi-square test of independence)
Scale
1. Prepare two-way frequency tables
2. Draw Scatter diagram
3. Test hypotheses (chi-square, z, t, F tests)
4. Carry out correlation & simple regression analysis
Sampling
• Simple Random Sampling
– Using select cases
– Use of random numbers (Random Number Generator in SPSS and Excel)
• Selection of sample in Stratified random sampling in Excel
Estimation
• Point Estimation
• Interval estimation
– Confidence Interval
• (Analyse>descriptive statistics>explore>estimation)
Fundamental of Hypothesis Testing
• There two types of statistical inferences, Estimation and Hypothesis Testing
• Hypothesis Testing: A hypothesis is a claim (assumption) about one or more population parameters.
– Average price of a lunch in hetauda is μ = Rs 200
– The population mean monthly cell phone bill of this city is: μ = Rs 125
– The average number of TV sets in Homes is equal to three; μ = 2
• It Is always about a population parameter, not about a sample statistic
• Sample evidence is used to assess the probability that the claim about the population parameter is true
A. It starts with Null Hypothesis, H0
• We begin with the assumption that H0 is true and any difference between the sample statistic and true population parameter is due to chance and not a real (systematic) difference.
• Always contains “=” , “≤” or “” sign • May or may not be rejected
0H :μ 3 and X=2.79
B. Next we state the Alternative Hypothesis, H1
• Is the opposite of the null hypothesis – e.g., The average number of TV sets in
homes is not equal to 2 ( H1: μ ≠ 2 ) • Never contains the “=” , “≤” or “” sign • May or may not be proven • Is generally the hypothesis that the
researcher is trying to prove. Evidence is always examined with respect to H1, never with respect to H0.
• We never “accept” H0, we either “reject” or “not reject” it
A. Rejection Region Method: • Divide the distribution into rejection and non-
rejection regions
• Defines the unlikely values of the sample statistic if the null hypothesis is true, the critical value(s)
– Defines rejection region of the sampling
distribution
• Rejection region(s) is designated by , (level of
significance)
– Typical values are .01, .05, or .10
• is selected by the researcher at the beginning
• provides the critical value(s) of the test
H0: μ ≥ 12
H1: μ < 12
0
H0: μ ≤ 12
H1: μ > 12
a
a
Represents critical value
Lower-tail test
0 Upper-tail test
Two-tail test
Rejection region is shaded
/2
0
a /2 a H0: μ = 12
H1: μ ≠ 12
Rejection Region or Critical Value Approach:
Level of significance =
Non-rejection region
• P-Value Approach – • P-value=Max. Probability of (Type I Error), calculated from the
sample.
• Given the sample information what is the size of blue are?
H0: μ ≥ 12
H1: μ < 12
H0: μ ≤ 12
H1: μ > 12 0 Upper-tail test
Two-tail test 0
H0: μ = 12
H1: μ ≠ 12
0
Type I and II Errors:
• The size of , the rejection region, affects the risk of making different types of incorrect decisions.
Type I Error – Rejecting a true null hypothesis when it should NOT be rejected
– Considered a serious type of error
– The probability of Type I Error is
– It is also called level of significance of the test
Type II Error – Fail to reject a false null hypothesis that should have been
rejected
– The probability of Type II Error is β
Truth
Decision H0 true H0 false
Retain H0 Correct retention
Type II error
Reject H0 Type I error Correct rejection
α ≡ probability of a Type I error
β ≡ Probability of a Type II error
Two types of decision errors:
Type I error = erroneous rejection of true H0
Type II error = erroneous retention of false H0
• P-Value approach to Hypothesis Testing:
• That is to say that P-value is the smallest value of for which H0 can be rejected based on the
sample information • Convert Sample Statistic (e.g., sample mean) to
Test Statistic (e.g., Z statistic )
• Obtain the p-value from a table or computer
• Compare the p-value with
– If p-value < , reject H0
– If p-value , do not reject H0
P-value (Observed Significance Level)
• P-value - Measure of the strength of evidence the sample data provides against the null hypothesis:
P(Evidence This strong or stronger against H0 | H0 is true)
)(: obszZPpvalP
Test of Hypothesis for the Mean
The test statistic is:
n
S
μXt 1-n
σ Unknown
σ known
The test statistic is:
n
σ
μXZ
Steps to Hypothesis Testing
1. State the H0 and H1 clearly
2. Identify the test statistic (two-tail, one-tail, and type of test to be used)
3. Depending on the type of risk you are willing to take, specify the level of significance,
4. Find the decision rule, critical values, and rejection regions. If –CV<actual value (sample statistic) <+CV, then do not reject the H0
5. Collect the data and do the calculation for the actual values of the test statistic from the sample
Steps to Hypothesis testing, continued
Make statistical decision
Do not Reject H0 Reject H0
Conclude H0 may be true
Make management/business/administrative decision
Conclude H1 is “true”
(There is sufficient evidence of H1)
When do we use a two-tail test? when do we use a one-tail test?
• The answer depends on the question you are trying to answer.
• A two-tail is used when the researcher has no idea which
direction the study will go, interested in both direction. • (example: testing a new technique, a new product, a new theory and we don’t
know the direction)
• A new machine is producing 12 fluid once can of soft drink. The quality control manager is concern with cans containing too much or too little. Then, the test is a two-tailed test. That is the two rejection regions in tails is most likely (higher probability) to provide evidence of H1.
oz 12 :H
oz12:H
1
0
12
• One-tail test is used when the researcher is interested in the direction.
• Example: The soft-drink company puts a label on cans claiming they contain 12 oz. A consumer advocate desires to test this statement. She would assume that each can contains at least 12 oz and tries to find evidence to the contrary. That is, she examines the evidence for less than 12 0z.
• What tail of the distribution is the most logical (higher probability) to find that evidence? The only way to reject the claim is to get evidence of less than 12 oz, left tail.
oz 12 :H
oz12:H
1
0
12 14 11.5
Type of Hypothesis
• What to test • Significance of means
– Single mean test – Double mean test
• Dependent pairs • Independent pairs
– More than two mean test
Correlation and Regression
How do we measure association
between two variables?
1. For ordinal and nominal variable
• Odds Ratio (OR)
• Chi square test of independence of attributes
2. For scale variables
• Correlation Coefficient R
• Coefficient of Determination (R-Square)
Example
• A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn
• The following data set provide information on 15 pregnant mothers who were contacted for this study
BMI (Kg/m2)
Birth-weight (Kg)
20
2.7
30
2.9
50
3.4 45
3.0 10
2.2 30
3.1
40
3.3
25
2.3
50
3.5
20
2.5
10
1.5
55
3.8
60
3.7
50
3.1
35
2.8
Scatter Diagram
• Scatter diagram is a graphical method to display the relationship between two variables
• Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y plane
• Y is called the dependent variable
• X is called an independent variable
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70
Scatter diagram of BMI and Birthweight
Is there a linear relationship
between BMI and BW?
• Scatter diagrams are important for initial exploration of the relationship between two quantitative variables
• In the above example, we may wish to summarize this relationship by a straight line drawn through the scatter of points
Simple Linear Regression
• Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory.
• An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares.
• Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized.
Least-squares or regression line
• These vertical distances, i.e., the distance between y values and their corresponding estimated values on the line are called residuals
• The line which fits the best is called the regression line or, sometimes, the least-squares line
• The line always passes through the point defined by the mean of Y and the mean of X
Linear Regression Model
• The method of least-squares is available in most of the statistical packages (and also on some calculators) and is usually referred to as linear regression
• Y is also known as an outcome variable
• X is also called as a predictor
Estimated Regression Line
ˆˆˆ 0
ˆ. . . int
ˆ 0 . . .
y = + x = 1.775351 + 0. 330187 x
1.775351 is called y ercept
0. 330187 is called the slope
Application of Regression Line
This equation allows you to estimate BW of other newborns when the BMI is given.
e.g., for a mother who has BMI=40, i.e. X = 40 we predict BW to be
ˆˆˆ 0 (40) 3.096y = + x = 1.775351 + 0. 330187
Correlation Coefficient, R
• R is a measure of strength of the linear association between two variables, x and y.
• Most statistical packages and some hand calculators can calculate R
• For the data in our Example R=0.94
• R has some unique characteristics
Correlation Coefficient, R
• R takes values between -1 and +1
• R=0 represents no linear relationship between the two variables
• R>0 implies a direct linear relationship
• R<0 implies an inverse linear relationship
• The closer R comes to either +1 or -1, the stronger is the linear relationship
Coefficient of Determination
• R2 is another important measure of linear association between x and y (0 R2 1)
• R2 measures the proportion of the total
variation in y which is explained by x
• For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).
Difference between Correlation and Regression
• Correlation Coefficient, R, measures the strength of bivariate association
• The regression line is a prediction equation that estimates the values of y for any given x
Limitations of the correlation coefficient
• Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of nonlinear relationship
• When the sample size, n, is small we also have to be careful with the reliability of the correlation
• Outliers could have a marked effect on R
• Causal Linear Relationship
Regression Analysis
• Click „Analyze,‟ „Regression,‟ then click
„Linear‟ from the main menu.
Regression Analysis
• For example let‟s analyze the model
• Put „Beginning Salary‟ as Dependent and „Educational Level‟ as
Independent.
edusalbegin 10
Click Click
Regression Analysis
• Clicking OK gives the result
Plotting the regression line
• Click „Graphs,‟ „Legacy Dialogs,‟
„Interactive,‟ and „Scatterplot‟ from the main menu.
Plotting the regression line
• Drag „Current Salary‟ into the vertical axis box
and „Beginning Salary‟ in the horizontal axis box.
• Click „Fit‟ bar. Make sure the Method is
regression in the Fit box. Then click „OK‟.
Click Set this to
Regression!
Is the model significant?
• r2 is the proportion of the variance in y that is explained by our regression model
• SE is also another measure check significance through
• F-statistic:
F (dfŷ,dfer) = sŷ
2
ser2
=......= r2 (n - 2)2
1 – r2
complicated rearranging
And we should know the significance of reg. coeff.
t = byx
S.E.
If all these satisfies than we can say model is Fit.