statistical analysis
DESCRIPTION
Statistical Analysis (5/6). A series of six presentation, introduce scientific research in the areas of cross-cultural, using quantitative approach.TRANSCRIPT
Quantitative Research Methodologies (5/6):
Statistical Analysis
Prof. Dr. Hora Tjitra & Dr. He Quan
www.SinauOnline.com
2
Statistical Analysis Process
... Describing the data, are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures.
Data Preparation
Descriptive Statistics
Inferential Statistic
... Testing Hypotheses and Models, investigate questions, models and hypotheses. In many cases, the conclusions from inferential statistics extend beyond the immediate data alone.
... Cleaning and organizing the data for analysis, involves checking or logging the data in; checking the data for accuracy; entering the data into the computer; etc.
3
Data Preparation
Logging the Data Checking the Data For Accuracy
Developing a Database Structure
Entering the Data into the
Computer
Data Transformation
•mail surveys returns
•coded interview data
•pretest or posttest data
• observational data
•Are the responses legible/readable?•Are all important questions answered?•Are the responses complete? •Is all relevant contextual information included (e.g., data, time, place)
•missing values•item reversals•scale totals •categories
•Double Entry. •In this procedure you enter the data once. •Then, you use a special program that allows you to enter the data a second time and checks each second entry against the first.
•variable name •variable description•variable format (number, data, text)•instrument/method of collection•date collected•respondent or group•variable location (in database) •notes
4
Descriptive Statistics (Univariate Analysis )
The Distribution: A summary of the frequency of individual values or ranges of values for a variable
Central Tendency: A distribution is an estimate of the “center” of a distribution of values
Dispersion: the spread of the values around the central tendency
5
The Distribution
Frequency distributions can be depicted in two ways ….
A table shows an age frequency distribution with five categories of age ranges defined
A graph shows the frequency distribution. It is often referred to as a histogram or bar chart
6
Central Tendency
• The Mean or average is probably the most commonly used method of describing central tendency
• The Median is the score found at the exact middle of the set of values• The Mode is the most frequently occurring value in the set of scores
2, 3, 5, 3, 4, 3, 6
M=(2+3+5+3+4+3+6)/7=26/7=3.71
Md, 2, 3, 3, 3, 4, 5, 62.5----2.83----3.16---3.5Md=3.33
Mo = 3
Example
7
Dispersion
• The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the range is 36 - 15 = 21.
Standard Deviation The Standard Deviation shows the relation that set of scores has to the mean of the sample (15,20,21,20,36,15,25,15), M=20.875
15 - 20.875 = -5.87520 - 20.875 = -0.87521 - 20.875 = +0.12520 - 20.875 = -0.87536 - 20.875 = 15.12515 - 20.875 = -5.87525 - 20.875 = +4.125
15 - 20.875 = -5.875
5.875 * -5.875 = 34.515625-0.875 * -0.875 = 0.765625+0.125 * +0.125 = 0.015625-0.875 * -0.875 = 0.76562515.125 * 15.125 = 228.765625-5.875 * -5.875 = 34.515625+4.125 * +4.125 = 17.015625-5.875 * -5.875 = 34.515625Sum of Squares 350.875350.875 / 7 = 50.125
SQRT(50.125) = 7.079901129253
8
Normal Distribution
The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. • (-1,1)------------68.26%• (-1.96, 1.96)---95%• (-2.58, 2.58)---99%
Normal distributions: are a family of distributions that have the same general shape. They are symmetric with scores more concentrated in the middle than in the tails (two parameters: the mean (m) and the standard deviation (s).
9
Hypothesis Testing / significance testing
a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1)
• The philosophical basis for hypothesis testing lies in the fact that random variation pervades all aspects of life, and in the desire to avoid being fooled by what might be chance variation
• The alternative hypothesis typically describes some change or effect that you expect or hope to see confirmed by data. For example, new drug A works better than standard drug B. Or the accuracy of a new weapon targeting system is better than historical standards
• The null hypothesis embodies the presumption that nothing has changed, or that there is no difference
10
The Statistical Inference Decision Matrix
In reality
What we conclude
H0 is true,H1 is falseIn reality... There is no relationship; There is no
difference, no gain; Our theory is wrong
H0 is false,H1 is true In reality... There is a relationship, There is a difference
or gain, Our theory is correct
We accept H0, reject H1.We say... "There is no
relationship"; "There is no difference, no gain"; "Our theory is wrong"
1-α (e.g., .95)THE CONFIDENCE LEVEL
The odds of saying there is no relationship, difference, gain, when in fact there is none
The odds of correctly not confirming our theory95 times out of 100 when there is no effect, we’ll
say there is none
β(e.g., .20)TYPE II ERROR
The odds of saying there is no relationship, difference, gain, when in fact there is one
The odds of not confirming our theory when it’s true
20 times out of 100, when there is an effect, we’ll say there isn’t
We reject H0 ,accept H1.We say... "There is a
relationship"; "There is a difference or gain"; "Our theory is correct"
α (e.g., .05)TYPE I ERROR
(SIGNIFICANCE LEVEL)The odds of saying there is an relationship,
difference, gain, when in fact there is notThe odds of confirming our theory incorrectly5 times out of 100, when there is no effect, we’ll
say there is onWe should keep this small when we can’t afford/
risk wrongly concluding that our program works
1-β(e.g., .80)POWER
The odds of saying that there is an relationship, difference, gain, when in fact there is one
The odds of confirming our theory correctly80 times out of 100, when there is an effect, we’ll
say there isWe generally want this to be as large as possible
11
Examples
12
Selecting the Appropriate Statistical Test
Type of Design
Between-subject Within-subject
Number of independent variables
Number of groups or levels of the independent variables
One independent variable
Two independent variable
Between-subject Between-subject
Twogroups
More than two groups
Two-way analysis of variance
Independent sample t-test
One-way analysis of variance
Two groups or two levels of the independent
variable
More than two groups or more
than two levels of the independent
variable
Correlated t-test Repeated
measures analysis of variance
13
Correlation
Correlation is a measure of the relation between two or more variables
The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data
Correlation coefficients can range from -1.00 to +1.00
14
The types of correlation
subject X Y Zx Zy ZxZy
A 1 4 -1.5 -1.5 2.25
B 3 7 -1.0 -1.0 1.00
C 5 10 -0.5 -0.5 0.25
D 7 13 0 0 0.00
E 9 16 0.5 0.5 0.25
F 11 19 1.0 1.0 1.00
G 13 22 1.5 1.5 2.25
N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435
r=(∑ZxZy)/N=7.00/7=1.00
N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435
r=(∑ZxZy)/N=7.00/7=1.00
N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435
r=(∑ZxZy)/N=7.00/7=1.00
N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435
r=(∑ZxZy)/N=7.00/7=1.00
N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435
r=(∑ZxZy)/N=7.00/7=1.00
N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435
r=(∑ZxZy)/N=7.00/7=1.00
r = (∑ZxZy)/N
Pearson r - Used when data represents either interval or ratio scales and assumes a linear relationship between variables.
Spearman rs , the rank correlation coefficient,used with ordered or ranked data.
rs =1-[6(∑D2)/N(N2-1)]
IQ rank Leadership rank D D2
1 4 -3 9
2 2 0 0
3 1 2 4
4 6 -2 4
5 3 2 4
6 5 1 1
∑D=0 ∑D2=22
N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37
N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37
N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37
N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37
How to judge the significance of r?
15
T-Test :Testing for Differences Between Two Groups
Group1
Group2
3221223221
223333434
N∑X∑X2
XSSs
1020442.04.0
0.6667
927853.04.0
0.7071
t = -3.17df = (10-1)+(9-1)=17p = .004Report :t (17) =-3.17, p< .05
16
ANOVA
A statistical procedure developed by RA Fisher that allows one compare simultaneously the difference between two or more means
One-way ANOVA …comparing the effects of different levels of a single independent variableTwo-way ANOVA …comparing simultaneously the effects of two independent variables
Between-Groups Variance… Estimate of variance between group means With-Groups Variance… Estimate of the average variance within each groupHomogeneity of variance …the variance of the groups are equivalent to each other
Basic concept
17
Work Example
Group1
Group 2
Group 3
Group 4
X 1.1=4 X 1.2=6 X 1.3=4 X 1.4=5
X 2.1=2 X 2.2=3 X 2.3=5 X 2.4=8
X 3.1=1 X 3.2=5 X 3.3=7 X 3.4=6
X 4.1=3 X 4.2=4 X 4.3=6 X 4.4=5
X .1 =10n1 =4x=2.5
∑Xi12 =30
s12 =1.67
X .2 =18n2 =4x=4.5
∑Xi22 =86
s22 =1.67
X .3 =22n3 =4x=5.5
∑Xi32 =126
s32 =1.67
X .4 =24n4 =4x=6.0
∑Xi42 =150
s42 =2.00
∑Xij=74N=16X=4.62∑Xij
2 =392
F max =2.00/1.67 =1.198
The degree of freedomdf total =N-1 =16-1 =15df between =k-1= 4-1 =3df within =∑(nj-1) = 12
F =MS between /MS within = 5.48
Result Table
Source SS df MS F
Between-groups 28.75 3 9.583 5.48
Within-groups 21.00 12 1.750
Total 49.75 15
SStotal, SSwithin , SSbetween
18
MANOVA
MANOVA is a technique which determines the effects of independent categorical variables on multiple continuous dependent variables. It is usually used to compare several groups with respect to multiple continuous variables.
The main distinction between MANOVA and ANOVA is that several dependent variables are considered in MANOVA. While ANOVA tests for inter-group differences between the mean values of one dependent variable, MANOVA tests for differences between centroid s - the vectors of the mean values of the dependent variables.
One important index is interaction
19
Work Example
Source SS df MS F Sig. of F
Target 235.20 3 78.40 74.59 .000
Device 86.47 2 43.23 41.13 .000
Light 76.80 1 76.80 73.07 .000
Target* Device 104.20 6 17.37 16.52 .000
Target* Light 93.87 3 31.29 29.77 .000
* Target Device* Light 174.33 6 29.06 27.65 .000
Model 770.87 21 34.93 34.93 .000
With +Residual 103.00 98 1.05
Total 873.87 119
心理运动测验分数与受试者瞄准的目标大小的关系 Target: T(1), T(2), T(3),T(4)Device: D(1), D(2), D(3)Light : L(1), L(2)Interaction: target*light, target*device, Target*Device*Light
20
Factor Analysis
…a statistical technique used to reduce a set of variables to a smaller number of variables or factors. examines the pattern of intercorrelations between the variables, and determines whether there are subsets of variables (or factors) that correlate highly with each other but that show low correlations with other subsets (or factors).
Variable: x1, x2, x3, x4, … xmLoad Factor: z1 , z2 , z3 , z4 , … znx1 =b11 z1 +b12 z2 +b13 z3 +...+b1n zn +e1.....z1 =a11 x1 +a12 x2+a13 x3+…+a1n xm
.......
• Exploratory Factor Analysis• Confirmatory Factor Analysis
21
Exploratory Factor Analysis (EFA)
Seeks to uncover the underlying structure of a relatively large set of variables. The researcher's a priori assumption is that any indicator may be associated with any factor.
This is the most common form of factor analysis. There is no prior theory and one uses factor loadings to intuit the factor structure of the data.
Assumptions of Exploratory Factor AnalysisNo outliers, interval data, linearity, multivariate normality, orthogonality for principal factor analysis
22
Work Example
测量项目 因素1 因素2 信度
1.我对我的家庭生活感到满意 .808 .270
.846
2.我的家庭生活状况很好 .791 .268
.846
3.我现实的家庭生活与理想的家庭生活很接近 .736 .226
.8464.我已得到我在家庭生活中想要的重要东西 .709 .251 .846
5.我不满意夫妻间的交流,我的配偶不理解我 .672 -.105
.846
6.我不喜欢配偶的性格和个人习惯 .644 -.135
.846
7.我非常满意夫妻双方在婚姻中承担的责任 .583 .239
.846
8.总的来说,我对我现在的工作非常满意 .081 .846
.804
9.我从我的工作中感受到快乐 .100 .820
.80410.我对这份工作实在是毫无兴趣可言 .052 .717 .804
11.大部分从事这份工作的人对这份工作很满意 .158 .702
.804
12.这里的员工常常想辞职 .160 .551
.804
因素变异解释量(%)29.999 25.366
累加因素变异解释量(%)29.999 55.365
因素命名家庭满意感
工作满意感
23
Confirmatory Factor Analysis (CFA)
Seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors.
There are two approaches to confirmatory factor analysis: The Traditional Method. Confirmatory factor analysis can be accomplished through any general-purpose statistical package which supports factor analysisThe SEM Approach. Confirmatory factor analysis can mean the analysis of alternative measurement (factor) models using a structural equation modeling package such as AMOS or LISREL.
A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model, but usually also the researcher will posit expectations about which variables will load on which factors
@ Tjitra, 2010
Thanks YouAny comments & quest ions
are welcomeContact me at [email protected]
24
www.SinauOnline.com