statistical analysis

Quantitative Research Methodologies (5/6):

Statistical Analysis

Prof. Dr. Hora Tjitra & Dr. He Quan

www.SinauOnline.com

http://www.SinauOnline.com


2

Statistical Analysis Process

... Describing the data, are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures.

Data Preparation

Descriptive Statistics

Inferential Statistic

... Testing Hypotheses and Models, investigate questions, models and hypotheses. In many cases, the conclusions from inferential statistics extend beyond the immediate data alone.

... Cleaning and organizing the data for analysis, involves checking or logging the data in; checking the data for accuracy; entering the data into the computer; etc.

3

Data Preparation

Logging the Data Checking the Data For Accuracy

Developing a Database Structure

Entering the Data into the

Computer

Data Transformation

•mail surveys returns

•coded interview data

•pretest or posttest data

• observational data

•Are the responses legible/readable?•Are all important questions answered?•Are the responses complete? •Is all relevant contextual information included (e.g., data, time, place)

•missing values•item reversals•scale totals •categories

•Double Entry. •In this procedure you enter the data once. •Then, you use a special program that allows you to enter the data a second time and checks each second entry against the first.

•variable name •variable description•variable format (number, data, text)•instrument/method of collection•date collected•respondent or group•variable location (in database) •notes

4

Descriptive Statistics (Univariate Analysis )

The Distribution: A summary of the frequency of individual values or ranges of values for a variable

Central Tendency: A distribution is an estimate of the “center” of a distribution of values

Dispersion: the spread of the values around the central tendency

5

The Distribution

Frequency distributions can be depicted in two ways ….

A table shows an age frequency distribution with five categories of age ranges defined

A graph shows the frequency distribution. It is often referred to as a histogram or bar chart

6

Central Tendency

• The Mean or average is probably the most commonly used method of describing central tendency

• The Median is the score found at the exact middle of the set of values• The Mode is the most frequently occurring value in the set of scores

2, 3, 5, 3, 4, 3, 6

M=(2+3+5+3+4+3+6)/7=26/7=3.71

Md, 2, 3, 3, 3, 4, 5, 62.5----2.83----3.16---3.5Md=3.33

Mo = 3

Example

7

Dispersion

• The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the range is 36 - 15 = 21.

Standard Deviation The Standard Deviation shows the relation that set of scores has to the mean of the sample (15,20,21,20,36,15,25,15), M=20.875

15 - 20.875 = -5.87520 - 20.875 = -0.87521 - 20.875 = +0.12520 - 20.875 = -0.87536 - 20.875 = 15.12515 - 20.875 = -5.87525 - 20.875 = +4.125

15 - 20.875 = -5.875

5.875 * -5.875 = 34.515625-0.875 * -0.875 = 0.765625+0.125 * +0.125 = 0.015625-0.875 * -0.875 = 0.76562515.125 * 15.125 = 228.765625-5.875 * -5.875 = 34.515625+4.125 * +4.125 = 17.015625-5.875 * -5.875 = 34.515625Sum of Squares 350.875350.875 / 7 = 50.125

SQRT(50.125) = 7.079901129253

8

Normal Distribution

The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. • (-1,1)------------68.26%• (-1.96, 1.96)---95%• (-2.58, 2.58)---99%

Normal distributions: are a family of distributions that have the same general shape. They are symmetric with scores more concentrated in the middle than in the tails (two parameters: the mean (m) and the standard deviation (s).

9

Hypothesis Testing / significance testing

a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1)

• The philosophical basis for hypothesis testing lies in the fact that random variation pervades all aspects of life, and in the desire to avoid being fooled by what might be chance variation

• The alternative hypothesis typically describes some change or effect that you expect or hope to see confirmed by data. For example, new drug A works better than standard drug B. Or the accuracy of a new weapon targeting system is better than historical standards

• The null hypothesis embodies the presumption that nothing has changed, or that there is no difference

10

The Statistical Inference Decision Matrix

In reality

What we conclude

H0 is true,H1 is falseIn reality... There is no relationship; There is no

difference, no gain; Our theory is wrong

H0 is false,H1 is true In reality... There is a relationship, There is a difference

or gain, Our theory is correct

We accept H0, reject H1.We say... "There is no

relationship"; "There is no difference, no gain"; "Our theory is wrong"

1-α (e.g., .95)THE CONFIDENCE LEVEL

The odds of saying there is no relationship, difference, gain, when in fact there is none

The odds of correctly not confirming our theory95 times out of 100 when there is no effect, we’ll

say there is none

β(e.g., .20)TYPE II ERROR

The odds of saying there is no relationship, difference, gain, when in fact there is one

The odds of not confirming our theory when it’s true

20 times out of 100, when there is an effect, we’ll say there isn’t

We reject H0 ,accept H1.We say... "There is a

relationship"; "There is a difference or gain"; "Our theory is correct"

α (e.g., .05)TYPE I ERROR

(SIGNIFICANCE LEVEL)The odds of saying there is an relationship,

difference, gain, when in fact there is notThe odds of confirming our theory incorrectly5 times out of 100, when there is no effect, we’ll

say there is onWe should keep this small when we can’t afford/

risk wrongly concluding that our program works

1-β(e.g., .80)POWER

The odds of saying that there is an relationship, difference, gain, when in fact there is one

The odds of confirming our theory correctly80 times out of 100, when there is an effect, we’ll

say there isWe generally want this to be as large as possible

11

Examples

12

Selecting the Appropriate Statistical Test

Type of Design

Between-subject Within-subject

Number of independent variables

Number of groups or levels of the independent variables

One independent variable

Two independent variable

Between-subject Between-subject

Twogroups

More than two groups

Two-way analysis of variance

Independent sample t-test

One-way analysis of variance

Two groups or two levels of the independent

variable

More than two groups or more

than two levels of the independent

variable

Correlated t-test Repeated

measures analysis of variance

13

Correlation

Correlation is a measure of the relation between two or more variables

The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data

Correlation coefficients can range from -1.00 to +1.00

14

The types of correlation

subject X Y Zx Zy ZxZy

A 1 4 -1.5 -1.5 2.25

B 3 7 -1.0 -1.0 1.00

C 5 10 -0.5 -0.5 0.25

D 7 13 0 0 0.00

E 9 16 0.5 0.5 0.25

F 11 19 1.0 1.0 1.00

G 13 22 1.5 1.5 2.25

N=7 , X=7.0, Y=13.0, Sx=4.0, Sy=6.0, ∑X=49, ∑Y=91, ∑Zx=∑Zy=0.0, ∑ZxZy=7.00, ∑X2=455, ∑Y2=1435

r=(∑ZxZy)/N=7.00/7=1.00


r=(∑ZxZy)/N=7.00/7=1.00


r=(∑ZxZy)/N=7.00/7=1.00


r=(∑ZxZy)/N=7.00/7=1.00


r=(∑ZxZy)/N=7.00/7=1.00


r=(∑ZxZy)/N=7.00/7=1.00

r = (∑ZxZy)/N

Pearson r - Used when data represents either interval or ratio scales and assumes a linear relationship between variables.

Spearman rs , the rank correlation coefficient,used with ordered or ranked data.

rs =1-[6(∑D2)/N(N2-1)]

IQ rank Leadership rank D D2

1 4 -3 9

2 2 0 0

3 1 2 4

4 6 -2 4

5 3 2 4

6 5 1 1

∑D=0 ∑D2=22

N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37

N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37

N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37

N=6, rs =1-[6(∑D2)/N(N2-1)]=1-[ 6*22/(6*35)]=1-.63=.37

How to judge the significance of r?

15

T-Test :Testing for Differences Between Two Groups

Group1

Group2

3221223221

223333434

N∑X∑X2

XSSs

1020442.04.0

0.6667

927853.04.0

0.7071

t = -3.17df = (10-1)+(9-1)=17p = .004Report :t (17) =-3.17, p< .05

16

ANOVA

A statistical procedure developed by RA Fisher that allows one compare simultaneously the difference between two or more means

One-way ANOVA …comparing the effects of different levels of a single independent variableTwo-way ANOVA …comparing simultaneously the effects of two independent variables

Between-Groups Variance… Estimate of variance between group means With-Groups Variance… Estimate of the average variance within each groupHomogeneity of variance …the variance of the groups are equivalent to each other

Basic concept

17

Work Example

Group1

Group 2

Group 3

Group 4

X 1.1=4 X 1.2=6 X 1.3=4 X 1.4=5

X 2.1=2 X 2.2=3 X 2.3=5 X 2.4=8

X 3.1=1 X 3.2=5 X 3.3=7 X 3.4=6

X 4.1=3 X 4.2=4 X 4.3=6 X 4.4=5

X .1 =10n1 =4x=2.5

∑Xi12 =30

s12 =1.67

X .2 =18n2 =4x=4.5

∑Xi22 =86

s22 =1.67

X .3 =22n3 =4x=5.5

∑Xi32 =126

s32 =1.67

X .4 =24n4 =4x=6.0

∑Xi42 =150

s42 =2.00

∑Xij=74N=16X=4.62∑Xij

2 =392

F max =2.00/1.67 =1.198

The degree of freedomdf total =N-1 =16-1 =15df between =k-1= 4-1 =3df within =∑(nj-1) = 12

F =MS between /MS within = 5.48

Result Table

Source SS df MS F

Between-groups 28.75 3 9.583 5.48

Within-groups 21.00 12 1.750

Total 49.75 15

SStotal, SSwithin , SSbetween

18

MANOVA

MANOVA is a technique which determines the effects of independent categorical variables on multiple continuous dependent variables. It is usually used to compare several groups with respect to multiple continuous variables.

The main distinction between MANOVA and ANOVA is that several dependent variables are considered in MANOVA. While ANOVA tests for inter-group differences between the mean values of one dependent variable, MANOVA tests for differences between centroid s - the vectors of the mean values of the dependent variables.

One important index is interaction

19

Work Example

Source SS df MS F Sig. of F

Target 235.20 3 78.40 74.59 .000

Device 86.47 2 43.23 41.13 .000

Light 76.80 1 76.80 73.07 .000

Target* Device 104.20 6 17.37 16.52 .000

Target* Light 93.87 3 31.29 29.77 .000

* Target Device* Light 174.33 6 29.06 27.65 .000

Model 770.87 21 34.93 34.93 .000

With +Residual 103.00 98 1.05

Total 873.87 119

心理运动测验分数与受试者瞄准的目标大小的关系 Target: T(1), T(2), T(3),T(4)Device: D(1), D(2), D(3)Light : L(1), L(2)Interaction: target*light, target*device, Target*Device*Light

20

Factor Analysis

…a statistical technique used to reduce a set of variables to a smaller number of variables or factors. examines the pattern of intercorrelations between the variables, and determines whether there are subsets of variables (or factors) that correlate highly with each other but that show low correlations with other subsets (or factors).

Variable: x1, x2, x3, x4, … xmLoad Factor: z1 , z2 , z3 , z4 , … znx1 =b11 z1 +b12 z2 +b13 z3 +...+b1n zn +e1.....z1 =a11 x1 +a12 x2+a13 x3+…+a1n xm

.......

• Exploratory Factor Analysis• Confirmatory Factor Analysis

21

Exploratory Factor Analysis (EFA)

Seeks to uncover the underlying structure of a relatively large set of variables. The researcher's a priori assumption is that any indicator may be associated with any factor.

This is the most common form of factor analysis. There is no prior theory and one uses factor loadings to intuit the factor structure of the data.

Assumptions of Exploratory Factor AnalysisNo outliers, interval data, linearity, multivariate normality, orthogonality for principal factor analysis

22

Work Example

测量项目因素1 因素2 信度

1.我对我的家庭生活感到满意 .808 .270

.846

2.我的家庭生活状况很好 .791 .268

.846

3.我现实的家庭生活与理想的家庭生活很接近 .736 .226

.8464.我已得到我在家庭生活中想要的重要东西 .709 .251 .846

5.我不满意夫妻间的交流，我的配偶不理解我 .672 -.105

.846

6.我不喜欢配偶的性格和个人习惯 .644 -.135

.846

7.我非常满意夫妻双方在婚姻中承担的责任 .583 .239

.846

8.总的来说，我对我现在的工作非常满意 .081 .846

.804

9.我从我的工作中感受到快乐 .100 .820

.80410.我对这份工作实在是毫无兴趣可言 .052 .717 .804

11.大部分从事这份工作的人对这份工作很满意 .158 .702

.804

12.这里的员工常常想辞职 .160 .551

.804

因素变异解释量（％）29.999 25.366

累加因素变异解释量（％）29.999 55.365

因素命名家庭满意感

工作满意感

23

Confirmatory Factor Analysis (CFA)

Seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors.

There are two approaches to confirmatory factor analysis: The Traditional Method. Confirmatory factor analysis can be accomplished through any general-purpose statistical package which supports factor analysisThe SEM Approach. Confirmatory factor analysis can mean the analysis of alternative measurement (factor) models using a structural equation modeling package such as AMOS or LISREL.

A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model, but usually also the researcher will posit expectations about which variables will load on which factors

@ Tjitra, 2010

Thanks YouAny comments & quest ions

are welcomeContact me at [email protected]

24

www.SinauOnline.com

mailto:[email protected]

mailto:[email protected]



statistical analysis

Education

data preparation

immediate data

set of values

variablecentral tendency

highest value

high value

lowest value

missing values readable