action research data manipulation and crosstabs

INFO 515 Lecture #8 1

Action ResearchData Manipulation and

Crosstabs

INFO 515Glenn Booker


Parametric vs. Nonparametric Statistical tests fall into two broad

categories – parametric & nonparametric Parametric methods

Require data at higher levels of measurement - interval and/or ratio scales

Are more mathematically powerful than nonparametric statistics

But often require more assumptions about the data, such as having a normal distribution, or equal variances


Parametric vs. Nonparametric Nonparametric methods

Use nominal or ordinal scale data Still allows us to test for a relationship,

and its strength and direction (direction only if ordinal)

Often has easier prerequisites for being tested (e.g. no distribution limits)

Ratio or interval scale data may be recoded to become nominal or ordinal data, and hence be used with nonparametric tests


Significance and Association … are useful for inferring population

values from samples (inferential statistics) Significance establishes whether chance

can be ruled out as the most likely explanation of differences

Association shows the nature, strength, and/or direction of the relationship between two (or among three or more) variables

Need to show significance before association is meaningful


Common Tests of Significance We’ve been introduced to three common

tests of significance: z test (large samples of ratio or interval data) t test (small samples of ratio or interval data) F test (ANOVA)

Shortly we’ll explore a fourth one Pearson’s chi-square 2 (used for nominal or

ordinal scale data)

{ is the Greek letter chi, pronounced ‘kye’, rhymes with ‘rye’}


Common Measures of Association Association measures often range in value

from -1 to +1 (but not always!) Absence of association between variables

generally means a result of 0 Examples

Pearson’s r (for interval or ratio scale data) Yule’s Q (ordinal data in a 2x2 table) Gamma (ordinal – more than 2x2 table)

{A “2x2” table has 2 rows and 2 columns of data.}


Common Measures of Association Notice these are all for nominal scale data

Phi (, ‘fee’) (nominal data in a 2x2 table) Contingency Coefficient (nominal – table larger

than 2x2) Cramer’s V (nominal - larger than 2x2) Lambda () - nominal data Eta () – nominal data


Significance and Association Tests of significance and measures of

association are often used together But you can have statistical significance

without having association


Significance and Association Examples Ratio data: You might use F to determine

if there is a significant relationship, then use ‘r’ from a regression to measure its strength

Ordinal data: You might run a chi-square to determine statistical significance in the frequencies of two variables, and then run a Yule’s Q to show the relationship between the variables


Crosstabs Brief digression to introduce crosstabs

before discussing non-parametric methods Crosstabs are a table, often used to display

data, sorted by two nominal or ordinal variables at once, to study the relationship between variables that have a small number of possible answers each

Generally contains basic descriptive statistics, such as frequency counts and percentages


Crosstabs Used to check the distribution of data, and

as a foundation for more complex tests Look for gaps or sparse data (little or no

contribution to the data set) Rule of thumb - put independent variable

in the columns and dependent variable in the rows


Percentages Can show both column and row

percentages in crosstabs, rather than just frequency counts (or show both counts and percentages) Make sure percentages add to 100%!

Raw frequency counts of variables don’t always provide an accurate picture Unequal numbers of subjects in groups (N)

might make the numbers appear skewed


Crosstabs Example Open data set “GSS91 political.sav”

Use Analyze / Descriptive Statistics / Crosstabs...

Set the Row(s) as “region”, and the Column(s) as “relig”

Note the default scope of an SPSS crosstab is to show frequency Counts, with row and column totals


Crosstabs ExampleREGION OF INTERVIEW * RS RELIGIOUS PREFERENCE Crosstabulation

Count

34 49 0 6 1 9088 86 8 13 6 201

156 77 4 17 1 25592 24 0 3 3 122

217 53 6 14 7 297104 4 0 3 1 11292 24 1 7 4 12857 15 1 5 1 79

115 49 12 33 5 214955 381 32 101 29 1498

NEW ENGLANDMIDDLE ATLANTICE. NOR. CENTRALW. NOR. CENTRALSOUTH ATLANTICE. SOU. CENTRALW. SOU. CENTRALMOUNTAINPACIFIC

REGION OFINTERVIEW

Total

PROTESTANT CATHOLIC JEWISH NONE OTHER

RS RELIGIOUS PREFERENCE

Total


Crosstabs Example Repeat the same example with

percentages selected under the “Cells…” button to get detailed data in each cell Percent within that region (Row) Percent within that religious preference

(Column) Percent of total data set (divide by Total N)

Gets a bit messy to show this much!


Crosstabs Example

REGION OF INTERVIEW * RS RELIGIOUS PREFERENCE Crosstabulation

34 49 0 6 1 90

37.8% 54.4% .0% 6.7% 1.1% 100.0%

3.6% 12.9% .0% 5.9% 3.4% 6.0%

2.3% 3.3% .0% .4% .1% 6.0%88 86 8 13 6 201

43.8% 42.8% 4.0% 6.5% 3.0% 100.0%

9.2% 22.6% 25.0% 12.9% 20.7% 13.4%

5.9% 5.7% .5% .9% .4% 13.4%156 77 4 17 1 255

61.2% 30.2% 1.6% 6.7% .4% 100.0%

16.3% 20.2% 12.5% 16.8% 3.4% 17.0%

10.4% 5.1% .3% 1.1% .1% 17.0%92 24 0 3 3 122

75.4% 19.7% .0% 2.5% 2.5% 100.0%

9.6% 6.3% .0% 3.0% 10.3% 8.1%

6.1% 1.6% .0% .2% .2% 8.1%217 53 6 14 7 297

73.1% 17.8% 2.0% 4.7% 2.4% 100.0%

22.7% 13.9% 18.8% 13.9% 24.1% 19.8%

14.5% 3.5% .4% .9% .5% 19.8%104 4 0 3 1 112

92.9% 3.6% .0% 2.7% .9% 100.0%

10.9% 1.0% .0% 3.0% 3.4% 7.5%

6.9% .3% .0% .2% .1% 7.5%92 24 1 7 4 128

71.9% 18.8% .8% 5.5% 3.1% 100.0%

9.6% 6.3% 3.1% 6.9% 13.8% 8.5%

6.1% 1.6% .1% .5% .3% 8.5%57 15 1 5 1 79

72.2% 19.0% 1.3% 6.3% 1.3% 100.0%

6.0% 3.9% 3.1% 5.0% 3.4% 5.3%

3.8% 1.0% .1% .3% .1% 5.3%115 49 12 33 5 214

53.7% 22.9% 5.6% 15.4% 2.3% 100.0%

12.0% 12.9% 37.5% 32.7% 17.2% 14.3%

7.7% 3.3% .8% 2.2% .3% 14.3%955 381 32 101 29 1498

63.8% 25.4% 2.1% 6.7% 1.9% 100.0%

100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

63.8% 25.4% 2.1% 6.7% 1.9% 100.0%

Count% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of TotalCount% within REGION OFINTERVIEW% within RS RELIGIOUSPREFERENCE% of Total

NEW ENGLAND

MIDDLE ATLANTIC

E. NOR. CENTRAL

W. NOR. CENTRAL

SOUTH ATLANTIC

E. SOU. CENTRAL

W. SOU. CENTRAL

MOUNTAIN

PACIFIC

REGION OFINTERVIEW

Total

PROTESTANT CATHOLIC JEWISH NONE OTHER

RS RELIGIOUS PREFERENCE

Total


Recoding An interval or ratio scaled variable, like

age or salary, may have too many distinct values to use in a crosstab

Recoding lets you combine values into a single new variable -- also called collapsing the codes

Also helpful for creating histogram variables (e.g. ranges of age or income)


Recoding Example Use Transform / Recode /

Into Different Variables… Move “age” from the dropdown list for the

Numeric Variable Define the new Output Variable to have Name

“agegroup” and Label “Age Group” Click “Change” button to use “agegroup” Click on “Old and New Values” button


Recoding Example For the Old Value, enter Range of 18 to 30 Assign this to a New Value of 1 Click on “Add”

Repeat to define ages 31-50 as agegroup New Value 2, 51-75 as 3, and 76-200 as 4

Click “Continue” and now a new variable exists as defined


RecodingExample


Recoding Example Now generate a crosstab with “agegroup”

as columns, and “region” as the rowsREGION OF INTERVIEW * Age Group Crosstabulation

Count

25 40 17 8 9036 89 66 12 20356 115 71 13 25529 41 37 15 12266 115 95 21 29715 57 30 10 11238 55 27 8 12822 24 24 9 7948 106 48 12 214

335 642 415 108 1500

NEW ENGLANDMIDDLE ATLANTICE. NOR. CENTRALW. NOR. CENTRALSOUTH ATLANTICE. SOU. CENTRALW. SOU. CENTRALMOUNTAINPACIFIC

REGION OFINTERVIEW

Total

1.00 2.00 3.00 4.00Age Group

Total


Second Recoding Example Prof. Yonker had a previous INFO515 class

surveyed for their height (in inches) and desired salaries ($/yr)

Rather than analyze ratio data with few frequencies larger than one, she recoded: Heights into: Dwarves for people below

average height, and Giants for those above Desired salaries were recoded into Cheap and

Expensive, again below and above average


Second Recoding Example The resulting crosstab was like this:

New Salary * New Height Crosstabulation

9 7 1669.2% 53.8% 61.5%

4 6 1030.8% 46.2% 38.5%

13 13 26100.0% 100.0% 100.0%

Count% within New HeightCount% within New HeightCount% within New Height

Cheap

Expensiv

New Salary

Total

Dwarves GiantsNew Height

Total


Pearson Chi Square Test The Chi Square test measures how much

observed (actual) frequencies (fo) differ from “expected” frequencies (fe) Is a nonparametric test, a.k.a. the Goodness of

Fit statistic Does not require assumptions about the shape

of the population distribution Does not require variables be measured on an

interval or ratio scale


Chi Square Concept Chi Square test is like the ANOVA test

ANOVA proved whether there was a difference among several means – proved that the means are different from each other in some way

Chi square is trying to prove whether the frequency distribution is different from a random one – is there a significant difference among frequencies?

Allows us to test for a relationship (but not the strength or direction if there is one)


Chi Square Null Hypothesis Null hypothesis is that the frequencies in

cells are independent of each other (there is no relationship among them) Each case is independent of every other case;

that is, the value of the variable for one individual does not influence the value for another individual

Chi Square works better for small sample sizes (< hundreds of samples) WARNING: Almost any really large table will

have a significant chi square


Assumptions for Chi Square A random sample is the “expected” basis

for comparison Each case can fall into only one cell

No zero values are allowed for the observed frequency, fo And no expected frequencies, fe, less than one

At least 80% of expected frequencies, fe, should be greater than or equal to five (≥5)


Expected Frequency The expected frequency for a cell is based

on the fraction of things which would fall into it randomly, given the same general row and column count proportions as the actual data set fe = (row total) * (column total) / N So if 90 people live in New England, and 335

are in Age Group 1 from a total sample of 1500, then we would expect fe = 90*335/1500 = 20.1 people in that cell

See slide 21


Expected Frequency So the general formula for the expected

frequency of a given cell is: fe = (actual row total)* (actual column total)/N

Notice that this is NOT using the average expected frequency for every cell

fe = N / [(# of rows)*(# of columns)]


Calculating Chi Square The Chi square value for each cell is the

observed frequency minus the expected one, squared, divided by the expected frequencyChi square per cell = (fo-fe)2/fe Sum this for all cells in the crosstab

For the cell on slide 28, the actual frequency was 25, so Chi square for that cell is = (25-20.1)2/20.1 = 1.195 Note: Chi square is always positive


Calculating Chi Square Page 36/37 of the Action Research

handout has an example of chi square calculation, where fo is the observed (actual) frequency fe is the expected frequency E.g. fe for the first cell is 20*30/60 = 10.0

Chi square for each cell is (fo-fe)2/fe Sum chi square for all cells in the table

No comments about fe fi fo fum! Is that clear?!?!


Interpreting Chi Square When the total Chi square is larger than

the critical value, reject the null hypothesis See Action Research handout page 42/43 for

critical Chi square (2) values Look up critical value using the ‘df’ value,

which is based on the number of rows and columns in the crosstab: df = (#rows - 1)(#columns - 1) For the example on slide 21,

df = (9-1)(4-1) = 8*3 = 24


Interpreting Chi Square Or you can be lazy and use the old

standby: if the significance is less than 0.050, reject the

null hypothesis if the significance is less than 0.050, reject the null hypothesis if the significance is less than 0.050, reject the null hypothesisif the significance is less than 0.050, reject the null hypothesis


Chi Square Example Open data set “GSS91 political.sav” Use Analyze / Descriptive Statistics /

Crosstabs... Set the Row(s) as “region”, and the

Column(s) as “agegroup” Click on “Statistics…” and select the

“Chi-square” test

Notice we’re still using the Crosstab command!


Chi Square ExampleChi-Square Tests

43.260a 24 .00943.557 24 .009

1.062 1 .303

1500

Pearson Chi-SquareLikelihood RatioLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 5.69.

a.


Chi Square Example Note that we correctly predicted the ‘df’

value of 24 SPSS is ready to warn you if too many cells

expected a count below five, or had expected counts below one

The significance is below 0.050, indicating we reject the null hypothesis

The total Chi square for all cells is 43.260


Chi Square Example The critical Chi square value can be looked

up on page 42/43 of Yonker For df = 24, and significance level 0.050,

we get a critical Chi square of 36.415 Since the actual Chi square (43.260) is greater

than the critical value (36.415), reject the null hypothesis

Chi square often shows significance falsely for large sample sizes (hence the earlier warning)


Chi Square Example What are the other tests? They don’t apply

here... The Likelihood Ratio test is specifically for log-

linear models The Linear-by-Linear Association test is a

function of Pearson’s ‘r’, so it only applies to interval or ratio scale variables

Notice that SPSS doesn’t realize those tests don’t apply, and blindly presents results for them…


One-variable Chi square Test To check only one variable’s distribution,

there is another way to run Chi square Null hypothesis is that the variable is

evenly distributed across all of its categories

Hence all expected frequencies are equal for each category, unless you specify otherwise Expected range can also be specified


Other Chi square Example Use Analyze / Nonparametric Tests /

Chi-square… NOT using the Crosstab command here

Add “region” to the Test Variable List Now df is the number of categories in

the variable, minus one df = (# categories) - 1

Significance is interpreted the same


Other Chi square Example

Test Statistics

290.3528

.000

Chi-Square a

dfAsymp. Sig.

REGION OFINTERVIEW

0 cells (.0%) have expected frequencies less than5. The minimum expected cell frequency is 166.7.

a.


Other Chi square Example So in this case, the “region” variable has

nine categories, for a df of 9-1 = 8 Critical Chi square for df = 8 is 15.507, so

the actual value of 290 shows these data are not evenly distributed across regions

Significance below 0.050 still, in keeping with our fine long established tradition, rejects the null hypothesis


Whodunit? The chi-square value by itself doesn’t tell

us which of the cells are major contributors to the statistical significance

We compute the standardized residual to address that issue

This hints at which cells contribute a lot to the total chi square


Residuals The Residual is the Observed value minus

the Estimated value for some data point Residual = fo - fe

If this variable is evenly distributed, the Residuals should have a normal distribution

Plots of residuals are sometimes used to check data normalcy (i.e. how normal is this data’s distribution?)


Standardized Residual The Standardized Residual is the Residual

divided by the standard deviation of the residuals

When the absolute value of the Standardized Residual for a cell is greater than 2, you may conclude that it is a major contributor to the overall chi-square value Analogous to the original t test, looking for

|t| > 2


Standardized Residual Extreme values of Standardized Residual

(e.g. minimum, maximum) can also help identify extreme data points

The meaning of residual is the same for regression analysis, BTW, where residuals are an optional output


Standardized Residual Example In the crosstab region-agegroup example Click “Cells…” and select Standardized

Residuals In this case, the worst cell is the

combination W. Nor. Central region - Age Group 4, which produced a standardized residual of 2.1


Standardized Residual Example

REGION OF INTERVIEW * Age Group Crosstabulation

25 40 17 8 901.1 .2 -1.6 .636 89 66 12 203

-1.4 .2 1.3 -.756 115 71 13 255-.1 .6 .1 -1.329 41 37 15 122.3 -1.6 .6 2.166 115 95 21 297.0 -1.1 1.4 -.115 57 30 10 112

-2.0 1.3 -.2 .738 55 27 8 128

1.8 .0 -1.4 -.422 24 24 9 79

1.0 -1.7 .5 1.448 106 48 12 214.0 1.5 -1.5 -.9

335 642 415 108 1500

CountStd. ResidualCountStd. ResidualCountStd. ResidualCountStd. ResidualCountStd. ResidualCountStd. ResidualCountStd. ResidualCountStd. ResidualCountStd. ResidualCount

NEW ENGLAND

MIDDLE ATLANTIC

E. NOR. CENTRAL

W. NOR. CENTRAL

SOUTH ATLANTIC

E. SOU. CENTRAL

W. SOU. CENTRAL

MOUNTAIN

PACIFIC

REGION OFINTERVIEW

Total

1.00 2.00 3.00 4.00Age Group

Total


Crosstab Statistics for 2x2 Table 2x2 tables appear so often that many tests

have been developed specifically for them Equality of proportions McNemar Chi-square Yates Correction Fisher Exact Test


Crosstab Statistics for 2x2 Table Equality of proportions tests prove

whether the proportion of one variable is the same as for two different values of another variable e.g. Do homeowners vote as often as renters?

McNemar Chi-square tests for frequencies in a 2x2 table where samples are dependent (such as pre-test and post-test results)


Crosstab Statistics for 2x2 Table Yates Correction for Continuity

chi-square is refined for small observed frequencies fe = ( |fo-fe| - 0.5)/fe Corrections are too conservative; don’t use!

Fisher Exact Test – assumes row/column frequencies remain fixed, and computes all possible tables; gives significance value like Chi square


Nominal Measures of Association Are used to test if each measure is zero

(null hypothesis) using different scales Phi Cramer’s V Contingency Coefficient

All three are zero iff Chi square is zero “iff” is mathspeak for ‘if and only if’


Nominal Measures of Association The usual Significance criterion is used

for all three If significance < 0.050, reject the null

hypothesis, hence the association is significant Notice that direction is meaningless for

nominal variables, so only the strength of an association can be determined


Phi For a 2x2 table, Phi and Cramer’s V are

equal to Pearson’s r Phi (φ) can be > 1, making it an unusual

measure of association Phi = sqrt[ (Chi square) / N]

Phi = 0 means no association Phi near or over 1 means strong

association


Cramer’s V Cramer’s V ≤ 1 V = sqrt[ Chi Square / (N*(k – 1) ]

where k is the smaller of the number of columns or rows

Is a better measure for tables larger than 2x2 instead of the Contingency Coefficient


Contingency Coefficient a.k.a. C or Pearson’s C or Pearson’s

Contingency Coefficient Most widely used measure based on

chi-square Requires only nominal data C has a value of 0 when there is no

association


Contingency Coefficient The max possible value

of C is the square root of (the number of columns minus 1, divided by the number of columns)Cmax = sqrt( (#column - 1) / #column)

C = sqrt[ Chi Square / (Chi Square + N) ]

Maximum Contingency Coefficient

0.7

0.75

0.8

0.85

0.9

0.95

1

0 2 4 6 8 10 12 14

Number of Columns

Cmax

action research data manipulation and crosstabs

Documents