learn to use the phi coefficient measure and test in r

12
Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) © 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets.

Upload: others

Post on 04-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learn to Use the Phi Coefficient Measure and Test in R

Learn to Use the Phi Coefficient

Measure and Test in R With Data

From the Welsh Health Survey

(Teaching Dataset) (2009)

© 2019 SAGE Publications, Ltd. All Rights Reserved.

This PDF has been generated from SAGE Research Methods Datasets.

Page 2: Learn to Use the Phi Coefficient Measure and Test in R

Learn to Use the Phi Coefficient

Measure and Test in R With Data

From the Welsh Health Survey

(Teaching Dataset) (2009)

Student Guide

Introduction

This example dataset introduces the Phi Coefficient, which allows researchers to

measure and test the strength of association between two categorical variables,

each of which has only two groups. This example describes the Phi Coefficient,

discusses the assumptions underlying its validity, and shows how to compute and

interpret it. We illustrate the Phi Coefficient measure and test using a subset of

data from the 2009 Welsh Health Survey. Specifically, we measure and test the

strength of association between sex and whether the respondent has visited the

dentist in the last twelve months. The Phi Coefficient can be used in its own

right as a means to assess the strength of association between two categorical

variables, each with only two groups. However, typically, the Phi Coefficient is

used in conjunction with the Pearson’s Chi-Squared test of association in tabular

analysis. Pearson’s Chi-Squared test tells us whether there is an association

between two categorical variables, but it does not tell us how important, or how

strong, this association is. The Phi Coefficient provides a measure of the strength

of association, which can also be used to test the statistical significance (with

which that association can be distinguished from zero, or no-association).

This page provides links to this sample dataset and a guide to producing the Phi

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 2 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 3: Learn to Use the Phi Coefficient Measure and Test in R

Coefficient test using statistical software.

What Is a Phi Coefficient?

The Phi Coefficient is a method for determining the strength of association

between two categorical variables (e.g., sex, ethnicity, occupation), each of which

is or is measured as binary, that is, they only have two groups (male/female or

employed/unemployed). Also known as Pearson’s Phi Coefficient, the measure is

designed for variables at the binary categorical level only. When used as a formal

statistical test, one must, as always, first define the null hypothesis (H0) to be

tested. In this case, the standard null hypothesis is that there is no association

between the two variables. Even if the variables are not associated in truth, some

non-zero association would be expected simply due to sampling error, i.e., random

chance in sampling. The Phi Coefficient test conducted here is designed to help

us determine whether the difference from zero-association that occurs in the

sample is large enough to declare the association statistically significantly non-

zero. “Large enough” is typically defined as a test statistic with a level of statistical

significance, or p-value, of less than .05, meaning that sample associations this

large or larger would occur “just by random chance” in only 5% of samples this

size. We would “reject the null hypothesis (H0) of no association between the two

variables” at the .05 level.

Calculating a Phi Coefficient

The Phi Coefficient is derived from Pearson’s Chi-Square statistic of tabular

association. The modifications restrict the resulting statistic to a range of −1.0 to

1.0, analogously to (although not the same as) Pearson’s Correlation Coefficient.

If the variables are not associated, then the Phi Coefficient value should be

0; perfect positive (negative) association yields a Phi Coefficient of 1 (−1). To

illustrate, let’s imagine that we have surveyed 100 participants, whom we have

categorised by whether they have children and asked them to identify whether

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 3 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 4: Learn to Use the Phi Coefficient Measure and Test in R

they have a pet or not. Table 1 shows the hypothetical results below.

Table 1: Cross-Tabulation of Pet Ownership and Having a Child.

Whether respondent has a pet

Whether respondent has a child

Yes No Total

Yes (n = 30) 20 (66.6%) 10 (33.3%) 30

No (n = 70) 10 (14%) 60 (86%) 70

Total 30 70

The cross-tabulation suggests a possible positive association as there appears to

be greater pet ownership amongst those who have children, 66.6% of people with

a pet also had children compared with 33.3% of people without children. However,

we do not know whether this is statistically significant. Table 1 is also known as a

2 × 2 contingency table; two binary variables are considered positively associated

if most of the data fall along the diagonal cells, thus a and d are larger than b and

c. Conversely, if the data fall in the off-diagonal, then two variables are negatively

associated. Table 2 below illustrates this, with each observed count labelled.

Table 2: Cross-Tabulation of Pet Ownership and Having a Child.

Whether respondent has a pet

Whether respondent has a child

Yes No Total

Yes (n = 30) 20 (66.6%)

a

10 (33.3%)

b

30

e

No (n = 70) 10 (14%)

c

60 (86%)

d

70

f

Total 30

g

70

h

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 4 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 5: Learn to Use the Phi Coefficient Measure and Test in R

If we look at Table 2, we can see that a and d appear larger than b and c. However,

we need to calculate the Phi Coefficient, using Equation 1.

Equation 1 presents the formula for the Phi Coefficient (using the data in counts)

(1)

φ =ad − bc√efgh

Equation 2 presents the formula populated with data from the example

(2)

φ =20x60 − 10x10√30x70x30x70

φ =1200 − 100√4410000

φ =11002100

φ = 0.52

We have calculated the Phi Coefficient to be 0.52. We can interpret this figure

using the same scale as that for Pearson’s Correlation coefficient.

Table 3 presents the Phi Coefficient Scale.

Table 3: The Phi Coefficient Scale.

Phi Coefficient Interpretation

−1.0 to −0.7 Strong negative association between the variables

−0.69 to −0.4 Medium negative association between the variables

−0.39 to −0.2 Weak negative association between the variables

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 5 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 6: Learn to Use the Phi Coefficient Measure and Test in R

−0.199 to 0.01 No or negligible association between the variables

0.00 No association between the two variables

0.01 to 0.19 No or negligible association between the variables

0.2 to 0.39 Weak positive association between the variables

0.4 to 0.69 Medium positive association between the variables

0.70 to 1.0 Strong positive association between the variables

In our example, the Phi Coefficient value is 0.52, which we can interpret as

a medium (positive) association between our variables. We can reject the H0;

in other words, there is a statistically significant association between the two

variables. Moreover, by reviewing the contingency table (Table 1), we can add that

the association between having a child and owning a pet is a positive association.

Assumptions Behind the Method

All statistical tests rely on some underlying assumptions, and they all are affected

by the type of data that you have. The Phi Coefficient test can be run on its own

to test the association between two variables. However, typically it is used as a

post-test following a cross-tabulation and a Pearson’s Chi-Squared test, where it

adds depth to the analysis by identifying the strength of association between two

variables.

Assumptions of the Phi Coefficient test

• Both variables must have two categorical, independent groups.

• There must be independence of observations, so there is no relationship

between the groups or between the observations in each group.

• All expected counts should be greater than 1 and no more than 20% of

expected counts. No expected counts should be less than 5.

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 6 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 7: Learn to Use the Phi Coefficient Measure and Test in R

The first and second assumptions are not typically testable from the sample data

and are related to the research design. The second assumption is only likely

to be violated if the data were sampled by pairs rather than individuals (e.g.,

couples rather than individual persons). It is important to understand how your

data were collected and categorized; this will help you avoid violating the first

two assumptions. The third assumption can be tested easily in most statistical

software programs.

Illustrative Example: Association Between Sex and Whether

Respondent Visited the Dentist in the Last Twelve Months

This example presents a Phi Coefficient analysis using two variables from the

2009 Welsh Health Survey. Specifically, we test whether there is an association

between sex and whether the respondent visited the dentist in the last twelve

months.

Thus, this example addresses the following research question:

Does visiting the dentist in the last twelve months vary by an individual’s

sex?

Stated in the form of a null hypothesis:

H0 = There will be no association between sex and whether the

respondent has visited the dentist in the last twelve months.

It should be noted that this hypothesis is two-tailed.

The Data

This example uses a subset of data from the 2009 Welsh Health Survey. This

extract includes 16,018 respondents, which is a large sample. It should be noted

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 7 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 8: Learn to Use the Phi Coefficient Measure and Test in R

that the original dataset is larger still, but it has been “cleaned” to include only

those who have responded to both our variables. The two variables we examine

are:

• Respondent’s sex (sex).

• Whether respondent has visited the dentist in the last twelve months

(denbi).

The first variable, Respondent’s sex (sex), is coded 1, if male, and 2, if female.

Whether the respondent has visited the dentist in the last twelve months (denbi) is

coded; 0, if “no” and 1, if “yes.” We treat both variables as categorical, in line with

common practice in social science research. In addition, both variables are binary.

First, we should test our data to ensure that no expected counts are less than 5.

Table 4: Contingency Table for Sex and Whether the Respondent Visited

the Dentist in the Last Twelve Months.

Sex

Whether the respondent has visited the dentist in the last twelve

months

Male Female Total

No

Count 2,329 2,058 4,387

Expected

Count 2,039.2 2,347.8 4,387.0

Yes

Count 4,684 6,016 10,700

Expected

Count 4973.8 5,726.2 10,700.0

Total

Count 7,013 8,074 15,087

Expected

Count 7,013.0 8,074.0 15,087.0

We can see from Table 4 that no cells have an expected count less than 5, and

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 8 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 9: Learn to Use the Phi Coefficient Measure and Test in R

all the expected counts are greater than 1. In addition, both our variables are

categorical with only two groups each, therefore the Phi Coefficient is appropriate

for these data. Usually the Phi Coefficient is run as a post-test to tell us something

about the strength of an association that a Pearson’s Chi-squared test has

identified as significant.

Analysing the Data

Before conducting the Phi Coefficient test, we should first examine each variable

in isolation. We start by presenting a frequency distribution of sex in Table 5. Table

5 shows the distribution of sex; there are slightly more females (53.7%) than males

(46.3%) in the sample.

Table 5: Frequency Distribution of Sex.

Frequency Percent Valid percent Cumulative percent

Valid

Male 7,412 46.3 46.3 46.3

Female 8,606 53.7 53.7 100.0

Total 16,018 100.0 100.0

Table 6 shows the frequency distribution of denbi. Just under a third of

respondents (29.1%) did not visit a dentist in the last twelve months, while 70.9%

did. It should be noted that 931 respondents did not answer the question.

Table 6: Frequency Distribution of denbi.

Frequency Percent Valid percent Cumulative percent

Valid

No 4,387 27.4 29.1 29.1

Yes 10,700 66.8 70.9 100.0

Total 15,087 94.2 100.0

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 9 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 10: Learn to Use the Phi Coefficient Measure and Test in R

Missing No answer/refused 931 5.8

Total 16,018 100.0

Tables 5 and 6 show the distribution of each of these variables by themselves, but

they cannot tell us whether they are in a relationship.

Calculating the Phi Coefficient and Conducting the Phi Coefficient

Test

Tables 7 and 8 present the results of the Phi Coefficient analysis. Table 7 presents

the Chi-square test result, which statistic also underlies the Phi Coefficient. We

can see that our results are significant at the p ≤ .000, the variables are associated

with each other.

Table 7: Results of the Phi Coefficient Analysis: The Chi-Squared Result.

Value df Asymptotic significance

Pearson chi-square 108.477 1 .000

N of valid cases 15,087

Table 8: Results of the Phi Coefficient Analysis: Phi Coefficient Measure

and Test.

Value Approximate significance

Nominal by nominal Phi 0.085 .000

N of valid cases 15,087

Table 8 presents the Phi Coefficient value, which is 0.085, and this suggests that

there is perhaps negligible positive association between the variables. However,

the p ≈ .000 suggests a highly statistically significant association. Given the very

low Phi Coefficient result, we need to treat the p value with caution as they are

very sensitive to sample size; in large samples (like our example), they often

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 10 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 11: Learn to Use the Phi Coefficient Measure and Test in R

deem small differences as significant. Therefore, given that the Phi Coefficient

accommodates sample size, we should use it as the basis to accept our null

hypothesis; there will be no association between sex and whether the respondent

visited the dentist in the last twelve months.

Presenting Results

A Phi Coefficient test can be reported as follows:

“We used a subset of data from the 2009 Welsh Health Survey dataset to measure

and test the association between sex and whether respondent visited the dentist

in the last twelve months. We tested the following null hypothesis:

H0 = There is no association between sex and whether the respondent

has visited the dentist in the last twelve months.

The data included 16,018 adult respondents. There was no substantively

significant association between sex and whether the respondent visited the dentist

in the last twelve months, φ = 0.085, however p = .000, which suggests no

association between the variables. This leads us to accept the null hypothesis of

no association between sex and whether respondent visited the dentist in the last

twelve months.”

Review

The Phi Coefficient is a statistical measure used to evaluate the strength of

association between two dichotomous variables.

You should know:

• What types of variables are suited for a Phi Coefficient test.

• The basic assumptions underlying this statistical test.

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 11 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)

Page 12: Learn to Use the Phi Coefficient Measure and Test in R

• How to compute and interpret a Phi Coefficient test.

• How to report the results of a Phi Coefficient test.

Your Turn

You can download this sample dataset along with a guide showing how to produce

a Phi Coefficient test using statistical software. The sample dataset also includes

another variable called teethbi, which relates to how many teeth the respondent

has. See whether you can reproduce the results presented here for the sex

variable, and then try producing your own Phi Coefficient analysis substituting

teethbi for sex in the analysis.

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 12 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the

Welsh Health Survey (Teaching Dataset) (2009)