coding schemes - university of tennessee at … · web viewcoding schemes more than you ever wanted...

5/24/2023 P5510 – Coding 1

Coding SchemesMore than you ever wanted to know about comparing group means.

Data analysis involves two general activities . . .

1. Comparing means, asking whether two or more means are “significantly” different or not.

2. Assessing relations, asking whether values of one variable vary systematically with values of some other variable or group of variables.

There are a few other things that we do, but these two activities – comparing means and assessing relations– makes up the majority of our data analytic efforts.

Historical aside – these two activities correspond to the lab vs field distinction in psychology.

As we learned last semester, comparing means is really assessing a specific kind of relation – that between values of a variable and group membership.

So, this means that data analysis actually involves only one thing: assessing relations between variables, typically between dependent and independent variables.

Reviewing the details of converting comparisons of means into assessments of relations

This involves creating a variable that represents group membership. Each value of that variable corresponds to a different group. Then regress the dependent variable onto that newly created variable.

With two groups, that’s easy. Just create a variable with ANY two values, 1 and 2, 0 and 1, -1 and +1. We know which group a person is in by his/her value of this group coding variable.

We then regress the dependent variable onto this newly created variable.

If the correlation is 0, that means the dependent variable is NOT related to group membership.

If the correlation is not 0, then that means the dependent variable IS related to group membership.

5100 Group coding schemes - 1

5/24/2023 P5510 – Coding 2

Simple Example of moving from means comparison to regression/correlation

Comparing Means the tradition way . . .

Independent Samples TestLevene's Test for

Equality of Variances t-test for Equality of Means

F Sig. t dfSig. (2-tailed)

Mean Difference

Std. Error Difference

95% Confidence Interval of the

DifferenceLower Upper

exam

Equal variances assumed

1.943 .174 -3.978 28 .000 -4.911 1.234 -7.439 -2.382

Equal variances not assumed

-4.019 27.997 .000 -4.911 1.222 -7.414 -2.408


Mean = 72.71

Mean = 77.62

5/24/2023 P5510 – Coding 3

The same analysis as the assessment of a relation

The variable condit is a variable that represents group membership – 1 = group 1; 2=group 2.I assess the correlation of exam scores with condit scores . . .

Correlationscondit exam

condit Pearson Correlation 1 .601**

Sig. (2-tailed) .000

N 30 30

exam Pearson Correlation .601** 1

Sig. (2-tailed) .000

N 30 30

**. Correlation is significant at the 0.01 level (2-tailed).

Using the relationship to compare means

Early on, statisticians discovered a formula that would convert the correlation between a dependent variable and group membership into a test statistic. Here it is . . .

t = r/sqrt((1-r2)/(n-2))= .601/sqrt(.638/28) = .601/.151 = 3.979

So this result gives hope that perhaps there might be a way to incorporate ALL KINDS OF GROUP MEAN COMPARISONS into regression.

The problem is that many times, we’ll want to compare the means of MORE THAN Two groups.

Long story short (skipping the history of the development of group coding variables) . . . .

There is a way to represent membership in any number of groups so that the relation between such membership and other variables can be assessed.

The method involves creating what are called Group Coding Variables. Thanks, mathematicians.

Group Coding Variable

A variable added to a data matrix that represents group member in a way that such membership can be included in correlation or regression analyses.

If your interest is simply in the relationship, use a correlation analysis.

If your interest is in prediction or explanation, use regression analysis.


5/24/2023 P5510 – Coding 4

How many group coding variables should we have?

Hmm. If you have two groups, as in the above example, it appears that only one group coding variable would be required for two groups.

Mathematicians have determined that for more than 2 groups, you need more than 1 group coding variable. Specifically, the number of variables needed is determined by the number of possible unique differences between group means.

If you have three groups, you have 3 means. How many unique differences are possible?

Three total are possible: A-B A-C and B-C

Someone might say, “Since there are three total differences, they must all be unique.”

In fact, however, one of those differences is perfectly predictable from the other two.

For example, if A-B = 3 and A-C = 5, then B-C must be 2.

A-B=3 implies that B = A-3A-C=5 implies that C= A-5This means that B-C = A-3 - (A-5) = A-3-A+5 = 2

So, there are only two unique differences between the means of three groups.

It can be shown that the number of unique differences will always be 1 less than the number of groups.

This leads to the following rule: If we have K groups, we’ll need K-1 group coding variables to represent all the differences between them.

So, in fact, the number of group coding variables needed to represent all the unique differences between K groups is always K-1.


5/24/2023 P5510 – Coding 5

“Do we really need to know this?” you might ask.

Good question.

Actually, the real question is, “Do I really need to do all of my analyses using multiple regression?” because this is all done for its sake?

The answer is the Amazon.com answer or the Walmart answer. Multiple regression is an all-purpose analytic strategy. It’s kind of the Amazon.com or Wal-Mart Supercenter of statistical techniques. It allows you to analyze both quantitative and qualitative independent variables at the same time.

You can use t-tests and the analysis of variance formulas to compare means, but you can’t include quantitative variables with t-tests and with analysis of variance. Multiple regression with group coding variables lets you do it all.

So, yes, we do need to have some familiarity with group coding schemes so that we can use MR for all of our analyses. OK, so the computer programmers can use MR techniques for us.

It turns out that many of the procedures in various statistical analyses found in SPSS and other statistical packages are based on multiple regression techniques, so being able to understand group coding schemes helps you take advantage of those procedures. These procedures are

REGRESSIONGLMLOGISTIC REGRESSION (Regression with categorical dependent variables)COX REGRESSION (Survival analysis)Comparison of means using SEM techniques.Analysis of nonlinear relationships

These procedures are fundamental statistical procedures.

They all involve assessing relations.


5/24/2023 P5510 – Coding 6

Types of coding schemes – ways of creating group coding variables that allow you to compare means in Regression.

There are several ways of representing group membership. Each way is called a coding scheme.

The coding schemes are distinguished by the comparisons associated with each of the group-coding variables.

Dummy Variable Coding.

Each group-coding variable compares the mean of a group with the mean of a reference group.

Each group coding variable is made up of 0’s and 1’s.

Effects Coding.

Each group-coding variable compares the mean of a group with the unweighted mean of the means of all the groups.

Each gcv is made up of 0’s, 1’s, and -1’s.

Contrast Coding.

Each group-coding variable compares the mean of one user-chosen set of groups with the mean of a second user-chosen set of groups.

Helmert Coding

Difference Coding

There are others.


5/24/2023 P5510 – Coding 7

Dummy Variable Coding (called Indicator Coding in SPSS)(Yes – we covered this in PSY 5130 – this is a review.)Two groups

DV1 YG1 1 YsG2 0 Ys (G2 is called the reference group.)The t value associated with the single group-coding variable, DV1, compares X-barG1 with X-barG2

Three Groups

DV1 DV2 YG1 1 0 YsG2 0 1 YsG3 0 0 Ys (G3 is the reference group here.)The t-value associated with the 1st group-coding variable, DV1, compares X-barG1 with X-barG3.The t-value associated with the 2nd group-coding variable, DV2, compares X-barG2 with X-barG3

Four Groups

DV1 DV2 DV3 YG1 1 0 0 YsG2 0 1 0 YsG3 0 0 1 YsG4 0 0 0 Ys (G4 is the reference group)The t-values compare means of G1, G2, or G3 with mean of G4.

Five Groups

DV1 DV2 DV3 DV4 YG1 1 0 0 0 YsG2 0 1 0 0 YsG3 0 0 1 0 YsG4 0 0 0 1 YsG5 0 0 0 0 Ys (G5 is the reference group)The t-values compare means of G1, G2, G3, G4 with mean of G5.

SPSS does this automatically in many procedures. It’s called INDICATOR coding by SPSS.


5/24/2023 P5510 – Coding 8

Regression Comparing Means of 3 GroupsUsing Dummy Variable Coding

The dependent variable is Job Satisfaction being compared across 3 jobs. JS JOB DC1 DC2

6 1 1 0 7 1 1 0 8 1 1 0 11 1 1 0 9 1 1 0 7 1 1 0 7 1 1 0 5 2 0 1 7 2 0 1 8 2 0 1 9 2 0 1 10 2 0 1 8 2 0 1 9 2 0 1 4 3 0 0 3 3 0 0 6 3 0 0 5 3 0 0 7 3 0 0 8 3 0 0 2 3 0 0

The following was NOT gotten from the Regression procedure. It was obtained from one of procedures that lets you compute summary statistics for subgroups, such as the Means procedure.

The inability to report summary statistics for the groups whose means are being compared in an annoyance associated with using regression to assess the statistical significance of the difference between those means.


The group coded with all 0’s on the two dummy variables is the reference group.

The group coded with all 1’s on dummy variable 1 is represented by dummy variable 1 in the regression analysis below.

The group coded with all 1’s on dummy variable 2 is represented by dummy variable 2 in the regression analysis below..

Report

JS

7.86 7 1.688.00 7 1.635.00 7 2.166.95 21 2.25

JOB1 Clerks2 Receptionist3 MailroomTotal

Mean NStd.

Deviation

5/24/2023 P5510 – Coding 9

Output of the Oneway procedure – an ANOVA procedure.

Output of the Regression procedure

What does the F tell us?

It tests the significance of the collection of differences between the means.

If the p-value is <= .05, then we can conclude that there are significant differences between the means.

If that’s all we’re interested in, we can stop.


The ANOVA F in Regression analysis is equal to the F test from One way analysis of variance.

This F, often the only statistic that a researcher is interested in is an omnibus test, comparing all the means at once.

ANOVA

J S

4 0 .0 9 5 2 2 0 .04 8 5 .9 3 0 .0 116 0 .8 5 7 1 8 3 .3 8 1

1 00 .95 2 2 0

Be twe en Gro u psWi th i n Gro up sTo ta l

Sum o fSqu a re s d f

Me anSqu a re F Sig .

Variables Entered/Removed b

DC2, DC1 a . EnterModel1

VariablesEntered

VariablesRemoved Method

All requested variables entered.a.

Dependent Variable: JSb.

Model S um m ar y

.630a .397 .330 1.84Model1

R R S quareA djusted R

S quare

S td. E rrorof the

E stimate

P redictors: (Constant), DC2, DC1a.

ANOVAb

4 0 .0 9 5 2 2 0 .0 4 8 5 .9 3 0 .0 11 a

6 0 .8 5 7 1 8 3 .3 8 11 0 0 .9 5 2 2 0

Re g re s s i o nRe s i d u a lTo ta l

Mo d e l1

Su m o fSq u a re s d f

Me a nSq u a re F Sig .

Pre d i c to rs : (Co n s ta n t), DC2 , DC1a .

De p e n d e n t Va ri a b l e : J Sb .

5/24/2023 P5510 – Coding 10

Other information from the regression analysis involving dummy variables.

Side dish comparisons.Each Dummy variable represents a comparison of a group with the reference group.

The group coded with all 0’s is the reference group.

The group coded with 1’s on a variable is the group compared with the reference group by that variable.

What about Post Hoc comparisons?

Fagettaboutit. Post Hoc comparisons are not available from the regression analysis comparison of means.

As we’ll see, SPSS’s GLM procedure uses the regression analysis procedure, but GLM also uses auxiliary formulas to give us post hocs. So GLM uses a combination of regression procedures and special group comparison formulas developed for post hocs.


p-value for comparison of the group coded with 1’s on DC1 with the reference group.

p-value for comparison of the group coded with 1’s on DC2 with the reference group.

Each B equals the difference between a specific group mean and the reference group mean.

Coefficientsa

5.000 .695 7.194 .0002.857 .983 .614 2.907 .0093.000 .983 .645 3.052 .007

(Cons tant)DC1DC2

Model1

B Std. Error

Uns tandardiz edCoeffic ients

Beta

Standardiz ed

Coeffic ient

st Sig.

Dependent Variable: J Sa.

5/24/2023 P5510 – Coding 11

The same example using rcmdr

R -> Load packages -> RcmdrData -> Import Data -> from text file, clipboard or URL . . .

View Data set ----


This is the dialog box that appears whenever you import data from a CSV file using RCMDR.

5/24/2023 P5510 – Coding 12

Statistics -> Fit Models Linear regression . . .

> GroupCodingExamples2 <- + read.table("G:/MDBO/html2/p5510/Data Files/groupcodingJSexamples.csv", + header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)

> RegModel.2 <- lm(js~dc1+dc2, data=GroupCodingExamples2)

> summary(RegModel.2)

Call:lm(formula = js ~ dc1 + dc2, data = GroupCodingExamples2)

Residuals: Min 1Q Median 3Q Max -3.000 -1.000 0.000 1.000 3.143

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0000 0.6950 7.194 1.07e-06 ***dc1 2.8571 0.9828 2.907 0.00940 ** dc2 3.0000 0.9828 3.052 0.00686 ** ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.839 on 18 degrees of freedomMultiple R-squared: 0.3972, Adjusted R-squared: 0.3302 F-statistic: 5.93 on 2 and 18 DF, p-value: 0.01051

Since this is just a multiple regression analysis, both SPSS AND RCMDR MUST give the same answer. If not, we are in big trouble.


5/24/2023 P5510 – Coding 13

Effect Coding (could be called Effects Coding)

Two groups

EV1G1 1G2 -1

Three Groups

EV1 EV2G1 1 0G2 0 1G3 -1 -1 (G3 is the reference group here.)The t-value associated with EV1 compares X-barG1 with the mean of all 3 group means.The t-value associated with EV2 compares X-barG2 with the mean of all 3 group means.

Four Groups

EV1 EV2 EV3G1 1 0 0G2 0 1 0G3 0 0 1G4 -1 -1 -1 (G4 is the reference group)The mean of each EV compares a Group mean with the mean of all groups.

Five Groups

EV1 EV2 EV3 EV4G1 1 0 0 0G2 0 1 0 0G3 0 0 1 0G4 0 0 0 1G5 -1 -1 -1 -1 (G5 is the reference group)The mean of each EV compares a Group mean with the mean of all groups.

SPSS calls this coding DEVIATION coding.


5/24/2023 P5510 – Coding 14

Regression Comparing Means of 3 GroupsUsing Effects Coding

JS JOB EC1 EC2

6 1 1 0 7 1 1 0 8 1 1 0 11 1 1 0 9 1 1 0 7 1 1 0 7 1 1 0 5 2 0 1 7 2 0 1 8 2 0 1 9 2 0 1 10 2 0 1 8 2 0 1 9 2 0 1 4 3 -1 -1 3 3 -1 -1 6 3 -1 -1 5 3 -1 -1 7 3 -1 -1 8 3 -1 -1 2 3 -1 -1


The group coded with all -1’s on the two Effects variables is the reference group. It’s basically ignored in the following analyses.

The group coded with 1’s on Effects variable 1 is represented by Effects variable 1 in the regression analysis below.

The group coded with 1’s on Effects variable 2 is represented by Effects variable 2 in the regression analysis below.

Again, since the REGRESSION procedure does not recognize groups, I used a different procedure to get group means.

Report

JS

7.86 7 1.688.00 7 1.635.00 7 2.166.95 21 2.25


Mean NStd.

Deviation

5/24/2023 P5510 – Coding 15

Output of the Oneway analysis of variance procedure

Output of the Regression procedure


It test the significance of the collection of differences between the means.




ANOVA F in Regression analysis is equal to the F test from One way analysis of variance.

To reiterate, this is the omnibus test of “Are there ANY differences in the means?” that many researcher are interested in.

EC1 EC2

ANOVA

J S

4 0 .0 9 5 2 2 0 .04 8 5 .9 3 0 .0 116 0 .8 5 7 1 8 3 .3 8 1

1 00 .95 2 2 0






VariablesEntered




Model S um m ar y

.630a .397 .330 1.84Model1


S quare

S td. E rrorof the

E stimate


ANOVAb

4 0 .0 9 5 2 2 0 .0 4 8 5 .9 3 0 .0 11 a

6 0 .8 5 7 1 8 3 .3 8 11 0 0 .9 5 2 2 0


Mo d e l1



Pre d i c to rs : (Co n s ta n t), EC2 , EC1a .


Effect

5/24/2023 P5510 – Coding 16

Why are the B values (and their ts) for the Dummy and Effect coding different?



As we’ll see, SPSS’s GLM procedure uses the regression analysis procedure, but GLM also uses auxiliary formula’s go give us post hocs. So GLM uses a combination of regression procedures and special group comparison formulas developed for post hocs.


The Effect Coding side dishes.

p-value for comparison of the group coded with 1’s on EC1 with the mean of means of all three groups.

p-value of the comparison of the group coded with 1’s on EC2 with the mean of means of all three groups.

Each B equals the difference between the mean of a group and the mean of the means of all groups.

Dummy

DC1

DC2

EC2

EC1

Dummy coding compares each mean to the reference group mean.

So, dummy variables coding tells us which means are different from the reference group.

Effect coding compares each mean to the mean of all means.

So, effects coding tells us which means are “extreme” – different from the mean of all the groups.

Coefficientsa

6.952 .401 17.327 .000.905 .567 .337 1.594 .128

1.048 .567 .390 1.846 .081

(Cons tant)EC1EC2

Model1

B Std. Error

Uns tandardiz edCoeffic ien ts

Beta

Standard iz ed

Coeffic ient

st Sig.

Dependent Variable : J Sa.

5/24/2023 P5510 – Coding 17

The same example using rcmdrStatistics -> Fit Models Linear regression . . .

> RegModel.1 <- lm(js~ec1+ec2, data=Dataset2)

> summary(RegModel.1)

Call:lm(formula = js ~ ec1 + ec2, data = Dataset2)


Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9524 0.4012 17.327 1.13e-12 ***ec1 0.9048 0.5674 1.594 0.1282 ec2 1.0476 0.5674 1.846 0.0814 . ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


As was the case with Dummy Coding, we should get EXACTLY the same result from RCMDR as we got from SPSS, since it’s just regression.


5/24/2023 P5510 – Coding 18

Contrast Coding – new informationContrast coding involves the creation of K-1 comparisons by the analyst to assess whatever specific comparisons he/she may have.

The problem with Dummy and Effects coding is that the specific comparisons made in them – each group vs. the reference group or each group mean vs. mean of all means may not be the comparisons desired by an analyst.

What if you have a specific comparison that you’d like to make, and it’s not one of those afforded by the Dummy Coding or Effect Coding schemes? Contrast Coding to the rescue.

In contrast coding, each specific comparison compares the mean of one collection of groups with the mean of a second collection of groups, each collection chosen by the analyst. Some of the groups can be left out of the comparison.

Examples first, then a how-to.

The general procedure is that you pick the specific comparisons you’re interested in. Then you make up additional comparisons so that you end up with K-1 comparisons.

2 Group s Only one comparison is possible.C1 Y

G1 +.5 YsG2 -.5 Ys

C1 compares the mean of one group with the mean of the other. Boring!! The result would be the same as we would obtain using Dummy or Effects coding.

3 Groups . Suppose you wanted 1) to compare the mean of G1 with the mean of all the persons in G2 and G3

and 2) to compare the mean of G2 with the mean of G3, leaving out G1.

C1 C2 Y (I don’t think that there are any other ways to contrast for 3 groups,G1 .67 0 Ys other than switching Gs.)G2 -.33 .5 YsG3 -.33 -.5 Ys

C1 compares the mean of G1 with the mean of G2 and G3, treated as one group.So the t value for C1 will tell us whether there is a significant difference between the mean of G1 and the combined mean of G2 and G3.

C2 compares the mean of G2 with the mean of G3, ignoring G1. So the t value for C2 in the regression will tell us whether there is a significant difference between the mean of G2 and the mean of G3, ignoring G1 completely.

We may not be interested in this last comparison, but we have to create it so that we’ll end up with K-1 comparisons.


5/24/2023 P5510 – Coding 19

4 Groups – Suppose you wanted 1) to compare the mean of group G1 with the mean of Groups 2, 3, and 4

and 2) to compare. The mean of G2 with the means of G3, G4, omitting G1, and 3) to compare the mean of G3 with the mean of G4, omitting G1 and G2.

C1 C2 C3 G1 .75 0 0G2 -.25 .67 0G3 -.25 -.33 .5G4 -.25 -.33 -.5

C1 compares the mean of G1 and with the mean of G2, G3 and G4, treated as one group.

This is the comparison we’re interested in. Alas, we have to create K-1 comparisons total.

C2 compares the mean of G2 with the mean of G3 and G4, treated as one group

C3 compares the mean of G3 with the mean of G4 leaving out G1 and G2.

We have to add C2 and C3 to the data set, even though we may not be interested in the results, just so that we have a total of K-1 comparisons.

Creating coefficients.

Rule

Create 3 collections of groups: Collection 1, Collection 2, and Collection 3.

Collection 1 will be the group(s) whose mean performance is expected to be highest.Collection 2 will be the group(s) whose mean performance is expected to be worst.Collection 3 will be group(s) that are not involved in this specific comparison.

Collection 1 Groups coefficient:

Number of groups in Collection 2/ Number of groups in both collections.


Minus Number of groups in Collection 1 / Number of groups in both collections.

Collection 3 Group(s) coefficient:

Coefficients assigned to all groups in it are 0.

First, decide what specific comparison you wish to make. If you can’t decide, don’t do contrast coding – in the words of Jack Nicholson, “You can’t handle the contrast coding!!”.


5/24/2023 P5510 – Coding 20

Comparing 3 Group Means Using Contrast Coding ID JS JOB CC1 CC2

1 6 1 .667 .000 2 7 1 .667 .000 3 8 1 .667 .000 4 11 1 .667 .000 5 9 1 .667 .000 6 7 1 .667 .000 7 7 1 .667 .000 8 5 2 -.333 .500 9 7 2 -.333 .500 10 8 2 -.333 .500 11 9 2 -.333 .500 12 10 2 -.333 .500 13 8 2 -.333 .500 14 9 2 -.333 .500 15 4 3 -.333 -.500 16 3 3 -.333 -.500 17 6 3 -.333 -.500 18 5 3 -.333 -.500 19 7 3 -.333 -.500 20 8 3 -.333 -.500 21 2 3 -.333 -.500

CC1 compares mean of Job 1 with the mean of Jobs 2 and 3.

CC2 compares the Mean of Job 2 with the mean of Job 3.


The rule for forming a contrast variable between two sets of groups is

1st collection value = No. of groups in 2nd set / Total no. of groups.

2nd collection value = - No. of groups in 1st set / Total no. of groups.

3rd Value = 0 for all groups to be excluded.

So, 1st set Value of CC1 = 2 / 3 = .667.

2nd set Value of CC1 = - 1 / 3

3rd set Value of CC1 = NA, since there is no 3rd set for CC1

1st set Value of CC2 = 1 / 2 = .5

2nd set Value of CC2 = -1 / 2 = -.5

3rd set Value of CC2 = 0 to exclude Job 1.

General Rule: V / (U+V) vs. –U / (U+V)Where U and V are the number of groups in each set.

Report

JS

7.86 7 1.688.00 7 1.635.00 7 2.166.95 21 2.25


Mean NStd.

Deviation

5/24/2023 P5510 – Coding 21

Output of Oneway ANOVA procedure

Output of Regression procedure


It tests the significance of the collection of differences between the means.




ANOVA F in Regression analysis is equal to the F test from One way analysis of variance.

CC1 CC2

ANOVA

J S

4 0 .0 9 5 2 2 0 .04 8 5 .9 3 0 .0 116 0 .8 5 7 1 8 3 .3 8 1

1 00 .95 2 2 0






VariablesEntered




Model S um m ar y

.630a .397 .330 1.84Model1


S quare

S td. E rrorof the

E stimate


ANOVAb

4 0 .0 9 5 2 2 0 .0 4 8 5 .9 3 0 .0 11 a

6 0 .8 5 7 1 8 3 .3 8 11 0 0 .9 5 2 2 0


Mo d e l1



Pre d i c to rs : (Co n s ta n t), CC2 , CC1a .


5/24/2023 P5510 – Coding 22



As we’ll see, SPSS’s GLM procedure uses the regression analysis procedure, but GLM also uses auxiliary formula’s go give us post hocs. So GLM uses a combination of regression procedures and special group comparison formulas developed for post hocs.


p-value for the comparison represented by the 1st Contrast variable.

p-value for the comparison represented by the 2nd Contrast variable..

Each B equals or is proportionate to the comparisons represented by the contrast coded variable.

Coefficientsa

6.952 .401 17.326 .0001.357 .851 .292 1.594 .1283.000 .983 .559 3.052 .007

(Cons tant)CC1CC2

Model1

B Std. Error

Uns tandardiz edCoeffic ien ts

Beta

Standard iz ed

Coeffic ient

st Sig.

Dependent Variable : J Sa.

5/24/2023 P5510 – Coding 23

The same example using rcmdr

Statistics -> Fit Models Linear regression . . .

Call:lm(formula = js ~ cc1 + cc2, data = GCwithContrasts)


Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9519 0.4012 17.326 1.13e-12 ***cc1 1.3571 0.8512 1.594 0.12824 cc2 3.0000 0.9828 3.052 0.00686 ** ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



MUST BE same as those obtained using SPSS

5/24/2023 P5510 – Coding 24

Example: Suppose we have 5 groups. Thanks to Brittany Sentell for suggesting this.

The effect of background information on salary offers

G1: Evaluations of a job applicant with no history of arrest.G2: Evaluations of a job applicant with an arrest DUI.G3: Evaluations of a job applicant with an arrest for Marijuana possession.G4: Evaluations of a job applicant with an arrest for domestic violence.G5: Evaluations of a job applicant with an arrest for assault in a bar fight. Dependent variable is salary offered to the employee.

1. Our first interest is in comparing the mean salary offered to persons without a history of arrest with those with some kind of history.

Collection 1: G1 Collection 2: G2, G3, G4, G5 Collection 3: Empty


Number of groups in Collection 2/ Number of groups in both collections.

For the example: 4 / (4+1) = 4/5 = 0.80


Minus Number of groups in Collection 1 / Number of groups in both collections.

– 1/(4+1) = -1/5 = -0.20 for the example.

Collection 3 Group(s) coefficient:

Coefficients assigned to all groups in it are 0.

Contrast 1 is

C1G1 0.8G2 -0.2G3 -0.2G4 -0.2G5 -0.2


5/24/2023 P5510 – Coding 25

Example continued. Recall the group descriptions . . .G1: Evaluations of a job applicant with no history of arrest.G2: Evaluations of a job applicant with an arrest DUI.G3: Evaluations of a job applicant with an arrest for Marijuana possession.G4: Evaluations of a job applicant with an arrest for domestic violence.G5: Evaluations of a job applicant with an arrest for assault in a bar fight.

2. Suppose our second interest is in comparing the mean salary offered to those with nonviolent arrest records with those applicants whose arrest records involved some kind of violence.

Collection 1: G2, G3 Collection 2: G4, G5 Collection 3: G1

Collection 1 coefficient: 2 / (2+2) = 2/4 = 0.5Collection 2 coefficient: -2 / (2+2) = -2/4 = -0.5 (note the minus sign)Collection 3 coefficient: 0

So, now the two contrasts are

C1 C2G1 0.8 0G2 -0.2 +.5G3 -0.2 +.5G4 -0.2 -.5G5 -0.2 -.5

3. Now suppose our third interest is to compare the mean salary offered to those with DUI vs mean salary to those with arrest for Marijuana.

Collection 1: G2 Collection 2: G3 Collection 3: G1, G4, G5

Collection 1 coefficient: 1 / (1+1) = .5Collection 2 coefficient: -1 / (1+1) = -.5Collection 3 coefficient: 0

ContrastsC1 C2 C3

G1 0.8 0 0G2 -0.2 +.5 +.5G3 -0.2 +.5 -.5G4 -0.2 -.5 0G5 -0.2 -.5 0


5/24/2023 P5510 – Coding 26

4. Finally, suppose our 4th and last interest is to compare the mean salary offered to those with a domestic violence arrest with those with a simple assault arrest.

Collection 1: G4 Collection 2: G5 Collection 3: G1, G2, G3

Collection 1 coefficient: 1 / (1+1) = .5Collection 2 coefficient: -1 / (1+1) = -.5Collection 3 coefficient: 0

ContrastsC1 C2 C3 C4

G1 +.8 0 0 0G2 -.2 +.5 +.5 0G3 -.2 +.5 -.5 0G4 -.2 -.5 0 +.5G5 -.2 -.5 0 -.5

So we have now created 4 contrasts, each one testing a hypothesis of interest to us . . .

C1: Do evaluators of job applications give different mean salaries to persons with no criminal background vs those persons with some criminal arrest record?

C2: Do evaluators of job applications give different mean salaries to persons with a criminal background of drug use vs. those with a criminal background of violence?

C3: Do evaluators of job applications given different mean salaries to persons with arrest records mentioning alcohol vs. those with arrest records mentioning marijuana?

C4: Do evaluators of job applications given different mean salaries to persons with arrest records mentioning domestic violence vs. those with arrest records mentioning simple assault?

Note that we could have done a simple one-way analysis of variance with post hoc tests.

But 1) the ANOVA+post hocs would not have directly addressed the questions posed in the contrasts, and

2) the a priori comparisons used here are more powerful – better able to detect small but real differences – than post hoc tests.

For these two reasons, the contrast method used here is preferred.

This approach is called by many texts the “Planned Comparison” approach.


5/24/2023 P5510 – Coding 27

Hypothetical data illustrating the Comparing Groups not Previously Compared Method shown above.

SPSS syntax to create the C variables and perform the regression . . .recode group (1 = .8)(else = -.2) into C1.recode group (1=0)(2,3=.5)(else=-.5) into C2.recode group (1=0)(2=.5)(3=-.5)(else=0) into C3.recode group (1,2,3=0)(4=.5)(5=-.5) into C4.regression variables = y c1 c2 c3 c4 /dep=y /enter.

I don’t usually advise using a display such as the above to just present means. However, when you have a large collection of means, a visual display such as this might be useful in discovering a pattern of mean differences. For example, in the above, the display points out that the “No arrest” salary offers were higher than all of the arrest salary offers. The t-value for C1 will tell is us the difference is significant.


Grp C1 C2 C3 C4 1 +.8 0 0 0 2 -.2 +.5 +.5 0 3 -.2 +.5 -.5 0 4 -.2 -.5 0 +.5 5 -.2 -.5 0 -.5

5/24/2023 P5510 – Coding 28

RegressionVariables Entered/Removeda

Model Variables Entered Variables Removed Method1 C4, C3, C2, C1b . Enter

a. Dependent Variable: y

b. All requested variables entered.

Model Summary

Model R R Square Adjusted R SquareStd. Error of the

Estimate1 .668a .447 .397 10.43936

a. Predictors: (Constant), C4, C3, C2, C1

ANOVAa

Model Sum of Squares df Mean Square F Sig.1 Regression 3958.022 4 989.506 9.080 .000b

Residual 4904.112 45 108.980

Total 8862.134 49


b. Predictors: (Constant), C4, C3, C2, C1

Coefficientsa

ModelUnstandardized Coefficients

Standardized Coefficients

t Sig.B Std. Error Beta1 (Constant) 75.086 1.476 50.859 .000

C1 20.136 3.691 .605 5.456 .000C2 8.271 3.301 .278 2.505 .016C3 -1.626 4.669 -.039 -.348 .729C4 -1.845 4.669 -.044 -.395 .695


An advantage of planned comparisons such as these is that they do NOT require a significant overall F value to be interpreted.


5/24/2023 P5510 – Coding 29

Rules for creating contrasts . . . Skip in 2018

The first contrast is easy and may be the only contrast in which you are really interested.

Alas, the remaining contrasts are important for technical reasons and must also be created.

The contrasts must satisfy the following conditions . . .

1. Coefficients of each contrast must sum to zero.

2. Sum of products of coefficients of any pair of contrasts must be zero. This is called the orthogonality condition.

Two ways of meeting the orthogonality conditions for contrasts . . .

1) The compare-groups-that-were-not-previously-compared method.

Separate the positive coefficient groups from the negative coefficient groups in previous contrasts.

Form each subsequent contrast among those groups with equal coefficient value(s) on all previous contrasts. This is what was done in the above example . . .

C1 C2 C3 C4G1 +.8 0 0 0G2 -.2 +.5 +.5 0G3 -.2 +.5 -.5 0G4 -.2 -.5 0 +.5G5 -.2 -.5 0 -.5

Each new contrast – C2, then C3, then C4, is between groups whose coefficients on all previous contrasts were the same which means that these groups were not previously compared.


Grp C1 C2 C3 C4 1 +.8 0 0 0 2 -.2 +.5 +.5 0 3 -.2 +.5 -.5 0 4 -.2 -.5 0 +.5 5 -.2 -.5 0 -.5

5/24/2023 P5510 – Coding 30

2) The Factorial Comparisons Method – Skip in 2018Contrast Codes that mimic Factorial DesignsCreating contrasts so they represent the comparisons automatically performed in analysis of factorial designs.

Example: 2 x 2 Factorial – the simplest factorial design

Recall from PSY 5100 that factorial designs are represented on paper as a two-way table

Column Factor

Col 1 Col 2

Row 1 G1 G2Row

Row 2 G3 G4

Recall also that in a factorial design, we automatically perform 3 tests1. Test of the Row main effect: Mean of G1 and G2 vs. Mean of G3 and G42. Test of the Column main effect: Mean of G1 and G3 vs. Mean of G2 and G43. Test of the Interaction: Difference of G1 and G3 vs. Difference of G2 and G4

The three comparisons are automatically performed by GLM for factorial designs.They can also be created using contrast codes.

The Row Main Effect is coded by treating everyone in Row 1 (G1 and G2) as one collection and everyone in Row 2 (G3 and G4) as a 2nd collection. So, everyone in Row 1 (G1 and G2) gets one value (+.5) in the example, and everyone in Row 2 gets the other value (-.5).

The Column Main Effect is coded in the same way. Everyone in Column 1 (G1 and G3) gets one value. Everyone in Column2 (G2 and G4) gets the other.

Finally, the Interaction is coded using the following rule: Each value of the interaction group coding variable is the product of the two main effect variable values involved in the interaction. So, the +.25 is the product of +.5 and +.5. The -.25 in the second line is the product of +.5 and -.5, and so forth.

Row ME Col MEMain effect Main effect InteractionGCV1 GCV2 GCV3

G1 +.5 +.5 +.25 (= .5 * .5)G2 +.5 -.5 -.25 (= .5 * -.5)G3 -.5 +.5 -.25 (= -.5 * .5)G4 -.5 -.5 +.25 (= -.5 * -.5)


CC1 CC2 CC3G1 .5 .5 .25G2 .5 -.5 -.25G3 -.5 .5 -.25G4 -.5 -.5 .25

5/24/2023 P5510 – Coding 31

Table copied from previous page . . . Skip in 2018

Row ME Col MEMain effect Main effect InteractionGCV1 GCV2 GCV3

G1 +.5 +.5 +.25 (= .5 * .5)G2 +.5 -.5 -.25 (= .5 * -.5)G3 -.5 +.5 -.25 (= -.5 * .5)G4 -.5 -.5 +.25 (= -.5 * -.5)

The comparison that each coding scheme makes:

CGV1: Mean of G1 and G2 vs. mean of G3 and G4, (G1+G2) – (G3+G4)CGV2: Mean of G1 and G3 vs. mean of G2 andG4, (G1+G3) – (G2+G4)GCG3: Difference of G1 and G2 vs. Difference of G3 and G4: G1-G2 – (G3-G4)

The beauty of this is that all the conditions of contrast coding are met. The coefficients sum to 0. The coefficients are orthogonal. Life is good.

Contrast Coding a 2 x 3 Factorial Design.

Representation of the design:

Column Factor

Col1 Col2 Col3

Row1 G1 G2 G3Row

Row2 G4 G5 G6

Row main effect: Mean of G1, G2, G3 vs. mean of G4, G5, G6.

We’ll treat G1, G2, G3 as one group and compare its mean with the mean of G4, G5, and G6 treated as one group.

Column main effect: Mean of G1, G4 vs. mean of G2, G5 vs. mean of G3, G6. Note that there are THREE columns, so we’re going to have to create TWO contrasts to represent the comparisons between them.

Interaction: Difference of G1 and G4 compared with difference of G2 and G5 compared with difference of G3 and G6.


CC1 CC2 CC3 CC4 CC5G1G2G3G4G5G6

5/24/2023 P5510 – Coding 32

Group Coding variables for a 2 x 3 factorial Skip in 2018

Row main effect: Only one contrast is required since there are only two rows to compare.

Column main effect: Two contrasts are required to carry the column main effect, since there are 3 columns whose means must be compared.

Interaction: Since there are two column main effect variables, there will be two interaction variables. Each coefficient is the product of one Row ME coefficient and one Col ME coefficient.

Row ME Col ME1 Col ME2 Interaction1 Interaction2Row x Col 1 Row x Col 2

G1 .5 .67 0 .33 0G2 .5 -.33 .5 -.17 .25G3 .5 -.33 -.5 -.17 -.25G4 -.5 .67 0 -.33 0G5 -.5 -.33 .5 .17 -.25G6 -.5 -.33 -.5 .17 .25

This is getting quite complicated, and only a few applications actually require reference to things like the interaction group coding variable. So, we won’t pursue it further here.

But now you know what GLM and other ANOVA programs do. They create contrast codes and then regress the dependent variable onto those contrast codes. It’s all regression.


Column Factor

Col1 Col2 Col3

Row1 G1 G2 G3Row

Row2 G4 G5 G6

5/24/2023 P5510 – Coding 33

Helmert coding

A sequential contrast coding scheme comparing

1) 1st K-1 compared with Kth2) 1st K-2 compared with K-1th, Kth=03) 1st K-3 compared with K-2nd, K-1th and Kth = 0...K-1) 1st with 2nd.

Three Groups H1 H2G1 -.33 -.5G2 -.33 +.5G3 .67 0

Four Groups H1 H2 H3G1 -.25 -.33 -.5G2 -.25 -.33 +.5G3 -.25 +.67 0 G4 +.75 0 0

Eight Groups H1 H2 H3 H4 H5 H6 H7G1 -.125 -.143 -1/6 -1/5 -1/4 -1/3 -1/2G2 -.125 -.143 -1/6 -1/5 -1/4 -1/3 1/2G3 -.125 -.143 -1/6 -1/5 -1/4 2/3 0G4 -.125 -.143 -1/6 -1/5 3/4 0 0G5 -.125 -.143 -1/6 .800 0 0 0G6 -.125 -.143 .833 0 0 0 0G7 -.125 .857 0 0 0 0 0G8 .875 0 0 0 0 0 0

Difference CodingSame as Helmert except flipped around horizontal axis – 1st vs. rest, 2nd vs. rest, 3rd vs. rest, etc.

I’ve used Helmert and Difference coding not because of any inherent interest in the contrasts, but simply to get some form of contrast codes for a set of groups.


5/24/2023 P5510 – Coding 34

Why contrasts?

Why should we even bother with doing contrast codes? Why not just do an overall ANOVA and use post hocs to compare individual group means?

Good question. The answer is power. The overall ANOVA F is like using a flood light looking for bugs. It lights the whole area, but because the given amount of light has to cover a wide expanse, the amount of light at any given spot is small. If you’re mainly interested in whether there are bugs in one specific corner of the room, it’s much better to use a spot light. That concentrates the light in one particular area, increasing your chanced of finding the bug – the real difference if it exists.

Consider the contrasts created above . . .

C1 C2 C3 C4G1 +.8 0 0 0G2 -.2 +.5 +.5 0G3 -.2 +.5 -.5 0G4 -.2 -.5 0 +.5G5 -.2 -.5 0 -.5

Suppose the only real differences between groups in the above was the difference between G1 and the average of the other groups, G2 – G5 – the difference assessed by C1. Suppose that there were no differences between G2 and G3 and G4 and G5 – their means just varied randomly.

The overall ANOVA F would look at ALL the differences, including the essentially zero differences between G2, G3, G4, and G5, and very likely conclude, “No differences OVERALL (on average) between the groups.” In fact, however, that would be wrong – there is ONE difference – that between G1 and the remaining groups. But it gets overshadowed by the predominance of nondifferences among G2, G3, G4, and G5.

Using contrasts forces a direct examination of specific differences.


5/24/2023 P5510 – Coding 35

Final thoughts on the use of coding schemes . . .

1. Dummy Variable coding is useful when your research involves one group that is “special” such that the mean of scores in that group needs to be compared with the mean of every other group.

2. Effect coding is useful when you need a way to identify “deviant” groups. Picking a group that is clearly non-deviant as the reference group allows each t-test in the regression analysis to tell you whether the group that the t represents is deviant or not.

3. Contrast coding is useful when the automatic comparisons that available are not right for you.

You may have multiple groups, but you’re not really interested in the “Are there any differences in any of the group means?” comparison that is represented by the overall F test.

Contrast coding lets you pick specific comparisons and test each of those specific comparisons.

It turns out that each of those specific tests will be more powerful than a test based on the “Are there any differences . . .?” overall F test that is automatically computed.

4. Remember that statistical packages such as SPSS create group coding variables behind the scenes, even though the output of the statistical procedures may not seem to have come from such schemes. Sometimes, as in the LOGISTIC REGRESSION and SURVIVAL procedures, for example, the use of group coding schemes is made obvious by SPSS.

5. You don’t have to specifically use group coding variables unless you want to.


coding schemes - university of tennessee at … · web viewcoding schemes more than you ever wanted...

Documents