running head: reliability and validity coefficients reliability...
TRANSCRIPT
RUNNING HEAD: Reliability and Validity Coefficients
Reliability and Construct Validity Coefficients Adjusted for Correlation within Family Dyads
George J. Knafl,1* Jane K. Dixon,2 Margaret Grey,2 Jean P. O'Malley,1 Kathleen A. Knafl1
1 School of Nursing, Oregon Health & Science University
2 School of Nursing, Yale University
* Corresponding author, [email protected]
Acknowledgements
This work was supported in part by grant R01 NR08048 from the National Institute of Nursing
Research (NINR) of the National Institutes of Health (NIH), grant R01 AI057043 from the
National Institute of Allergy and Infectious Diseases of the NIH, and by Oregon Health &
Science University's Oregon Clinical and Translational Research Institute (OCTRI) through
grant UL1 RR024140 from the National Center for Research Resources (NCRR) of the NIH and
the NIH Roadmap for Medical Research.
Reliability and Validity Coefficients
2
ABSTRACT
The assessment of reliability and construct validity for scales used to evaluate families is
potentially limited because standard coefficients are not computed appropriately using the
combined responses for family members due to their possible intra-familial correlation. In this
paper, we describe and demonstrate the use of linear mixed modeling to compute generalizations
of such coefficients for item and scale responses for family dyads. These include generalizations
of coefficients for construct validity correlation, differential reliability index, internal consistency
reliability, item-total correlation, and test-retest reliability. Example analyses are presented using
survey responses from parents of a child with a chronic condition that focus on family
management of the child's condition.
Key Words: childhood chronic conditions, construct validity, dyadic data, family, reliability,
survey instruments
Reliability and Validity Coefficients
3
Reliability and Construct Validity Coefficients Adjusted for Correlation within Family Dyads
In order to address the family as a unit, family researchers collect measurements on multiple
family members, often on family dyads (e.g., mothers/fathers, parent/child, ill/healthy child).
Measurements for family members are likely to be correlated, and so their combined analysis
requires special methods accounting for intra-familial correlation (IFC) (Kashy & Snyder, 1995).
Hierarchical linear modeling (HLM), also called multilevel modeling, often is used to model
such measurements (Atkins, 2005; Raudenbush & Bryk, 2002) including approaches specialized
to family data (Cook & Snyder, 2005). These measurements commonly are based on summated
scales whose reliability for the complete study data requires assessment. Standard reliability
coefficients, however, are based on the assumption of independence across subjects, and so
should be computed only for each family member separately. Standard coefficients do not reflect
the composite reliability of combined responses for all family members, but commonly are
reported due to limitations of current methods. Recent examples include Martini, Grusec, and
Bernadini (2001) on satisfaction with help provided to older mothers by their adult daughters,
O'Leary and Vidair (2005) on child rearing for fathers and mothers, and Yeh et al. (2006) on
sexual satisfaction, marital quality, and marital instability for midlife couples.
The results reported in these and similar articles may provide an accurate assessment of
reliability, but that only can be verified through comparison with results that account for IFC
computed using responses for all surveyed family members. Even when results based upon
standard reliability coefficients do provide an accurate assessment, they do not provide an
indication of the strength of the relationship between family members and thus only provide a
partial assessment of the family, addressing individual family members but not the family unit.
As Kashy and Snyder (1995) point out in the context of couples, the couple together is more than
Reliability and Validity Coefficients
4
the aggregate of two individuals.
Reliability has been generalized in several ways. Wang (2002) used HLM (i.e., the method as
opposed to the software tool also called HLM) to generalize internal consistency reliability (ICR)
with intraclass correlations computed from models conditioned on properties of surveyed
individuals (in their case students). The properties they considered are fixed effects and so would
not account for IFC, but HLM could be used to introduce random family effects. However, HLM
with covariance structures generated by random effects or coefficients is a special case of linear
mixed modeling which also supports other, directly specified covariance structures, including the
ones presented here. Graham, Liu, and Jeziorski (2006) applied the reliability generalization of
Vacha-Haase (1998) to conduct a meta-analysis combining ICR values for multiple studies.
These methods combine reliability values computed separately from multiple samples, rather
than computing composite values from combined scale responses for family members within a
single sample. Confirmatory factor analysis (CFA) can be used to compute composite ICR
values for multiple factors combined together in a single CFA model (Hatcher, 1994). This
provides an alternative to Cronbach's alpha for computing reliability for factors combined
together rather than for associated scales one at a time, but does not generalize that coefficient to
address family data. Raykov and Shrout (2002) used structural equation modeling (SEM) to
generalize ICR to the case of scales with more than one underlying latent construct. While their
work does not address generalization to handle IFC, SEM could be used to generate analogous
models to the ones considered here. However, some popular SEM software tools, such as PROC
CALIS of the Statistical Analysis System (SAS), could not be used because they do not support
differing numbers of participating family members.
This paper describes how linear mixed modeling (Brown & Prescott, 1999) can be used to
Reliability and Validity Coefficients
5
compute reliability and construct validity coefficients addressing the family as a composite unit
for the special but common case of family dyads. Formulations are provided for generalized
coefficients accounting for IFC in item and scale responses from or about family dyads (e.g.,
from mothers and fathers about the family or from mothers about ill and healthy siblings). With
these methods, composite coefficients can be computed using combined responses for both
family members (as opposed to computing them with standard formulas one member at a time)
while accounting for the family unit through the correlation in responses within families. These
methods are demonstrated using data from an instrument development study that surveyed
parents of a child with a chronic condition on the family's management of the child's condition.
We begin by generalizing Cronbach's alpha for measuring ICR, the most commonly reported
reliability coefficient. The construct validity correlation (CVC) is generalized next. This requires
a somewhat more complex formulation than needed to generalize ICR. Then, several other
coefficients are generalized including the test-retest reliability (TRR), the item-total correlation
(ITC), and the differential reliability index (DRI) in that order. These are generalized using the
same kind of formulation as for the CVC. Formulations are presented in separate sections for
readers interested in the computational details. SAS macros to support these computations are
available from the authors.
Composite Internal Consistency Reliability (ICR)
ICR for a scale commonly is measured using Cronbach's alpha, computed using a well-
known formula combining variances for the scale and all of its items (e.g., see DeVellis, 2003, p.
35). These item and scale variances are computed using the standard approach based on the
assumption of independence across subjects, which will not hold in general for item and scale
responses from members of the same family. Consequently, ICR should be computed in this way
Reliability and Validity Coefficients
6
using responses for individual family members separately, not responses for all family members
combined. However, item and scale variances can be generalized using linear mixed modeling to
composite quantities accounting for IFC. These can be inserted into the standard Cronbach's
alpha formula to obtain composite ICR coefficients that account for IFC. Composite item-deleted
ICR coefficients can be computed in this way as well.
Generalizability theory (or G theory for short) can be used to extend Cronbach's alpha to
express ICR for a scale as a function of the number of items used in computing it, not just the
actual number of items (Shavelson & Webb, 1991). These G theory ICR extensions are
generalized here as well to composite coefficients.
ICR Formulation
General linear mixed models are used in modeling multiple measurements (Brown &
Prescott, 1999). Measurements for the same subject are treated as dependent while measurements
for different subjects are treated as independent. Each such model consists of a fixed component
modeling expected or mean values for the measurements and a random component modeling the
within-subject dependence in measurements. The fixed component can be any special case of the
general linear model (e.g., regression, analysis of variance, analysis of covariance). The random
component can be any covariance matrix for within-subject measurements (or, equivalently, the
variances for all measurements together with correlations for all pairs of measurements). This
covariance matrix can be generated through random effects and/or coefficients along with
independent, constant-variance error terms, and then the linear mixed model is equivalent to an
associated hierarchical linear or multilevel model (Singer, 1998). For example, a single random
intercept generates the covariance structure called compound symmetry because it has constant
variances and constant correlations for different measurements. On the other hand, there are
Reliability and Validity Coefficients
7
valid covariance structures that cannot be generated through random effects/coefficients, so that
linear mixed models are more general (Verbeke & Mohlenberghs, 2000). Parameters of linear
mixed models are estimated using full information maximum likelihood (Arminger & Sobel,
1990), allowing subjects to have partial sets of measurements without imputing values for
missing measurements. This is especially important for modeling family data which typically
involve families with differing numbers of participating members.
The specific linear mixed model used to generate composite variances for ICR computations
for family dyads is based on a pair of measurements, one for each family dyad member, for
either a scale or for one of its items. Families are the "subjects" of the model, rather than the
participating family members (the subjects of the study). The fixed component has different
mean values for the two family members along with a possible dyad effect (e.g., partnered versus
single) on the mean. The covariance matrix has two rows and two columns (and so dimension
2×2). The two diagonal entries correspond to variances, treated as different, for the two family
members. The two off-diagonal entries are equal to each other, and together with the variances
determine the IFC in responses within family dyads. This covariance matrix is called
unstructured since no structure has been imposed on it to reduce the number of covariance
parameters. For this 2×2 case, compound symmetry is the structured alternative with the
variances constrained to be equal.
It is important to allow for different mean values for family members since the goal is to
compute separate variances for those family members, but it also may be important to account
for an additional dyad effect. For example, when the dyad consists of parents from one- or two-
parent families, it may be important to allow for a family type effect on mean values.
Alternately, for ill and healthy siblings, it may be important to allow for a gender effect.
Reliability and Validity Coefficients
8
Using an unstructured covariance matrix allows different composite variances to be
computed for each family member. These composite variances, for a scale and for its items, are
inserted into the standard Cronbach's alpha formula to compute the scale's composite ICR
coefficients, one for each family member, but based on the combined responses for both family
members. A single composite ICR for both family members combined could have been
computed instead using the constant variance parameter of the compound symmetry covariance
matrix. This has the advantage of providing a single measure of ICR, as would be used in non-
family settings, but ignores the individuality of family members. Families are composed of
individual family members, and so the assessment of reliability for a scale addressing multiple
family members should address the reliability for each of those individuals. On the other hand,
the family unit has dimensions not fully described by the individual properties of its members
alone. This is represented by the IFC, a dimension not addressed in standard ICR computations.
ICR Example Analysis
The data used in example analyses come from an instrument development study that
surveyed parents of children with a chronic condition on family management of the condition. A
total of 579 parents were interviewed, including 414 mothers and 165 fathers from 417 families,
with both parents responding from 162 families. These survey responses were used to develop
the Family Management Measure (FaMM) consisting of six scales representing attributes of
family management, including the Child Identity, Concern, Difficulty, Effort, Manageability, and
Parental Mutuality Scales (Knafl et al., in review). The Parental Mutuality Scale is appropriate
only for partnered parents, but the other scales apply to both partnered and single parents. For
example, the Concern Scale, measuring concern parents have about the condition, has 10 items
rated from 1 to 5, meaning strongly disagree to strongly agree (e.g., "we think about our child’s
Reliability and Validity Coefficients
9
condition all the time" and "because of the condition, we worry about our child’s future"). Larger
scores indicate higher levels of concern. The Manageability Scale measures how manageable the
condition is and has 12 items, also scored from 1 to 5 (e.g., "we often feel unsure about what to
do to take care of our child’s condition" and "when something unexpected happens with our
child’s condition, we usually know how to handle it"). Larger scores indicate higher levels of
manageability. Table 1 contains ICR results along with summary statistics for the Concern and
Manageability Scales computed from responses for mothers and fathers both separately and in
combination, allowing for an effect on scale means for family type (partnered versus single).
Composite ICR for mothers on concern was lower at .73 than for fathers at .77, but both at
respectable levels (i.e., at least .70 but less than .80, DeVellis, 2003). Separate ICR values were
lower for mothers at .72 and also for fathers at .75. For manageability, composite ICR values
were also respectable for mothers at .71 and for fathers at .73, and the same as separate values in
both cases. ICR values computed separately can be lower than associated composite values in
some cases, but may not be too much lower. This suggests that the separate approach can
produce reasonably accurate ICR values, at least for cases like this with substantial numbers of
observations for each family member. However, IFCs (in this case, inter-parental correlations)
associated with the computation of composite summary statistics for concern and manageability
are .57 and .33, respectively, with both significantly nonzero (p < .01). Not accounting for this
correlation can ignore a distinct dimension of the family as a composite unit.
The estimated composite mean for concern was 26.9 for partnered mothers, 28.4 for single
mothers, and 26.9 for fathers and was not significantly different for partnered mothers compared
to single mothers (p = .08) and for partnered mothers compared to fathers (p = .94). The
estimated composite mean for manageability was 49.3 for partnered mothers, 47.5 for single
Reliability and Validity Coefficients
10
mothers, and 49.4 for fathers and was significantly different for partnered mothers compared to
single mothers (p = .04) but not for partnered mothers compared to fathers (p = .76). Although it
may not always be crucial to allow for different means for partnered and single mothers, as for
concern, it may be in some cases, as for manageability. Consequently, allowance for an effect on
the mean for single versus partnered mothers was incorporated into analyses of reliability and
construct validity for all the FaMM scales reported by Knafl et al. (in review) except for the
Parental Mutuality Scale since it is not completed by single mothers.
Estimates for mean concern and manageability were not affected much by consideration of
IFC. The separate approach produced almost the same values as the composite approach.
Separate estimates for all partnered mothers and for fathers, however, cannot be compared using
standard tests. The standard two-sample t test is not appropriate because of the lack of
independence between responses within the same families. The standard paired t-test can be
appropriate, but only for responses from families with two participating parents. Differences
cannot be calculated for the other families with only participating mothers. The test generated by
the linear mixed model generalizes the standard paired t-test to account for responses from
families with either one or two participating members while accounting for correlation in
responses within families with two participating members. Separate and composite estimates of
the standard deviation were also almost the same, explaining why separate and composite ICR
values, which are computed from such standard deviation estimates, were also not too different.
G Theory Extension of ICR
Cronbach's alpha can be extended using G theory to express ICR for a scale as a function of
the number of items used in computing it, not just the actual number of items (Shavelson &
Webb, 1991). These alternative ICR values can be computed from the ICR value α for the actual
Reliability and Validity Coefficients
11
number of items (see the following section). Composite values for such generalized ICRs can be
generated with the same computation using composite values for α. Figure 1 portrays how
composite ICR for the Concern Scale would change with differing numbers of items of similar
effect to the 10 available items. One use for these curves would be to assess how many additional
items are needed to improve the scale's reliability to .80, considered the minimal very good level
by DeVellis (2003). Only two more items are needed to increase ICR for fathers to at least .80,
but 5 more are needed for mothers. This may be an unacceptable increase in subject burden,
especially if similar increases are needed for other FaMM scales as well.
G Theory Formulation
Using a one-facet crossed design in terms of persons and items (Shavelson & Webb, 1991),
the associated relative decision study generates an ICR coefficient ICR(IN) for an arbitrary
number IN of items. This coefficient equals the standard Cronbach's alpha when IN is set to the
observed number I of items (I = 10 for the Concern Scale), that is, ICR(I) = α. It also can be
expressed in the form
ICR(I′) = y / (y + 1 / I′) (1)
where y is the ratio of the person variance component divided by the residual variance
component (see equation 6.18, Shavelson & Webb). Set IN = I in equation 2 and solve for y
giving
y = α / (I · (1 − α)) (2)
and then insert this into equation 1 to solve for ICR(IN) for any IN in terms of α. The coefficients
ICR(IN) can be generalized to account for IFC by using composite ICR coefficients α in equation
2 to compute a composite y value and then using it in equation 1 to calculate composite values
ICR(IN) over a selection of possible values IN for the number of items (e.g., as plotted in Figure
Reliability and Validity Coefficients
12
1).
Composite Construct Validity Correlation (CVC)
The construct validity of a scale can be addressed by computing correlations between that
scale and other established measures related to the scale's underlying latent construct. Construct
validity for the scale is supported when these correlations are distinct and in expected directions,
thereby supporting hypothesized relationships. However, these correlations also are expected to
be limited in strength, indicating that underlying constructs are only related, not equivalent.
CVC Formulation
The standard CVC is computed using a general bivariate approach, treating the scale and the
established measure as correlated with different means and different variances. The associated
covariance matrix is unstructured with dimension 2×2.
The specific linear mixed model used to generalize this to a composite CVC for family dyads
is based on four measurements, the scale and the established measure for each of the two family
members. The fixed component is determined by allowing the scale and the established measure
each to have different mean values for the two family members along with a possible dyad effect
(e.g., partnered versus single). The covariance matrix has dimension 4×4. The first two rows and
columns correspond to the scale for the two family members. The last two rows and columns
correspond to the established measure for the two family members. The four diagonal elements
correspond to the variances for the four measurements, allowing them to have different values.
The off-diagonal elements together with the variances determine the correlations. There are six
possible correlations corresponding to the possible distinct pairs for the four measurements.
However, the model imposes structure on these correlations so that there are only two distinct
parameters, an IFC and a CVC. The correlation between scale values for the two family
Reliability and Validity Coefficients
13
members is set equal to the correlation between established measure values for the two family
members. This common value is the IFC parameter. Similarly, the correlation between scale
values and established measure values for the first family member is set equal to the correlation
between scale values and established measure values for the second family member. This
common value is the CVC parameter. The other correlations are determined by multiplying these
two parameters together. In other words, the correlation between scale values for one family
member and established measure values for the other family member is set equal to the product
of the IFC and CVC parameters. This 4×4 covariance matrix is a special case of what is called a
Kronecker product matrix (Schott, 2005). It is the product of two 2×2 unstructured covariance
matrices.
This model provides a common composite CVC parameter, rather than different ones, for the
two family members. While different CVC parameters would provide information on individual
family member differences, the latent construct underlying the scale would be the same for both
family members, as would the latent construct for the established measure. A common CVC for
both family members better reflects this homogeneity in latent constructs across family
members, and so seems more appropriate. Moreover, while an extended model with different
CVC parameters can be formulated, this extension is not supported by SAS PROC MIXED, the
software tool used here to estimate linear mixed models, and so is not readily computable. While
PROC MIXED does not support this one case, another popular statistical software tool, the
Statistical Package for the Social Sciences (SPSS), does not support Kronecker product
covariance matrices of any kind for linear mixed modeling. This model could be computed in
SAS using PROC NLMIXED, but the extra effort did not seem warranted. An alternate approach
supported by PROC MIXED would be to use an unstructured 4×4 covariance matrix with 6
Reliability and Validity Coefficients
14
different correlation parameters including different CVCs for mothers and fathers and different
IFCs associated with these two CVCs. There would also be two other correlations combining
aspects of the CVC and the IFC since they involve two different parents and two different scales.
This approach was judged to be too complex.
CVC Example
Several established measures were collected as part of the FaMM Study for construct validity
purposes including the Functional Status II (Stein & Jessop, 1990), providing a parent's
assessment of child functioning in the presence of a chronic condition. This scale is computed as
the sum of 14 items scores from 0 to 2. Larger scores indicate better child functional status which
would be expected to correspond to lower concern and higher manageability. Consequently,
construct validity for the Concern Scale would be supported by a negative CVC with child
functional status and for the Manageability Scale by a positive CVC.
Composite construct validity results for the Concern and Manageability Scales in terms of
child functional status are reported in Table 2 with composite ICR results. Child functional status
was missing for two parents, and so results used responses for 577 parents. Composite ICR for
child functional status was .79 for mothers and .81 for fathers, with IFC of .58 (p < .01). The
composite CVC between concern and child functional status was −.32 (p < .01), a distinct value
in the expected direction and of limited strength, thereby supporting construct validity for the
Concern Scale. The associated IFC was .52 (p < .01). Construct validity of the Manageability
Scale was supported by a positive composite CVC with child functional status of .32 (p < .01),
with associated IFC of .38 (p < .01). For both scales, the impact of the family unit on construct
validity was distinct as indicated by IFCs, demonstrating the importance of accounting for intra-
familial correlation
Reliability and Validity Coefficients
15
Composite Test-Retest Reliability (TRR)
The standard TRR for a scale is computed as the correlation between responses to that scale
by the same individual at two points in time. The time points should be far enough apart so that
subjects cannot recall their original responses, but close enough so the scale's underlying latent
construct remains stable. Item TRR can be computed similarly. These computations use the same
general bivariate approach as used for the CVC, with different means and variances for the scale
at the two time points along with a temporal correlation parameter representing TRR.
These standard coefficients can be generalized to composite coefficients with the same direct
product covariance approach used for the CVC, providing composite TRR estimates for a scale
and its items adjusted for associated IFC. TRR sample sizes are smaller than for general scale
development purposes, so it may be important to avoid too complex a model. One natural way to
simplify the model is to allow variances to differ across times, but not across family members.
These two alternatives can be compared using penalized likelihood criteria (Sclove, 1987). We
use Schwarz's Bayesian information criterion (BIC) because its penalty accounts for the sample
size as well as the number of parameters, as opposed to the Akaike information criterion with
penalty only accounting for the number of parameters. Accounting for the size of the sample in
the penalty seems important for TRR since it typically is based on samples of limited size.
Alternate forms reliability based on two alternate forms for the items can be generalized
similarly.
Composite TRR Formulation
The specific linear mixed model used to generalize the TRR coefficient has the same
structure as the model for the composite CVC. Measurements are either for a scale or for one of
its items, measured four ways corresponding to two family members at two time points, the test
Reliability and Validity Coefficients
16
and the retest, with different means possibly changing due to a dyad effect and with different
variances. The six possible correlations are determined by an IFC parameter, a temporal
correlation parameter representing the TRR, and products of these two parameters. Two
alternative versions of this model are considered here, the full model with all four variances
different and the reduced model with variances only different across time points not also across
family members. The reduced model is considered since the full model can be over-
parameterized for the kinds of sample sizes usually considered for TRR.
Each of these alternative models provides a common composite TRR parameter, rather than
different ones, for the two family members. Models with different parameters for each family
member would be preferable, but as for the CVC are not readily computable. An individual
family member perspective on TRR can be obtained from standard coefficients computed for
each family member separately, but there may not be sufficient numbers retested within each
family member type for meaningful results. The two alternatives for a common composite TRR
coefficient are more accurate because they are computed from responses for all retested subjects.
If at most one member per family is retested, a common TRR value can be computed using
the standard approach, but this would not account fully for the family as a unit composed of
multiple individuals. Composite coefficients account for this through the IFC. However, if the
IFC is not significantly different from zero, it could be more appropriate to compute the common
value for TRR using the standard approach applied to responses for all retested subjects, treating
them as independent across subjects even when for different members of the same family. This
latter approach may not be appropriate, though, unless the IFC is distinctly nonsignificant. For
example, Kashy and Snyder (1995) recommend using a conservative cutoff of .25 for the p-value
for the test of zero IFC before treating responses within families as independent. In any case, this
Reliability and Validity Coefficients
17
latter approach treats variances as different across time, but not across family members, and so is
generalized to account for IFC by the reduced linear mixed model.
Composite TRR Example
For the FaMM Study, 65 parents were retested within 2-4 weeks, including 41 mothers and
24 fathers from 44 families with both parents retested from 21 families. Table 3 contains TRR
results for the Concern and Manageability Scales. The composite TRR for the Concern Scale
was .87 using both the full and reduced models. The associated IFC was .34 for the full model
and .32 for the reduced model, with both significantly nonzero (both at p = .02), so it is
inappropriate to compute a composite value using the standard approach. The standard TRR
computed just for mothers was .89 compared to .84 just for fathers. The Concern Scale has very
good TRR levels from all perspectives.
For the Manageability Scale, composite TRR was at a reasonable level of .79 using both the
full and reduced models. The associated IFC was −.16 for the full model with p = .29 while it
was −.17, almost the same, with p = .27 for the reduced model. These p-values are large enough
(over .25) to compute an alternative composite value using the standard approach, which was
.78, about the same as for the other more general cases. Alternately, the standard TRR computed
just for mothers was .83 compared to .64 just for fathers. It is possible that TRR for the
Manageability Scale is at an undesirable level (i.e., at least .60 but less than .65, DeVellis, 2003)
for fathers, but at a very good level for mothers. However, these results also could be a
consequence of small subsample sizes, especially for fathers with only 24 retested. It seems quite
possible that the Manageability Scale has reasonable TRR levels as indicated by its three
alternative composite scores, a conclusion in this case not affected much by correlation in
responses within the same families. In any case, an effective evaluation of TRR for family scales
Reliability and Validity Coefficients
18
like the FaMM requires sufficiently large numbers of all family member types, not just a large
number of family members of any type as was collected for in the FaMM Study.
Composite TRR results for the full and reduced models are not too different for both the
Concern and Manageability Scales. In such situations, either approach could be used, but then
the reduced model seems preferable as more parsimonious. In other situations, it is possible that
estimating more variance parameters than is necessary might produce inferior TRR estimates,
suggesting again the reduced model may be preferable. However, there are objective ways to
compare these models. Table 3 provides such a comparison using BIC scores, with smaller
values indicating better models. The reduced model generates smaller BIC scores for both the
Concern and the Manageability Scales. On the other hand, the differences in scores are not too
large, indicating that either model may be an acceptable choice, but the reduced model is still
preferable as the simpler model. The reduced model is both more parsimonious as well as more
effective for these scales. We recommend its use over the full model in general. Furthermore,
BIC scores are distinctly larger for separate TRR analyses than for composite TRR analyses
suggesting that the differences in separate TRR results for mothers and fathers, especially for the
Manageability Scale, are questionable.
Composite Item-Total Correlation (ITC)
The standard ITC is computed as the correlation between an item and the total of a scale with
that item deleted. The scale is corrected by subtracting the item before computing the correlation.
An uncorrected version can be computed similarly, but is not considered as appropriate since
including the item in the scale can inflate the resulting ITC coefficient (DeVellis, 2003, p. 93).
The ITC also is called the discrimination index. Items with values at least .30 are considered
discriminating and those with values over .20 but less than .30 moderately discriminating
Reliability and Validity Coefficients
19
(Nunnally & Bernstein, 1994). The standard ITC coefficient can be generalized to a composite
coefficient using the same direct product covariance approach used to generalize the CVC,
providing a composite ITC adjusted for associated IFC.
Composite ITC Formulation
The specific linear mixed model used to generalize the ITC has the same form as the model
for the CVC. There are four measurements corresponding to the item and the corrected scale for
each of the two family members. These four measurements are treated as having different means,
possibly changing due to a dyad effect, and different variances. The six possible correlations are
determined by an IFC parameter, an ITC parameter, and products of these two parameters. This
model provides a common composite ITC parameter, rather than different ones, for the two
family members. A model with different parameters for each family member would be
preferable, but as for the CVC is not readily computable. An individual family member
perspective on ITC can be obtained from standard coefficients computed for each family
member separately, but those address only the individual family members, not the family unit.
Composite ITC Example
Table 4 contains (corrected) ITCs, separate values for mothers and fathers as well as
composite values, along with associated IFCs for the 10 Concern Scale items. Items are listed in
increasing order of composite ITCs. Composite ITCs range from .24 to .52 with all significantly
nonzero (p < .01). IFCs range from .24 to .53 with all significantly nonzero (p < .01). Seven of
the Concern Scale items are discriminating while the other three are moderately discriminating,
suggesting that it is reasonable from this perspective to retain all the items in the scale. Separate
ITC values can be quite different for mothers and fathers (as much as .11 apart), and can be only
moderately discriminating for one parent when discriminating for the other parent. However,
Reliability and Validity Coefficients
20
when items are considered discriminating if discriminating for both parents and moderately
discriminating if moderately discriminating for one parent and moderately discriminating or
better for the other parent, the Concern Scale items are categorized in exactly the same way as by
the composite ITC. This suggests that it is sufficient to compute the composite coefficient,
producing a consistent categorization of items while also providing information on the family
unit.
Composite Differential Reliability Index (DRI)
An item's ITC can be combined with its correlation with a measure of social desirability to
compute its DRI, measuring how much of an effect it has on its scale beyond what is attributable
to its correlation with social desirability (Jackson, 1970; Streiner & Norman, 2003). A new
formulation is not required to generalize DRI to a composite value based on the combined
responses for two family members. It is computed from the composite ITC and the composite
correlation for an item with social desirability. This latter correlation is computed using the same
approach as for the composite CVC, but applied to an item rather than to a scale. These two
correlation values are combined using the standard formula for the DRI (e.g., see Streiner &
Norman, 2003, p. 87).
DRI Example
The short version of the Marlowe-Crowne Social Desirability Scale (Strahan & Gerbasi,
1972) was used in the FaMM Study to assess the effect of social desirability on responses to the
FaMM items and scales. This scale is computed as the sum of 10 items scored as either 0 or 1.
Larger scores indicate more of an influence to social desirability on other responses. Social
desirability was missing for one parent, and so results used responses for 578 parents. Table 5
contains composite DRI values for the Concern Scale items together with associated composite
Reliability and Validity Coefficients
21
ITCs (reproduced from Table 4) and composite item-social-desirability correlations used in their
computation. Items are listed in Table 5 in the same order as in Table 4, but that is also
increasing order in composite DRI. Composite DRI ranges from .24 to .65, and so all Concern
Scale items are at least moderately discriminating in terms of DRI (using the same definition as
for the ITC). Associated composite item-social-desirability correlations are quite small ranging
from −.10 to .11 while their associated IFCs are substantial ranging from .22 to .49 with all of
these significantly nonzero (p < .01). Three of these item-social-desirability correlations are
significant (p < .05), but the composite correlation between concern and social desirability is .00
(p = .96), and so scale values are not influenced by social desirability. Also, social desirability
has undesirable composite ICR levels of .60 for mothers and .62 for fathers, possibly due to the
fact that the items are binary valued, with associated IFC of .33 (p < .01).
A convention does not appear to exist for what DRI values to consider acceptable, but a
small change from the ITC would indicate an acceptable impact of social desirability. For
example, composite DRI values for the Concern Scale items differ from associated composite
ITCs by at most .02, with decreases of at most 6.9%, indicating that social desirability has at
most a limited impact on the effect of items of the Concern Scale.
Computation
Computations were conducted using the linear mixed modeling procedure PROC MIXED
and the matrix language procedure PROC IML available in SAS Version 9.1 (SAS Institute, Inc.,
Cary, NC). Measurements for some analyses needed normalizing for maximum likelihood
routines to converge. For example, when computing composite ITCs, item and corrected scale
values were divided by their observed maximum values, so that both normalized variables had
the same upper bound of 1. SAS macros for conducting these computations are available at
Reliability and Validity Coefficients
22
http://www.ohsu.edu/son/faculty/knafl/comprel.html (accessed ?/?/08).
Discussion
Standard reliability and construct validity coefficients are appropriately computed only for
one family member at a time because their formulations assume independence across subjects.
This paper formulates methods for calculating composite versions of these coefficients from
combined responses for family dyads, which account for the dyad as a unit through the intra-
familial correlation (IFC) in responses for that dyad. Composite coefficients provide more
comprehensive measures of reliability and construct validity based on responses for both
members of a family dyad combined. Their values may not be too much different, however, from
those computed separately one family member at a time, especially if subsample sizes are not too
small. This was the case for reported internal consistency reliability (ICR) values, but not for
reported test-retest reliability values based on data for a much smaller set of subjects. Separate
values can provide useful information, but provide an incomplete assessment only addressing
family members individually, not the dyad as a unit. In our opinion, composite coefficients and
associated IFC values should be reported as standard practice, possibly along with separate
estimates if they provide useful information. The importance of accounting for IFC in these
contexts is supported by the example analyses for which most IFC values were significant and of
substantial strength. Even when the IFC is nonsignificant, not accounting for this possible
dependence ignores an important structural aspect of the relationship for couples (Kashy &
Snyder, 1995). That is, a couple together is more than the aggregate of its two members.
Whether or not composite ICR values are close to those computed separately for each family
member, associated IFC values represent an important aspect of the reliability of scales used to
assess family dyads. This is especially important in development of new scales. A nonsignificant
Reliability and Validity Coefficients
23
IFC for a scale intended for use with family dyads and computed from scale development survey
data with subjects purposely recruited to be broadly representative of those dyads might suggest
that the scale is an unacceptable family measure. Even if a new family scale has acceptable
levels of reliability for both members of the dyad, if that is achieved without a substantive IFC,
the scale represents a latent construct that reflects only individual family members not the family
unit they form. For example, if one of the scales of the Family Management Measure had a
nonsignificant IFC for the survey responses used to develop those scales, its value in measuring a
dimension of the management of childhood chronic conditions by the family as a unit would be
questionable. The IFC is also important for assessing responses to established scales, but a
nonsignificant value in that case may be an interesting result suggesting a weak relationship for
the kinds of family members under study rather than a reliability problem for the associated
scale. For example, Knafl, Knafl, & McCorkle (2005) provide an example involving parental
assessments of school performance for a child with a chronic condition and for a healthy sibling.
There were two IFC dimensions, an inter-parental correlation between assessments by different
parents of the same child and an inter-sibling correlation between assessments of different
children by the same parent. The inter-sibling correlation was nonsignificant (p = .39), indicating
that each parent evaluated the school performance of each child independently from the school
performance of the other child, while the inter-parental correlation was significant (p < .01),
indicating that as expected each parent's assessment of each individual child's school
performance was strongly related to the assessment of the other parent for that same child.
Current guidelines for reporting research study results (Wilkinson & The Task Force on
Statistical Inference, 1999) treat ICR as a property of scale responses rather than of the scale
itself, and so recommend that ICR computed from observed responses be reported as common
Reliability and Validity Coefficients
24
practice. Analogously, reliability for scale responses from family research studies can be
considered a property of responses for all family members combined rather than responses for
individual types of family members separately. Composite ICR coefficients provide for such a
more comprehensive assessment of reliability. However, their computation is not generally
supported in statistical software tools at this time.
Consequently, macros have been developed as part of this research that automate the
computation of composite reliability and construct validity coefficients within SAS, a popular
statistical software tool. These macros are available on the Internet for use by researchers with
access to SAS (when licensed to use PROC MIXED and PROC IML). Familiarity with using the
SAS programming interface is needed to run these macros, but a novice level would be
sufficient. Computation within other statistical software tools is feasible, but would require a
sophisticated level of familiarity with the programming interface of the tool to automate.
This work also applies to simple repeated measurement situations with responses to items
and scales repeated over two arbitrary conditions, like two treatments or two time points, or to
any kind of dyad. The generalization to any dyad is the same as for two family members.
However, further work is needed to extend standard reliability and construct validity coefficients
to arbitrary repeated measurement contexts. Generalization using similar approaches is possible
for repeated measurements over more than two conditions, like multiple family members,
treatments, or time points, but that is an issue for future research.
This work is also only a first step in generalizing psychometric methods to address survey
instruments used in family research. For example, standard factor analysis models assume
independence across subjects and so are appropriately applied only to item responses for one
member in each surveyed family. It would be preferable to factor analyze the complete set of
Reliability and Validity Coefficients
25
responses for all surveyed family members in combination while accounting for IFC in item
responses within families, but that is an issue for future research.
Reliability and Validity Coefficients
26
References
Arminger, G., & Sobel, M. E. (1990). Pseudo-maximum likelihood estimation of mean and
covariance structures with missing data. Journal of the American Statistical Association, 85,
195-203.
Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data:
Basic and advanced issues. Journal of Family Psychology, 19, 98-110.
Brown, H., & Prescott, R. (1999). Applied mixed models in medicine. New York: Wiley.
Cook, W. L., & Snyder, D. K. (2005). Analyzing nonindependent outcomes in couple therapy
using the Actor-Partner Interdependence Model. Journal of Family Psychology, 19, 133-141.
DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousands Oaks,
CA: Sage.
Graham, J. M., Liu, Y. J., & Jeziorski, J. L. (2006). The Dyadic Adjustment Scale: A reliability
generalization meta-analysis. Journal of Marriage and Family, 68, 701-717.
Hatcher, L. (1994). A step-by-step approach to using SAS for factor analysis and structural
equation modeling. Cary, NC: SAS Institute.
Jackson, D. N. (1970). A sequential system for personality scale development. In C. D.
Spielberger (Ed.), Current topics in clinical and community psychology, vol. 2 (pp. 61-96).
New York: Academic Press.
Kashy, D. A., & Snyder, D. K. (1995). Measurement and data analytic issues in couples research.
Psychological Assessment, 7, 338-348.
Knafl, G. J., Dixon, J. K., O'Malley, J. P., Grey, M., Deatrick, J., Gallo, A., & Knafl, K. A. (in
review). Scale development supported by likelihood cross-validation.
Reliability and Validity Coefficients
27
Knafl, G. J., Knafl, K. A. & McCorkle, R. (2005). Mixed models incorporating intra-familial
correlation through spatial autoregressive methods. Research in Nursing and Health, 28,
348-356.
Martini, T. S., Grusec, J. E., & Bernardini, S. C. (2001). Effects of interpersonal control,
perspective taking, and attributions on older mothers' and adult daughters' satisfaction with
their helping relationship. Journal of Family Psychology, 15, 688-705.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw
Hill.
O'Leary, S. G., & Vidair, H. B. (2005). Marital adjustment, child-rearing disagreements, and
overreactive parenting: Predicting child behavior problems. Journal of Family Psychology,
19, 208-216.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
analysis methods (2nd ed). Thousand Oaks, CA: Sage.
Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: Point and
interval estimation using a structural equation modeling approach. Structural Equation
Modeling, 9, 195-212.
SAS Institute Inc. (2004). SAS/STAT 9.1 user's guide, Cary, NC: SAS Institute Inc.
Schott, J. R. (2005). Matrix analysis for statistics (2nd ed.). Hoboken, NJ: J. Wiley & Sons.
Sclove, L. S. (1987). Application of model-selection criteria to some problems in multivariate
analysis. Psychometrika, 52, 333-343.
Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models,
and individual growth curves. Journal of Educational and Behavioral Statistics, 24, 323-355.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park:
Reliability and Validity Coefficients
28
Sage.
Stein, R., & Jessop, D. (1990). Functional Status II: A measure of child health status. Medical
Care, 28, 431-438.
Strahan, R., & Gerbasi, K. (1972). Short, homogeneous version of the Marlowe-Crowne Social
Desirability Scale. Journal of Clinical Psychology, 28, 191-193.
Streiner, D. L., & Norman, G. R. (2003). Health measurement scales: A practical guide to their
development and use. Oxford: Oxford University Press.
Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error
affecting score reliability across studies. Educational and Psychological Measurement, 58, 6-
20.
Verbeke, G., & Mohlenberghs, G. (2000). Linear mixed models for longitudinal data. New York:
Springer.
Wang, J. (2002). Reliability generalization: An HLM approach. Journal of Instructional
Psychology, 29, 213-218.
Wilkinson, L., & The Task Force on Statistical Inference. (1999). Statistical methods in
psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
Yeh, H.-C., Lorenz, F. O., Wickrama, K. A. S., Conger, R. D., & Elder, G. H., Jr. (2006).
Relationships among sexual satisfaction, marital quality, and marital instability at midlife.
Journal of Family Psychology, 20, 339-343.
Reliability and Validity Coefficients
29
Table 1 Summary Statistics and Internal Consistency Reliability (ICR) for the Concern and Manageability Scales of the Family Management Measurea
Scale Type Mean SD ICR Mothers Fathers Mothers Fathers Mothers Fathers Partnered Single Concern compositeb 26.9 28.4c 26.9d 6.8 7.2 .73 .77 separate 26.8 28.4 27.1 6.6 7.0 .72 .75 Manageability compositee 49.3 47.5f 49.4g 6.2 5.9 .71 .73 separate 49.3 47.5 49.3 6.3 5.9 .71 .73 a Using responses from 579 parents, 414 mothers and 165 fathers, of 417 families including 162 families with both parents responding. b The associated intra-familial correlation (IFC) was .57 and was significantly nonzero at p < .01. c Not significantly different from the mean for partnered mothers at p = .08. d Not significantly different from the mean for partnered mothers at p = .94. e The associated IFC was .33 and was significantly nonzero at p < .01. f Significantly different from the mean for partnered mothers at p = .04. g Not significantly different from the mean for partnered mothers at p = .76.
Reliability and Validity Coefficients
30
Table 2 Support for Composite Construct Validity for the Concern and Manageability Scales of the Family Management Measurea
Scale Hypothesized Retationship
Correlation with Child Functional Statusb
Intra-Familial Correlation
Concernc − −.32d .52d Manageabilityc + .32d .38d a Using responses from 577 parents, 413 mothers and 164 fathers, from 415 families including 160 families with both parents responding. Child functional status was missing for the other 2 parents. b Composite internal consistency reliability (ICR) for child functional status was .79 for mothers and .81 for fathers with associated intra-familial correlation .58 (p < .01). c Composite ICR reported in Table 1. d All significantly nonzero at p < .01.
Reliability and Validity Coefficients
31
Table 3 Test-Retest Reliability (TRR) for the Concern and Manageability Scales of the Family Management Measurea
Scale SeparateAnalysis Composite Analysis SeparateTRR BICb,c Full Reduced Mothers Fathers TRR BICb TRR BICb Concern .89 .84 780.8 .87d 771.2 .87e 768.8 Manageability .82 .64 823.2 .79f 817.4 .79g 814.5 a Using responses from 65 parents, 41 mothers and 24 fathers, of 44 families including 21 families with both parents responding, that were retested within 2-4 weeks of their initial interviews. b BIC is the Bayesian information criterion with smaller values indicating better models. c Computed using the combined data for mothers and fathers under the model equivalent to the separate models for each parent while treating IFC as zero. d The associated intra-familial correlation (IFC) was .34 and was significantly nonzero at p = .02. e The associated IFC was .32 and was significantly nonzero at p = .02. f The associated IFC was −.16 and was not significantly different from zero at p = .29. When treated as zero, the composite TRR was .78. g The associated IFC was −.17 and was not significantly different from zero at p = .27. When treated as zero, the composite TRR was .78.
Reliability and Validity Coefficients
32
Table 4 Composite Item-Total Correlation (ITC) for the Items of the Concern Scale of the Family Management Measurea
ITC Intra-Familial Correlationc
Separate Compositeb Mothers Fathers
.23 .33 .24 .34
.27 .35 .27 .35
.33 .25 .29 .30
.31 .40 .30 .47
.37 .33 .33 .39
.39 .48 .40 .34
.41 .52 .41 .53
.51 .43 .47 .31
.51 .46 .48 .24
.51 .59 .52 .38 a Using responses from 579 parents, 414 mothers and 165 fathers, of 417 families including 162 families with both parents responding. b All composite ITCs were significantly nonzero at p < .01. c All intra-familial correlations were significantly nonzero at p < .01.
Reliability and Validity Coefficients
33
Table 5 Composite Differential Reliability Index for the Items of the Concern Scalea
Composite Composite Differential Composite Social Reliability Item-Total Desirability Inter-Familial
Index Correlation Correlationb Correlationc value p .24 .24 −.03 .44 .23 .26 .27 .06 .16 .26 .27 .29 .11 .01 .22 .28 .30 .10 .02 .41 .32 .33 .07 .12 .33 .39 .40 −.06 .15 .28 .40 .41 −.03 .51 .49 .47 .47 −.03 .47 .23 .47 .48 −.10 .02 .23 .51 .52 −.04 .33 .32 a Using responses from 578 parents, 414 mothers and 164 fathers, of 417 families including 162 families with both parents responding. Social desirability was missing for the other 1 parent. Items are listed in the same order as in Table 4. b Composite internal consistency reliability (ICR) for social desirability was .60 for mothers and .62 for fathers with associated intra-familial correlation .33 (p < .01). The composite correlation between social desirability and concern was .00 (p = .96) with intra-familial correlation (IFC) .44 (p < .01). c All intra-familial correlations were significantly nonzero p < .01.