comparing conditional and marginal direct estimation of ......this conditional normality is a less...

55
RESEARCH REPORT January 2003 RR-03-02 Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions Research & Development Division Princeton, NJ 08541 Matthias von Davier

Upload: others

Post on 05-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • RESEARCH REPORT January 2003 RR-03-02

    Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

    Research & Development Division Princeton, NJ 08541

    Matthias von Davier

  • Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

    Matthias von Davier

    Educational Testing Service, Princeton, NJ

    January 2003

  • Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:

    Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541

  • Abstract

    Many large-scale assessment programs in education utilize “conditioning models” that

    incorporate both cognitive item responses and additional respondent background variables

    relevant for the population of interest. The set of respondent background variables serves as a

    predictor for the latent traits (proficiencies/abilities) and is used to obtain a conditional prior

    distribution for these traits. This is done by estimating a linear regression, assuming normality of

    the conditional trait distributions given the set of background variables. Multiple imputations, or

    plausible values, of trait parameter estimates are used in addition to or, better, on top of the

    conditioning model—as a computationally convenient approach to generating consistent

    estimates of the trait distribution characteristics for subgroups in complex assessments. This

    report compares, on the basis of simulated and real data, the conditioning method with a recently

    proposed method of estimating subgroup distribution statistics that assumes marginal normality.

    Study I presents simulated data examples where the marginal normality assumption leads to a

    model that produces appropriate estimates only if subgroup differences are small. In the presence

    of larger subgroup differences that cannot be fitted by the marginal normality assumption,

    however, the proposed method produces subgroup mean and variance estimates that differ

    strongly from the true values. Study II extends the findings on the marginal normality estimates

    to real data from large-scale assessment programs such as the National Assessment of

    Educational Progress (NAEP) and the National Adult Literacy Survey (NALS). The research

    presented in Study II shows differences between the two methods that are similar to the

    differences found in Study I. The consequences of relying upon the assumption of marginal

    normality in direct estimation are discussed.

    Key words: conditioning models, large-scale assessments, NAEP, NALS, direct estimation

    i

  • Acknowledgements

    I would like to thank John Mazzeo for valuable comments on previous versions of this

    document, which improved both content and presentation. Any remaining errors are mine.

    ii

  • Introduction

    Large-scale assessments such as the National Assessment of Educational Progress

    (NAEP) estimate the distribution of academic achievement for policy relevant subgroups.

    Examples of estimates provided by large-scale assessment are means and percentages above cut

    points for the subgroups of interest. Many large-scale assessments such as NAEP use a sparse

    matrix sample design in which the number of cognitive items per respondent is kept relatively

    small. Using such designs allows the assessment to provide a broad coverage of the content

    domain while keeping the subjects’ testing time brief. This implies that individual ability

    estimates based on these kinds of assessments would have a large measurement error component,

    which has to be taken into account when reporting aggregate statistics for subgroups. Direct

    estimation procedures, by which these estimates are obtained without the generation of

    individual scores, have been the approach most commonly taken to address this analysis

    challenge. Typically, these procedures have made use of background variables along with the

    cognitive item responses to ensure a higher degree of accuracy in estimating subgroup

    characteristics compared to only using the cognitive responses. Moreover, matrix sampling

    makes it impossible to compare subjects—or groups of subjects—based on their observed item

    responses. Therefore, large-scale assessments using matrix sampling rely on item response

    theory (IRT) models (Lord & Novick, 1968; Rasch 1960).

    To estimate the subgroup statistics of interest, ETS has employed since 1984 a particular

    approach of integrating achievement data (item responses) and background information, such as

    subgroup membership and additional student variables, into a hierarchical IRT model. This

    approach may be referred to as “direct estimation” because ETS estimates group statistics

    without the use of individual test scores. For the purposes of this report, I refer to this approach

    as ETS-DE. The core features of the ETS-DE approach include:

    1. A population model that assumes proficiencies are normally distributed conditional on a

    large number of background variables (grouping variables and other covariates). As a

    consequence, the marginal distribution (overall and for major reporting subgroups) is a

    mixture of normals.

    2. The generation of a posterior latent trait distribution of proficiency for each individual in the

    sample, which is based on an estimate of (1); a separately estimated set of IRT parameters

    that are treated as fixed and known; the cognitive item responses, the respondents’ group

    1

  • membership; and other covariates. The mixture of these individual posterior distributions

    provides the estimate of the actual subgroup distributions.

    3. The integration over posterior distributions of examinees and some of the model parameters

    (the parameters of the population model defined later) in (1) to obtain estimates of means,

    percentages above achievement levels, etc.

    4. The use of normal approximations for the individual posteriors and a multiple-imputation

    approach (the so-called plausible values) to approximate the integration in (3). Imputations

    are used in conjunction with conditioning models based on both cognitive item responses and

    background information. The imputations are used as a mere convenience in order to

    simplify the integration in (3) and to provide data that can be used with standard tools by

    secondary analysts.

    Cohen and Jiang (1999) propose an alternative approach to direct estimation (which I

    refer to as CJ-DE in this report) of subpopulation characteristics that does not utilize additional

    background variables. Cohen and Jiang assume that CJ-DE provides consistent subgroup

    estimates without the use of background variables. The core features of CJ-DE include:

    1. A population model that assumes marginal normality, i.e., the ability distributions of all

    subgroups align in such a way that the joint distribution is normal.

    2. A measurement model for the categorical grouping variables that assumes an underlying

    continuous latent variable whose joint distribution with proficiency is normal.

    3. Use of a set of fixed/known IRT model parameters.

    4. Item responses that are used together with a single grouping variable only—the one used for

    reporting—i.e., no additional covariates like other reporting variables or their interactions are

    used in the population model.

    5. A direct calculational approach that bypasses the generation of individual posterior

    distributions and the generation of plausible values.

    Both approaches, ETS-DE and CJ-DE, may be referred to as “direct estimation” because they

    estimate group statistics without the use of individual test scores. ETS-DE uses a more general

    model, which includes grouping variables as well as additional background information and no

    specific assumption regarding the marginal proficiency distribution. CJ-DE includes the

    assumption of marginal normality and ignores all the additional background information other

    2

  • than a single grouping variable. This report presents a comparison of ETS-DE and CJ-DE using

    simulated and real data.

    The ETS-DE Methodology

    For obtaining estimates of subpopulation distributions, ETS-DE involves a two-phase

    procedure that uses achievement data (item responses) and respondents’ background

    information. Key references for a more detailed outline of the conditioning model used by the

    ETS-DE method are Mislevy (1991), Mislevy, Beaton, Kaplan, and Sheehan (1992) and Thomas

    (1993, 2002). The two phases of the method, which sometimes are confused when discussed in

    secondary literature, are:

    1. Estimation of parameters for the conditioning, or population, model.

    2. Production of plausible values from individual posterior distributions given the model

    parameters, item responses, and background data.

    The Conditioning Model

    The method used for analyzing large-scale assessments at ETS uses both item responses

    and background information, sometimes numbering up to one hundred conditioning variables.

    Assume that there are k scales in the assessment and that each proficiency scale follows a

    unidimensional IRT model1 with the usual assumption of conditional independence given ��, i.e.,

    � � �� � �

    �Kk Kk kJj kjkkkJ

    xPxxP..1 ..1 )(..1)(1

    )|()|,..,( �� � � � (1)

    The conditioning model combines the k-scale IRT model with a k-dimensional

    multivariate latent regression model in order to maximize the likelihood based on the posterior

    distribution of the latent trait �=(��,.,��):

    )|()|(~),|(),|(..1 )(..1

    yxPyxfyxLKk kJj kjk

    ���� � �� �

    �� (2)

    where the prior �(��| y) is assumed to be normal with ��y�� N(�'y , �). The latent trait � is

    unobserved and must be inferred from the observed item responses. The predictor y is a vector of

    3

  • individual values on a set of conditioning variables, � is a matrix of regression weights, and � is

    the residual variance-covariance matrix. Note that at ETS, three software programs are currently

    available to carry out the estimation: NGROUP, BGROUP, and CGROUP. All implementations

    are based on the EM (estimation-maximization) algorithm. In the E-step, the posterior

    distribution of � given item responses and conditional on the background variables is computed

    for each individual. These estimates are then used in the M-step to obtain the regression weights

    ��and the residual covariance matrix �. The approaches implemented in NGROUP, BGROUP,

    and CGROUP differ with respect to how each carries out the E-step:

    1. NGROUP assumes that the item likelihood ��������j=1..J(k)�P(xjk|��) can be approximated by

    a multivariate normal distribution and has limited use. (It may be used only for generating

    starting values for CGROUP or with extremely long scales.)

    2. BGROUP does not assume any specific form of the item likelihood and uses a numerical

    quadrature in the E-step. To date, BGROUP has been shown to not be computationally

    feasible in more than two dimensions.

    3. CGROUP is designed to be computationally feasible for more than two dimensions (it uses a

    Laplace approximation in the E-step). CGROUP is used most frequently in NAEP since most

    subject areas have multiple scales and require reporting on a composite.

    In NAEP and other large-scale assessments analyzed at ETS, the estimation of the

    conditioning model for multivariate latent traits is carried out with BGROUP and CGROUP.

    This report uses CGROUP as the basis for evaluating the differences in direct estimation

    between the conditional normality approach (ETS-DE, as implemented in CGROUP) and the

    marginal normality approach (CJ-DE, as implemented in the AM software, see below) since

    CGROUP has been the program most frequently used for NAEP analysis purposes.

    Plausible Values

    The second phase of the ETS-DE involves the production of plausible values, which

    provide a computationally tractable approach of integrating the posterior distributions of

    respondents to estimate the target statistics in subgroups of interest. Using plausible values

    provides a means for estimating the error in the estimates due to the proficiencies being latent

    (i.e., only indirectly observed) and the uncertainty about the regression parameters in the

    4

  • population model. In addition, plausible values provide a set of quantities that researchers can

    use with commercial statistical software to conduct a wide variety of secondary analyses.

    The BGROUP, CGROUP, and NGROUP set of programs generate multiple imputations

    for each respondent based on the estimates of � and � and on the respondents’ background data y

    and the item responses x. These plausible values are drawn from the k-dimensional posterior

    N(E(�|y,x),�(�|y,x)). In other words, the approach assumes that � given y and x is approximately normally distributed. This conditional normality is a less restrictive assumption

    compared to the marginal normality assumption, on which CJ-DE relies (Cohen & Jiang, 1999).

    The marginal distribution in ETS-DE conditioning model is therefore rather flexible and is not

    limited to the normal distribution, but it is actually a mixture of the conditional posterior

    distributions for the given set of items responses and background variables.

    In order to carry the variability due to measurement and parameter estimation errors

    through all subsequent analyses, a number of plausible values has to be drawn for each

    respondent. As a rule of thumb, five to ten plausible values are drawn in most large-scale

    assessment analyses. These plausible values are aggregated to provide consistent estimates of

    group means, variances, and percentages above cut points for the subgroups defined by the

    reporting variables. Plausible values drawn from a population model that uses item responses and

    a large amount of background information are a valuable source for studying relationships

    between the proficiency scales and secondary variables.

    The CJ-DE Methodology

    Marginal normality based direct estimation, or CJ-DE (Cohen & Jiang, 1999), is a

    recently proposed method of estimation subgroup statistics based on a number of assumptions

    regarding a) the marginal distribution of the latent trait and b) its relation to a set of group

    indicator variables. The following studies use simulated and real data to compare the results from

    the ETS-DE and CJ-DE methods. The study of real data offers a determination as to whether CJ-

    DE yields estimates consistent with the results of more general models.

    The software package AM (Cohen, 1998) implements the CJ-DE approach and is

    available for the Windows operating system. The software provides modules for CJ-DE and

    additional modules for univariate and composite regressions of the latent trait on a number of

    predictors, which is referred to as marginal maximum likelihood (MML) regression in the AM

    5

  • package. While the focus of this study is to compare CJ-DE with the ETS-DE conditioning

    approach, AM's MML regression was used to make sure that both software programs—AM and

    CGROUP—agree on the data structure. AM provides two procedures for CJ-DE that were

    developed “…to consistently estimate subpopulation distributions when the groups are defined

    by values of a [nominal or ordinal variable]” (Cohen, 1998; Cohen & Jiang, 1999). The AM

    modules implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table”

    (NT) in the software, depending on the scale level of the grouping variable. Both the OT and the

    NT modules assume that the latent trait � is marginally normally distributed (Cohen, 1998;

    Cohen & Jiang, 1999), so that the estimates of a finite mixture of subgroup distributions have to

    fit this assumption.

    In contrast to this assumption, the conditional normality estimation—ETS-DE, which is

    used in NAEP's conditioning model and other large-scale assessment programs—does not rely

    on assuming a certain form of the trait parameters’ marginal distribution. The marginal

    distribution in the conditioning model is a mixture of normals. In addition, NAEP uses a

    multinomial distribution to approximate the marginal distribution of � for item calibration

    (Yamamoto & Mazzeo, 1992), so that the item parameters used in the conditioning model are not

    based on a certain form of the marginal trait distribution.

    Central Assumptions Driving CJ-DE

    Cohen and Jiang (1999) propose to use the following approach in order to estimate

    subgroup statistics:

    a) Assume a latent trait ��~ N(,�). � is usually unobserved and has to be inferred by the

    subjects responses to a number of items (x1,..,xk)

    b) Assume that there are m groups, where the group membership gi indicates the maximum

    outcome on a number of m unobserved variables, yl,...,ym. That means the group membership

    of individual i equals k (gi = k), if for the unobserved variables yki > yli for all l k.

    c) Assume that for k=1,..,m, a linear relationship exists between ��and yk (i.e., yk = ak + bk� + ek)

    with mutually independent ek. The conditional distribution of yk given �� is assumed to be N(0,1).

    d) Assume that conditional on �, the yi are mutually independent, i.e.,

    )|(*)|()|,( ��� ByPAyPByAyP jiji ����� �� � � � �����������

    6

  • Assumption (a) forces the ability distribution to be marginally normal. Assumption (c)

    also is very strong and “may not be true but is a common and powerful one” (Cohen & Jiang,

    1999). Assumptions (b) and (d) are used for defining the conditional density of

    ( | ) ( | ) ( | ) ( | ) ( | )k k j k k k jj k kf g k f x P y y j k dy f x P y y dy� � � � ��� � � � � � ��� � (4)

    This conditional density, together with assumption (c) and the assumption of marginal normality

    (a), yields

    ( , ) ( ) ( ) ( | )k k k k jj k kf g k z y a b P y y dy�� � � � ��� � � � ��� (5)�

    where denotes the normal density and z�=����/����One more replacement uses the second

    part of assumption (c), namely that the error term e in the linear relation yj=aj+bj�+e is assumed

    to be N(0,1). This yields

    (6) )()()()|0( ���� jjkjjkkjjjk baybayePyebaPyyP ��������������

    where ��denotes the normal distribution function. It follows that

    ( , ) ( ) ( ) ( )k k k k j jj k kf g k z y a b y a b dy�� � � � ��� � � � � � ��� (7)

    Finally, the conditional density of � given group g=k is obtained by

    � �� ����

    �����

    �����

    ��

    �����

    �����

    ddybaybayz

    dybaybayzkgf

    kkj jjkkkk

    kkj jjkkkk

    )()()(

    )()()()|( (8)

    which is used to compute the conditional means and variances given subgroup g=k (see Cohen &

    Jiang, 1999). We may now define

    7

  • � ��� ���� dkgfkgEnn )|()|( (9)

    in order to obtain the conditional moments of �� The parameters a1,b1...am,bm and �,�� of

    f(�|g=k) are estimated by maximizing the likelihood function based on the individual likelihood

    terms

    � ��� ����� �� dkgfxpkgxbaL ),()|(),|,...,,,( 11 (10)

    for a subject in group g=k with observed responses x=(x1,..,xj), and f(g=k,�) as defined by

    Equation (7). The two approaches taken by ETS-DE and CJ-DE differ strongly with respect to

    the information incorporated in estimating subgroup characteristics. ETS-DE uses extensive

    background (conditioning) information including grouping variables in addition to the cognitive

    item responses. In contrast to that, CJ-DE only includes the grouping variable together with the

    item responses but draws on a number of strong assumptions regarding the shape of the marginal

    ability distribution and the relation between � and the grouping variable. The following section

    presents examples of the differences found between both approaches with respect to recovering

    known subgroup characteristics of simulated data.

    Study I: Simulation Results

    The examples presented in this section compare ETS-DE and CJ-DE based on simulated

    data where each simulee responds to a limited set of test items and is additionally characterized

    by a small set of background variables. The simulated data sets resemble some characteristics of

    NAEP, such as the number of items per subscale. Short subscales in NAEP typically consist of

    an average of 6 items across booklets; long subscales consist of approximately 12 items. The

    number of subscales or dimensionality of the latent trait, k=3 in the simulations, also is found in

    NAEP. The number of background variables in the simulation is smaller than what is typically

    used in NAEP’s conditioning approach. While NAEP’s conditioning model may include up to

    hundreds of background variables, the simulated data used in the present study limits the number

    of background variables to the three made-up variables, GROUP, SES, and GENDER. Four

    distinct data sets were simulated following a 2 x 2 design, varying:

    1. The number of items per subscale (6 versus 12 items).

    8

  • 2. The dependency of the latent traits on the background variables: Setup (1) had a strong

    dependency leading to multimodal marginal trait distributions, while Setup (2) had a weak

    dependency resulting in unimodal, but possibly platokurtic marginals.

    Using two different linear models created the two levels of dependency of the latent traits on

    the background variables. Two different sets of regression weight were used to generate the

    three-dimensional trait parameters (�1, �2, �3). Each latent trait value �i for i in 1, 2, 3 was

    generated based on a linear model

    �i = �1yGENDER,i + �2ySES,i + �3yGROUP,i + ei (11)

    incorporating fictitious GENDER, SES, and GROUP effects together with normally distributed

    residuals ei. GENDER, SES, and GROUP accounted for a varying percentage of variance for the

    three trait components (see regression results below). The trait variable (�1, �2, �3) and its

    component-wise linear relation to GENDER, SES, and GROUP were unaffected by additional

    fictitious design variables WEIGHT, STRATA, and CLUSTER. The latter variables have been

    included to check whether zero correlations are recovered in the same way by the regression

    modules of CJ-DE and ETS-DE.

    Setup 1, which includes one bimodal and one multimodal marginal, was included to

    examine how CJ-DE performs in situations where its marginal distribution assumptions are

    clearly violated. Setup 2 represents a more typical situation in which the marginal distributions

    are unimodal but more platokurtic than the normal (see Figure 1). Data were generated for the

    six-item test for both Setups 1 and 2, the item parameters used to generate the data are given in

    Appendix A and B. However, only the six-item test is presented for Setup 2, since the pattern of

    results obtained for the two test lengths was similar in Setup 1.

    9

  • Figure 1. Histograms of marginal distributions for Setups 1 (left) and 2 (right).

    10

  • Figure 1 shows histograms with integrated density plots for Setup 1 (left column) and

    Setup 2 (right column), crossed by the three (from top to bottom row) simulated latent traits.

    Setup 1 on the left results in a clearly bimodal marginal for Dimension 1, whereas in Setup 2, the

    marginals are platokurtic or skewed, but not obviously multimodal.

    In Setup 1, the proportion of variance of ��accounted for by the fictitious GROUP and

    GENDER produced bimodal (for gender) or multi-modal marginal ��distributions. In Setup 2,

    the proportion of variance explained by the fictitious conditioning variables GENDER, SES, and

    GROUP was reduced, so that the resulting marginal � distributions are unimodal but platokurtic.

    The marginal distribution of �1 is a mixture of two subpopulations where the mean difference

    between subgroups is due to the fictitious GENDER variable. �2 is a mixture of five normals

    with common variance but slightly different means due to the five-category variable GROUP.

    The third variable, �3, can be viewed as the “control dimension” in both setups (i.e., the subgroup

    distributions are all identical as there is no effect of the conditioning variables on latent trait �3).

    Setup 2 can be viewed as a less extreme, non bimodal, version of Setup 1 with higher

    intercorrelations between the � variables. The data generated by both setups were analyzed with

    the ETS-DE and CJ-DE approaches to direct estimation. The results of both methods were

    compared to the true values obtained from analyzing the actual � values used for generating the

    item responses. Tables 1a and 1b show the marginal correlations obtained from analyzing the

    simulees’ generating � values, both for the 6- and the 12-item data sets.

    Table 1a

    Marginal � Distributions in Setup 1, Correlations Between � Dimensions

    [,1] [,2] [,3]

    [1,] 1.0000000 0.3985606 0.1620800

    [2,] 0.3985606 1.0000000 0.1832677

    [3,] –0.1620800 0.1832677 1.0000000

    11

  • Table 1b

    Marginal � Distributions in Setup 2, Correlations Between � Dimensions

    [,1] [,2] [,3]

    [1,] 1.0000000 0.6499676 0.5401054

    [2,] 0.6499676 1.0000000 0.7718106

    [3,] 0.5401054 0.7718106 1.0000000

    The following sections present results based on the generating true � values on the one

    hand and the two approaches to direct estimation of subgroup statistics on the other. To clarify

    that the expected differences between CJ-DE and ETS-DE are the result of differences in model

    assumptions, the agreement of both software packages on the correlational structure of the

    simulated data was assessed. To check this, the recovery of regression weights and the residual

    variance covariance matrix of both AM (the software used for CJ-DE) and CGROUP (the

    software for ETS-DE) was analyzed.

    Regression Module Comparison

    The regression module comparison is a check of agreement between both programs using

    the same data. The regression of the three dimensional latent trait ��on the variables INTER

    (explicit intercept), GENDER, SES, GROUP, STRATA, and CLUSTER was compared. The

    results in Table 2 are obtained by analyzing the generating � vectors (the TRUE columns in the

    tables below) with standard regression procedures. The entries in the ETS-DE and MML

    columns stem from analyzing the item response data with the conditioning model incorporated in

    ETS-DE and with AM’s MML regression module. The MML regression module, however, is

    different from the direct estimation proposed by Cohen and Jiang (1999). The MML regression

    module closely resembles the regression part of the ETS-DE approach in the one-dimensional

    case and consequently should yield similar results when used with the same set of background

    variables. MML regression does not include the marginal normality assumptions used by CJ-DE.

    Table 2 shows the estimates of the linear model for the three-dimensional � variable. The

    estimates show that the GENDER variable has the largest effect on �1, whereas the GROUP

    12

  • variable has highest impact on �2 and the effects for �3 are close to zero for all methods, as

    expected.

    Table 2

    Regression Coefficients for the Six-item Simulated Data Set, Setup 1

    Scale 1 Scale 2 Scale 3

    Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

    Constant -3.460 -3.530 -3.480 -2.970 -2.590 -2.570 0.040 0.070 0.070

    CLUSTER -0.030 -0.050 -0.050 0.000 -0.060 -0.060 -0.040 -0.050 -0.050

    STRATA -0.010 -0.010 -0.010 0.000 0.000 0.000 0.010 -0.010 -0.010

    GENDER 1.680 1.760 1.740 0.430 0.340 0.340 0.000 0.020 0.030

    GROUP 0.190 0.220 0.220 0.600 0.640 0.640 0.060 0.050 0.050

    SES 0.280 0.280 0.280 0.270 0.210 0.210 -0.020 0.020 0.020

    MML regression and the regression that is part of ETS-DE agree closely on the estimates

    for this setup. Both ETS-DE and MML regression produce estimates close to those in the TRUE

    columns, even though the number of six items per scale is comparably small (i.e., the inference

    on � used by ETS-DE and MML regression are subject to a rather large measurement error).

    Table 3 shows the respective results based on the 12-item data set.

    Table 3

    Regression Coefficients 12-item Simulated Data Set, Setup 1

    Scale 1 Scale 2 Scale 3

    Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

    Constant -3.560 -3.540 -3.533 -2.940 -2.890 -2.883 -0.070 -0.030 -0.038

    CLUSTER 0.000 0.000 0.004 -0.010 -0.020 -0.016 0.040 0.040 0.043

    STRATA 0.000 -0.010 -0.008 0.000 0.000 -0.003 -0.010 0.000 0.000

    GENDER 1.700 1.720 1.718 0.470 0.510 0.505 -0.060 -0.080 -0.085

    GROUP 0.140 0.140 0.136 0.610 0.620 0.617 -0.050 -0.050 -0.047

    SES 0.300 0.290 0.290 0.220 0.180 0.183 0.080 0.040 0.045

    MML regression and ETS-DE recover the parameters weights more closely if the number

    of items is doubled. Note that both methods also agree with the TRUE columns for Scale 3,

    13

  • where there is no impact on the latent variable, and as expected, all three columns show values

    close to zero. Table 4 shows the residual correlations and variances as they were obtained using

    the true � values from the simulations as well as the corresponding values produced by the ETS-

    DE regression and MML regression algorithms.

    Table 4

    Residual Correlations With Variances in the Diagonal, Six-item Simulated Data Set

    TRUE ETS-DE MML regression

    Scale 1 2 3 1 2 3 1 2 3

    1 0.188 –0.025 0.203 0.199 –0.033 0.246 0.188 –0.025 0.249

    2 0.194 –0.214 0.214 –0.289 0.209 –0.293

    3 0.996 1.127 1.155

    Table 5 shows the results for the 12-item data set. ETS-DE and MML regression

    reproduce the residual correlations and variances in a very similar way, both for the 6-item and

    the 12-item data set in Setup 1.

    Table 5

    Residual Correlations With Variances in the Diagonal, 12-item Simulated Data Set

    TRUE ETS-DE MML regression

    Scale 1 2 3 1 2 3 1 2 3

    1 0.176 –0.036 0.186 0.183 –0.125 0.227 0.183 –0.133 0.231

    2 0.196 –0.218 0.167 –0.143 0.167 –0.144

    3 0.991 0.960 0.971

    Subgroup Distribution Recovery

    ETS-DE and CJ-DE implement two very different approaches to direct estimation. While

    ETS-DE assumes that the latent trait � is conditionally normal given a vector of background

    data, CJ-DE assumes that the marginal latent distribution is normal, regardless of potentially

    large subgroup differences in complex samples. These two approaches are compared in this

    14

  • section with respect to the recovery of subgroup distributions. This analysis uses the exemplary

    data previously introduced as Setup 1—6 and 12 items and Setup 2—6 items.

    As shown in the previous section, the ETS-DE regression and the MML regression as

    implemented in the software packages CGROUP and AM agree on these data sets and reproduce

    the true regression parameters in a very similar way. In contrast, ETS-DE and CJ-DE incorporate

    different assumptions regarding the marginal distribution of the latent traits. Recall that the

    marginal distributions for Setup 1 are bimodal for Scale 1 and multimodal for Scale 2, because

    the background variable GENDER (two subgroups) explains a major part of the variance for

    Scale 1 whereas the background variable GROUP (five subgroups) has a strong impact on Scale

    2. It can be expected that the marginal normality assumption of CJ-DE, which is violated for

    Scales 1 and 2, will result in differences between subgroup mean estimates of ETS-DE and the

    true values on the one hand, and CJ-DE on the other hand.

    Table 6a

    Subgroup Means and Standard Deviations for the Six-item Data Set, Setup 12

    Mean Standard deviation

    Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    ALL –0.004 0.027(.039) -/- 1.001 1.040 -/-Female –0.849 –0.852(.030) –0.466(.074) 0.548 0.565 0.947

    1 Male 0.840 0.907(.040) 0.343(.116) 0.527 0.545 0.936

    ALL –0.002 –0.032(.047) -/- 0.997 0.960 -/- Female –0.213 –0.205(.057) –0.193(.088) 0.991 0.968 0.972

    2 Male 0.208 0.140(.058) 0.135(.113) 0.957 0.919 0.972

    ALL 0.014 –0.012(.045) -/- 1.003 1.065 -/- Female 0.000 –0.024(.057) –0.032(.058) 1.029 1.083 1.076

    3 Male 0.027 0.000(.064) –0.003(.059) 0.975 1.045 1.08

    Note. The results of CJ-DE direct estimation reported here are ones closest to the true values from one out of four trials with AM’s “slog through” option. Rows with large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    Table 6a shows the TRUE values for the six-item data set in Setup 1 (i.e., the values

    obtained by analyzing the generating data) as well as the subgroup means and standard

    deviations as estimated by ETS-DE and CJ-DE. In addition, the values in parentheses next to the

    15

  • subgroup mean estimates show the associated standard errors either computed with Rubin’s

    imputation formula in the case of ETS-DE or as given by the Taylor series estimates in the case

    of CJ-DE. The Taylor series estimates are given by the CJ-DE direct estimation procedure and

    are recommended to yield appropriate estimates for complex samples by Cohen and Jiang

    (1999). Here, the Taylor series standard error estimates for Scales 1 and 2 are larger than the

    imputation-based estimate.

    Table 6b gives a more condensed overview of the same results. Instead of individual subgroup means, the table gives standardized mean differences

    ZETS-DE = (METS-DE - true)/se(DETS-DE) (12)

    ZCJ-DE = (MCJ-DE - true)/se(DCJ-DE) (13)

    as well as the variance ratio of estimated variance divided by true variance. se(D) stands for the

    standard error of the difference. Assuming the TRUE values to be fixed target statistics, the

    se(D) equals the standard error associated to the respective estimate given either by ETS-DE or

    CJ-DE. If the difference between the two estimates of a certain subgroup mean is standardized,

    se(D) equals the square root of the sum of the squared standard errors of the two statistics. The

    standardized mean differences between CJ-DE and TRUE should be ~N(0,1) if the CJ-DE model

    holds. The variance ratios given in Table 6a should be close to 1 if the approach recovers the

    values in the TRUE column.

    16

  • Table 6b

    Subgroup Standardized Mean Differences and Variance Ratios for the Six-item Data Set, Setup 1

    Standardized mean difference Variance ratio Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    Female 0.000 –0.099 5.176 1.000 1.063 2.9861 Male 0.000 1.693 –4.284 1.000 1.069 3.154

    Female 0.000 0.140 0.227 1.000 0.954 0.962 2 Male 0.000 –1.178 –0.646 1.000 0.922 1.032

    Female 0.000 –0.419 –0.552 1.000 1.108 1.093 3 Male 0.000 –0.425 –0.508 1.000 1.149 1.218

    Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    The differences from the expected values are as hypothesized; CJ-DE shows large

    differences for Scale 1, for which the marginal normality assumption does not hold. The absolute

    standardized mean differences are 5.176 for the female subgroup and 4.284 for the male

    subgroup. The variance ratios indicate that CJ-DE overestimates the subgroup variances by a

    factor of ~3 for Scale 1.

    Table 7a gives the mean and standard deviation for the 12-item data set in Setup 1 while

    Table 7b gives the standardized mean differences and variance ratio. Table 7b enables a direct

    comparison against the values 0 (zero) for the expected mean differences and 1 (one) for the

    expected variance ratio if the models behind the approaches fit the data.

    17

  • Table 7a

    Subgroup Mean and Standard Deviation for GENDER for the 12-item Data Set, Setup 1

    Mean Standard deviation

    Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    ALL 0.004 .016(.035) -/- 1.008 0.995 -/-Female –0.854 –.832(.029) –.707(.072) 0.529 0.515 0.858

    1 Male 0.862 .865(.031) .704(.063) 0.527 0.524 0.746

    ALL 0.000 .008(.039) -/- 1.002 0.991 -/- Female –0.234 –.252(.049) –.255(.089) 0.978 0.959 1.003

    2 Male 0.234 .269(.050) .248(.124) 0.971 0.952 1.003

    ALL –0.004 –.003(.036) -/- 0.996 0.993 -/- Female 0.031 .034(.047) .045(.050) 0.989 0.967 0.988

    3 Male –0.041 –.042(.052) –.040(.047) 1.001 0.981 0.988

    Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    Table 7b

    Subgroup Standardized Mean Differences and Variance Ratios for the 12-item Data Set, Setup 1

    Standardized mean difference Variance ratio

    Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    Female 0.000 0.760 2.042 1.000 0.948 2.6311 Male 0.000 0.097 –2.508 1.000 0.989 2.004

    Female 0.000 –0.367 –0.236 1.000 0.962 1.052

    2 Male 0.000 0.699 0.113 1.000 0.961 1.067

    Female 0.000 0.063 0.280 1.000 0.956 0.998

    3 Male 0.000 –0.020 0.021 1.000 0.960 0.974

    Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    18

  • In the 12-item case, the current CJ-DE implementation does not converge with the default

    settings for Scale 1 but needs to be put into the “slog through” mode, and the number of

    iterations needs to be increased from 50 to 500. ETS-DE reproduces the subgroup means and

    standard deviations accurately also for the 12-item data set. As in the six-item case, the

    differences between subgroup standard deviations are not reproduced by CJ-DE.

    The second reporting variable with a strong impact on one of the latent trait components

    is GROUP, a variable with the categories 1..5. Table 8a shows the standardized mean differences

    for the six-item data from Setup 1. Like in the above analysis with the grouping variable

    GENDER, the algorithm for CJ-DE needs to be put in the “slog through” mode in AM to

    converge in this example. The marginal normality assumption does not hold for Scales 1 and 2,

    the first and second component of the three dimensional latent trait in the example data. It can be

    expected that CJ-DE using the marginal normality assumption will not match the true subgroup

    means and variances as closely as ETS-DE does in the analysis of the GROUP reporting

    variable.

    The subgroup mean differences for CJ-DE in Scale 2 indicate two subgroups for which

    CJ-DE estimates deviate significantly from the true values. For Group 1, the absolute mean

    difference between CJ-DE and the true value is 4.45, and for Group 5, the absolute mean

    difference is 5.62. In contrast, the subgroup mean differences for ETS-DE and the true values are

    all in the expected range. Table 8a shows also that CJ-DE overestimates the subgroup variances

    for all subgroups and Scale 2 by a factor of between 1.77 and 2.54.

    19

  • Table 8a

    Subgroup Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 1

    Standardized mean differences Variance ratio

    Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    Group 1 0.0000 0.1928 0.7942 1.0000 1.0659 1.2226 Group 2 0.0000 0.6372 0.0748 1.0000 0.9503 1.2346 Group 3 0.0000 0.5593 –0.0784 1.0000 1.0092 1.3660 Group 4 0.0000 0.8930 0.4622 1.0000 1.0661 1.4910

    1

    Group 5 0.0000 0.6808 0.0856 1.0000 1.0927 1.2583

    Group 1 0.0000 0.4218 4.4504 1.0000 1.0483 2.1302 Group 2 0.0000 0.1192 –1.0500 1.0000 0.8695 2.0350 Group 3 0.0000 0.0802 –1.0467 1.0000 0.9320 2.1604 Group 4 0.0000 –0.5602 –1.6938 1.0000 1.1380 2.5416

    2

    Group 5 0.0000 –1.7368 –5.6242 1.0000 0.9781 1.7734

    Group 1 0.0000 0.5929 0.4584 1.0000 1.0153 0.9944 Group 2 0.0000 –0.7004 –0.3818 1.0000 1.0841 1.1513 Group 3 0.0000 –0.4016 –0.0861 1.0000 1.0557 1.0988 Group 4 0.0000 –1.5818 –0.7226 1.0000 1.0774 1.1999

    3

    Group 5 0.0000 –1.0241 –0.6721 1.0000 1.1177 1.1141

    Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    Simulation Results: Setup 2

    Truly multimodal distributions are rarely found in real data, even though results from

    large-scale assessments show variables that account for large differences in average achievement

    between subgroups. Setup 2 was designed to be a less extreme version of the same model used

    for Setup 1 and was made more realistic by allowing larger between-scale correlations as they

    can be found in many large-scale assessment programs. Analyses like the ones presented for

    Setup 1 were carried out with the six-item data set in Setup 2 in order to obtain additional results

    from this less extreme case. Table 8b shows the comparison of CJ-DE MML regression and

    ETS-DE regression estimates with the regression coefficients based on the true � values.

    20

  • Table 8b

    Regression Coefficients for the Six-item Simulated Data Set, Setup 2

    Scale 1 Scale 2 Scale 3

    Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

    INTER -3.186 -3.162 -3.100 -2.953 -2.996 -2.973 0.123 0.138 0.132

    CLUSTER -0.033 -0.021 -0.018 0.000 -0.013 -0.014 -0.040 -0.080 -0.080

    STRATA -0.016 -0.020 -0.020 -0.005 -0.002 -0.002 -0.010 0.000 0.000

    GENDER 1.203 1.199 1.178 0.495 0.484 0.482 -0.014 0.015 0.018

    GROUP 0.285 0.267 0.258 0.537 0.558 0.556 0.053 0.094 0.096

    SES 0.394 0.382 0.373 0.313 0.330 0.327 0.000 -0.023 -0.024

    Both ETS-DE and MML regression reproduce the regression weights based on the true values

    closely for this data set. This indicates that AM’s MML regression and ETS-DE agree on the underlying

    relationship between the reporting variables and the latent trait variables, so that the basis on which CJ-

    DE marginal direct estimation and ETS-DE’s conditioning model are compared is the same.

    Table 9 shows the residual correlations and variances for the true � residuals and for the

    estimates as obtained by MML regression and ETS-DE.

    Table 9

    Residual Correlations With Variances in the Diagonal, Six-Item Data Set, Setup 2

    TRUE ETS-DE MML

    Scale 1 2 3 1 2 3 1 2 3

    1 0.410 –0.077 0.200 0.437 –0.102 0.330 0.412 –0.104 0.332

    2 0.297 –0.228 0.349 –0.109 0.345 –0.106

    3 0.996 1.036 1.069

    The two approaches reproduce the residual covariance matrix in a very similar way. The

    differences between ETS-DE and CJ-DE are even smaller than the small differences of the two

    approaches to the true values. The results on the regression part of ETS-DE and the MML

    regression module of AM give no indication that the basic relationships between the three latent

    21

  • traits and the subgroup variables are represented differently by the two approaches to direct

    estimation.

    For reporting the variable GENDER in Setup 2, Table 10 shows the respective

    standardized mean differences and variance ratios.

    Table 10

    Standardized Mean Differences and Variance Ratios for GENDER for the Six-item Data Set, Setup 2

    Standardized mean difference Variance ratio

    Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    Female 0.000 0.544 1.088 1.000 1.082 1.218 1

    Male 0.000 –0.450 –1.517 1.000 0.969 1.019

    Female 0.000 0.069 0.057 1.000 1.098 1.126 2

    Male 0.000 –0.045 –0.277 1.000 1.084 1.117

    Female 0.000 –0.641 –0.281 1.000 1.026 1.044 3

    Male 0.000 –0.164 –0.015 1.000 0.990 1.086

    The results for Setup 2 show smaller, but noticeable differences between the estimates of

    CJ-DE on the one hand, and the true values and ETS-DE on the other hand. The marginal

    distributions in Setup 2 deviate to a lesser extent from CJ-DE normality assumption, so that the

    subgroup estimates of CJ-DE seem impacted less by a moderate model violation as compared to

    Setup 1. Table 11 shows the results for the reporting variable GROUP, which again could not be

    estimated by CJ-DE using the default options and which is the one with the strongest effect on

    Scale 2.

    As expected, the standardized mean differences between the true values and CJ-DE for

    Scale 2 are larger than the differences between the true values and ETS-DE. In addition, the

    variance ratios for CJ-DE are consistently larger than 1.5 for Scale 2, indicating that CJ-DE

    overestimates subgroup variances here.

    The results for both reporting variables GENDER and GROUP are similar with respect to

    where CJ-DE deviates from the TRUE values and the ETS-DE approach: The GENDER effect is

    largest for Scale 1, where CJ-DE deviates most when reporting GENDER subgroup means.

    22

  • Similarly, for Scale 2, where the GROUP reporting variable has a strong effect on Latent Trait 2,

    CJ-DE deviates most when reporting on the GROUP subgroups.

    Table 11

    Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 2

    Standardized mean difference Variance ratio

    Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

    Group 1 0.0000 0.0349 0.6512 1.0000 0.9608 0.9649 Group 2 0.0000 –0.0490 –0.2745 1.0000 1.1590 1.2968 Group 3 0.0000 0.3333 –0.0897 1.0000 0.8876 0.8820 Group 4 0.0000 0.1477 –0.5000 1.0000 0.9061 0.9119

    1

    Group 5 0.0000 –0.2414 –0.3534 1.0000 1.0628 1.1324

    Group 1 0.0000 –0.5567 3.2474 1.0000 1.1540 1.9239 Group 2 0.0000 0.0300 0.0300 1.0000 1.0930 1.7859 Group 3 0.0000 0.4177 –1.0127 1.0000 1.1113 1.6488 Group 4 0.0000 0.6000 –1.2143 1.0000 1.2346 1.7322

    2

    Group 5 0.0000 –0.3088 –3.4118 1.0000 1.1239 1.4805

    Group 1 0.0000 –0.3298 –0.6702 1.0000 0.9496 0.9625 Group 2 0.0000 0.1038 0.6321 1.0000 1.1183 1.1314 Group 3 0.0000 0.1954 0.3448 1.0000 1.0988 1.1816 Group 4 0.0000 –0.7143 –0.5048 1.0000 1.0020 1.0563

    3

    Group 5 0.0000 –0.8095 –0.4762 1.0000 0.8931 1.0000

    Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    Conclusions: Study I

    In the examples presented above, AM’s MML regression module yields similar results to

    what is found when using the regression results of the ETS-DE methodology. Regression

    coefficients, residual correlations, and variances are reproduced in much the same way as ETS-

    DE recovers these parameters. These results cannot be generalized as they are currently based on

    a few simulated data sets only. Nevertheless, all examples presented here indicate that both

    23

  • software programs agree on the basic correlational relationships in the data as given by the AM

    MML regression module and ETS-DE’s regression estimates.

    In contrast to the close agreement of ETS-DE regression and AM’s MML regression, the

    AM module for CJ-DE—the marginal normality direct estimation approach—diverges from the

    ETS-DE results and the true values if the marginal distributions are non-normal. The exemplary

    data sets were constructed and simulated in a way to show where discrepancies can be expected,

    and the results so far match the expectations. Setup 1 was constructed to study how CJ-DE

    performs if marginal distributions are bimodal or multimodal, and CJ-DE did not converge with

    the default settings for the scales that violated the assumptions used in the marginal direct

    estimation approach. Setup 2 represents a “milder” version of model violation for CJ-DE and

    also shows that under this setup, where the multimodality of the marginal is less obvious, the CJ-

    DE estimates differ from the values produced by ETS-DE, the conditional normality direct

    estimation and the true values.

    Assuming that the latent trait is normally distributed across groups may lead to an

    inappropriate model because of strong monotonicity assumptions in the IRT model (note that

    IRT serves as the basis for both ETS-DE and CJ-DE). For the 1PL and 2PL IRT models as well

    as the (generalized) partial credit models, a simple statistic of the observed responses—the

    weighted sum of scores—is sufficient for estimating the latent trait. Even for the 3PL, the

    monotonicity of the success probability P(X=1|�) in the latent trait � and in the item parameters

    ensures a relationship between the observed distribution of the raw scores and the unobserved

    (but not arbitrary!) distribution of the latent trait. As an example, if a test is administered to two

    different samples that differ a lot in their ability distributions (e.g. a reading test taken by both a

    group of kindergarten students and a group of third graders), it seems unreasonable to assume a

    joint normal distribution. A model assuming marginal normality would force both distributions

    under one mode and produce biased estimates of differences between these two groups and other

    groups defined by additional reporting variables.

    The simulated data examples revealed effects of CJ-DE in the presence of non-normal

    marginal distributions: systematic deviations from the true values in the mean and in the variance

    estimates. In contrast, no indication of systematic differences between the true values and the

    ETS-DE approach were found in the examples analyzed here. From the perspective of data

    analysis, the differences in the subgroup mean estimates of CJ-DE are easier to detect, because in

    24

  • extreme cases CJ-DE reports when it fails to converge. Nevertheless, when using AM’s “slog

    through” estimation option and increasing the number of iterations, there may be no indication of

    nonconvergence. The effects of CJ-DE when estimating subgroup variances are more difficult to

    detect, as this can only be accomplished by additional analysis using other, less restrictive

    methods.

    Study II: Comparing Marginal Direct Estimation and Conditional Direct Estimation

    Subgroup Statistics for NAEP and NALS Data

    Study I showed that the marginal direct estimation (CJ-DE) method relies strongly on the

    assumption that the latent trait is marginally normally distributed. The CJ-DE method as

    implemented in the AM software (Cohen, 1998) does not reproduce subgroup mean and variance

    appropriately in cases where a significant part of subgroup differences is explained by the

    grouping variable of interest.

    The examples presented here help in studying consequences of this effect of marginal

    direct estimation in large-scale assessment data analysis. Assessments across a number of

    countries, states, regions, or other grouping variables cannot assume a certain form of marginal

    distribution of the trait across the groups (Yamamoto & Mazzeo, 1992). In addition, assuming

    that subgroup variances are homogenous (i.e., that the trait[s] vary to a similar degree within all

    groups) might be too restrictive to fit diverse populations. Data from large-scale assessment

    programs provide a source to study differences between CJ-DE and ETS-DE in a realistic data

    analysis setting. Using real data with operational reporting variables enables one to formulate

    expectations about whether certain variances should be equal or for which subgroups differences

    may be expected. This adds a different perspective to what was examined in Study I, where

    known parameters were compared with CJ-DE and ETS-DE estimates.

    NAEP Math Assessment, Grade 4

    As the first real data example, results were compared for ETS-DE and CJ-DE on data

    from an assessment given to a nationally representative sample of 13,855 students in the fourth

    grade for the National Assessment of Educational Progress (NAEP). The assessment,

    administered in 2000, used a sparse matrix sample design where examinees were given a 45-

    minute test of mathematics items consisting of a mixture of multiple choice and constructed

    25

  • response items. The 173-item pool was divided into 13 blocks of items (separately-timed

    sections). The blocks were assembled into 26 booklets based on a BIB (balanced incomplete

    block) design (Braswell et al., 2001). Each booklet contained three blocks of items, which were

    classified into five content-area scales—numeracy and operations, measurement, geometry, data

    analysis, and algebra. A typical examinee answered from 6 to 12 items per scale. A multiscale

    IRT model estimated with PARSCALE was used to calibrate the IRT item parameters for each

    of the five scales.

    The following exhibits show results based on the ETS-DE methodology using 381

    background variables in addition to item responses in order to obtain subgroup estimates. The

    381 background variables are factor scores based on a principal component analysis that was

    conducted using the variables available from the background questionnaire (see Braswell et al.,

    2001, for details on the NAEP 2000 math assessment and the available background data). The

    operational NAEP 2000 item parameters were used in a five-dimensional run with CGROUP, the

    current software implementation of the multidimensional ETS-DE approach. The ETS-DE

    approach was found to work accurately in recovering subgroup means and variances in Study I

    and serves as a benchmark for CJ-DE, which has been proposed for use for subgroup reporting

    (Cohen & Jiang, 1999). In contrast to CJ-DE, the ETS-DE approach assumes conditional

    normality of the latent traits with a large set of background variables. Given that a large number

    of background variables are used that explain a significant portion of the latent trait variance, this

    approach is capable of modeling complex mixtures of abilities resulting in non-normal

    population and subgroup distributions. To compare the results of ETS-DE and CJ-DE, the

    operational data and NAEP 2000 math item parameters were imported into the software that

    implements CJ-DE.

    School Type

    The first reporting variable used in this comparison is School Type, which has three

    categories in NAEP—Public, Private, and Catholic. The subsequent tables offer a comparison

    between CJ-DE and ETS-DE, the benchmark, on the basis of standardized mean differences and

    variance ratios similar to the exhibits in the previous part of the report. Table 12a shows the

    reference values estimated by ETS-DE in the untransformed latent trait scale, not in the NAEP

    reporting scale. The untransformed latent trait scale is implicitly given by the item parameters as

    26

  • calibrated with the PARSCALE software. PARSCALE defaults to the marginal latent trait

    moments M(�)=0 and a standard deviation S(�)=1.

    Table 12a

    ETS-DE Estimates of the Means and Standard Deviations in the Latent Trait (Theta) Scale for School Type Subgroups

    Mean Standard deviation

    Public Private Catholic Public Private Catholic

    NUM&OPER –0.047 0.430 0.368 1.021 0.913 0.842

    MEASURMT –0.053 0.480 0.402 1.060 0.928 0.897

    GEOMETRY –0.034 0.299 0.267 1.012 0.913 0.821

    DATA ANL –0.045 0.327 0.425 1.103 0.969 0.880

    ALGEBRA –0.047 0.429 0.358 1.081 0.969 0.886

    The Private and Catholic school categories have a mean that is about 0.35 to 0.52

    standard deviations higher than the one for Public schools, whereas the respective standard

    deviations for these subgroups is slightly lower than the subgroup standard deviation for Public

    school category across all five scales of the NAEP math assessment. Table 12b gives the

    corresponding standardized mean differences and variance ratios. The table shows these values

    for the School Type subgroups, where the differences are formed by “CJ-DE minus ETS-DE”

    and the ratios are “CJ-DE divided by ETS-DE.”

    27

  • Table 12b

    Standardized Mean Differences and Variance Ratios for School Type Subgroups

    Standardized mean difference Variance ratio

    Public Private Catholic Public Private Catholic

    NUM&OPER 0.047 –0.245 –0.737 0.931 1.129 1.338

    MEASURMT 0.039 0.137 –0.293 0.911 1.143 1.235

    GEOMETRY 0.074 0.658 –1.390 0.902 1.087 1.359

    DATA ANL 0.071 –0.420 –0.330 0.820 1.044 1.253

    ALGEBRA 0.149 –1.226 –0.439 0.832 1.010 1.211

    Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

    ETS-DE and CJ-DE provide quite similar subgroup mean estimates for most of the five

    scales in the three subgroups, but there are differences in the subgroup standard deviations

    reported by the two methods. The ETS-DE method reports that the Catholic school subgroup has

    a smaller standard deviation as compared to the Public school types on all five scales3, whereas

    the CJ-DE method report comparably more similar standard deviations across the three

    subgroups. In Study I, using simulated data examples, it was found that CJ-DE does not recover

    differences in subgroup standard deviations correctly. The ETS-DE method, however, was found

    to recover this type of subgroup heteroscedasticity in the simulated examples, and ETS-DE

    reflects differences between subgroup variances in the NAEP example reported here.

    Race/Ethnicity

    The next variable analyzed is Race/Ethnicity, which has four categories—WHI/AI/O

    (White, American Indian, Other), AFRAM (African American), HISPANIC (Hispanic

    American), and ASIAM (Asian American)—in the NAEP 2000 data. Table 13 below shows the

    subgroup mean differences between CJ-DE and ETS-DE and the corresponding variance ratios

    for this reporting variable.

    28

  • Table 13

    Race/Ethnicity Subgroup Reports Generated Based on the NAEP 2000 Grade 4 Math Data 1

    Standardized mean difference Variance ratio

    WHI/AI/O AFRAM HISPANIC ASIAM WHI/AI/O AFRAM HISPANIC ASIAM

    NUM&OPER -0.259 0.482 0.348 -0.285 0.996 0.952 0.849 0.729

    MEASURMT -0.109 -0.172 0.459 0.091 0.955 0.935 0.830 0.738

    GEOMETRY -0.282 0.012 0.658 0.222 0.976 0.901 0.784 0.715

    DATA ANL -0.094 0.743 -0.254 -0.715 0.889 0.778 0.691 0.730

    ALGEBRA -0.305 0.490 0.477 -0.388 0.878 0.815 0.728 0.717

    Note. Large differences from the expected values given the more general model are printed in boldface.

    The subgroup mean differences indicate that the estimates of the two methods do not

    differ significantly from each other. CJ-DE resembles the ETS-DE mean estimates satisfactory

    for the race subgroup variable.

    The standard deviation estimates given by CJ-DE differ from what is reported by the

    ETS-DE method for the subgroups, AFRAM, HISPANIC, and ASIAM. The standard deviation

    estimates provided by CJ-DE are about 0.7 times the size of the respective ETS-DE estimate. In

    contrast to that, CJ-DE yields a standard deviation more similar to ETS-DE for the WHI/AI/O

    subgroup.

    Individualized Education Plan

    Table 14 shows the subgroup mean differences of CJ-DE estimates against the ETS-DE

    analysis and the corresponding variance ratios for the dichotomous grouping variable IEP

    (Individualized Education Plan). There is a large mean difference between the two subgroups

    IEP and non-IEP. The IEP group means are approximately 0.9 standard deviations smaller than

    the non-IEP group estimates across all five scales (see Appendix C, where the ETS-DE estimates

    for the reporting variable IEP are given).

    Based on the findings of Study I, it can be expected that CJ-DE mean estimates will not

    reflect the large difference between the IEP and the non-IEP subgroups. The standardized mean

    differences and variance ratios for the IEP reporting variable are given in Table 14.

    29

  • Table 14

    IEP Subgroup Reports Based on the NAEP 2000 Grade 4 Math Data

    Standardized mean difference Variance ratio

    IEP Non-IEP IEP Non-IEP

    NUM&OPER 5.207 –0.556 0.841 1.025

    MEASURMT 4.145 –1.206 0.835 0.996

    GEOMETRY 4.832 0.042 0.830 1.002

    DATA ANL 4.783 0.275 0.685 0.921

    ALGEBRA 5.639 –0.998 0.621 0.961

    Note. Large differences from the expected values given the more general model are printed in boldface.

    The CJ-DE estimates show large differences to the IEP group means as provided by ETS-

    DE. CJ-DE reports consistently smaller mean differences between IEP and non-IEP subgroups,

    so that the corresponding mean difference between CJ-DE and ETS-DE is a large positive

    number. The same was found in Study I (see above) using simulated data when the absolute

    mean differences between subgroups are large. These results support the conjecture that CJ-DE

    direct estimation of subgroup mean differences deviate from more general models in the

    presence of large between group differences. Compared to ETS-DE, CJ-DE slightly

    underestimates the IEP subgroup variances for the subscale categories—NUM&OPER,

    MEASURMT, and GEOMETRY. For the subgroup variances of ALGEBRA and DATA ANL,

    the CJ-DE estimates are only about 0.7 the size of the corresponding ETS-DE estimates.

    National Adult Literacy Study

    The second real data set used in this comparison is taken from National Adult Literacy

    Survey (NALS) administered in 1992. This data set consists of 21,363 subjects and contains a

    sparse matrix sample of 713 items from three content domains of literacy—quantitative, prose,

    and document. NALS

    …measured literacy along three dimensions, prose literacy, document literacy, and

    quantitative literacy, designed to capture an ordered set of information-processing skills

    and strategies that adults use to accomplish a diverse range of literacy tasks. The literacy

    30

  • scales make it possible to profile the various types and levels of literacy among different

    subgroups in our society (“Defining and measuring literacy,” n.d.).

    The exemplary comparisons presented here utilize the NALS main assessment data file

    and the operational item parameters, which were used with the CGROUP program, which is the

    current implementation of the ETS-DE approach. The same data and item parameters were

    imported into the implementation of the CJ-DE approach, the AM software.

    Similar to the preceding analyses, a number of policy-relevant grouping variables from

    the NALS data file were chosen to compare the subgroup distribution estimates as given by the

    ETS-DE and the CJ-DE approach. Table 15 shows variance ratios and standardized mean

    differences and between the estimates of ETS-DE and CJ-DE for the grouping variable REGION

    with four subgroups.

    Table 15

    Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for REGION as Defined in the NALS 1992 Data

    Standardized mean difference Variance ratio

    REGION Prose Document Quantitative Prose Document Quantitative

    MIDWEST –0.560 –1.189 –0.418 1.012 0.981 0.964

    N-EAST –0.154 0.193 –0.277 0.828 0.809 0.831

    SOUTH –0.112 –0.275 –0.298 0.770 0.743 0.748

    WEST 0.771 1.029 0.892 0.717 0.708 0.740

    Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 are printed in boldface.

    The results indicate that all four subgroup mean estimates given by ETS-DE and CJ-DE

    agree relatively well. In contrast to the agreement between ETS-DE and CJ-DE for the means of

    the region subgroups, the variance estimates for the regions SOUTH and WEST given by CJ-DE

    are only about 0.75 times as large as the variance estimates given by ETS-DE.

    The next NALS reporting variable used in the comparison is BORN IN having the five

    categories—USA, SPAN (Spanish-speaking world), EUROP, ASIA, and OTHER. Table 16

    31

  • shows the standardized mean differences and variance ratios CJ-DE compared to the ETS-DE

    estimates for this reporting variable.

    Table 16

    Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for the Grouping Variable BORN IN as Defined in the NALS 1992 Data

    Standardized mean differences Variance ratio

    BORN IN Prose Document Quantitative Prose Document Quantitative

    USA –2.687 –2.899 –2.475 0.923 0.890 0.884

    SPAN 8.892 7.552 7.142 0.413 0.409 0.455

    EUROP 0.616 0.980 0.415 0.559 0.592 0.655

    ASIA 2.093 1.930 1.789 0.539 0.505 0.553

    OTHER 1.404 1.293 1.182 0.588 0.564 0.627

    Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.

    There are discrepancies between the subgroup mean estimates of CJ-DE and ETS-DE for

    the USA and SPAN subgroups. The CJ-DE estimates for USA are about 2.5 to 2.8 standard units

    lower than the ETS-DE estimates for the three literacy scales. The standardized differences

    between the CJ-DE mean estimates and the ETS-DE estimates for SPAN lie between 7 to 8

    across the three scales, indicating that CJ-DE differs significantly from the ETS-DE estimates.

    The variance ratio for four subgroups—SPAN, EUROP, ASIA, and OTHER—is between 0.4

    and 0.65 across all three subscales of the NALS data, indicating that the CJ-DE estimates are

    systematically smaller than the ETS-DE estimates in this case.

    The final comparison of CJ-DE and ETS-DE on the basis of the NALS data is based on

    the reporting variable “Years living in the USA.” This reporting variable has nine categories,

    ranging from “1-5 years in the USA” to “Ever live in the USA,” in 5 to 10 year intervals (see

    below). Table 17 shows the standardized mean differences between CJ-DE and ETS-DE and the

    variance ratios for the three literacy scales across the nine subgroups.

    32

  • Table 17

    Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for “Years Living in the USA” as Defined in NALS 1992 Data

    Standardized mean difference Variance ratio

    Yrs in USA Prose Document Quantitative Prose Document Quantitative

    1–5 12.060 11.418 8.791 0.432 0.410 0.432

    6–10 2.762 2.725 2.584 0.545 0.527 0.572

    11+ 4.121 4.581 3.361 0.427 0.434 0.471

    16+ 2.863 2.333 2.740 0.440 0.436 0.486

    21+ 1.461 2.035 2.208 0.520 0.518 0.565

    31+ 0.777 1.151 0.913 0.558 0.602 0.609

    41+ 0.187 0.370 0.104 0.467 0.473 0.499

    51+ –0.441 –0.035 –0.322 0.939 0.943 0.979

    Ever –3.182 –3.471 –2.770 0.937 0.902 0.896

    Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.

    The subgroup mean estimates of CJ-DE are between 2.3 and 12 standardized units larger

    than the corresponding estimates given by ETS-DE for the subgroups—“1–5 years in the USA,”

    “6–10,” “11+,” and “16+.” The mean estimate for subgroup “Ever live in the USA” is between

    2.7 and 3.4 standard units smaller for CJ-DE as compared to ETS-DE.

    The variances estimates by CJ-DE for the first six subgroups in the interval between “1–

    5” and “41+” are systematically smaller than what ETS-DE reports. The variance ratio lies

    between 0.41 and 0.6 in these subgroups across all three scales. In contrast, the CJ-DE subgroup

    variance estimates for “Ever” and “51+” are close to what ETS-DE yields, as the variance ratio is

    close to 1. Note that the subgroups of US residents with a comparably small amount of years

    residing in the United States are the subgroups with a comparably larger difference to the total

    mean (see Appendix D). For these subgroups, CJ-DE yields estimates that deviate more from

    33

  • what is given by the more general ETS-DE approach, whereas subgroups closer to the total mean

    (“Ever” and “51+”) receive estimates that agree more closely with the ETS-DE approach.

    Conclusions: Study II

    The results reported in Study II show similarities with the results obtained in Study I,

    which used simulated data. In the case of simulated data, CJ-DE differs from the values obtained

    by ETS-DE and the true values obtained from analyzing the simulated proficiency values used

    for generating the response data. The assumption of marginal normality leads to discrepancies

    between CJ-DE and the true values in the presence of large subgroup mean differences and in

    cases where the subgroup variances are heteroscedastic. Recall Cohen and Jiang's (1999) direct

    estimation model, where the conditional density of � given subgroup membership g=k is derived

    based on the marginal normality assumption. This density depends on the marginal parameters

    ��and �� and subgroup parameters (a1,b1,..,aG,bG). Essentially, the marginal normal density

    �����/��� acts as a prior for the conditional density

    ������

    ������

    ��

    ��

    dkgfkgf

    kgf� ��

    ��

    ���

    )|())(()|())((

    )|(1

    1

    (14)

    which prevents the conditional densities from fitting larger subgroup mean differences. This

    might be an indication why the CJ-DE standard deviation estimates are less variable across

    subgroups, and the restriction of the standard deviation is correlated with the distance of the

    corresponding subgroup mean from the total mean. A thorough analysis of the marginal direct

    estimation model (Cohen & Jiang, 1999) should reveal that this restriction of the parameter space

    is caused by the assumption of marginal normality. This assumption forces the mixture of

    subgroup distributions to fit under the unimodal normal distribution.

    The conclusion in Study II, which uses real data from NAEP and NALS, corresponds

    closely to the findings of Study I, which compares CJ-DE and ETS-DE based on simulated data

    examples, even though in real data applications, the true values usually are unknown. In the

    presence of large subgroup mean differences, CJ-DE yields less extreme subgroup estimates than

    ETS-DE, which also was found in the comparison in Study I of both methods to the true values.

    34

  • Additionally, the variance estimates given by CJ-DE tend to be more similar across subgroups as

    compared to the ETS-DE estimates and when comparing CJ-DE the true values in Study I. The

    CJ-DE variance estimates seem to be increasingly restricted with increasing difference of the

    subgroup mean to the total mean.

    As noted in the introduction, CJ-DE uses a number of assumptions to derive a conditional

    subgroup density while maintaining the restriction of normality of the marginal density. This

    normal marginal assumption of the latent trait is believed to reflect common practice in large-

    scale assessment applications of IRT (see Cohen & Jiang, 1999). However, NAEP and other

    large-scale assessments do not rely on this assumption. Appendix E gives an example of how to

    avoid the assumptions of CJ-DE when using AM in order to estimate a less restrictive model

    with this software. The results of using simulated data and the results of using real data both

    show that these assumptions used in CJ-DE lead to discrepancies when analyzing complex

    samples where the assumptions are not met by the data. The operational ETS-DE approach does

    not put the normality assumption in the marginal distribution, but in the conditional distribution

    of the latent trait given the item responses and a large number of the background variables. The

    conditioning approach utilized by ETS-DE is therefore more general and enables it to fit non-

    normal distributions, as the conditional means given the background model are not assumed to

    follow a specific distribution. In the light of systematic differences seen in both Study I and II,

    using methods such as CJ-DE that rely on item responses only and replacing valuable

    background information by a number of assumptions does not seem defendable for the analysis

    of large-scale assessment data. This also holds for trend studies, where the assessment of change

    relies even more on maximizing the comparability of results and the accuracy of the mean and

    variance estimates obtained across time points and subgroups.

    35

  • References

    Braswell, J. S., Lutkus, A. D., Grigg, W. S., Santapau, S. L., Tay-Lim, B., & Johnson, M. (2001).

    The nation’s report card: Mathematics 2000. Washington, DC: National Center for

    Education Statistics.

    Cohen, J. D. (1998). AM online help content—Preview. Washington, DC: American Institutes for

    Research.

    Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across normal

    populations. Journal of the American Statistical Association, 94(448), 1035-1044.

    Defining and measuring literacy. (n.d.) In National assessments of adult literacy. Retrieved

    December 6, 2002, from http://nces.ed.gov/naal/defining/defining.asp

    Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:

    Addison-Wesley.

    Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex

    samples. Psychometrika, 56(2), 177-196.

    Mislevy, R. J., Beaton, A. E., Kaplan. B., & Sheehan. K. M. (1992). Estimating population

    characteristics from sparse matrix samples of item responses. Journal of Educational

    Measurement, 29(2), 133-161.

    Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago:

    University of Chicago Press.

    Thomas, N. (1993). Asymptotic corrections for multivariate posterior moments with factored

    likelihood functions. Journal of Computational and Graphical Statistics, 2, 309-322.

    Thomas, N. (2002). The role of secondary covariates when estimating latent trait population

    distributions. Psychometrika, 67(1), 33-48.

    Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linkage in NAEP. Journal of

    Educational Statistics 17(2), 155-173.

    36

  • Notes 1 The item parameters in the k-scale IRT model are assumed to be known constants. 2 The overall mean and standard deviation reported here are estimates by ETS-DE and the

    TRUE data; CJ-DE does not provide overall means and standard deviations. 3 This indicates that the Catholic school category is more homogeneous as compared to the

    two other categories. The Public school category consistently has the largest standard

    deviations across all five scales.

    37

  • Appendix A

    Item Parameters of the Simulated Three-scale Six-item Data Set

    Scale Slope Difficulty

    Scale 1

    [1,] 1.0707435 –0.423607249

    [2,] 1.1946191 0.369087609

    [3,] 1.1356097 –0.008368651

    [4,] 1.1029780 –0.434542858

    [5,] 0.6926124 –0.320136837

    [6,] 0.8034373 0.817567985

    Scale 2

    [7,] 0.9617609 0.003169065

    [8,] 1.1004634 1.327405006

    [9,] 0.9115646 0.451618136

    [10,] 1.0574126 –2.053570652

    [11,] 1.1098851 0.006184470

    [12,] 0.8589135 0.265193973

    Scale 3

    [13,] 1.2621460 1.339141978

    [14,] 0.8917393 –0.220816527

    [15,] 0.9161605 0.758596816

    [16,] 0.9253288 –0.066838528

    [17,] 0.7505099 –0.099260362

    [18,] 1.2541155 –1.710823377

    Note. The guessing parameter was 0.1 for all items.

    38

  • Appendix B

    Item Parameters of the Simulated 3-scale 12-item Data Set

    Scale Slope Difficulty

    Scale 1

    [1,] 1.0301048 –0.40334405

    [2,] 1.0807597 –0.12162779

    [3,] 1.0250148 –0.29599706

    [4,] 0.8097633 0.13585131

    [5,] 1.0834746 –0.10137978

    [6,] 1.0881449 1.18682432

    [7,] 0.8241556 0.58488677

    [8,] 1.0754401 0.95989977

    [9,] 0.8284506 –1.57049425

    [10,] 1.0272207 –0.24556290

    [11,] 1.1410092 –0.53788298

    [12,] 0.9864615 0.40882663

    Scale 2

    [13,] 1.0131312 0.56203695

    [14,] 1.0604981 0.63205024

    [15,] 1.2831725 –0.41368560

    [16,] 1.1636971 –0.90477486

    [17,] 0.9043142 0.01714852

    [18,] 0.9837799 –0.84975192

    [19,] 1.0296239 0.63169027

    [20,] 1.2039188 0.04996556

    [21,] 0.6799550 0.77051519

    [22,] 0.9778539 –0.91851904

    [23,] 1.0707815 0.19213650

    (Table continues)

    39

  • Table (continued)

    Scale Slope Difficulty

    [24,] 0.6292741 0.23118819

    Scale 3

    [25,] 1.1981095 0.24101286

    [26,] 1.0874208 –0.18829633

    [27,] 0.9684248 –0.58984308

    [28,] 0.8853709 –0.95740524

    [29,] 0.9017118 –0.19778461

    [30,] 1.0488593 –1.42372395

    [31,] 1.0086545 0.17463042

    [32,] 0.8052735 1.48726305

    [33,] 1.2051341 1.30940643

    [34,] 1.0667933 0.28232721

    [35,] 0.8616304 –0.32302987

    [36,] 0.9626173 0.18544311

    Note. The guessing parameter was 0.1 for all items.

    40

  • Appendix C

    ETS-DE Estimates for IEP Subgroup Means and Standard Deviations

    Mean Standard deviation

    IEP Non-IEP IEP Non-IEP

    NUM&OPER –0.910 0.091 1.050 0.966

    MEASURMT –0.841 0.054 1.100 1.016

    GEOMETRY –0.844 0.076 1.043 0.958

    DATA ANL –0.852 0.103 1.199 1.043

    ALGEBRA –0.927 0.099 1.244 1.009

    41

  • Appendix D

    Means and Standard Deviations for ETS-DE Estimates for

    “Years Living in the USA” as Defined in NALS 1992 Data

    Mean Standard deviation

    Yrs in USA Prose Document Quantitative Prose Document Quantitative

    1-5 –1.287 –1.154 –1.043 1.549 1.584 1.548

    6-10 –1.181 –1.029 –0.987 1.325 1.364 1.305

    11+ –1.228 –1.094 –1.026 1.506 1.513 1.448

    16+ –0.900 –0.874 –0.777 1.507 1.514 1.447

    21+ –0.714 –0.691 –0.565 1.389 1.396 1.346

    31+ –0.463 –0.513 –0.351 1.349 1.298 1.298

    41+ –0.616 –0.749 –0.585 1.459 1.448 1.417

    51+ 0.068 –0.139 0.062 1.033 1.032 1.018

    Ever 0.102 0.094 0.084 1.065 1.076 1.085

    Note. The subgroup means are reported as differences from the total mean.

    42

  • Appendix E

    Using AIR’s AM Software for Secondary Analyses

    Studies I and II have shown that AM’s procedure for CJ-DE, a direct estimation approach

    relying on a marginal normality assumption, does not seem suitable for data where the normality

    of the latent trait across subgroups cannot be warranted. The CJ-DE approach has been

    developed “to consistently estimate subpopulation distributions when the groups are defined by

    values of a [nominal or ordinal variable]” (Cohen & Jiang, 1999). The two procedures

    implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table” (NT) in the

    AM software, depending on the grouping variables scale level. In contrast to the findings

    concerning CJ-DE, AM’s MML regression procedure reproduced the results of analyzing the

    true values—which served as the basis for the simulated data—quite well, in much the same way

    the ETS-DE approach does. In the simulated data examples with known true regression

    coefficients, ETS’s method and AM’s MML regression agreed closely when estimating

    regression parameters for the full conditioning model.

    AM’s MML regression module cannot be used “as is” for reporting purposes, because

    additional steps are necessary in order to produce subgroup statistics based on the regression

    results. The goal of this appendix is to explore ways to use MML regression and other modules

    of AM and to provide a guideline on how to put together analysis steps that can be used to get

    results with the AM software that resemble more closely the true values and the ETS-DE

    conditioning model estimates.

    AM was used in examples presented below in a multistep procedure for producing

    subgroup statistics without using AM’s CJ-DE modules. This step-by-step procedure lacks the

    convenience of the operational ETS-DE approach in that it requires manual concatenation of

    separate intermediate results produced by AM’s procedures. Therefore, the goal of the study

    presented here is not to provide an alternative to ETS-DE, but to test whether AM can be used

    for secondary analyses.

    The approaches taken by the ETS-DE conditioning model on the one hand and AM’s

    direct estimation as well as its AM’s CJ-DE module on the other hand differ strongly with

    respect to the information incorporated in estimating subgroup characteristics. ETS-DE uses

    extensive background (conditioning) information, including grouping variables in addition to the

    observed item responses. CJ-DE, in contrast, only includes one grouping variable at a time

    43

  • together with the item responses but draws on a number of strong assumptions regarding the

    shape of the marginal ability distribution and the relation between � and the group indicator

    variable.

    Issues in Model Selection

    Assumptions about the population structure are central in the process of building a model

    for complex survey data. The question is what kind of assumptions are viewed as appropriate for

    the comparison of multiple subgroups with respect to their means and variances.

    Figure E1. Subgroup distributions with normality assumption on the marginal level.

    In the case depicted in Figure E1, the overall distribution is assumed to be normal, and

    the sum of all subgroup distributions has to accommodate this shape. It follows that the shapes of

    the subgroup distributions are no longer free; they have to fit under the overall normal shape and

    their sum has to be equal to that shape. This assumption is central to CJ-DE and makes it

    44

  • inappropriate for more complex real data. A less restrictive assumption is that all subgroups are

    normally distributed and share the same variance but may vary with respect to their means and

    size. This assumption can be modeled by a regression with contrast coded subgroup indicators.

    This can be done in many software packages as well as in ETS-DE and AM. This drops the

    assumption of marginal normality and with it the main feature of CJ-DE as proposed by Cohen

    and Jiang (1999). The effect of this relaxing the marginal normality assumption is illustrated in

    Figure E2.

    Figure E2. Subgroup distributions with normality assumption in all subgroup levels.

    This less restrictive assumption obviously allows a larger range of cases to be fitted as

    compared to CJ-DE. This approach can be taken by using AM’s MML means procedure, even

    though that procedure will not yield subgroup variance estimates. If only a few subgroups are

    used, the homoscedasticity assumption within subgroups limits the ability to fit more general

    marginal distributions. A useful extension would be to assume a separate variance for each

    subgroup. In AM, this assumption can be accommodated by using MML regression together

    with filtering the data as many times as there are subgroups. But even this limits the subgroup

    45

  • distributions to be normal, which seems still a too restrictive approach if, for example, there is a

    strong indication that some subgroups are composites.

    This is one of the reasons why