the stata journal - mahidol university · (2013), which build on kelejian and prucha (1998, 1999,...

The Stata Journal

Volume 13 Number 2 2013

®

A Stata Press publicationStataCorp LPCollege Station, Texas

The Stata Journal

Editors

H. Joseph Newton

Department of Statistics

Texas A&M University

College Station, Texas

[email protected]

Nicholas J. Cox

Department of Geography

Durham University

Durham, UK

[email protected]

Associate Editors

Christopher F. Baum, Boston College

Nathaniel Beck, New York University

Rino Bellocco, Karolinska Institutet, Sweden, and

University of Milano-Bicocca, Italy

Maarten L. Buis, WZB, Germany

A. Colin Cameron, University of California–Davis

Mario A. Cleves, University of Arkansas for

Medical Sciences

William D. Dupont, Vanderbilt University

Philip Ender, University of California–Los Angeles

David Epstein, Columbia University

Allan Gregory, Queen’s University

James Hardin, University of South Carolina

Ben Jann, University of Bern, Switzerland

Stephen Jenkins, London School of Economics and

Political Science

Ulrich Kohler, University of Potsdam, Germany

Frauke Kreuter, Univ. of Maryland–College Park

Peter A. Lachenbruch, Oregon State University

Jens Lauritsen, Odense University Hospital

Stanley Lemeshow, Ohio State University

J. Scott Long, Indiana University

Roger Newson, Imperial College, London

Austin Nichols, Urban Institute, Washington DC

Marcello Pagano, Harvard School of Public Health

Sophia Rabe-Hesketh, Univ. of California–Berkeley

J. Patrick Royston, MRC Clinical Trials Unit,

London

Philip Ryan, University of Adelaide

Mark E. Schaffer, Heriot-Watt Univ., Edinburgh

Jeroen Weesie, Utrecht University

Ian White, MRC Biostatistics Unit, Cambridge

Nicholas J. G. Winter, University of Virginia

Jeffrey Wooldridge, Michigan State University

Stata Press Editorial Manager

Lisa Gilmore

Stata Press Copy Editors

David Culwell and Deirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book

reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository

papers that link the use of Stata commands or programs to associated principles, such as those that will serve

as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go

“beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate

or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to

a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users

(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers

analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could

be of interest or usefulness to researchers, especially in fields that are of practical importance but are not

often included in texts or other journals, such as the use of Stata in managing datasets, especially large

datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata

with topics such as extended examples of techniques and interpretation of results, simulations of statistical

concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-

ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch,

Scopus, and Social Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com

http://www.stata-journal.com

Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone

979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic

1-year subscription $ 98 1-year subscription $138

2-year subscription $165 2-year subscription $245


1-year student subscription $ 75 1-year student subscription $ 99

1-year university library subscription $125 1-year university library subscription $165



1-year institutional subscription $245 1-year institutional subscription $285



Electronic only Electronic only

1-year subscription $ 75 1-year subscription $ 75



1-year student subscription $ 45 1-year student subscription $ 45

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may

be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX

77845, USA, or emailed to [email protected].

®

Copyright c© 2013 by StataCorp LP

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and

help files) are copyright c© by StataCorp LP. The contents of the supporting files (programs, datasets, and

help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,

as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.

This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,

fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.

Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting

files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,

or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,

incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote

free communication among Stata users.

The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata

Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.

http://www.stata.com/bookstore/sj.html

http://www.stata.com/bookstore/sjj.html

http://www.stata-journal.com/archives.html

Volume 13 Number 2 2013

The Stata Journal

Articles and Columns 221

Maximum likelihood and generalized spatial two-stage least-squares estimators fora spatial-autoregressive model with spatial-autoregressive disturbances . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .D. M. Drukker, I. R. Prucha, and R. Raciborski 221

Creating and managing spatial-weighting matrices with the spmat command. . . .. . . . . . . . . . . . . . . . . . . D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 242

A command for estimating spatial-autoregressive models with spatial-autoregressivedisturbances and additional endogenous variables . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .D. M. Drukker, I. R. Prucha, and R. Raciborski 287

A command for Laplace regression . . . . . . . . . . . . . . . . . . . . . M. Bottai and N. Orsini 302Importing U.S. exchange rate data from the Federal Reserve and standardizing

country names across datasets . . . . . . . . .B. Dicle, J. Levendis, and M. F. Dicle 315Generating Manhattan plots in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D. E. Cook, K. R. Ryckman, and J. C. Murray 323Semiparametric fixed-effects estimator . . . . . . . . . . . . . . . . . F. Libois and V. Verardi 329Exact Wilcoxon signed-rank and Wilcoxon Mann–Whitney ranksum tests . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Harris and J. W. Hardin 337Extending the flexible parametric survival model for competing risks. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. R. Hinchliffe and P. C. Lambert 344Goodness-of-fit tests for categorical data . . . . . . . . . . . . . . R. Bellocco and S. Algeri 356Standardizing anthropometric measures in children and adolescents with functions

for egen: Update . . . . . . . . . . . . . . . . . . . . . . . S. I. Vidmar, T. J. Cole, and H. Pan 366Bonferroni and Holm approximations for Sidak and Holland–Copenhaver q-values

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .R. B. Newson 379Fitting the generalized multinomial logit model in Stata . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Y. Gu, A. R. Hole, and S. Knox 382Speaking Stata: Creating and varying box plots: Correction. . . . . . . . . . .N. J. Cox 398

Notes and Comments 401

Stata tip 115: How to properly estimate the multinomial probit model with het-eroskedastic errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .M. Herrmann 401

Software Updates 406

The Stata Journal (2013)13, Number 2, pp. 221–241

Maximum likelihood and generalized spatialtwo-stage least-squares estimators for a

spatial-autoregressive model withspatial-autoregressive disturbances

David M. DrukkerStataCorp

College Station, TX

[email protected]

Ingmar R. PruchaDepartment of EconomicsUniversity of Maryland

College Park, MD

[email protected]

Rafal RaciborskiStataCorp

College Station, TX

[email protected]

Abstract. We describe the spreg command, which implements a maximumlikelihood estimator and a generalized spatial two-stage least-squares estimatorfor the parameters of a linear cross-sectional spatial-autoregressive model withspatial-autoregressive disturbances.

Keywords: st0291, spreg, spatial-autoregressive models, Cliff–Ord models, maxi-mum likelihood estimation, generalized spatial two-stage least squares, instrumen-tal-variable estimation, generalized method of moments estimation, prediction,spatial econometrics, spatial statistics

1 Introduction

Cliff–Ord (1973, 1981) models, which build on Whittle (1954), allow for cross-unitinteractions. Many models in the social sciences, biostatistics, and geographic scienceshave included such interactions. Following Cliff and Ord (1973, 1981), much of theoriginal literature was developed to handle spatial interactions. However, space is notrestricted to geographic space, and many recent applications use these techniques inother situations of cross-unit interactions, such as social-interaction models and networkmodels; see, for example, Kelejian and Prucha (2010) and Drukker, Egger, and Prucha(2013) for references. Much of the nomenclature still includes the adjective “spatial”,and we continue this tradition to avoid confusion while noting the wider applicabilityof these models. For texts and reviews, see, for example, Anselin (1988, 2010), Arbia(2006), Cressie (1993), Haining (2003), and LeSage and Pace (2009).

The simplest Cliff–Ord model only considers spatial spillovers in the dependent vari-able, with spillovers modeled by including a right-hand-side variable known as a spatiallag. Each observation of the spatial-lag variable is a weighted average of the values of thedependent variable observed for the other cross-sectional units. The matrix containingthe weights is known as the spatial-weighting matrix. This model is frequently referredto as a spatial-autoregressive (SAR) model. A generalized version of this model alsoallows for the disturbances to be generated by a SAR process. The combined SAR model

c© 2013 StataCorp LP st0291

222 ML and GS2SLS estimators for a SARAR model

with SAR disturbances is often referred to as a SARAR model; see Anselin and Florax(1995).1

In modeling the outcome for each unit as dependent on a weighted average of theoutcomes of other units, SARAR models determine outcomes simultaneously. This si-multaneity implies that the ordinary least-squares estimator will not be consistent; seeAnselin (1988) for an early discussion of this point.

In this article, we describe the spreg command, which implements a maximumlikelihood (ML) estimator and a generalized spatial two-stage least-squares (GS2SLS)estimator for the parameters of a SARAR model with exogenous regressors. For discus-sions of the ML estimator, see, for example, the above cited texts and Lee (2004) for theasymptotic properties of the estimator. For a discussion of the estimation theory for theimplemented GS2SLS estimator, see Arraiz et al. (2010) and Drukker, Egger, and Prucha(2013), which build on Kelejian and Prucha (1998, 1999, 2010) and the references citedtherein.

Section 2 describes the SARAR model. Section 3 describes the spreg command. Sec-tion 4 provides some examples. Section 5 describes postestimation commands. Section 6presents methods and formulas. The conclusion follows.

We use the notation that for any matrix A and vector a, the elements are denotedas aij and ai, respectively.

2 The SARAR model

The spreg command estimates the parameters of the cross-sectional model (i = 1, . . . , n)

yi = λ∑n

j=1 wijyj +∑k

p=1 xipβp + ui

ui = ρ∑n

j=1 mijuj + εi

or more compactly,

y = λWy + Xβ + u (1)u = ρMu + ε (2)

1. These models are also known as Cliff–Ord models because of the impact that Cliff and Ord (1973,1981) had on the subsequent literature. To avoid confusion, we simply refer to these models asSARAR models while still acknowledging the importance of the work of Cliff and Ord.

D. M. Drukker, I. R. Prucha, and R. Raciborski 223

where

• y is an n × 1 vector of observations on the dependent variable;

• W and M are n × n spatial-weighting matrices (with 0 diagonal elements);

• Wy and Mu are n × 1 vectors typically referred to as spatial lags, and λ and ρare the corresponding scalar parameters typically referred to as SAR parameters;

• X is an n × k matrix of observations on k right-hand-side exogenous variables(where some of the variables may be spatial lags of exogenous variables), and βis the corresponding k × 1 parameter vector;

• ε is an n × 1 vector of innovations.

The model in (1) and (2) is a SARAR with exogenous regressors. Spatial interactionsare modeled through spatial lags. The model allows for spatial interactions in thedependent variable, the exogenous variables, and the disturbances.2

The spatial-weighting matrices W and M are taken to be known and nonstochastic.These matrices are part of the model definition, and in many applications, W = M.Let y = Wy. Then

yi =n∑

j=1

wijyj

which clearly shows the dependence of yi on neighboring outcomes via the spatial lagyi. By construction, the spatial lag Wy is an endogenous variable. The weights wij

will typically be modeled as inversely related to some measure of proximity betweenthe units. The SAR parameter λ measures the extent of these interactions. For furtherdiscussions of spatial-weighting matrices and the parameter space for the SAR parameter,see, for example, the literature cited in the introduction, including Kelejian and Prucha(2010); see Drukker et al. (2013) for more information about creating spatial-weightingmatrices in Stata.

The innovations ε are assumed to be independent and identically distributed (IID)or independent but heteroskedastically distributed, where the heteroskedasticity is ofunknown form. The GS2SLS estimator produces consistent estimates in either casewhen the heteroskedastic option is specified; see Kelejian and Prucha (1998, 1999,2010), Arraiz et al. (2010), and Drukker, Egger, and Prucha (2013) for discussions andformal results. The ML estimator produces consistent estimates in the IID case butgenerally not in the heteroskedastic case; see Lee (2004) for some formal results for theML estimator, and see Arraiz et al. (2010) for evidence that the ML estimator does notgenerally produce consistent estimates in the heteroskedastic case.

2. An extension of the model to a limited-information-systems framework with additional endogenousright-hand-side variables is considered in Drukker, Prucha, and Raciborski (2013), which discussesthe spivreg command.


Because the model in (1) and (2) is a first-order SAR model with first-order SAR

disturbances, it is also referred to as a SARAR(1, 1) model, which is a special case ofthe more general SARAR(p, q) model. We refer to a SARAR(1, 1) model as a SARAR

model. When ρ = 0, the model in equations (1) and (2) reduces to the SAR modely = λWy + Xβ + ε. When λ = 0, the model in equations (1) and (2) reduces toy = Xβ +u with u = ρMu+ ε, which is sometimes referred to as the SAR error model.Setting ρ = 0 and λ = 0 causes the model in equations (1) and (2) to reduce to a linearregression model with exogenous variables.

spreg requires that the spatial-weighting matrices M and W be provided in the formof an spmat object as described in Drukker et al. (2013). spreg gs2sls supports bothgeneral and banded spatial-weighting matrices; spreg ml supports general matricesonly.

3 The spreg command

3.1 Syntax

spreg ml depvar[indepvars

] [if

] [in

], id(varname)

[noconstant level(#)

dlmat(objname[, eig

]) elmat(objname

[, eig

]) constraints(constraints)

gridsearch(#) maximize options]

spreg gs2sls depvar[indepvars

] [if

] [in

], id(varname)

[noconstant

level(#) dlmat(objname) elmat(objname) heteroskedastic impower(q)

maximize options]

3.2 Options for spreg ml

id(varname) specifies a numeric variable that contains a unique identifier for eachobservation. id() is required.

noconstant suppresses the constant term in the model.

level(#) specifies the confidence level, as a percentage, for confidence intervals. Thedefault is level(95) or as set by set level.

dlmat(objname[, eig

]) specifies an spmat object that contains the spatial-weighting

matrix W to be used in the SAR term. eig forces the calculation of the eigenvaluesof W, even if objname already contains them.

elmat(objname[, eig

]) specifies an spmat object that contains the spatial-weighting

matrix M to be used in the spatial-error term. eig forces the calculation of theeigenvalues of M, even if objname already contains them.

constraints(constraints); see [R] estimation options.


gridsearch(#) specifies the fineness of the grid used in searching for the initial valuesof the parameters λ and ρ in the concentrated log likelihood. The allowed range is[.001, .1]. The default is gridsearch(.1).

maximize options: difficult, technique(algorithm spec), iterate(#),[no

]log,

trace, gradient, showstep, hessian, showtolerance, tolerance(#),ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see[R] maximize. These options are seldom used. from() takes precedence overgridsearch().

Options for spreg gs2sls


noconstant suppresses the constant term.


dlmat(objname) specifies an spmat object that contains the spatial-weighting matrixW to be used in the SAR term.

elmat(objname) specifies an spmat object that contains the spatial-weighting matrixM to be used in the spatial-error term.

heteroskedastic specifies that spreg use an estimator that allows the errors to beheteroskedastically distributed over the observations. By default, spreg uses anestimator that assumes homoskedasticity.

impower(q) specifies how many powers of W to include in calculating the instrumentmatrix H. The default is impower(2). The allowed values of q are integers in theset 2, 3, . . . , �√n�, where n is the number of observations.

maximize options: iterate(#),[no

]log, trace, gradient, showstep,

showtolerance, tolerance(#), and ltolerance(#); see [R] maximize.from(init specs) is also allowed, but because ρ is the only parameter in this opti-mization problem, only initial values for ρ may be specified.

3.3 Saved results

spreg ml saves the following in e():

Scalarse(N) number of observations e(p) significancee(k) number of parameters e(rank) rank of e(V)e(df m) model degrees of freedom e(converged) 1 if converged, 0 otherwisee(ll) log likelihood e(iterations) number of ML iterationse(chi2) χ2


Macrose(cmd) spreg e(user) name of likelihood-evaluatore(cmdline) command as typed programe(depvar) name of dependent variable e(estimator) mle(indeps) names of independent variables e(model) lr, sar, sare, or sarare(title) title in estimation output e(constant) noconstant or hasconstant

e(chi2type) type of model χ2 test e(idvar) name of ID variablee(vce) oim e(dlmat) name of spmat object used ine(technique) maximization technique dlmat()e(crittype) type of optimization e(elmat) name of spmat object used ine(estat cmd) program used to implement elmat()

estat e(properties) b Ve(predict) program used to implement

predict

Matricese(b) coefficient vector e(gradient) gradient vectore(Cns) constraints matrix e(V) variance–covariance matrix ofe(ilog) iteration log the estimators

Functionse(sample) marks estimation sample

spreg gs2sls saves the following in e():

Scalarse(N) number of observations e(converged) 1 if generalized methode(k) number of parameters of momentse(rho 2sls) initial estimate of ρ converged, 0e(iterations) number of generalized otherwise

method of moments e(converged 2sls)1 if two-stage least-iterations squares converged,

e(iterations 2sls) number of two-stage 0 otherwiseleast-squares iterations

Macrose(cmd) spreg e(idvar) name of ID variablee(cmdline) command as typed e(dlmat) name of spmat objecte(estimator) gs2sls used in dlmat()e(model) lr, sar, sare, or sarar e(elmat) name of spmat objecte(het) homoskedastic or used in elmat()

heteroskedastic e(estat cmd) program used toe(depvar) name of dependent variable implement estate(indeps) names of independent variables e(predict) program used toe(title) title in estimation output implement predicte(exogr) exogenous regressors e(properties) b Ve(constant) noconstant or hasconstante(H omitted) names of omitted instruments

in H matrix

Matricese(b) coefficient vector e(delta 2sls) initial estimate of βe(V) variance–covariance matrix and λ

of the estimators



4 Example

In our examples, we use spreg.dta, which contains simulated data on the number ofarrests for driving under the influence for the continental U.S. counties.3 We use anormalized contiguity matrix taken from Drukker et al. (2013). In Stata, we type

. use dui

. spmat use ccounty using ccounty.spmat

to read the dataset into memory and to put the spatial-weighting matrix into thespmat object ccounty. This row-normalized spatial-weighting matrix was created inDrukker et al. (2013, sec. 2.4) and saved to disk in Drukker et al. (2013, sec. 11.4).

Our dependent variable, dui, is defined as the alcohol-related arrest rate per 100,000daily vehicle miles traveled (DVMT). Figure 1 shows the distribution of dui acrosscounties, with darker colors representing higher values of the dependent variable. Spatialpatterns of dui are clearly visible.

Figure 1. Hypothetical alcohol-related arrests for continental U.S. counties

3. The geographical location data came from the U.S. Census Bureau and can be found atftp://ftp2.census.gov/geo/tiger/TIGER2008/. The variables are simulated but inspired byPowers and Wilson (2004).


Our explanatory variables include police (number of sworn officers per 100,000DVMT); nondui (nonalcohol-related arrests per 100,000 DVMT); vehicles (number ofregistered vehicles per 1,000 residents); and dry (a dummy for counties that prohibitalcohol sale within their borders). In other words, in this illustration,X = [police,nondui,vehicles,dry,intercept].

We obtain the GS2SLS parameter estimates of the SARAR model parameters by typing

. spreg gs2sls dui police nondui vehicles dry, id(id)> dlmat(ccounty) elmat(ccounty) nolog

Spatial autoregressive model Number of obs = 3109(GS2SLS estimates)

dui Coef. Std. Err. z P>|z| [95% Conf. Interval]

duipolice -.5591567 .0148772 -37.58 0.000 -.5883155 -.529998nondui -.0001128 .0005645 -0.20 0.842 -.0012193 .0009936

vehicles .062474 .0006198 100.79 0.000 .0612592 .0636889dry .303046 .0183119 16.55 0.000 .2671553 .3389368

_cons 2.482489 .1473288 16.85 0.000 2.19373 2.771249

lambda_cons .4672164 .0051261 91.14 0.000 .4571694 .4772633

rho_cons .1932962 .0726583 2.66 0.008 .0508885 .3357038

Given the normalization of the spatial-weighting matrix, the parameter space forλ and ρ is taken to be the interval (−1, 1); see Kelejian and Prucha (2010) for furtherdiscussions of the parameter space. The estimated λ is positive and significant, indicat-ing moderate SAR dependence in dui. In other words, the dui alcohol-arrest rate fora given county is affected by the dui alcohol-arrest rates of the neighboring counties.This result may be because of coordination among police departments or because strongenforcement in one county leads some people to drink in neighboring counties.

The estimated ρ coefficient is positive, moderate, and significant, indicating mod-erate SAR dependence in the error term. In other words, an exogenous shock to onecounty will cause moderate changes in the alcohol-related arrest rate in the neighboringcounties.

The estimated β vector does not have the same interpretation as in a simple linearmodel, because including a spatial lag of the dependent variable implies that the out-comes are determined simultaneously. We present one way to interpret the coefficientsin section 5.


For comparison, we obtain the ML parameter estimates by typing

. spreg ml dui police nondui vehicles dry, id(id)> dlmat(ccounty) elmat(ccounty) nolog

Spatial autoregressive model Number of obs = 3109(Maximum likelihood estimates) Wald chi2(4) = 62376.4

Prob > chi2 = 0.0000


duipolice -.5593526 .014864 -37.63 0.000 -.5884854 -.5302197nondui -.0001214 .0005645 -0.22 0.830 -.0012279 .0009851

vehicles .0624729 .0006195 100.84 0.000 .0612586 .0636872dry .3030522 .018311 16.55 0.000 .2671633 .3389412

_cons 2.490301 .1471885 16.92 0.000 2.201817 2.778785

lambda_cons .4671198 .0051144 91.33 0.000 .4570957 .4771439

rho_cons .1962348 .0711659 2.76 0.006 .0567522 .3357174

sigma2_cons .0859662 .0021815 39.41 0.000 .0816905 .0902418

There are no apparent differences between the two sets of parameter estimates.

5 Postestimation commands

The postestimation commands supported by spreg include estat, test, and predict;see help spreg postestimation for the full list. Most postestimation methods havestandard interpretations; for example, a Wald test is just a Wald test.

Predictions from SARAR models require some additional explanation. Kelejian andPrucha (2007) consider different information sets and define predictors as conditionalmeans based on these information sets. They also derive the mean squared errors ofthese predictors, provide some efficiency rankings based on these mean squared errors,and provide Monte Carlo evidence that the additional efficiencies obtained by usingmore information can be practically important.

One of the predictors that Kelejian and Prucha (2007) consider is based on the infor-mation set {X,W,M,wiy}, where wi denotes the ith row of W, which will be referredto as the limited-information predictor.4 We denote the limited-information predictorby limited in the syntax diagram below. Another estimator that Kelejian and Prucha(2007) consider is based on the information set {X,W,M}, which yields the reduced-form predictor. This predictor is denoted by rform in the syntax diagram below.

4. Kelejian and Prucha (2007) also consider a full-information predictor. We have postponed imple-menting this predictor because it is computationally more demanding; we plan to implement it infuture work.


Kelejian and Prucha (2007) show that their limited-information predictor can be muchmore efficient than the reduced-form predictor.

In addition to the limited-information predictor and the reduced-form predictor,predict can compute two other observation-level quantities, which are not recom-mended as predictors but may be used in subsequent computations. These quantitiesare denoted by naive and xb in the syntax diagram below.

While prediction is frequently of interest in applied statistical work, predictionscan also be used to compute marginal effects.5 A change to one observation in oneexogenous variable potentially changes the predicted values for all the observations ofthe dependent variable because the n observations for the dependent variable form asystem of simultaneous equations in a SARAR model. Below we use predict to calculatepredictions that we in turn use to calculate marginal effects.

Various methods have been proposed to interpret the parameters of SAR models: see,for example, Anselin (2003); Abreu, De Groot, and Florax (2004); Kelejian and Prucha(2007); and LeSage and Pace (2009).

5.1 Syntax

Before using predict, we discuss its syntax.

predict[type

]newvar

[if

] [in

] [, rform | limited | naive | xb

rftransform(matname)]

5.2 Options

rform, the default, calculates the reduced-form predictions.

limited calculates the Kelejian and Prucha (2007) limited-information predictor. Thispredictor is more efficient than the reduced-form predictor, but we call it limitedbecause it is not as efficient as the Kelejian and Prucha (2007) full-information pre-dictor, which we plan to implement in the future.

naive calculates λwiy + xiβ for each observation.

xb calculates the linear prediction Xβ.

5. We refer to the effects of both infinitesimal changes in a continuous variable and discrete changesin a discrete variable as marginal effects. While some authors refer to “partial” effects to cover thecontinuous and discrete cases, we avoid the term “partial” because it means something else in asimultaneous-equations framework.


rftransform(matname) is a seldom-used option that specifies a matrix to use in com-puting the reduced-form predictions. This option is only useful when computingreduced-form predictions in a loop, when the option removes the need to recom-pute the inverse of a large matrix. See section 5.3 for an example that uses thisoption, and see section 6.3 for the details. rftransform() may only be specifiedwith statistic rform.

5.3 Example

In this section, we discuss two marginal effects that measure how changes in the ex-ogenous variables affect the endogenous variable. These measures use the reduced-formpredictor y = E(y|X,W,M) = (I − λW)−1Xβ, which we discuss in section 6.3, whereit is denoted as y(1). The expression for the predictor shows that a change in a singleobservation on an exogenous variable will typically affect the values of the endogenousvariable for all n units because the SARAR model forms a system of simultaneous equa-tions.

Without loss of generality, we explore the effects of changes in the kth exogenousvariable. Letting xk = (x1k, . . . , xnk)′ denote the vector of observations on the kthexogenous variable allows us to denote the dependence of y on xk by using the notation

y(xk) = {y1(xk), . . . , yn(xk)}

The first marginal effect we consider is

∂y(xk + δi)∂δ

=∂y(x1k, . . . , xi−1,k, xik + δ, xi+1,k, . . . , xnk)

∂δ=

∂y(xk)∂xik

where i = [0, . . . , 0, 1, 0, . . . , 0]′ is the ith column of the identity matrix. In terminologyconsistent with that of LeSage and Pace (2009, 36–37), we refer to the above effect asthe total direct impact of a change in the ith unit of xk. LeSage and Pace (2009, 36–37)define the corresponding summary measure

n−1n∑

i=1

∂yi(xk + δi)∂δ

= n−1n∑

i=1

∂yi(x1k, . . . , xi−1,k, xik + δ, xi+1,k, . . . , xnk)∂δ

= n−1n∑

i=1

∂yi(xk)∂xik

(3)

which they call the average total direct impact (ATDI). The ATDI is the average overi = {1, . . . , n} of the changes in the yi attributable to the changes in the correspondingxik. The ATDI can be calculated by computing y(xk), y(xk + δi), and the average ofthe difference of these vectors of predicted values, where δ is the magnitude by whichxik is changed. The ATDI measures the average change in yi attributable to sequentiallychanging xik for a given k.


Sequentially changing xik for each i = {1, . . . , n} differs from simultaneously chang-ing the xik for all n units. The second marginal effect we consider measures the effectof simultaneously changing x1k, . . . , xnk on a specific yi and is defined by

∂yi(xk + δe)∂δ

=∂yi(x1k + δ, . . . , xik + δ, . . . , xnk + δ)

∂δ=

n∑r=1

∂yi(xk)∂xrk

where e = [1, . . . , 1]′ is a vector of 1s. LeSage and Pace (2009, 36–37) define the corre-sponding summary measure

n−1n∑

i=1

∂yi(xk + δe)∂δ

= n−1n∑

i=1

∂yi(x1k + δ, . . . , xik + δ, . . . , xnk + δ)∂δ

= n−1n∑

i=1

n∑r=1

∂yi(xk)∂xrk

(4)

which they call the average total impact (ATI). The ATI can be calculated by computingy(xk), y(xk +δe), and the average difference in these vectors of predicted values, whereδ is the magnitude by which x1k, . . . , xnk is changed.

We now continue our example from section 4 and use the reduced-form predictor tocompute the marginal effects of adding one officer per 100,000 DVMT in Elko County,Nevada. We begin by using the reduced-form predictor and the observed values of theexogenous variables to obtain predicted values for dui:

. predict y0(option rform assumed)

Next we increase police by 1 in Elko County, Nevada, and calculate the reduced-formpredictions:

. generate police_orig = police

. quietly replace police = police_orig + 1 if st==32 & NAME00=="Elko"


Now we compute the difference between these two predictions:

. generate deltay = y1-y0

The output below lists the predicted difference and the level of dui for Elko County,Nevada:

. list deltay dui if (st==32 & NAME00=="Elko")

deltay dui

1891. -.5654716 19.777429

The predicted effect of the change would be a 2.9% reduction in dui in Elko County,Nevada.


Below we use four commands to summarize the changes and levels in the contiguouscounties:

. spmat getmatrix ccounty W

. generate double elko_neighbor = .(3109 missing values generated)

. mata: st_store(.,"elko_neighbor",W[1891,.]´)

. summarize deltay dui if elko_neighbor>0

Variable Obs Mean Std. Dev. Min Max

deltay 9 -.0203756 .0000364 -.0204239 -.020298dui 9 21.29122 1.6468 19.2773 23.49109

In the first command, we use spmat getmatrix to store a copy of the normalized-contiguity spatial-weighting matrix in Mata memory; see Drukker et al. (2013, sec. 14)for a discussion of spmat getmatrix. In the second and third commands, we generateand fill in a new variable for which the ith observation is 1 if it contains information on acounty that is contiguous with Elko County and is 0 otherwise. In the fourth command,we summarize the predicted changes and the levels in the contiguous counties. Themean predicted reduction is less than 0.1% of the mean level of dui in the contiguouscounties.

In the output below, we get a summary of the levels of dui and a detailed summaryof the predicted changes for all the counties in the sample.

. summarize dui


dui 3109 20.84307 1.457163 15.01375 26.61978

. summarize deltay, detail

deltay

Percentiles Smallest1% -.0007572 -.56547165% 0 -.0204239

10% 0 -.0203991 Obs 310925% 0 -.0203991 Sum of Wgt. 3109

50% 0 Mean -.0002495Largest Std. Dev. .0101996

75% 0 090% 0 0 Variance .00010495% 0 0 Skewness -54.7866199% 0 0 Kurtosis 3035.363

Less than 1% of the sample had any socially significant difference, with no change atall predicted for at least 95% of the sample.


In some of the computations below, we will use the matrix S = (In − λW)−1, whereλ is the estimate of the SAR parameter and W is the spatial-weighting matrix. In theoutput below, we use the W stored in Mata memory in an example above to computeS.


. mata:mata (type end to exit)

: b = st_matrix("e(b)")

: lam = b[1,6]

: S = luinv(I(rows(W))-lam*W)

: (b[1,1]/rows(W))*sum(S)-.6993674779

: end

We next compute the ATDI defined in (3). The output below shows an instructive(but slow) method to compute the ATDI. For each county in the data, we set police tobe the original value for all the observations except the ith, which we set to police + 1.Then we compute the predicted value of dui for observation i and store this prediction inthe ith observation of y1. (We use the rftransform() option to use the inverse matrixS computed above. Without this option, we would recompute the inverse matrix foreach of the 3,109 observations, which would cause the calculation to take hours.) Aftercomputing the predicted values of y1 for each observation, we compute the differencesin the predictions and compute the sample average.

. drop y1 deltay

. generate y1 = .(3109 missing values generated)

. local N = _N

. forvalues i = 1/`N´ {2. quietly capture drop tmp3. quietly replace police = police_orig4. quietly replace police = police_orig + 1 in ì´5. quietly predict tmp in ì´, rftransform(S)6. quietly replace y1 = tmp in ì´7. }


. summarize deltay


deltay 3109 -.5633844 .0009144 -.5690784 -.5599785

. summarize dui


dui 3109 20.84307 1.457163 15.01375 26.61978

The absolute value of the estimated ATDI is −0.56, so the estimated effect is 2.7%of the sample mean of dui.


As mentioned, the above method for computing the estimate of the ATDI is slow.LeSage and Pace (2009, 36–37) show that the estimate of the ATDI can also be computedas

βk

ntrace(S)

where βk is the kth component of β and S = (In − λW)−1, which we computed above.Below we use this formula to compute the ATDI,

. mata: (b[1,1]/rows(W))*trace(S)-.5633844076

and note that the result is the same as above.

Now we estimate the ATI, which simultaneously adds one more police officer per100,000 residents to each county. In the output below, we add 1 to police in eachobservation and then calculate the differences in the predictions. We then calculate theATI defined in (4) by computing the sample average.

. drop y1 deltay

. quietly replace police = police_orig + 1



. summarize deltay


deltay 3109 -.6993675 .0309541 -.8945923 -.5801525

. summarize dui


dui 3109 20.84307 1.457163 15.01375 26.61978

The absolute value of the estimated average total effect is about 3.4% of the samplemean of dui.

LeSage and Pace (2009, 36–37) show that the ATI is given by

βk

n

n∑i=1

n∑j=1

Si,j

where βk is the kth component of β and Sij is the (i, j)th element of S = (In−λW)−1.In the output below, we use the spmat getmatrix command discussed in Drukker et al.(2013) and a few Mata computations to show that the above expression yields the samevalue for the ATI as our calculations above.



. mata:mata (type end to exit)

: b = st_matrix("e(b)")

: lam = b[1,6]

: S = luinv(I(rows(W))-lam*W)

: (b[1,1]/rows(W))*sum(S)-.6993674779

: end

In general, it is not possible to say whether the ATDI is greater than or less than theATI. Using the expressions from LeSage and Pace (2009, 36–37), we see that

ATI − ATDI =βk

n

n∑i=1

n∑j=1

Si,j − βk

n

n∑i=1

Si,i =βk

n

n∑i=1

n∑j=1j �=i

Si,j

which depends on the sum of the off-diagonal elements of S as well as on βk.

In the case at hand, one would expect the ATDI to be smaller than the ATI because theATDI, unlike the ATI, does not incorporate the reinforcing effects of having all countiesimplement the change simultaneously.

6 Methods and formulas

6.1 ML estimator

Recall that the SARAR model under consideration is given by

y = λWy + Xβ + u (5)u = ρMu + ε (6)

In the following, we give the log-likelihood function under the assumption that ε ∼N(0, σ2I). As usual, we refer to the maximizer of the likelihood function when theinnovations are not normally distributed as the quasi-maximum likelihood (QML) esti-mator. Lee (2004) gives results concerning the consistency and asymptotic normality ofthe QML estimator when ε is IID but not necessarily normally distributed. Violationsof the assumption that the innovations ε are IID can cause the QML estimator to pro-duce inconsistent results. In particular, this may be the case if the innovations ε areheteroskedastic, as discussed by Arraiz et al. (2010).

Likelihood function

The reduced form of the model in (5) and (6) is given by

y = (I − λW)−1Xβ + (I − λW)−1(I − ρM)−1ε


The unconcentrated log-likelihood function is

ln L(y|β, σ2, λ, ρ) = −n

2ln(2π) − n

2ln(σ2) + ln ||I − λW|| + ln ||I − ρM||

− 12σ2

{(I − λW)y − Xβ}T (I − ρM)T (I − ρM) {(I − λW)y − Xβ} (7)

We can concentrate the log-likelihood function by first maximizing (7) with respect toβ and σ2, yielding the maximizers

β(λ, ρ) ={XT (I − ρM)T (I − ρM)X

}−1XT (I − ρM)T (I − ρM)(I − λW)y

σ2(λ, ρ) = (1/n){

(I − λW)y − Xβ(λ, ρ)}T

(I − ρM)T (I − ρM){(I − λW)y − Xβ(λ, ρ)

}Substitution of the above expressions into (7) yields the concentrated log-likelihood

function

Lc(y|λ, ρ) = −n

2{ln(2π) + 1} − n

2ln(σ2(λ, ρ)) + ln ||I − λW|| + ln ||I − ρM||

The QML estimates for the autoregressive parameters λ and ρ can now be computed bymaximizing the concentrated log-likelihood function. Once we have obtained the QML

estimates λ and ρ, we can calculate the QML estimates for β and σ2 as β = β(λ, ρ) andσ2 = σ2(λ, ρ).

Initial values

As noted in Anselin (1988, 186), poor initial starting values for ρ and λ in the concen-trated likelihood may result in the optimization algorithm settling on a local, ratherthan the global, maximum.

To prevent this problem from happening, spreg ml performs a grid search to findsuitable initial values for ρ and λ. To override the grid search, you may specify yourown initial values in the option from().

6.2 GS2SLS estimator

For discussions of the generalized method of moments and instrumental-variable estima-tion approach underlying the GS2SLS estimator, see Arraiz et al. (2010) and Drukker,Egger, and Prucha (2013). The articles build on Kelejian and Prucha (1998, 1999,2010) and the references cited therein. For a detailed description of the formulas, seealso Drukker, Prucha, and Raciborski (2013).

The GS2SLS estimator requires instruments. Kelejian and Prucha (1998, 1999) sug-gest using as instruments H the linearly independent columns of

X,WX, . . . ,WqX,MX,MWX, . . . ,MWqX


where q = 2 has worked well in Monte Carlo simulations over a wide range of reasonablespecifications. The choice of those instruments provides a computationally convenientapproximation of the ideal instruments; see Lee (2003) and Kelejian, Prucha, and Yuze-fovich (2004) for further discussions and refined estimators. At a minimum, the instru-ments should include the linearly independent columns of X and MX. When there is aconstant in the model and thus X contains a constant term, the constant term is onlyincluded once in H.

6.3 Spatial predictors

The spreg command provides for several unbiased predictors corresponding to differentinformation sets, namely, {X,W,M} and {X,W,M,wiy}, where wi denotes the ithrow of W; for a more detailed discussion and derivations, see Kelejian and Prucha(2007). Also in the following, xi denotes the ith row of X and ui denotes the ithelement of u.

The unbiased predictor corresponding to information set {X,W,M} is given by

y(1) = (I − λW)−1Xβ

and is called the reduced-form predictor. If λ = 0, then y(1) = Xβ. This predictor canbe calculated by specifying statistic rform to predict after spreg.

When specified, the rftransform() option specifies the name of a matrix in Matamemory that contains (I − λW)−1. The rftransform() option specifies a matrix thattransforms the model to its reduced form. This option is useful when computing manysets of reduced-form predictions from the same (I − λW)−1 because it alleviates theneed to recompute the inverse matrix.

Assuming that the innovations ε are distributed N(0, σ2I), the unbiased predictorcorresponding to information set {X,W,M,wiy} is given by

y(2)i = λwiy + xiβ +

cov(ui,wiy)var(wiy)

{wiy − E(wiy)}

where

Σu = (I − ρM)−1(I − ρMT )−1

Σy = (I − λW)−1Σu(I − λWT )−1

E(wiy) = wi(I − λW)−1Xβ

var(wiy) = σ2wiΣyw′i

cov(ui,wiy) = σ2σui (I − λWT )−1w′

i

σui is the i th row of Σu

We call this unbiased predictor the limited-information predictor because Kelejian andPrucha (2007) consider a more efficient predictor, the full-information predictor. Theformer can be calculated by specifying statistic limited to predict after spreg.


A further predictor considered in the literature is

yi = λwiy + xiβ

However, as pointed out in Kelejian and Prucha (2007), this estimator is generally bi-ased. While this biased predictor should not be used for predictions, it has uses asan intermediate computation, and it can be calculated by specifying statistic naive topredict after spreg.

The above predictors are computed by replacing the parameters in the predictionformula with their estimates.

7 Conclusion

After reviewing some basic concepts related to SARAR models, we presented the spregml and spreg gs2sls commands, which implement ML and GS2SLS estimators for theparameters of these models. We also discussed postestimation prediction. In futurework, we would like to investigate further methods and commands for parameter inter-pretation.

8 Acknowledgment

We gratefully acknowledge financial support from the National Institutes of Healththrough the SBIR grants R43 AG027622 and R44 AG027622.

9 ReferencesAbreu, M., H. L. F. De Groot, and R. J. G. M. Florax. 2004. Space and growth: A

survey of empirical evidence and methods. Working Paper TI 04-129/3, TinbergenInstitute.

Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: KluwerAcademic Publishers.

———. 2003. Spatial externalities, spatial multipliers, and spatial econometrics. Inter-national Regional Science Review 26: 153–166.

———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.

Anselin, L., and R. J. G. M. Florax. 1995. Small sample properties of tests for spatialdependence in regression models: Some further results. In New Directions in SpatialEconometrics, ed. L. Anselin and R. J. G. M. Florax, 21–74. Berlin: Springer.

Arbia, G. 2006. Spatial Econometrics: Statistical Foundations and Applications toRegional Convergence. Berlin: Springer.


Arraiz, I., D. M. Drukker, H. H. Kelejian, and I. R. Prucha. 2010. A spatial Cliff-Ord-type model with heteroskedastic innovations: Small and large sample results. Journalof Regional Science 50: 592–614.

Cliff, A. D., and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.

———. 1981. Spatial Processes: Models and Applications. London: Pion.

Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.

Drukker, D. M., P. Egger, and I. R. Prucha. 2013. On two-step estimation of a spatialautoregressive model with autoregressive disturbances and endogenous regressors.Econometric Reviews 32: 686–733.

Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2013. Creating and managingspatial-weighting matrices with the spmat command. Stata Journal 13: 242–286.

Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013. A command for estimatingspatial-autoregressive models with spatial-autoregressive disturbances and additionalendogenous variables. Stata Journal 13: 287–301.

Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: CambridgeUniversity Press.

Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squaresprocedure for estimating a spatial autoregressive model with autoregressive distur-bances. Journal of Real Estate Finance and Economics 17: 99–121.

———. 1999. A generalized moments estimator for the autoregressive parameter in aspatial model. International Economic Review 40: 509–533.

———. 2007. The relative efficiencies of various predictors in spatial econometric modelscontaining spatial lags. Regional Science and Urban Economics 37: 363–374.

———. 2010. Specification and estimation of spatial autoregressive models with au-toregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.

Kelejian, H. H., I. R. Prucha, and Y. Yuzefovich. 2004. Instrumental variable estimationof a spatial autoregressive model with autoregressive disturbances: Large and smallsample results. In Spatial and Spatiotemporal Econometrics, ed. J. P. LeSage andR. K. Pace, 163–198. New York: Elsevier.

Lee, L.-F. 2003. Best spatial two-stage least squares estimators for a spatial autoregres-sive model with autoregressive disturbances. Econometric Reviews 22: 307–335.

———. 2004. Asymptotic distributions of quasi-maximum likelihood estimators forspatial autoregressive models. Econometrica 72: 1899–1925.

LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:Chapman & Hall/CRC.


Powers, E. L., and J. K. Wilson. 2004. Access denied: The relationship between alcoholprohibition and driving under the influence. Sociological Inquiry 74: 318–337.

Whittle, P. 1954. On stationary processes in the plane. Biometrika 41: 434–449.

About the authors

David Drukker is the director of econometrics at StataCorp.

Ingmar Prucha is a professor of economics at the University of Maryland.

Rafal Raciborski is an econometrician at StataCorp.


Creating and managing spatial-weightingmatrices with the spmat command


College Station, TX

[email protected]

Hua PengStataCorp

College Station, TX

[email protected]


College Park, MD

[email protected] Raciborski

StataCorpCollege Station, TX

[email protected]

Abstract. We present the spmat command for creating, managing, and storingspatial-weighting matrices, which are used to model interactions between spatialor more generally cross-sectional units. spmat can store spatial-weighting matricesin a general and banded form. We illustrate the use of the spmat command anddiscuss some of the underlying issues by using United States county and postal-code-level data.

Keywords: st0292, spmat, spatial-autoregressive models, Cliff–Ord models, spa-tial lag, spatial-weighting matrix, spatial econometrics, spatial statistics, cross-sectional interaction models, social-interaction models

1 Introduction

Building on Whittle (1954), Cliff and Ord (1973, 1981) developed statistical modelsthat not only accommodate forms of cross-unit correlation but also allow for explicitforms of cross-unit interactions. The latter is a feature of interest in many social sci-ence, biostatistical, and geographic science models. Following Cliff and Ord (1973,1981), much of the original literature was developed to handle spatial interactions.However, space is not restricted to geographic space, and many recent applicationsuse spatial techniques in other situations of cross-unit interactions, such as social-interaction models and network models; see, for example, Kelejian and Prucha (2010)and Drukker, Egger, and Prucha (2013) for references. Much of the nomenclature stillincludes the adjective “spatial”, and we continue this tradition to avoid confusion whilenoting the wider applicability of these models. For texts and reviews, see, for example,Anselin (1988, 2010), Arbia (2006), Cressie (1993), Haining (2003), and LeSage and Pace(2009).

The models derived and discussed in the literature cited above model cross-unitinteractions and correlation in terms of spatial lags, which may involve the dependentvariable, the exogenous variables, and the disturbances. A spatial lag of a variable is


D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 243

defined as a weighted average of observations on the variable over neighboring units. Toillustrate, we after the rudimentary spatial-autoregressive (SAR) model

yi = λ

n∑j=1

wijyj + εi, i = 1, . . . , n

where yi denotes the dependent variable corresponding to unit i, the wij (with wii = 0)are nonstochastic weights, εi is a disturbance term, and λ is a parameter. In the abovemodel, the yi are determined simultaneously. The weighted average

∑nj=1 wijyj , on the

right-hand side, is called a spatial lag, and the wij are called the spatial weights. Itoften proves convenient to write the model in matrix notation as

y = λWy + ε

where

y =

⎡⎢⎣ y1

...yn

⎤⎥⎦ , W =

⎡⎢⎢⎢⎢⎢⎣0 w12 · · · w1,n−1 w1n

w21 0 . . . w2,n−1 w2n

......

. . ....

...wn−1,1 wn−1,2 · · · 0 wn−1,n

wn1 wn2 · · · wn,n−1 0

⎤⎥⎥⎥⎥⎥⎦ , ε =

⎡⎢⎣ ε1

...εn

⎤⎥⎦

Again the n× 1 vector Wy is typically referred to as the spatial lag in y and the n× nmatrix W as the spatial-weighting matrix. More generally, as indicated above, theconcept of a spatial lag can be applied to any variable, including exogenous variablesand disturbances, which—as can be seen in the literature cited above—provides for afairly general class of Cliff–Ord types of models.

Spatial-weighting matrices allow us to conveniently implement Tobler’s first law ofgeography—“everything is related to everything else, but near things are more relatedthan distant things” (Tobler 1970, 236)—which applies whether the space is geographic,biological, or social. The spmat command creates, imports, manipulates, and saves Wmatrices. The matrices are stored in a spatial-weighting matrix object (spmat object).The spmat object contains additional information about a spatial-weighting matrix,such as the identification codes of the cross-section units, and other items discussedbelow.1

The generic syntax of spmat is

spmat subcommand . . .

where each subcommand performs a specific task. Some subcommands create spmat ob-jects from a Stata dataset (contiguity, idistance, dta), a Mata matrix (putmatrix),or a text file (import). Other subcommands save objects to a disk (save, export) orread them back in (use, import). Still other subcommands summarize spatial-weighting

1. We use the term “units” instead of “places” because spatial-econometric methods have been appliedto many cases in which the units of analysis are individuals or firms instead of geographical places;for example, see Leenders (2002).

244 Creating and managing spatial-weighting matrices

matrices (summarize); graph spatial-weighting matrices (graph); manage them (note,drop, getmatrix); and perform computations on them (lag, eigenvalues). The re-maining subcommands are used to change the storage format of the spatial-weightingmatrices inside the spmat objects. As discussed below, matrices stored inside spmatobjects can be general or banded, with general matrices occupying much more spacethan banded ones. The subcommand permute rearranges the matrix elements, and thesubcommand tobanded is used to store a matrix in banded form.

spmat contiguity and spmat idistance create the frequently used inverse-distanceand contiguity spatial-weighting matrices; Haining (2003, 83) and Cliff and Ord (1981,17) discuss typical formulations of weights matrices. The import and management ca-pabilities allow users to create spatial-weighting matrices beyond contiguity and inverse-distance matrices. Section 17.4 provides some discussion and examples.

Drukker, Prucha, and Raciborski (2013a, 2013b) discuss Stata commands that im-plement estimators for SAR models. These commands use spatial-weighting matricespreviously created by the spmat command discussed in this article.

Before we describe individual subcommands in detail, we illustrate how to obtainand transform geospatial data into the format required by spmat, and we address com-putational problems pertinent to spatial-weighting matrices.

1.1 From shapefiles into Stata format

Many applications use geospatial data frequently made available in the form of shape-files. Each shapefile is a pair of files: the database file and the coordinates file. Thedatabase file contains data on the attributes of the spatial units, while the coordinatesfile contains the geographical coordinates describing the boundaries of the spatial units.In the common case where the units correspond to nonzero areas instead of points, theboundary data in the coordinates file are stored as a series of irregular polygons.

The vast majority of geospatial data comes in the form of ESRI or MIF shapefiles.2

There are user-written tools for translating shapefiles to Stata’s .dta format and formapping spatial data. shp2dta (Crow 2006) and mif2dta (Pisati 2005) translate ESRI

and MIF shapefiles to Stata datasets. shp2dta and mif2dta translate the two filesthat make up a shapefile to two Stata .dta files. The database file is translated tothe “attribute” .dta file, and the coordinates file is translated to the coordinates .dtafile.3,4

2. Refer to http://www.esri.com for details about the ESRI format and to http://www.pbinsight.comfor details about the MIF format. The ESRI format is much more common.

3. shp2dta and mif2dta save the coordinates data in the format required by spmap (Pisati 2007),which graphs data onto maps.

4. We use the term “attribute” instead of “database” because “database” does not adequately distin-guish between attribute data and coordinates data.


The code below illustrates the use of shp2dta and spmap (Pisati 2007) on the countyboundaries data for the continental United States; Crow and Gould (2007) provide abroader introduction to shapefiles, shp2dta, and spmap.

shp2dta, mif2dta, and spmap use a common set of conventions for defining thepolygons in the coordinates data translated from the coordinates file. Crow and Gould(2007) discuss these conventions.

We downloaded ts 2008 us county00.db and ts 2008 us county00.shp, which arethe attribute file and the coordinates file, respectively, and which make up the shapefilefor U.S. counties from the U.S. Census Bureau.5 We begin by using shp2dta to translatethese files to the files county.dta and countyxy.dta.

. shp2dta using tl_2008_us_county00, database(county)> coordinates(countyxy) genid(id) gencentroids(c)

county.dta contains the attribute information from the attribute file in the shape-file, and countyxy.dta contains the coordinates data from the shapefile. The attributedataset county.dta has one observation per county on variables such as county nameand state code. Because we specified the option gencentroids(c), county.dta alsocontains the variables x c and y c, which contain the coordinates of the county cen-troids, measured in degrees. (See the help file for shp2dta for details and the x–ynaming convention.) countyxy.dta contains the coordinates of the county boundariesin the long-form panel format used by spmap.6

Below we use use to read county.dta into memory and use destring (see [D] de-string) to create a new, numeric state-code variable st from the original string state-identifying variable STATEFP. Next we use drop to drop the observations defining thecoordinates of county boundaries in Alaska, Hawaii, and U.S. territories. Finally, weuse rename to rename the variables containing coordinates of the county centroids anduse save to save our changes into the county.dta dataset file.

. use county

. quietly destring STATEFP, generate(st)

. *keep continental US counties

. drop if st==2 | st==15 | st>56(123 observations deleted)

. rename x_c longitude

. rename y_c latitude

. save county, replacefile county.dta saved

Having completed the translation and selected our subsample, we use spmap to drawthe map, given in figure 1, of the boundaries in the coordinates dataset.

5. Actually, we downloaded ts 2008 us county00.zip fromftp://ftp2.census.gov/geo/tiger/TIGER2008/, and this .zip file contained the two files named inthe text.

6. Crow and Gould (2007), the shp2dta help file, and the spmap help file provide more informationabout the input and output datasets.


. spmap using countyxy, id(id)

Figure 1. County boundaries for the continental United States, 2000

1.2 Memory considerations

The spatial-weighting matrix for the n units is an n × n matrix, which implies thatmemory requirements increase quadratically with data size. For example, a contiguitymatrix for the 31,713 U.S. postal codes (five-digit zip codes) is a 31,713×31,713 matrix,which requires 31,713 × 31,713 × 8/230 ≈ 7.5 gigabytes of storage space.

Many users do not have this much memory on their machines. However, it is usuallypossible to store spatial-weighting matrices more efficiently. Drukker et al. (2011) dis-cuss how to judiciously reorder the observations so that many spatial-weighting matricescan be stored as banded matrices, thereby using less space than general matrices.

This subsection describes banded matrices and the potential benefits of using bandedmatrices for storing spatial-weighting matrices. If you do not have large datasets, youmay skip this section and all future references to banded matrices.


A banded matrix is a matrix whose nonzero elements are confined to a diagonal bandthat comprises the main diagonal, zero or more diagonals above the main diagonal, andzero or more diagonals below the main diagonal. The number of diagonals above themain diagonal that contain nonzero elements is the upper bandwidth, say, bU . Thenumber of diagonals below the main diagonal that contain nonzero elements is thelower bandwidth, say, bL. An example of a banded matrix having an upper bandwidthof 1 and a lower bandwidth of 2 is⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0 0 0 0 0 0 0 01 0 1 0 0 0 0 0 0 01 1 0 1 0 0 0 0 0 00 1 1 0 1 0 0 0 0 00 0 1 1 0 1 0 0 0 00 0 0 1 1 0 1 0 0 00 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 0 1 00 0 0 0 0 0 1 1 0 10 0 0 0 0 0 0 1 1 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦We can save a lot of space by storing only the elements in the diagonal band because

the elements outside the band are 0 by construction. Using this information, we canefficiently store this matrix without any loss of information as⎡⎢⎢⎣

0 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 01 1 1 1 1 1 1 1 0 0

⎤⎥⎥⎦The above matrix only contains the elements of the diagonals with nonzero elements.

To store the elements in a rectangular array, we added zeros as necessary. The rowdimension of the banded matrix is the upper bandwidth plus the lower bandwidth plus1, or b = bU + bL + 1. We will use the b × n shorthand to refer to the dimensions ofbanded matrices.

Banded matrices require less storage space than general matrices. The spmat suiteprovides tools for creating, storing, and manipulating banded matrices. In addition,computing an operation on a banded matrix is much faster than on a general matrix.

Drukker et al. (2011) show that many spatial-weighting matrices have a bandedstructure after an appropriate reordering. In particular, a banded structure is oftenattained by sorting the data in an ascending order of the distance from a well-chosenplace. In section 5, we will illustrate this method with data on U.S. counties and U.S.five-digit zip codes. In the case of U.S. five-digit zip codes, we show how to create acontiguity matrix with upper and lower bandwidths of 913. This allows us to storethe data in a 1,827 × 31,713 matrix, which requires only 1,827 × 31,713 × 8/230 ≈ 0.43gigabytes instead of the 7.5 gigabytes required for the general matrix.

We are now ready to describe the spmat subcommands.


2 Creating a contiguity matrix from geospatial data

2.1 Syntax

spmat contiguity objname[if

] [in

]using coordinates file, id(varname)[

options]

options Description

normalize(norm) control the normalization methodrook require that two units share a common border

instead of just a common point tobe neighbors

banded store the matrix in the banded formatreplace replace existing spmat objectsaving(filename

[, replace

]) save the neighbor list to a file

nomatrix suppress creations of spmat objecttolerance(#) use when determining whether

units share a common border

2.2 Description

spmat contiguity computes a contiguity or normalized-contiguity matrix from a coor-dinates dataset containing a polygon representation of geospatial data. More precisely,spmat contiguity constructs a contiguity matrix or normalized-contiguity matrix fromthe boundary information in a coordinates dataset and puts it into the new spmat objectobjname.

In a contiguity matrix, contiguous units are assigned weights of 1, and noncontiguousunits are assigned weights of 0. Contiguous units are known as neighbors.

spmat contiguity uses the polygon data in coordinates file to determine the neigh-bors of each unit. The coordinates file must be a Stata dataset containing the polygoninformation in the format produced by shp2dta and mif2dta. Crow and Gould (2007)discuss the conventions used to represent polygons in the Stata datasets created bythese commands.

2.3 Options

id(varname) specifies a numeric variable that contains a unique identifier for eachobservation. (shp2dta and mif2dta name this ID variable ID.) id() is required.

normalize(norm) specifies one of the three available normalization techniques: row,minmax, and spectral. In a row-normalized matrix, each element in row i is dividedby the sum of row i’s elements. In a minmax-normalized matrix, each element is


divided by the minimum of the largest row sum and column sum of the matrix. Ina spectral-normalized matrix, each element is divided by the modulus of the largesteigenvalue of the matrix. See section 2.5 for details.

rook specifies that only units that share a common border be considered neighbors (edgeor rook contiguity). The default is queen contiguity, which treats units that share acommon border or a single common point as neighbors. Computing rook-contiguitymatrices is more computationally intensive than the default queen-contiguity com-putation.7

banded requests that the new matrix be stored in a banded form. The banded matrixis constructed without creating the underlying n × n representation.

replace permits spmat contiguity to overwrite an existing spmat object.

saving(filename[, replace

]) saves the neighbor list to a space-delimited text file.

The first line of the file contains the number of units and, if applicable, bands; eachremaining line lists a unit identification code followed by the identification codes ofunits that share a common border, if any. You can read the file back into an spmatobject with spmat import ..., nlist. replace allows filename to be overwrittenif it already exists.

nomatrix specifies that the spmat object objname and spatial-weighting matrix W notbe created. In conjunction with saving(), this option allows for creating a textfile containing a neighbor list without allocating space for the underlying contiguitymatrix.

tolerance(#) specifies the numerical tolerance used in deciding whether two units areedge neighbors. The default is tolerance(1e-7).

2.4 Examples

As discussed above, spatial-weighting matrices are used to compute weighted averagesin which more weight is placed on nearby observations than on distant observations.While Haining (2003, 83) and Cliff and Ord (1981, 17) discuss formulations of weightsmatrices, contiguity and inverse-distance matrices are the two most common spatial-weighting matrices.

7. These definitions for rook neighbor and queen neighbor are commonly used; see, for example,Lai, So, and Chan (2009). (As many readers will recognize, the “rook” and “queen” terminologyarises by analogy with chess, in which a rook may only move across sides of squares, whereas aqueen may also move diagonally.)


In geospatial-type applications, researchers who want a contiguity matrix need toperform a series of complicated calculations on the boundary information in a coordi-nates dataset to identify the neighbors of each unit. spmat contiguity performs thesecalculations and stores the resulting weights matrix in an spmat object.

In contrast, some social-network datasets begin with a list of neighbors instead ofthe boundary information found in geospatial data. Section 15 discusses how to createa social-network matrix from a list of neighbors.

Example

We continue the example from section 1.1 and assume both of the Stata datasetscreated in section 1.1 are in the current working directory. After loading the attributedataset into memory, we create the spmat object ccounty containing a normalized-contiguity matrix for U.S. counties by typing

. use county, clear

. spmat contiguity ccounty using countyxy, id(id) normalize(minmax)

We use spmat summarize, discussed in section 4, to summarize the contents of thespatial-weighting matrix in the ccounty object we created above:

. spmat summarize ccounty, links

Summary of spatial-weighting object ccounty

Matrix Description

Dimensions 3109 x 3109Stored as 3109 x 3109

Linkstotal 18474

min 1mean 5.942104max 14

The table shows basic information about the normalized contiguity matrix, includingthe dimensions of the matrix and its storage. The number of neighbors found is reportedas 18,474, with each county having 6 neighbors on average.

2.5 Normalization details

In this section, we present details about the normalization methods.8 In each case,the normalized matrix W = (wij) is computed from the underlying matrix W = (wij),where the elements are assumed to be nonnegative; see, for example, Kelejian andPrucha (2010) for an introduction to the use and interpretation of these normalizationmethods.

8. The normalization methods are not restricted to contiguity matrices.


In a row-normalized matrix, the (i, j)th element of W becomes wij = wij/ri, whereri is the sum of the ith row of W. After row normalization, each row of W will sumto 1. Row normalizing a symmetric W produces an asymmetric W except in veryspecial cases. Kelejian and Prucha (2010) point out that normalizing by a vector of rowsums needs to be guided by theory.

In a minmax-normalized matrix, the (i, j)th element of W becomes wij = wij/m,where m = min{maxi(ri),maxi(ci)}, with maxi(ri) being the largest row sum of Wand maxi(ci) being the largest column sum of W. Normalizing by a scalar preservessymmetry and the basic model specification.

In a spectral-normalized matrix, the (i, j)th element of W becomes wij = wij/v,where v is the largest of the moduli of the eigenvalues of W. As for the minmax norm,normalizing by a scalar preserves symmetry and the basic model specification.

3 Creating an inverse-distance matrix from data

3.1 Syntax

spmat idistance objname cvarlist[if

] [in

], id(varname)

[options

]where cvarlist is the list of coordinate variables.

options Description

dfunction(function[, miles

]) specify the distance function

normalize(norm) specify the normalization methodtruncmethod specify the truncation methodbanded store the matrix in the banded formatreplace replace an existing spmat object

where function is one of euclidean, rhaversine, dhaversine, or p;miles may only be specified with rhaversine or dhaversine; andtruncmethod is one of btruncate(b B), dtruncate(dL dU), orvtruncate(v).

3.2 Description

An inverse-distance spatial-weighting matrix is composed of weights that are inverselyrelated to the distances between the units. spmat idistance uses the coordinate vari-ables from the attribute data in memory and the specified distance measure to computethe distances between units, to create an inverse-distance spatial-weighting matrix, andto store the result in an spmat object.


3.3 Options


dfunction(function[, miles

]) specifies the distance function. function may be one of

euclidean (default), dhaversine, rhaversine, or the Minkowski distance of orderp, where p is an integer greater than or equal to 1.

When the default dfunction(euclidean) is specified, a Euclidean distance measureis applied to the coordinate variable list cvarlist.

When dfunction(rhaversine) or dfunction(dhaversine) is specified, the haver-sine distance measure is applied to the two coordinate variables cvarlist. (Thefirst coordinate variable must specify longitude, and the second coordinate vari-able must specify latitude.) The coordinates must be in radians when rhaversineis specified. The coordinates must be in degrees when dhaversine is specified.The haversine distance measure is calculated in kilometers by default. Specifydfunction(rhaversine, miles) or dfunction(dhaversine, miles) if you wantthe distance returned in miles.

When dfunction(p) (p is an integer) is specified, a Minkowski distance measure oforder p is applied to the coordinate variable list cvarlist.

The formulas for the distance measure are discussed in section 3.5.

normalize(norm) specifies one of the three available normalization techniques: row,minmax, and spectral. In a row-normalized matrix, each element in row i is dividedby the sum of row i’s elements. In a minmax-normalized matrix, each element isdivided by the minimum of the largest row sum and column sum of the matrix. Ina spectral-normalized matrix, each element is divided by the modulus of the largesteigenvalue of the matrix. See section 2.5 for details.

truncmethod options specify one of the three truncation criteria. The values of thespatial-weighting matrix W that meet the truncation criterion will be changed to 0.Only apply truncation methods when supported by theory.

btruncate(b B) partitions the values of W into B equal-length bins and truncatesto 0 entries that fall into bin b or below, b < B.

dtruncate(dL dU) truncates to 0 the values of W that fall more than dL diagonalsbelow and dU diagonals above the main diagonal. Neither value can be greater than�(cols(W)−1)/4�.9vtruncate(v) truncates to 0 the values of W that are less than or equal to v.

See section 3.6 for more details about the truncation options.

9. This limit ensures that a cross product of the spatial-weighting matrix is stored more efficiently inbanded form than in general form. The limit is based on the cross product instead of the matrixitself because the generalized spatial two-stage least-squares estimators use cross products of thespatial-weighting matrices.


banded requests that the new matrix be stored in a banded form. The banded matrix isconstructed without creating the underlying n×n representation. Note that withoutbanded, a matrix with truncated values will still be stored in an n × n form.

replace permits spmat idistance to overwrite an existing spmat object.

3.4 Examples

As discussed above, spatial-weighting matrices are used to compute weighted averagesin which more weight is placed on nearby observations than on distant observations.Haining (2003, 83) and Cliff and Ord (1981, 17) discuss formulations of weights matri-ces, contiguity matrices, inverse-distance matrices, and combinations thereof.

In inverse-distance spatial-weighting matrices, the weights are inversely related to thedistances between the units. spmat idistance provides several measures for calculatingthe distances between the units.

The coordinates may or may not be geospatial. Distances between geospatial unitsare commonly computed from the latitudes and longitudes of unit centroids.10 Socialdistances are frequently computed from individual-person attributes.

In much of the literature, the attributes are known as coordinates because the nomen-clature has developed around the common geospatial case in which the attributes aremap coordinates. For ease of use, spmat idistance follows this convention and refersto coordinates, even though coordinate variables specified in cvarlist need not be spatialcoordinates.

The (i, j)th element of an inverse-distance spatial-weighting matrix is 1/dij , wheredij is the distance between unit i and j computed from the specified coordinates and dis-tance measure. Creating spatial-weighting matrices with elements of the form 1/f(dij),where f(·) is some function, is described in section 17.4.

Example

county.dta from section 1.1 contains the coordinates of the centroids of each county,measured in degrees, in the variables longitude and latitude. To get a feel for thedata, we create an unnormalized inverse-distance spatial-weighting matrix, store it inthe spmat object dcounty, and summarize it by typing

10. The word “centroid” in the literature on geographic information systems differs from the standardterm in geometry. In the geographic information systems literature, a centroid is a weighted averageof the vertices of a polygon that approximates the center of the polygon; see Waller and Gotway(2004, 44–45) for the formula and some discussion.


. spmat idistance dcounty longitude latitude, id(id) dfunction(dhaversine)

. spmat summarize dcounty

Summary of spatial-weighting object dcounty

Matrix Description


Valuesmin 0

min>0 .0002185mean .0012296max 1.081453

From the summary table, we can see that the centroids of the two closest countieslie within less than one kilometer of each other (1/1.081453), while the two most distantcounties are 4, 577 kilometers apart (1/0.0002185).

Below we compute a minmax-normalized inverse-distance matrix, store it in thespmat object dcounty2, and summarize it by typing

. spmat idistance dcounty2 longitude latitude, id(id) dfunction(dhaversine)> normalize(minmax)

. spmat summarize dcounty2

Summary of spatial-weighting object dcounty2

Matrix Description


Valuesmin 0

min>0 .0000382mean .0002151max .1892189

3.5 Distance calculation details

Specifying q variables in the list of coordinate variables cvarlist implies that the unitsare located in a q-dimensional space. This space may or may not be geospatial. Let theq variables in the list of coordinate variables cvarlist be x1, x2, . . . , xq, and denote thecoordinates of observation i by (x1[i], x2[i], . . . , xq[i]).

The default behavior of spmat idistance is to calculate the Euclidean distancebetween units s and t, which is given by

dst =

√√√√ q∑j=1

(xj[s]− xj[t]

)2

for observations s and t.


The Minkowski distance of order p is given by

dst = p

√√√√ q∑j=1

∣∣∣xj[s]− xj[t]∣∣∣p

for observations s and t. When p = 2, the Minkowski distance is equivalent to theEuclidean distance.

The haversine distance measure is useful when the units are located on the surfaceof the earth and the coordinate variables represent the geographical coordinates ofthe spatial units. In such cases, we usually wish to calculate a spherical (great-circle)distance between the spatial units. This is accomplished by the haversine formula givenby

dst = r × c

where

r is the mean radius of the Earth (6, 371.009 km or 3, 958.761 miles)

c = 2arcsin{min(1,√

a)}a = sin2 φ + cos(φ1) cos(φ2) sin2 λ

φ = 12 (φ2 − φ1) = 1

2 (x2[t]− x2[s])

λ = 12 (λ2 − λ1) = 1

2 (x1[t]− x1[s])

x1[s] and x1[t] are the longitudes of point s and point t, respectively

x2[s] and x2[t] are the latitudes of point s and point t, respectively

Specify dfunction(dhaversine) to compute haversine distances from coordinatesin degrees, and specify dfunction(rhaversine) to compute haversine distances fromcoordinates in radians. Both dfunction(dhaversine) and dfunction(rhaversine)by default use r = 6,371.009 to compute results in kilometers. To compute haversinedistances in miles, with r = 3,958.761, instead specify dfunction(dhaversine, miles)or dfunction(rhaversine, miles).


3.6 Truncation details

Unlike contiguity matrices, inverse-distance matrices cannot naturally yield a bandedstructure because the off-diagonal elements are never exactly 0. Consider an example inwhich we have nine units arranged on the real line with x denoting the unit locations.

. use truncex, clear

. list

id x

1. 1 02. 2 13. 3 24. 4 5035. 5 504

6. 6 5057. 7 10068. 8 10079. 9 1008

The units are grouped into three clusters. The units belonging to the same clusterare close to one another, while the distance between the units belonging to differentclusters is large. For real-world data, the units may represent, for example, cities indifferent states. We use spmat idistance to create the spmat object ex from the data:

. spmat idistance ex x, id(id)

The resulting spatial-weighting matrix of inverse distances is⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0.5 0.00199 0.00198 0.00198 0.00099 0.00099 0.00099

1 0 1 0.00199 0.00199 0.00198 0.001 0.00099 0.00099

0.5 1 0 0.002 0.00199 0.00199 0.001 0.001 0.00099

0.00199 0.00199 0.002 0 1 0.5 0.00199 0.00198 0.00198

0.00198 0.00199 0.00199 1 0 1 0.00199 0.00199 0.00198

0.00198 0.00198 0.00199 0.5 1 0 0.002 0.00199 0.00199

0.00099 0.001 0.001 0.00199 0.00199 0.002 0 1 0.5

0.00099 0.00099 0.001 0.00198 0.00199 0.00199 1 0 1

0.00099 0.00099 0.00099 0.00198 0.00198 0.00199 0.5 1 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦


Theoretical considerations may suggest that the weights should actually be 0 belowa certain threshold. For example, choosing the threshold value of 1/500 = 0.002 for ourmatrix results in the following structure:⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0.5 0 0 0 0 0 01 0 1 0 0 0 0 0 0

0.5 1 0 0 0 0 0 0 00 0 0 0 1 0.5 0 0 00 0 0 1 0 1 0 0 00 0 0 0.5 1 0 0 0 00 0 0 0 0 0 0 1 0.50 0 0 0 0 0 1 0 10 0 0 0 0 0 0.5 1 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦Now the matrix with the truncated values can be stored more efficiently in a banded

form: ⎡⎢⎢⎢⎢⎢⎢⎣

0 0 0.5 0 0 0.5 0 0 0.50 1 1 0 1 1 0 1 10 0 0 0 0 0 0 0 0

1 1 0 1 1 0 1 1 00.5 0 0 0.5 0 0 0.5 0 0

⎤⎥⎥⎥⎥⎥⎥⎦spmat idistance provides tools for truncating the values of an inverse-distance ma-

trix and storing the truncated matrix in a banded form. Like spmat contiguity, spmatidistance is capable of creating a banded matrix without creating the underlying n×nrepresentation of the matrix. The user must specify a theoretically justified truncationcriterion for such an application.

Here we illustrate how one could apply each of the truncation methods mentionedin section 3.3 to our hypothetical inverse-distance matrix. The most natural way is touse value truncation. In the code below, we create a new spmat object ex1 with thevalues of W that are less than or equal to 1/500 set to 0.11 We also request that W bestored in banded form.

. spmat idistance ex1 x, id(id) banded vtruncate(1/500)

The same outcome can be achieved with bin truncation. In bin truncation, we findthe maximum value in W denoted by m, divide the interval (0,m] into B bins of equallength, and then truncate to 0 elements that fall into bins 1, . . . , b; see Bin truncationdetails below for a more technical description. In our hypothetical matrix, the largestelement of W is 1. If we divide the values in W into three bins, the bins will be definedby (0, 1/3], (1/3, 2/3], (2/3, 1]. The values we wish to round to 0 fall into the first bin.

11. vtruncate() accepts any expression that evaluates to a number.


In the code below, we create a new spmat object ex2 with the values of W that fallinto the first bin set to 0. We also request that W be stored in banded form.

. spmat idistance ex2 x, id(id) banded btruncate(1 3)

Diagonal truncation is not based on value comparison; therefore, in general, we willnot be able to replicate exactly the results obtained with bin or value truncation. Inthe code below, we create a new spmat object ex3 with the values of W that fall morethan two diagonals below and above the main diagonal set to 0. We also request thatW be stored in banded form.

. spmat idistance ex3 x, id(id) banded dtruncate(2 2)

The resulting matrix based on diagonal truncation is shown below. No values in Whave been changed; instead, we copied the requested elements from W and stored themin banded form, padding the banded format with 0s when necessary (see section 1.2).

Diagonal truncation can be hard to justify on a theoretical basis. It can retainirrelevant neighbors, as in this example, or it can wipe out our relevant ones. Its useshould be limited to situations in which one has a good knowledge of the underlyingstructure of the spatial-weighting matrix. Bin or value truncation will generally beeasier to apply.⎡⎢⎢⎢⎢⎢⎢⎣

0 0 0.5 0.00199 0.00199 0.5 0.00199 0.00199 0.50 1 1 0.002 1 1 0.002 1 10 0 0 0 0 0 0 0 0

1 1 0.002 1 1 0.002 1 1 00.5 0.00199 0.00199 0.5 0.00199 0.00199 0.5 0 0

⎤⎥⎥⎥⎥⎥⎥⎦A word of warning: While truncation leads to matrices that can be stored more

efficiently, truncation should only be applied if supported by theory. Ad hoc truncationmay lead to a misspecification of the model and a subsequent inconsistent inference.

Bin truncation details

Formally, letting m be the largest element in W, btruncate(b B) divides the interval(0,m] into B equal-length subintervals and sets elements in W whose value falls in theb smallest subintervals to 0. We partition the interval (0,m] into B intervals (akL, akU ],where k = {1, . . . , B}, akL = (k − 1)m/B, and akU = km/B. We set wij = 0 ifwij ∈ (akL, akU ] for k ≤ b.


4 Summarizing an existing spatial-weighting matrix

4.1 Syntax

spmat summarize objname[, links detail {banded | truncmethod} ]

where truncmethod is one of btruncate(b B), dtruncate(dL dU), orvtruncate(v).

4.2 Description

spmat summarize reports summary statistics about the elements in the spatial-weight-ing matrix in the existing spmat object objname.

4.3 Options

links is useful when objname contains a contiguity or a normalized-contiguity matrix.Rather than the default summary of the values in the spatial-weighting matrix,links causes spmat summarize to summarize the number of neighbors.

detail requests a tabulation of links for a contiguity or a normalized-contiguity matrix.The values of the identifying variable with the minimum and maximum number oflinks will be displayed.

banded reports the bands for the matrix that already has a (possibly) banded structurebut is stored in an n × n form.

truncmethods are useful when you want to see summary statistics calculated on a spatial-weighting matrix after some elements have been truncated to 0. spmat summarizewith a truncmethod will report the lower and upper band based on a matrix towhich the specified truncation criterion has been applied. (Note: No data are ac-tually changed by selecting these options. These options only specify that spmatsummarize calculate results as if the requested truncation criterion has been ap-plied.)

btruncate(b B) partitions the values of W into B bins and truncates to 0 entriesthat fall into bin b or below.

dtruncate(dL dU) truncates to 0 the values of W that fall more than dL diagonalsbelow and dU diagonals above the main diagonal. Neither value can be greater than�(cols(W)−1)/4�.vtruncate(v) truncates to 0 the values of W that are less than or equal to v.


4.4 Saved results

Only spmat summarize returns saved results.

Let Wc be a contiguity or normalized-contiguity matrix and Wd be an inverse-distance matrix.

spmat summarize saves the following results in r():

Scalarsr(b) number of rows in W r(lmean) mean number ofr(n) number of columns in W neighbors in Wc

r(lband) lower band, if W is banded r(lmax) maximum number ofr(uband) upper band, if W is banded neighbors in Wc

r(min) minimum of Wd r(ltotal) total number of neighbors in Wc

r(min0) minimum element > 0 in Wd r(eig) 1 if object containsr(mean) mean of Wd eigenvalues, 0 otherwiser(max) maximum of Wd r(canband) 1 if object can be bandedr(lmin) minimum number of based on r(lband) and

neighbors in Wc r(uband), 0 otherwise

4.5 Examples

It is generally useful to know some summary statistics for the elements of your spatial-weighting matrices. In sections 2.4 and 3.4, we used spmat summarize to report sum-mary statistics for spatial-weighting matrices.

Many spatial-weighting matrices contain many elements that are not 0 but are verysmall. At times, theoretical considerations such as threshold effects suggest that thesesmall weights should be truncated to 0. In these cases, you might want to summarizethe elements of the spatial-weighting subject to different truncation criteria as part ofsome sensitivity analysis.

Example

In section 3.4, we stored an unnormalized inverse-distance spatial-weighting matrixin the spmat object dcounty. In this example, we find the summary statistics of theelements of a truncated version of this matrix.

For each county, we set to 0 the weights for counties whose centroids are fartherthan 250 km from the centroid of that county. Because we are operating on inversedistances, we specify 1/250 as the truncation criterion. The summary of the matrixcalculated after applying the truncation criterion is reported in the Truncated matrixcolumn. We can see that now the minimum nonzero distance is reported as 0.004.


. spmat summarize dcounty, vtruncate(1/250)


Current matrix Truncated matrix

Dimensions 3109 x 3109 3109 x 3109Bands (3098, 3098)

Valuesmin 0 0

min>0 .0002185 .004mean .0012296 .0002583max 1.081453 1.081453

The Bands row reports the lower and upper bands with nonzero values. Those valuestell us whether the matrix can be stored in banded form. As mentioned in section 3.3,neither value can be greater than �(cols(W)−1)/4�. In our case, the maximum valuesfor bands is �(3,109 − 1)/4� = 777; therefore, if we truncated the values of the matrixaccording to our criterion, we would not be able to store the matrix in banded form.12

In section 5.1, we show how we can use the sorting tricks of Drukker et al. (2011) tostore this matrix in banded form.

5 Examples of banded matrices

Thus far we have claimed that banded matrices are useful when handling spatial-weighting matrices, but we have not yet substantiated this point. To illustrate theusefulness of storing spatial-weighting matrices in a banded-matrix form, we revisit theU.S. counties data and introduce U.S. five-digit zip code data.

12. In practice, rather than calculating the maximum value for bands by hand, we would use ther(canband), r(lband), and r(uband) scalars returned by spmat summarize; see section 4.4 fordetails.


5.1 U.S. county data revisited

Recall that at the moment, we have the neighbor information stored in the spmat objectccounty. We use spmat graph, discussed in section 7, to produce an intensity plot ofthe n × n normalized contiguity matrix contained in the object ccounty by typing

. spmat graph ccounty, blocks(10)

Figure 2 shows that zero and nonzero entries are scattered all over the matrix.This pattern arises because the original shapefiles had the counties sorted in an orderunrelated to their distances from a common point.

1

100

200

311

Rows

1 100 200 311Columns

Figure 2. Normalized contiguity matrix for unsorted U.S. counties

We can store the normalized contiguity matrix more efficiently if we generate avariable containing the distance from a particular place to all the other places and thensort the data in an ascending order according to this variable.13 We implement thismethod in four steps: 1) we sort the county data by the longitude and latitude of thecounty centroids contained in longitude and latitude, respectively, so that the firstobservation will be the corner observation of Curry County, OR; 2) we calculate thedistance of each county from the corner county in the first observation; 3) we sort onthe variable containing the distances calculated in step 2; and 4) we recompute andsummarize the normalized contiguity matrix.

13. For best results, pick a place located in a remote corner of the map; see Drukker et al. (2011) forfurther details.


. sort longitude latitude

. generate double dist => sqrt( (longitude-longitude[1])^2 + (latitude-latitude[1])^2 )

. sort dist

. spmat contiguity ccounty2 using countyxy, id(id) normalize(minmax) banded> replace

. spmat summ ccounty2, links

Summary of spatial-weighting object ccounty2

Matrix Description


Linkstotal 18474


. spmat graph ccounty2, blocks(10)

Specifying the option banded in the spmat contiguity command caused the con-tiguity matrix to be stored as a banded matrix. The summary table shows that thecontiguity information is now stored in a 465 × 3,109 matrix, which requires muchless space than the original 3,109 × 3,109 matrix. Figure 3 clearly shows the bandedstructure.

1

100

200

311

Rows

1 100 200 311Columns

Figure 3. Normalized contiguity matrix for sorted U.S. counties


Similarly, we can re-create the dcounty object calculated on the sorted data andsee whether the inverse-distance matrix can be stored in banded form after applying atruncation criterion.

. spmat idistance dcounty longitude latitude, id(id) dfunction(dhaversine)> vtruncate(1/250) banded replace

. spmat summ dcounty


Matrix Description


Valuesmin 0

min>0 .004mean .0002583max 1.081453

We can see that the Values summary for this matrix and the matrix from section 4.5is the same; however, the matrix in this example is stored in banded form.

5.2 U.S. zip code data

The real power of banded storage is unveiled when we lack the memory to store spatialdata in an n × n matrix. We use the five-digit zip code level data for the continentalUnited States.14 We have information on 31,713 five-digit zip codes, and as was men-tioned in section 1.2, we need 7.5 gigabytes of memory to store the normalized contiguitymatrix as a general matrix.

14. Data are from the U.S. Census Bureau at ftp://ftp2.census.gov/geo/tiger/TIGER2008/.


Instead, we repeat the sorting trick and call spmat contiguity with option banded,hoping that we will be able to fit the banded representation into memory.

. use zip5, clear

. *keep continental US zip codes

. drop if latitude > 49.5 | latitude < 24.5 | longitude < -124(524 observations deleted)



. sort dist

. spmat contiguity zip5 using zip5xy, id(id) normalize(minmax) bandedwarning: spatial-weighting matrix contains 131 islands

. spmat summarize zip5, links

Summary of spatial-weighting object zip5

Matrix Description


Linkstotal 166906


warning: spatial-weighting matrix contains 131 islands

The output from spmat summarize indicates that the normalized contiguity matrixis stored in a 1,827× 31,713 matrix. This fits into less than half a gigabyte of memory!All we did to store the matrix in a banded format was change the sort order of the dataand specify the banded option. We discuss storing an existing n × n spatial-weightingmatrix in banded form in sections 18.1 and 18.2.

Having illustrated the importance of banded matrices, we return to documentingthe spmat commands.

6 Inserting documentation into your spmat objects

6.1 Syntax

spmat note objname[ { : "text", replace | drop } ]

6.2 Description

spmat note creates and manipulates a note attached to the spmat object.


6.3 Options

replace causes spmat note to overwrite the existing note with a new one.

drop causes spmat note to clear the note associated with objname.

6.4 Examples

If you plan to use a spatial-weighting matrix outside a given do-file or session, youshould attach some documentation to the spmat object.

spmat note stores the note in a string scalar; however, it is possible to store multiplenotes in the scalar by repeatedly appending notes.

Example

We attach a note to the spmat object ccounty and then display it by typing

. spmat note ccounty : "Source: Tiger 2008 county files."

. spmat note ccountySource: Tiger 2008 county files.

As mentioned, we can have multiple notes:

. spmat note ccounty : "Created on 18jan2011."

. spmat note ccountySource: Tiger 2008 county files. Created on 18jan2011.

7 Plotting the elements of a spatial-weighting matrix

7.1 Syntax

spmat graph objname[, blocks(

[(stat)

]p) twoway options

]7.2 Description

spmat graph produces an intensity plot of the spatial-weighting matrix contained inthe spmat object objname. Zero elements are plotted in white; the remaining elementsare partitioned into bins of equal length and assigned gray-scale colors gs0–gs15 (see[G-4] colorstyle), with darker colors representing higher values.

7.3 Options

blocks([(stat)

]p) specifies that the matrix be divided into blocks of size p and that

block maximums be plotted. This option is useful when the matrix is large. To plot a


statistic other than the default maximum, you can specify the optional stat argument.For example, to plot block medians, type blocks((p50) p). The supported statisticsinclude those returned by summarize, detail; see [R] summarize for a completelist.

twoway options are any options other than by(); they are documented in[G-3] twoway options.

7.4 Examples

An intensity plot of a spatial-weighting matrix can reveal underlying structure. Forexample, if there is a banded structure to the spatial-weighting matrix, large amountsof memory may be saved.

See section 5.1 for an example in which we use spmat graph to reveal the bandedstructure in a spatial-weighting matrix.

8 Computing spatial lags

8.1 Syntax

spmat lag[type

]newvar objname varname

8.2 Description

spmat lag uses a spatial-weighting matrix to compute the weighted averages of a vari-able known as the spatial lag of a variable.

More precisely, spmat lag uses the spatial-weighting matrix in the spmat objectobjname to compute the spatial lag of the variable varname and stores the result in thenew variable newvar.

8.3 Examples

Spatial lags of the exogenous right-hand-side variables are frequently included in SAR

models; see, for example, LeSage and Pace (2009).

Recall that a spatial lag is a weighted average of the variable being lagged. If x spldenotes the spatial lag of the existing variable x, using the spatial-weighting matrix W,then the algebraic definition is x spl = Wx.


The code below generates the new variable x spl, which contains the spatial lag of x,using the spatial-weighting matrix W, which is contained in the spmat object ccounty:

. clear all

. use county

. spmat contiguity ccounty using countyxy, id(id) normalize(minmax)

. generate x = runiform()

. spmat lag x_spl ccounty x

We could now include both x and x spl in our model.

9 Computing the eigenvalues of a spatial-weighting ma-trix

9.1 Syntax

spmat eigenvalues objname[, eigenvalues(vecname) replace

]9.2 Description

spmat eigenvalues calculates the eigenvalues of the spatial-weighting matrix containedin the spmat object objname and stores them in vecname. The maximum-likelihoodestimator implemented in the spreg ml command, as described in Drukker, Prucha,and Raciborski (2013b), uses the eigenvalues of the spatial-weighting matrix during theoptimization process. If you are estimating several models by maximum likelihood withthe same spatial-weighting matrix, computing and storing the eigenvalues in an spmatobject will remove the need to recompute the eigenvalues.

9.3 Options

eigenvalues(vecname) stores the user-defined vector of eigenvalues in the spmat objectobjname. vecname must be a Mata row vector of length n, where n is the dimensionof the spatial-weighting matrix in the spmat object objname.

replace permits spmat eigenvalues to overwrite existing eigenvalues in objname.

9.4 Examples

Putting the eigenvalues into the spmat object can dramatically speed up the com-putations performed by the spreg ml command; see Drukker, Prucha, and Raciborski(2013b) for details and references therein.


We can calculate the eigenvalues of the spatial-weighting matrix contained in thespmat object ccounty and store them in the same object by typing

. spmat eigenvalues ccounty

Calculating eigenvalues.... finished.

10 Removing an spmat object from memory

10.1 Syntax

spmat drop objname

10.2 Description

spmat drop removes the spmat object objname from memory.

10.3 Examples

To drop the spmat object dcounty from memory, we type

. spmat drop dcounty(note: spmat object dcounty not found)

11 Saving an spmat object to disk

11.1 Syntax

spmat save objname using filename[, replace

]11.2 Description

spmat save saves the spmat object objname to a file in a native Stata format.

11.3 Option

replace permits spmat save to overwrite filename.


11.4 Examples

Creating a spatial-weighting matrix, and perhaps its eigenvalues as well, can be a time-consuming process. If you are going to repeatedly use a spatial-weighting matrix, youprobably want to save it to a disk and read it back in for subsequent uses. spmat savewill save the spmat object to disk for you. Section 12 discusses spmat use, which readsthe object from disk into memory.

If you are going to save an spmat object to disk, it is a good practice to use spmatnote to attach some documentation to the object before saving it. Section 6 discussesspmat note.

Just like with Stata datasets, you can save your spmat objects to disk and sharethem with other Stata users. The file format is platform independent. So, for example,a Mac user could save an spmat object to disk and email it to a coauthor, and theWindows-using coauthor could read in this spmat object by using spmat use.

We can save the information contained in the spmat object ccounty in the fileccounty.spmat by typing

. spmat save ccounty using ccounty.spmat

12 Reading spmat objects from disk

12.1 Syntax

spmat use objname using filename[, replace

]12.2 Description

spmat use reads into memory an spmat object from a file created by spmat save; seesection 11 for a discussion of spmat save.

12.3 Option

replace permits spmat use to overwrite an existing spmat object.

12.4 Examples

As mentioned in section 11, creating a spatial-weighting matrix can be time consuming.When repeatedly using a spatial-weighting matrix, you might want to save it to diskwith spmat save and read it back in with spmat use for subsequent uses.


In section 11, we saved the spmat object ccounty to the file ccounty.spmat. Wenow drop the existing ccounty object from memory and read it back in with spmatuse:

. spmat drop ccounty


. spmat note ccountySource: Tiger 2008 county files. Created on 18jan2011.

13 Writing a spatial-weighting matrix to a text file

13.1 Syntax

spmat export objname using filename[, noid nlist replace

]13.2 Description

spmat export saves the spatial-weighting matrix contained in the spmat object objnameto a space-delimited text file. The matrix is written in a rectangular format with uniqueplace identifiers saved in the first column. spmat export can also save lists of neighborsto a text file.

13.3 Options

noid causes spmat export not to save unique place identifiers, only matrix entries.

nlist causes spmat export to write the matrix in the neighbor-list format describedin section 2.3.

replace permits spmat export to overwrite filename.

13.4 Examples

The main use of spmat export is to export a spatial-weighting matrix to a text file thatcan be read by another program. Long (2009, 336) recommends exporting all data totext files that will be read by future software as part of archiving one’s research work.

Another use of spmat export is to review neighbor lists from a contiguity matrix.Here we illustrate how one can export the contiguity matrix in the neighbor-list formatdescribed in section 2.3.

. spmat export ccounty using nlist.txt, nlist


We call the Unix command head to list the first 10 lines of nlist.txt:15

. !head nlist.txt

31091 1054 1657 2063 2165 2189 2920 29582 112 2250 2277 2292 2362 2416 31563 2294 2471 2575 2817 2919 29844 8 379 1920 2024 2258 23015 6 73 1059 1698 2256 2886 28966 5 1698 2256 2795 2886 2896 30987 517 1924 2031 2190 2472 25758 4 379 1832 2178 2258 29879 413 436 1014 1320 2029 2166

The first line of the file indicates that there are 3,109 total spatial units. Thesecond line indicates that the unit with identification code 1 is a neighbor of units withidentification codes 1054, 1657, 2063, 2165, 2189, 2920, and 2958. The interpretation ofthe remaining lines is analogous to that for the second line.

14 Getting a spatial-weighting matrix from an spmat ob-ject

14.1 Syntax

spmat getmatrix objname[matname

] [, id(vecname) eig(vecname)

]14.2 Description

spmat getmatrix copies the spatial-weighting matrix contained in the spmat objectobjname and stores it in the Mata matrix matname; see [M-0] intro for an introductionto using Mata. If specified, the vector of unique identifiers and the eigenvalues of thespatial-weighting matrix will be stored in Mata vectors.

14.3 Options

id(vecname) specifies the name of a Mata vector to contain IDs.

eig(vecname) specifies the name of a Mata vector to contain eigenvalues.

15. Users of other operating systems should open the file in a text editor.


14.4 Examples

If you want to make changes to an existing spatial-weighting matrix, you need to retrieveit from the spmat object, store it in Mata, make the desired changes, and store the newmatrix back in the spmat object by using spmat putmatrix. (See section 17 for adiscussion of spmat putmatrix.)

spmat getmatrix performs the first two tasks: it makes a copy of the spatial-weighting matrix from the spmat object and stores it in Mata.

As we discussed in section 3, spmat idistance creates a spatial-weighting matrixof the form 1/dij , where dij is the distance between units i and j. In section 17.4, weuse spmat getmatrix in an example in which we change a spatial-weighting matrix tothe form 1/ exp(0.1 × dij) instead of just 1/dij .

15 Importing spatial-weighting matrices

15.1 Syntax

spmat import objname using filename[, noid nlist geoda idistance

normalize(norm) replace]

15.2 Description

spmat import imports a spatial-weighting matrix from a space-delimited text file andstores it in a new spmat object.

15.3 Options

noid specifies that the first column of numbers in filename does not contain unique placeidentifiers and that spmat import should create and use the identifiers 1, . . . , n.

nlist specifies that the text file to be imported contain a list of neighbors in the formatdescribed in section 2.3.

geoda specifies that filename be in the .gwt or .gal format created by the GeoDaTM

software.

idistance specifies that the file contains raw distances and that the raw distancesshould be converted to inverse distances. In other words, idistance specifies thatthe (i, j)th element in the file be dij and that the (i, j)th element in the spatial-weighting matrix be 1/dij , where dij is the distance between units i and j.

normalize(norm) specifies one of the three available normalization techniques: row,minmax, and spectral. In a row-normalized matrix, each element in row i is dividedby the sum of row i’s elements. In a minmax-normalized matrix, each element is


divided by the minimum of the largest row sum and column sum of the matrix. Ina spectral-normalized matrix, each element is divided by the modulus of the largesteigenvalue of the matrix. See section 2.5 for details.

replace permits spmat import to overwrite an existing spmat object.

15.4 Examples

One frequently needs to import a spatial-weighting matrix from a text file. spmatimport supports three of the most common formats: simple text files, GeoDaTM textfiles, and text files that require minor changes such as converting from raw to inversedistances.

By default, the unique place-identifying variable is assumed to be stored in the firstcolumn of the file, but this can be overridden with the noid option.

In section 17.4, we provide an extended example that begins with using spmatimport to import a spatial-weighting matrix.

16 Obtaining a spatial-weighting matrix from a Statadataset

16.1 Syntax

spmat dta objname varlist[if

] [in

] [, id(varname) idistance

normalize(norm) replace]

16.2 Description

spmat dta imports a spatial-weighting matrix from the variables in a Stata dataset andstores it in an spmat object.

The number of variables in varlist must equal the number of observations becausespatial-weighting matrices are n × n.

16.3 Options

id(varname) specifies that the unique place identifiers be contained in varname. Thedefault is to create an identifying vector containing 1, . . . , n.

idistance specifies that the variables contain raw distances and that the raw distancesbe converted to inverse distances. In other words, idistance specifies that the ithobservation on the jth variable be dij and that the (i, j)th element in the spatial-weighting matrix be 1/dij , where dij is the distance between units i and j.



replace permits spmat dta to overwrite an existing spmat object.

16.4 Examples

People have created Stata datasets that contain spatial-weighting matrices. Given thepower of infile and infix (see [D] infile (fixed format) and [D] infix (fixed for-mat)), it is likely that more such datasets will be created. spmat dta imports thesespatial-weighting matrices and stores them in an spmat object.

Here we illustrate how we can create an spmat object from a Stata dataset. Thedataset schools.dta contains the distance in miles between five schools in the variablesc1-c5. The unique school identifier is recorded in the variable id. In Stata, we type

. use schools, clear

. list

id c1 c2 c3 c4 c5

1. 101 0 5.9 8.25 6.22 7.662. 205 5.9 0 2.97 4.87 7.633. 113 8.25 2.97 0 4.47 74. 441 6.22 4.87 4.47 0 2.775. 573 7.66 7.63 7 2.77 0

. spmat dta schools c*, id(id) idistance normalize(minmax)

17 Storing a Mata matrix in an spmat object

17.1 Syntax

spmat putmatrix objname[matname

] [, id(varname | vecname) eig(vecname)

idistance bands(l u) normalize(norm) replace]

17.2 Description

spmat putmatrix puts Mata matrices into an existing spmat object objname or intoa new spmat object if the specified object does not exist. The optional unique placeidentifiers can be provided as a Mata vector or a Stata variable. The optional eigenvaluesof the Mata matrix can be provided in a Mata vector.


17.3 Options

id(varname | vecname) specifies a Mata vector vecname or a Stata variable varnamethat contains unique place identifiers.

eig(vecname) specifies a Mata vector vecname that contains the eigenvalues of thematrix.

idistance specifies that the Mata matrix contains raw distances and that the rawdistances be converted to inverse distances. In other words, idistance specifiesthat the (i, j)th element in the Mata matrix be dij and that the (i, j)th element inthe spatial-weighting matrix be 1/dij , where dij is the distance between units i andj.

bands(l u) specifies that the Mata matrix matname be banded with l lower and u upperdiagonals.


replace permits spmat putmatrix to overwrite an existing spmat object.

17.4 Examples

spmat contiguity and spmat idistance create spatial-weighting matrices from rawdata. This section describes situations in which we have the spatial-weighting matrixprecomputed and simply want to put it in an spmat object. The spatial-weighting matrixcan be any matrix that satisfies the conditions discussed, for example, in Kelejian andPrucha (2010).

In this section, we show how to create an spmat object from a text file by usingspmat import and how to use spmat getmatrix and spmat putmatrix to generate aninverse-distance matrix according to a user-specified functional form.

The file schools.txt contains the distance in miles between five schools. We callthe Unix command cat to print the contents of the file:

. !cat schools.txt

5101 0 5.9 8.25 6.22 7.66205 5.9 0 2.97 4.87 7.63113 8.25 2.97 0 4.47 7441 6.22 4.87 4.47 0 2.77573 7.66 7.63 7 2.77 0


The school ID is recorded in the first column of the file, and column i records thedistance from school i to all the other schools, including itself. We can use spmatimport to create a spatial-weighting matrix from this file:

. spmat import schools using schools.txt, replace

The resulting spatial-weighting matrix is⎡⎢⎢⎢⎢⎣0 5.9 8.25 6.22 7.66

5.9 0 2.97 4.87 7.638.25 2.97 0 4.47 7.06.22 4.87 4.47 0 2.777.66 7.63 7.0 2.77 0

⎤⎥⎥⎥⎥⎦We now illustrate how to create a spatial-weighting matrix with the distance de-

clining in an exponential fashion, exp(−0.1dij), where dij is the original distance fromschool i to school j.

. spmat getmatrix schools x

. mata: x = exp(-.1:*x)

. mata: _diag(x,0)

. spmat putmatrix schools x, normalize(minmax) replace

Thus we read in the original distances, extract the distance matrix with spmatgetmatrix, use Mata to transform the matrix entries according to our specifications,and reset the diagonal elements to 0. Finally, we use spmat putmatrix to put thetransformed matrix into an spmat object. The resulting minmax-normalized spatial-weighting matrix is ⎡⎢⎢⎢⎢⎣

0 0.217 0.172 0.211 0.1820.217 0 0.292 0.241 0.1830.172 0.292 0 0.251 0.1950.211 0.241 0.251 0 0.2970.182 0.183 0.195 0.297 0

⎤⎥⎥⎥⎥⎦

18 Converting general matrices into banded matrices

This section shows how to transform a spatial-weighting matrix stored as a generalmatrix in an spmat object in a banded format. If this topic is not of interest, you canskip this section.

The easy case is when the matrix already has a banded structure so that we cansimply use spmat tobanded.

Now consider the more difficult case in which we have a spatial-weighting matrixstored in an spmat object and we would like to use the sorting method described inDrukker et al. (2011) to store this matrix in a banded format. This transformationrequires 1) permuting the elements of the existing spatial-weighting matrix to correspond


to a new row sort order and then 2) storing the spatial-weighting matrix in bandedformat. We accomplish step 1 by storing the new row sort order in a permutationvector, as explained below, and then by using spmat permute. We use spmat tobandedto perform step 2.

Note that most of the time, it is more convenient to sort the data as describedin section 5.1 and to call spmat contiguity or spmat idistance with a truncationcriterion. With very large datasets, spmat contiguity and spmat idistance will bethe only choices because they are capable of creating banded matrices from data withoutfirst storing the matrices in a general form.

18.1 Permuting a spatial-weighting matrix stored in an spmat object

Syntax

spmat permute objname pvarname

Description

spmat permute permutes the rows and columns of the n × n spatial-weighting matrixstored in the spmat object objname. The permutation vector stored in pvarname con-tains a permutation of the integers {1, . . . , n}, where n is both the sample size and thedimension of W. That the value of the ith observation of pvarname is j specifies thatwe must move row j to row i in the permuted matrix. After moving all the rows asspecified in pvarname, we move the columns in an analogous fashion. See Permutationdetails: Mathematics below for a more thorough explanation.

Examples

spmat permute is illustrated in the Examples section of section 18.2.

Permutation details: Mathematics

Let p be the permutation vector created from pvarname, and let W be the spatial-weighting matrix contained in the specified spmat object. The n×1 permutation vectorp contains a permutation of the integers {1, . . . , n}, where n is the dimension of W.

The permutation of W is obtained by reordering the rows and columns of W as specifiedby the elements of p. Each element of p specifies a row and column reordering of W.That element i of p is j—that is, p[i]=j—specifies that we must move row j to row i inthe permuted matrix. After moving all the rows according to p, we move the columnsanalogously.


Here is an illustrative example. We have a matrix W, which is not banded:

. mata: W[symmetric]

1 2 3 4 5

1 02 1 03 0 0 04 0 1 0 05 1 0 1 0 0

Suppose that we also have a permutation vector p that we could use to permute W to abanded matrix.

. mata: p1 2 3 4 5

1 3 5 1 2 4

See Permutation details: An example below to see how we used the sorting trick ofDrukker et al. (2011) to obtain this p. See Examples in section 18.2 for an examplewith real data.

The values in the permutation vector p specify how to permute (that is, reorder) therows and the columns of W. Let’s start with the rows. That 3 is element 1 of p specifiesthat row 3 of W be moved to row 1 in the permuted matrix. In other words, we mustmove row 3 to row 1.

Applying this logic to all the elements of p yields that we must reorder the rows ofW by moving row 3 to row 1, row 5 to row 2, row 1 to row 3, row 2 to row 4, and row 4to row 5. In the output below, we use Mata to perform this operation on W, store theresult in A, and display A. If the Mata code is confusing, just check that A contains thedescribed row reordering of W.

. mata: A = W[p,.]

. mata: A1 2 3 4 5

1 0 0 0 0 12 1 0 1 0 03 0 1 0 0 14 1 0 0 1 05 0 1 0 0 0

Having reordered the rows, we reorder the columns in the analogous fashion. Oper-ating on A, we move column 3 to column 1, column 5 to column 2, column 1 to column 3,column 2 to column 4, and column 4 to column 5. In the output below, we use Matato perform this operation on A, store the result in B, and display B. If the Mata code isconfusing, just check that B contains the reordering of A described above.


. mata: B = A[.,p]

. mata: B[symmetric]

1 2 3 4 5

1 02 1 03 0 1 04 0 0 1 05 0 0 0 1 0

Note that B is the desired banded matrix. For Mata aficionados, typing W[p,p]would produce this permutation in one step.

For those whose intuition is grounded in linear algebra, here is the permutation-matrix explanation. The permutation vector p defines the permutation matrix E, whereE is obtained by performing the row reordering described above on the identity matrixof dimension 5. Then the permuted form of W is given by E*W*E’, as we illustrate below:

. mata: E = I(5)

. mata: E[symmetric]

1 2 3 4 5

1 12 0 13 0 0 14 0 0 0 15 0 0 0 0 1

. mata: E = E[p,.]

. mata: E1 2 3 4 5

1 0 0 1 0 02 0 0 0 0 13 1 0 0 0 04 0 1 0 0 05 0 0 0 1 0

. mata: E*W*E´[symmetric]

1 2 3 4 5

1 02 1 03 0 1 04 0 0 1 05 0 0 0 1 0

permutation (see [M-1] permutation) provides further details on permutation vec-tors and permutation matrices.


Permutation details: An example

spmat permute requires that the permutation vector be stored in the Stata variablepvarname. Assume that we now have the unpermuted matrix W stored in the spmatobject cobj. The matrix represents contiguity information for the following data:

. list

id distance

1. 79 5.232. 82 27.563. 100 04. 114 1.775. 140 20.47

The variable distance measures the distance from the centroid of the place with id=100to the centroids of all the other places. We sort the data on distance and generate thepermutation vector p, which is just a running index 1, . . . , 5:

. sort distance

. generate p = _n

. list

id distance p

1. 100 0 12. 114 1.77 23. 79 5.23 34. 140 20.47 45. 82 27.56 5

We obtain our permutation vector by sorting the data back to the original orderbased on the id variable:

. sort id

. list

id distance p

1. 79 5.23 32. 82 27.56 53. 100 0 14. 114 1.77 25. 140 20.47 4

Now coding spmat permute cobj p will reorder the rows and columns of W inexactly the same way as the Mata code did above.


18.2 Banding a spatial-weighting matrix

Syntax

spmat tobanded objname1[objname2

] [, truncmethod replace

]where truncmethod is one of btruncate(b B), dtruncate(dL dU), or vtruncate(#).

Description

spmat tobanded stores an existing, general-format spatial-weighting matrix in a bandedformat. spmat tobanded has truncation options for inducing a banded structure inspatial-weighting matrices that are not already in banded form.

More precisely, spmat tobanded stores the spatial-weighting matrix in an spmatobject in banded format.

Options

truncmethod specifies one of the three truncation criteria. The values of W that meetthe truncation criterion will be changed to 0.

btruncate(b B) partitions the values of W into B bins and truncates to 0 entriesthat fall into bin b or below.

dtruncate(dL dU) truncates to 0 the values of W that fall more than dL diagonalsbelow and dU diagonals above the main diagonal. Neither value can be greater than�(cols(W)−1)/4�.vtruncate(#) truncates to 0 the values of W that are less than or equal to #.

replace allows objname1 or objname2 to be overwritten if it already exists.

Examples

Sometimes, we have large spatial-weighting matrices that fit in memory, but they takeup so much space that there is too little room to do anything else. In these cases, we arebetter off storing these spatial-weighting matrices in a banded format when possible.


spmat tobanded stores existing spatial-weighting matrices in a banded format. Thetwo allowed syntaxes are

spmat tobanded objname1 , replace

and

spmat tobanded objname1 objname2 [, replace]

The first syntax replaces the general-form spatial-weighting matrix in the spmat objectobjname1 with its banded form.

The second syntax stores the general-form spatial-weighting matrix in the spmat objectobjname1 in banded form in the spmat object objname2. You must specify replace ifobjname2 already exists.

We continue with the example from section 2.4, where we have the 3,109 × 3,109normalized-contiguity matrix stored in the spmat object ccounty. In section 5.1, weshowed that if we sort the data on a distance variable, we can call spmat contiguityagain and get a banded matrix. Here we show that we can achieve the same resultby 1) creating a permutation vector, 2) calling spmat permute, and 3) running spmattobanded on the existing spmat object.

We begin by generating a permutation vector and storing it in the Stata variable p.Recall that we want the ith element of p to contain the observation number that itwill have under the new sort order. This process is given in the code below and isanalogous to the one discussed in the subsections Permutation details: Mathematicsand Permutation details: An example in section 18.1. Because the data are alreadysorted by ID, we begin by sorting them by longitudes and latitudes of the centroids sothat the first observation will contain a corner place. Next we generate the distancefrom the corner place. After sorting the data in ascending order from the distance tothe corner observation, we generate our permutation vector p and finally put the databack in the original sort order.

. use county, clear

. generate p = _n



. sort dist


We can now use this permutation vector and spmat permute to perform the per-mutation, and we can finally call spmat tobanded to band the spatial-weighting matrixstored inside the spmat object ccounty. Note that the reported summary is identicalto the one in section 5.1.

. spmat permute ccounty p

. spmat tobanded ccounty, replace

. spmat summarize ccounty, links

Summary of spatial-weighting object ccounty

Matrix Description


Linkstotal 18474


(object contains eigenvalues)

19 Conclusion

We discussed the spmat command for creating, managing, importing, manipulating, andstoring spatial-weighting matrix objects. In future work, we will consider additionalsubcommands for creating specific types of spatial-weighting matrices.

20 Acknowledgment


21 ReferencesAnselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer

Academic Publishers.


Arbia, G. 2006. Spatial Econometrics: Statistical Foundations and Applications toRegional Convergence. Berlin: Springer.





Crow, K. 2006. shp2dta: Stata module to convert shape boundary files to Stata datasets.Statistical Software Components S456718, Department of Economics, Boston College.http://ideas.repec.org/c/boc/bocode/s456718.html.

Crow, K., and W. Gould. 2007. FAQ: How do I graph data onto a map with spmap?http://www.stata.com/support/faqs/graphics/spmap-and-maps/.


Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2011. Sorting inducesa banded structure in spatial-weighting matrices. Working paper, Department ofEconomics, University of Maryland.

Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013a. A command for estimatingspatial-autoregressive models with spatial-autoregressive disturbances and additionalendogenous variables. Stata Journal 13: 287–301.

———. 2013b. Maximum likelihood and generalized spatial two-stage least-squaresestimators for a spatial-autoregressive model with spatial-autoregressive disturbances.Stata Journal 13: 221–241.


Kelejian, H. H., and I. R. Prucha. 2010. Specification and estimation of spatial au-toregressive models with autoregressive and heteroskedastic disturbances. Journal ofEconometrics 157: 53–67.

Lai, P.-C., F.-M. So, and K.-W. Chan. 2009. Spatial Epidemiological Approaches inDisease Mapping and Analysis. Boca Raton, FL: CRC Press.

Leenders, R. T. A. J. 2002. Modeling social influence through network autocorrelation:Constructing the weight matrix. Social Networks 24: 21–47.

LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:Chapman & Hall/CRC.

Long, J. S. 2009. The Workflow of Data Analysis Using Stata. College Station, TX:Stata Press.

Pisati, M. 2005. mif2dta: Stata module to convert MapInfo Interchange Format bound-ary files to Stata boundary files. Statistical Software Components S448403, Depart-ment of Economics, Boston College.http://ideas.repec.org/c/boc/bocode/s448403.html.

———. 2007. spmap: Stata module to visualize spatial data. Statistical SoftwareComponents S456812, Department of Economics, Boston College.http://ideas.repec.org/c/boc/bocode/s456812.html.


Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region.Economic Geography 46: 234–240.

Waller, L. A., and C. A. Gotway. 2004. Applied Spatial Statistics for Public HealthData. Hoboken, NJ: Wiley.


About the authors


Hua Peng is a senior software engineer at StataCorp.




A command for estimatingspatial-autoregressive models with

spatial-autoregressive disturbances andadditional endogenous variables


College Station, TX

[email protected]


College Park, MD

[email protected]

Rafal RaciborskiStataCorp

College Station, TX

[email protected]

Abstract. We describe the spivreg command, which estimates the parametersof linear cross-sectional spatial-autoregressive models with spatial-autoregressivedisturbances, where the model may also contain additional endogenous variablesas well as exogenous variables. spivreg uses results and the literature cited inKelejian and Prucha (1998, Journal of Real Estate Finance and Economics 17:99–121; 1999, International Economic Review 40: 509–533; 2004, Journal of Econo-metrics 118: 27–50; 2010, Journal of Econometrics 157: 53–67); Arraiz et al. (2010,Journal of Regional Science 50: 592–614); and Drukker, Egger, and Prucha (2013,Econometric Reviews 32: 686–733).

Keywords: st0293, spivreg, spatial-autoregressive models, Cliff–Ord models, gener-alized spatial two-stage least squares, instrumental-variable estimation, generalizedmethod of moments estimation, spatial econometrics, spatial statistics

1 Introduction

Building on the work of Whittle (1954), Cliff and Ord (1973, 1981) developed statisticalmodels that accommodate forms of cross-unit interactions. The latter is a feature ofinterest in many social science, biostatistical, and geographic science models. A simpleversion of these models, typically referred to as spatial-autoregressive (SAR) models,augments the linear regression model by including an additional right-hand-side (RHS)variable known as a spatial lag. Each observation of the spatial-lag variable is a weightedaverage of the values of the dependent variable observed for the other cross-sectionalunits. Generalized versions of the SAR model also allow for the disturbances to begenerated by a SAR process and for the exogenous RHS variables to be spatial lags ofexogenous variables. The combined SAR model with SAR disturbances is often referredto as a SARAR model; see Anselin and Florax (1995).1

1. These models are also known as Cliff–Ord models because of the impact that Cliff and Ord (1973,1981) had on the subsequent literature. To avoid confusion, we simply refer to these models asSARAR models while still acknowledging the importance of the work of Cliff and Ord.


288 Spatial models with additional endogenous variables

In modeling the outcome for each unit as dependent on a weighted average of theoutcomes of other units, SARAR models determine outcomes simultaneously. This si-multaneity implies that the ordinary least-squares estimator will not be consistent; seeAnselin (1988) for an early discussion of this point. Drukker, Prucha, and Raciborski(2013) discuss the spreg command, which implements estimators for the model whenthe RHS variables are a spatial lag of the dependent variable, exogenous variables, andspatial lags of the exogenous variables.

The model we consider allows for additional endogenous RHS variables. Thus themodel of interest is a linear cross-sectional SAR model with additional endogenous vari-ables, exogenous variables, and SAR disturbances. We discuss an estimator for theparameters of this model and the command that implements this estimator, spivreg.Kelejian and Prucha (1998, 1999, 2004, 2010) and the references cited therein derivethe main results used by the estimator implemented in spivreg, with Drukker, Egger,and Prucha (2013) and Arraiz et al. (2010) producing some important extensions thatare used in the code.

While SARAR models have a wide range of possible applications, following Cliffand Ord (1973, 1981), much of the original literature was developed to handle spatialinteractions; see, for example, Anselin (1988, 2010), Cressie (1993), and Haining (2003).However, space is not restricted to geographic space, and many recent applicationsemploy these techniques in other situations of cross-unit dependence, such as social-interaction models and network models; see, for example, Kelejian and Prucha (2010)and Drukker, Egger, and Prucha (2013) for references. Much of the nomenclature stillincludes the adjective “spatial”, and we continue this tradition to avoid confusion whilenoting the wider applicability of these models.

Section 2 defines the generalized SARAR model. Section 3 describes the spivregcommand. Section 4 illustrates the estimation of a SARAR model on example datafor U.S. counties. Section 5 describes postestimation commands. Section 6 presentsmethods and formulas. The conclusion follows.

2 The model

The model of interest is given by

y = Yπ + Xβ + λWy + u (1)u = ρMu + ε (2)

where

• y is an n × 1 vector of observations on the dependent variable;

• Y is an n× p matrix of observations on p RHS endogenous variables, and π is thecorresponding p × 1 parameter vector;


• X is an n × k matrix of observations on k RHS exogenous variables (where someof the variables may be spatial lags of exogenous variables), and β is the corre-sponding p × 1 parameter vector;

• W and M are n × n spatial-weighting matrices (with 0 diagonal elements);

• Wy and Mu are n × 1 vectors typically referred to as spatial lags, and λ and ρare the corresponding scalar parameters typically referred to as SAR parameters;

• ε is an n × 1 vector of innovations.2

The model in equations (1) and (2) is a SARAR model with exogenous regressors andadditional endogenous regressors. Spatial interactions are modeled through spatial lags,and the model allows for spatial interactions in the dependent variable, the exogenousvariables, and the disturbances.

Because the model in equations (1) and (2) is a first-order SAR process with first-order SAR disturbances, it is also referred to as a SARAR(1,1) model, which is a specialcase of the more general SARAR(p, q) model. We refer to a SARAR(1,1) model as aSARAR model. Setting ρ = 0 yields the SAR model y = Yπ + Xβ + λWy+ε. Settingλ = 0 yields the model y = Yπ + Xβ + u with u = ρMu + ε, which is sometimesreferred to as the SAR error model. Setting ρ = 0 and λ = 0 causes the model to reduceto a linear regression model with endogenous variables.

The spatial-weighting matrices W and M are taken to be known and nonstochastic.These matrices are part of the model definition, and in many applications, W = M;see Drukker et al. (2013) for more about creating spatial-weighting matrices in Stata.Let y= Wy, let yi and yi denote the ith element of y and y, respectively, and let wij

denote the (i, j)th element of W. Then

yi =n∑

j=1

wijyj

which clearly shows the dependence of yi on neighboring outcomes via the spatial lagyi. The weights wij will typically be modeled as inversely related to some measureof distance between the units. The SAR parameter λ measures the extent of theseinteractions.

The innovations ε are assumed to be independent and identically distributed or in-dependent but heteroskedastically distributed. The option heteroskedastic, discussedbelow, should be specified under the latter assumption.

The spivreg command implements the generalized method of moments (GMM)and instrumental-variable (IV) estimation strategy discussed in Arraiz et al. (2010) and

2. The variables and parameters in this model are allowed to depend on the sample size; seeKelejian and Prucha (2010) for further discussions. We suppress this dependence for notationalsimplicity. In allowing, in particular, the elements of X to depend on the sample size, we findthat the specification is consistent with some of the variables in X being spatial lags of exogenousvariables.


Drukker, Egger, and Prucha (2013) for the above class of SARAR models. This estima-tion strategy builds on Kelejian and Prucha (1998, 1999, 2004, 2010) and the referencescited therein. More in-depth discussions regarding issues of model specifications andestimation approaches can be found in these articles and the literature cited therein.

spivreg requires that the spatial-weighting matrices M and W be provided in theform of an spmat object as described in Drukker et al. (2013). Both general and bandedspatial-weighting matrices are supported.

3 The spivreg command

3.1 Syntax

spivreg depvar[varlist1

](varlist2 =

[varlist iv

])

[if

] [in

], id(varname)[

dlmat(objname) elmat(objname) noconstant heteroskedastic impower(q)

level(#) maximize options]

3.2 Options


dlmat(objname) specifies an spmat object that contains the spatial-weighting matrixW to be used in the SAR term.

elmat(objname) specifies an spmat object that contains the spatial-weighting matrixM to be used in the spatial-error term.

noconstant suppresses the constant term in the model.

heteroskedastic specifies that spivreg use an estimator that allows e to be het-eroskedastically distributed over the observations. By default, spivreg uses anestimator that assumes homoskedasticity.

impower(q) specifies how many powers of the matrix W to include in calculating theinstrument matrix H. The default is impower(2). The allowed values of q areintegers in the set {2, 3, . . . , �√n�}.


maximize options: iterate(#),[no

]log, trace, gradient, showstep,

showtolerance, tolerance(#), ltolerance(#), and from(init specs);see [R] maximize for details. These options are seldom used.


3.3 Saved results

spivreg saves the following information in e():

Scalarse(N) number of observations e(converged) 1 if GMM stagee(k) number of parameters converged, 0e(rho 2sls) initial estimate of ρ otherwisee(iterations) number of GMM iterations e(converged 2sls)1 if 2SLS stagee(iterations 2sls) number of 2SLS iterations converged, 0

otherwise

Macrose(cmd) spivreg e(exogr) exogenous regressorse(cmdline) command as typed e(insts) instrumentse(depvar) name of dependent variable e(instd) instrumented variablese(title) title in estimation output e(constant) noconstant ore(properties) b V hasconstante(estat cmd) program used to implement e(H omitted) names of omitted

estat instruments in He(predict) program used to implement matrix

predict e(idvar) name of ID variablee(model) sarar, sar, sare, or lr e(dlmat) name of spmat objecte(het) heteroskedastic or used in dlmat()

homoskedastic e(elmat) name of spmat objecte(indeps) names of independent used in elmat()

variables

Matricese(b) coefficient vector e(delta 2sls) initial estimate of βe(V) variance–covariance matrix and λ

of the estimators


4 Examples

To provide a simple illustration, we use the artificial dataset spivreg.dta for the con-tinental U.S. counties.3 The contiguity matrix for the U.S. counties is taken fromDrukker et al. (2013). In Stata, we issue the following commands:

. use dui


The spatial-weighting matrix is now contained in the spmat object ccounty. Thisminmax-normalized spatial-weighting matrix was created in section 2.4 of Drukker et al.(2013) and was saved to disk in section 11.4.

In the output above, we are just reading in the spatial-weighting-matrix object thatwas created and saved in Drukker et al. (2013).

3. The geographical county location data came from the U.S. Census Bureau and can be foundat ftp://ftp2.census.gov/geo/tiger/TIGER2008/. The variables are simulated but inspired byPowers and Wilson (2004) and Levitt (1997).


Our dependent variable, dui, is defined as the alcohol-related arrest rate per 100,000daily vehicle miles traveled (DVMT). Figure 1 shows the distribution of dui acrosscounties, with darker colors representing higher values of the dependent variable. Spatialpatterns in dui are clearly visible.

Figure 1. Hypothetical alcohol-related arrests for continental U.S. counties

Our explanatory variables include police (number of sworn officers per 100,000DVMT); nondui (nonalcohol-related arrests per 100,000 DVMT); vehicles (number ofregistered vehicles per 1,000 residents); and dry (a dummy for counties that prohibitalcohol sale within their borders). Because the size of the police force may be a functionof dui arrest rates, we treat police as endogenous; that is, in this example, Y =(police). All other included explanatory variables, apart from the spatial lag, are takento be exogenous; that is, X = (nondui, vehicles, dry, intercept). Furthermore, weassume the variable elect is a valid instrument, where elect is 1 if a county governmentfaces an election and is 0 otherwise. Thus the instrument matrix H is based on Xf =(nondui, vehicles, dry, elect, intercept) as described above.


In Stata, we can estimate the SARAR model with endogenous variables by typing

. spivreg dui nondui vehicles dry (police = elect), id(id)> dlmat(ccounty) elmat(ccounty) nolog

Spatial autoregressive model Number of obs = 3109(GS2SLS estimates)


duipolice -1.467068 .0434956 -33.73 0.000 -1.552318 -1.381818nondui -.0004088 .0008344 -0.49 0.624 -.0020442 .0012267

vehicles .0989662 .0017653 56.06 0.000 .0955063 .1024261dry .4553992 .0278049 16.38 0.000 .4009026 .5098958

_cons 9.671655 .3682685 26.26 0.000 8.949862 10.39345

lambda_cons .7340818 .013378 54.87 0.000 .7078614 .7603023

rho_cons .2829313 .071908 3.93 0.000 .1419941 .4238685

Instrumented: policeInstruments: elect

Given the normalization of the spatial-weighting matrix, the parameter space forλ and ρ is taken to be the interval (−1, 1); see Kelejian and Prucha (2010) for furtherdiscussions of the parameter space. The estimate of λ is positive, large, and significant,indicating strong SAR dependence in dui. In other words, the alcohol-related arrestrate for a given county is strongly affected by the alcohol-related arrest rates in theneighboring counties. One possible explanation for this may be coordination amongpolice departments. Another may be that strong enforcement in one county may leadsome people to drink in neighboring counties.

The estimated ρ is positive, moderate, and significant, indicating moderate spatialautocorrelation in the innovations.

The estimated β vector does not have the same interpretation as in a simple lin-ear model, because including a spatial lag of the dependent variable implies that theoutcomes are determined simultaneously.


5 Postestimation commands

5.1 Syntax

The syntax for predict after spivreg is

predict[type

]newvar

[if

] [in

] [, statistic

]where statistic is one of the following:

naive, the default, computes Yπ + Xβ + λWy, which should not be viewed as apredictor for yi but simply as an intermediate calculation.

xb calculates Yπ + Xβ.

The predictor computed by the option naive will generally be biased; see Kelejianand Prucha (2007) for an explanation. Optimal predictors for the SARAR model withadditional endogenous RHS variables corresponding to different information sets willbe made available in the future. Optimal predictors for the SARAR model withoutadditional endogenous RHS variables are discussed in Kelejian and Prucha (2007).


In this section, we give a detailed description of the calculations performed by spivreg.We first discuss the estimation of the general model as specified in (1) and (2), bothunder the assumption that the innovations ε are homoskedastic and under the assump-tion that the innovations ε are heteroskedastic of unknown form. We then discuss thetwo special cases ρ = 0 and λ = 0, respectively.

6.1 SARAR model

It is helpful to rewrite the model in (1) and (2) as

y = Zδ + u

u = ρMu + ε

where Z = (Y,X,Wy) and δ = (π′,β′, λ)′. In the following, we review the two-stepGMM and IV estimation approach as discussed in Drukker, Egger, and Prucha (2013) forthe homoskedastic case and in Arraiz et al. (2010) for the heteroskedastic case. Thosearticles build on and specialize the estimation theory developed in Kelejian and Prucha(1998, 1999, 2004, 2010). A full set of assumptions, formal consistency and asymptoticnormality theorems, and further details and discussions are given in that literature.

The IV estimators δ depend on the choice of a set of instruments, say, H. Supposethat in addition to the included exogenous variables X, we also have excluded exogenousvariables Xe, allowing us to define Xf = (X,Xe). If we do not have excluded exogenous


variables, then Xf = X. Following the above literature, the instruments H may thenbe taken as the linearly independent columns of

(Xf ,WXf , . . . ,WqXf ,MXf ,MWXf , . . . ,MWqXf )

The motivation for the above instruments is that they are computationally simplewhile facilitating an approximation of the ideal instruments under reasonable assump-tions. Taking q = 2 has worked well in Monte Carlo simulations over a wide range ofspecifications. At a minimum, the instruments should include the linearly independentcolumns of Xf and MXf , and the rank of H should be at least the number of variablesin Z.4 For the following discussion, it proves convenient to define the instrument pro-jection matrix PH = H(H′H)−1H′. When there is a constant in the model, it is onlyincluded once in H.

The GMM estimators for ρ are motivated by quadratic moment conditions of theform

E (ε′Asε) = 0, s = 1, . . . , S

where the matrices As satisfy tr(As) = 0. Specific choices for those matrices will begiven below. We note that under heteroskedasticity, it is furthermore assumed that thediagonal elements of the matrices As are 0. This assumption simplifies the formula forthe asymptotic variance–covariance (VC) matrix; in particular, it avoids the fact thatthe VC matrix must depend on third and fourth moments of the innovations in additionto second moments.

We next describe the steps involved in computing the GMM and IV estimators andan estimate of their asymptotic VC matrix. The second step operates on a spatialCochrane–Orcutt transformation of the above model given by

y(ρ) = Z(ρ)δ+ε

with y(ρ) = (In − ρM)y and Z(ρ) = (In − ρM)Z.

Step 1a: Two-stage least-squares estimator

In the first step, we apply two-stage least squares (2SLS) to the untransformed modelby using the instruments H. The 2SLS estimator of δ is then given by

δ =(Z′Z

)−1

Z′y

where Z = PHZ.

4. Note that if Xf contains spatially lagged variables, H will contain collinear columns and will notbe full rank. In those cases, we drop collinear columns from H and return the names of omittedinstruments in e(H omitted).


Step 1b: Initial GMM estimator of ρ

The initial GMM estimator of ρ is given by

ρ = arg min

[{Γ(

ρρ2

)− γ

}′ {Γ(

ρρ2

)− γ

}]

where u = y − Zδ are the 2SLS residuals, u = Mu,

Γ = n−1

⎡⎢⎢⎣u′(A1 + A′

1)u −u′A1u

......

u′(AS + A′S)u −u

′Asu

⎤⎥⎥⎦ and γ = n−1

⎡⎢⎣u′A1u...

u′ASu

⎤⎥⎦Writing the GMM estimator in this form shows that we can calculate it by solvinga simple nonlinear least-squares problem. By default, S = 2 and homoskedastic isspecified. In this case,

A1 =[1 +

{n−1tr(M′M)

}2]−1 {

M′M − n−1tr(M′M)In

}and

A2 = M

If heteroskedastic is specified, then by default,

A1 = M′M − diag(M′M)

andA2 = M

Step 2a: Generalized spatial two-stage least-squares estimator of δ

In the second step, we first estimate δ by 2SLS from the transformed model by usingthe instruments H and from where the spatial Cochrane–Orcutt transformation uses ρ.The resulting generalized spatial two-stage least-squares (GS2SLS) estimator of δ is nowgiven by

δ (ρ) ={Z(ρ)′Z (ρ)

}−1

Z(ρ)′y(ρ)

where y(ρ) = (In − ρM)y, Z(ρ) = (In − ρM)Z, and Z(ρ) = PHZ(ρ).

Step 2b: Efficient GMM estimator of ρ

The efficient GMM estimator of ρ corresponding to GS2SLS residuals is given by

ρ = arg min

[{Γ(

ρρ2

)− γ

}′ {Ψ

ρρ(ρ)

}−1{Γ(

ρρ2

)− γ

}]


where u = y − Zδ denotes the GS2SLS residuals, u = Mu,

Γ = n−1

⎡⎢⎢⎣u′(A1 + A′

1)u −u′A1u

......

u′(AS + A′S)u −u

′Asu

⎤⎥⎥⎦ and γ = n−1

⎡⎢⎣u′A1u...

u′ASu

⎤⎥⎦and where Ψ

ρρ(ρ) is an estimator for the VC matrix of the (normalized) sample moment

vector based on GS2SLS residuals, say, Ψρρ. The estimator Ψρρ

(ρ) and Ψρρ differ for thecases of homoskedastic and heteroskedastic errors. When homoskedastic is specified,the r, s element of Ψ

ρρ(ρ) is given by (r, s = 1, 2),

Ψρρ

r,s (ρ) ={σ2 (ρ)

}2(2n)−1tr {(Ar + A′

r)(As + A′s)}

+ σ2 (ρ) n−1ar (ρ)′ as (ρ)

+ n−1[μ(4) (ρ) − 3

{σ2 (ρ)

}2]vecD (Ar)

′ vecD(As)

+ n−1μ(3)(ρ){ar (ρ)′ vecD(As) + as(ρ)′vecD(Ar)

}(3)

where

ar(ρ) = T(ρ)αr(ρ)

T(ρ) = HP(ρ)

P(ρ) = Q−1HHQHZ(ρ)

{QHZ(ρ)′Q−1

HHQHZ(ρ)′}−1

QHH =(n−1H′H

)QHZ(ρ) =

{n−1H′Z(ρ)

}Z(ρ) = (I − ρM)Z

αr(ρ) = −n−1 {Z(ρ)′(Ar + A′r)ε(ρ)}

ε(ρ) = (I − ρM)u

σ2(ρ) = n−1ε(ρ)′ε(ρ)

μ(3)(ρ) = n−1n∑

i=1

εi(ρ)3

μ(4)(ρ) = n−1n∑

i=1

εi(ρ)4

When heteroskedastic is specified, the r, s element of Ψρρ is estimated by

Ψρρ

r,s(ρ) = (2n)−1tr{

(Ar + A′r)Σ(ρ)(As + A′

s)Σ(ρ)}

+ n−1ar(ρ)′Σ(ρ)as(ρ) (4)

where Σ(ρ) is a diagonal matrix whose ith diagonal element is ε2i (ρ), and ε(ρ) and ar(ρ)

are as defined above. The last two terms in (3) do not appear in (4) because the As

matrices used in the heteroskedastic case have diagonal elements equal to 0.


Having computed the estimator θ = (δ′, ρ) in steps 1a, 1b, 2a, and 2b, we next

compute a consistent estimator for its asymptotic VC matrix, say, Ω. The estimator isgiven by nΩ where

Ω =

(Ω

δδΩ

δρ

Ωδρ′

Ωρρ

)

Ωδδ

= P(ρ)′Ψδδ

(ρ)P(ρ)

Ωδρ

= P(ρ)′Ψδρ

(ρ){Ψ

ρρ(ρ)

}−1

J[J′

{Ψ

ρρ(ρ)

}−1

J]−1

Ωρρ

=[J′

{Ψ

ρρ(ρ)

}−1

J]−1

J = Γ(

12ρ

)In the above, Ψ

ρρ(ρ) and P(ρ) are as defined in (3) and (4) with ρ replaced by ρ. The

estimators Ψδδ

(ρ) and Ψδρ

(ρ) are defined as follows:

When homoskedastic is specified,

Ψδδ

(ρ) = σ2(ρ)QHH

Ψδρ

(ρ) = σ2(ρ)n−1H′ {a1(ρ),a2(ρ)} + μ(3)(ρ)n−1H′ {vecD(A1), vecD(A2)}

When heteroskedastic is specified,

Ψδδ

(ρ) = n−1H′Σ(ρ)H

Ψδρ

(ρ) = n−1H′Σ(ρ) {a1(ρ),a2(ρ)}

We note that the expression for Ωρρ

has the simple form given above because theestimator in step 2b is the efficient GMM estimator.

6.2 SAR model without spatially correlated errors

Consider the case ρ = 0, that is, the case where the disturbances are not spatiallycorrelated. In this case, only step 1a is necessary, and spivreg estimates δ by 2SLS usingas instruments H the linearly independent columns of {Xf ,WXf , . . . ,WqXf}. The2SLS estimator is given by

δ =(Z′Z

)−1

Z′y

where Z = PHZ.


When homoskedastic is specified, the asymptotic VC matrix of δ can be estimatedconsistently by

σ2(Z′Z

)−1

where σ2 = n−1∑n

i=1 u2i and u = y − Zδ denotes the 2SLS residuals.

When heteroskedastic is specified, the asymptotic VC matrix of δ can be estimatedconsistently by the sandwich form

(Z′Z

)−1

Z′ΣZ(Z′Z

)−1

where Σ is the diagonal matrix whose ith element is u2i .

6.3 Spatially correlated errors without a SAR term

Consider the case λ = 0, that is, the case where there is no spatially lagged dependentvariable in the model. In this case, we use the same formulas as in section 6.1 after re-defining Z = Y,X, δ = (π′,β′)′, and we take H to be composed of linearly independentcolumns of (Xf ,MXf ).

6.4 No SAR term or spatially correlated errors

When the model does not contain a SAR term or spatially correlated errors, the 2SLS

estimator provides consistent estimates, and we obtain our results by using ivregress(see [R] ivregress). When homoskedastic is specified, the conventional estimator ofthe asymptotic VC is used. When heteroskedastic is specified, the vce(robust)estimator of the asymptotic VC is used. When no endogenous variables are specified,we obtain our results by using regress (see [R] regress).

7 Conclusion

We have described the spivreg command for estimating the parameters of a SARAR

model with additional endogenous RHS variables. In the future, we plan to add optionsfor optimal predictors corresponding to different information sets.

8 Acknowledgment



9 ReferencesAnselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer

Academic Publishers.


Anselin, L., and R. J. G. M. Florax. 1995. Small sample properties of tests for spatialdependence in regression models: Some further results. In New Directions in SpatialEconometrics, ed. L. Anselin and R. J. G. M. Florax, 21–74. Berlin: Springer.

Arraiz, I., D. M. Drukker, H. H. Kelejian, and I. R. Prucha. 2010. A spatial Cliff-Ord-type model with heteroskedastic innovations: Small and large sample results. Journalof Regional Science 50: 592–614.





Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2013. Creating and managingspatial-weighting matrices with the spmat command. Stata Journal 13: 242–286.

Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013. Maximum likelihood and gen-eralized spatial two-stage least-squares estimators for a spatial-autoregressive modelwith spatial-autoregressive disturbances. Stata Journal 13: 221–241.


Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squaresprocedure for estimating a spatial autoregressive model with autoregressive distur-bances. Journal of Real Estate Finance and Economics 17: 99–121.

———. 1999. A generalized moments estimator for the autoregressive parameter in aspatial model. International Economic Review 40: 509–533.

———. 2004. Estimation of simultaneous systems of spatially interrelated cross sectionalequations. Journal of Econometrics 118: 27–50.

———. 2007. The relative efficiencies of various predictors in spatial econometric modelscontaining spatial lags. Regional Science and Urban Economics 37: 363–374.

———. 2010. Specification and estimation of spatial autoregressive models with au-toregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.


Levitt, S. D. 1997. Using electoral cycles in police hiring to estimate the effect of policeon crime. American Economic Review 87: 270–290.

Powers, E. L., and J. K. Wilson. 2004. Access denied: The relationship between alcoholprohibition and driving under the influence. Sociological Inquiry 74: 318–337.


About the authors





A command for Laplace regression

Matteo BottaiUnit of Biostatistics

Institute of Environmental MedicineKarolinska InstitutetStockholm, [email protected]

Nicola OrsiniUnit of Biostatistics and Unit of Nutritional Epidemiology

Institute of Environmental MedicineKarolinska InstitutetStockholm, [email protected]

Abstract. We present the new laplace command for estimating Laplace re-gression, which models quantiles of a possibly censored outcome variable givencovariates. We illustrate laplace with an example from a clinical trial on survivalin patients with metastatic renal carcinoma. We also report the results of a smallsimulation study.

Keywords: st0294, laplace, quantile regression, censored outcome, survival analy-sis, Kaplan–Meier

1 Introduction

Estimating percentiles for a time-to-event variable of interest conditionally on covariatesmay offer a useful complement to current approaches to survival analysis. For exam-ple, comparing survival across treatments or exposure levels in observational studiesat various percentiles (for example, at the 50th or 10th percentiles) provides impor-tant insights. At the univariate level, this can be accomplished with the Kaplan–Meierestimator.

Laplace regression can be used to estimate the effect of risk factors and impor-tant predictors on survival percentiles while adjusting for other covariates. The user-written clad command (Jolliffe, Krushelnytskyy, and Semykina 2000) estimates condi-tional quantiles only when censoring times are fixed and known for all observations(Powell 1986), and its applicability is limited.

In this article, we present the laplace command for estimating Laplace regression(Bottai and Zhang 2010). In section 3, we describe the syntax and options. In section 3,we illustrate laplace with data from a randomized clinical trial. In section 4, we sketchthe methods and formulas. In section 5, we present the results of a small simulationstudy.


M. Bottai and N. Orsini 303

2 The laplace command

2.1 Syntax

laplace depvar[indepvars

] [if

] [in

] [, quantiles(numlist) failure(varname)

sigma(varlist) reps(#) seed(#) tolerance(#) maxiter(#) level(#)]

by, statsby, and xi are allowed with laplace; see [U] 11.1.10 Prefix commands.

See [R] qreg postestimation for features available after estimation.

2.2 Options

quantiles(numlist) specifies the quantiles as numbers between 0 and 1; numbers largerthan 1 are interpreted as percentages. The default is quantiles(0.5), which cor-responds to the median.

failure(varname) specifies the failure event; the value 0 indicates censored observa-tions. If failure() is not specified, all observations are assumed to be uncensored.

sigma(varlist) specifies the variables to be included in the scale parameter model. Thedefault is constant only.

reps(#) specifies the number of bootstrap replications to be performed for estimatingthe variance–covariance matrix and standard errors of the regression coefficients.

seed(#) sets the initial value of the random-number seed used by the bootstrap. Ifseed() is specified, the bootstrapped estimates are reproducible (see [R] set seed).

tolerance(#) specifies the tolerance for the optimization algorithm. When the abso-lute change in the log likelihood from one iteration to the next is less than or equal to#, the tolerance() convergence criterion is met. The default is tolerance(1e-10).

maxiter(#) specifies the maximum number of iterations. When the number of itera-tions equals maxiter(), the optimizer stops, displays an x, and presents the currentresults. The default is maxiter(2000).


304 A command for Laplace regression

2.3 Saved results

laplace saves the following in e():

Scalarse(N) number of observations e(n q) number of estimated quantilese(N fail) number of failures e(reps) number of bootstrap replications

Macrose(cmd) laplace e(qlist) requested quantilese(cmdline) command as typed e(vcetype) title used to label Std. Err.e(depvar) name of dependent variable e(properties) b Ve(eqnames) names of equations e(predict) program used to implement

predict

Matricese(b) coefficient vector e(V) variance–covariance matrix of

the estimators


3 Example: Survival in metastatic renal carcinoma

We illustrate the use of laplace with data from a clinical trial on 347 patients withmetastatic renal carcinoma. The patients were randomly assigned to either interferon-α (IFN) or oral medroxyprogesterone (MPA) (Medical Research Council Renal CancerCollaborators 1999). A total of 322 patients died during follow-up. The outcome ofprimary research interest is overall survival.

. use kidney_ca_l(kidney cancer data)

. quietly stset months, failure(cens)

The numeric variable months represents the time to event or censoring, and the binaryvariable cens indicates the failure status (0 = censored, 1 = death).

3.1 Median survival

We estimate a Laplace regression model where the response variable is time to death orcensoring (months) and the binary indicator for treatment (trt) is the only covariate.We specify the event status with the option failure(). The default percentile is themedian (q50).


. laplace months trt, failure(cens)

Laplace regression No. of subjects = 347No. of failures = 322

Robustmonths Coef. Std. Err. z P>|z| [95% Conf. Interval]

q50trt 3.130258 1.195938 2.62 0.009 .7862628 5.474254

_cons 6.80548 .7188408 9.47 0.000 5.396578 8.214382

The estimated median survival in the MPA group is 6.8 months (95% confidenceinterval: [5.4, 8.2]). The difference (trt) in median survival between the treatmentgroups is 3.1 months (95% confidence interval: [0.8, 5.5]). Median survival amongpatients on IFN can be obtained with the postestimation command lincom.

. lincom _cons + trt

( 1) [q50]trt + [q50]_cons = 0

months Coef. Std. Err. z P>|z| [95% Conf. Interval]

(1) 9.935738 .9557906 10.40 0.000 8.062423 11.80905

Percentiles of survival time by treatment group can also be obtained from the Kaplan–Meier estimate of the survivor function by using the command stci.

. stci, by(trt)

failure _d: censanalysis time _t: months

no. oftrt subjects 50% Std. Err. [95% Conf. Interval]

MPA 175 6.80548 .8902896 4.86575 8.15342IFN 172 9.830137 .8982793 7.7589 11.7041

total 347 7.956164 .5699226 6.90411 9.1726

The estimated median in the IFN group (9.8 months) differs slightly from the laplaceestimate (9.9 months) shown above. The Kaplan–Meier curve in the IFN group is flatat the 50th percentile between 9.83 and 9.96 months of follow-up. The command stcishows the lower limit of this interval while laplace shows a middle value.


3.2 Multiple survival percentiles

When it is relevant to estimate multiple percentiles of the distribution of survival time,these can be specified with the option quantiles().

. laplace months trt, failure(cens) quantiles(25 50 75) rep(100) seed(123)


Bootstrapmonths Coef. Std. Err. z P>|z| [95% Conf. Interval]

q25trt 1.509151 .8289345 1.82 0.069 -.1155312 3.133832

_cons 2.49863 .399623 6.25 0.000 1.715384 3.281877

q50trt 3.130258 1.209658 2.59 0.010 .7593719 5.501145

_cons 6.80548 .9100921 7.48 0.000 5.021732 8.589227

q75trt 3.663238 3.482536 1.05 0.293 -3.162407 10.48888

_cons 15.87945 1.714295 9.26 0.000 12.5195 19.23941

The treatment effect is larger at higher percentiles of survival time. The differencebetween the two treatment groups at the 25th, 50th, and 75th percentiles is 1.5, 3.1,and 3.7 months, respectively. When bootstrap is requested, one can test for differencesin treatment effects across survival percentiles with the postestimation command test.

. test [q25]trt = [q50]trt

( 1) [q25]trt - [q50]trt = 0

chi2( 1) = 2.59Prob > chi2 = 0.1076

We fail to reject the hypothesis that the treatment effects at the 25th and 50th survivalpercentiles are equal (p-value > 0.05).

Figure 1 shows the predicted percentiles from the 1st to the 99th in each treatmentgroup. The difference of 3 months in median survival between groups is represented bythe horizontal distance between the points A and B. Approximately 30% and 40% of thepatients on MPA and IFN, respectively, are estimated to live longer than 12 months. Theabsolute difference of about 10% in the probability of surviving 12 months is representedby the vertical distance between the points C and D.


A B

C

D

0102030405060708090

100Pe

rcen

tiles

0 12 24 36 48 60Follow−up time (months)

Figure 1. Survival percentiles in the MPA (solid line) and IFN (dashed line) groupsestimated with Laplace regression. The horizontal distance between the points A and B(3.1 months) indicates the difference in median survival between groups. The verticaldistance between C and D (about 10%) indicates the difference in the proportion ofpatients estimated to survive 12 months.

3.3 Interactions between covariates

Royston, Sauerbrei, and Ritchie (2004) analyzed the same data and described how acontinuous prognostic factor, white cell count (wcc), affects the treatment effect asmeasured by a relative hazard. We now perform a similar analysis by using Laplaceregression for the median survival. We include as covariates the treatment indicator(trt), three equally sized classes of white cell counts (cwcc) by means of two indicatorvariables, and their interactions.


. xi: laplace months i.trt*i.cwcc, failure(cens)i.trt _Itrt_0-1 (naturally coded; _Itrt_0 omitted)i.cwcc _Icwcc_0-2 (naturally coded; _Icwcc_0 omitted)i.trt*i.cwcc _ItrtXcwc_#_# (coded as above)


Robustmonths Coef. Std. Err. z P>|z| [95% Conf. Interval]

q50_Itrt_1 8.01462 2.270786 3.53 0.000 3.563962 12.46528_Icwcc_1 2.262442 2.068403 1.09 0.274 -1.791554 6.316438_Icwcc_2 -2.496523 1.645959 -1.52 0.129 -5.722544 .7294982

_ItrtXcwc_1_1 -5.737988 3.241483 -1.77 0.077 -12.09118 .6152021_ItrtXcwc_1_2 -7.751629 2.645534 -2.93 0.003 -12.93678 -2.566478

_cons 6.90203 1.658547 4.16 0.000 3.651337 10.15272

The predicted median survival can be obtained with standard postestimation commandssuch as predict or adjust.

. adjust, by(trt cwcc) format(%2.0f) noheader

White Cell Countstreatment Low Medium High

MPA 7 9 4IFN 15 11 5

Key: Linear Prediction

The between-treatment-group difference in median survival varies from 8 months inthe low white cell count category to 1 month in the high white cell count category. Wetest for interaction between treatment and white cell counts with the postestimationcommand testparm.

. testparm _ItrtX*

( 1) [q50]_ItrtXcwc_1_1 = 0( 2) [q50]_ItrtXcwc_1_2 = 0

chi2( 2) = 8.59Prob > chi2 = 0.0137

We reject the null hypothesis of equal treatment effect across categories of white cellcounts (p = 0.0137). The treatment effect seems to be largest in patients with low whitecell counts.

3.4 Laplace regression with uncensored data

Suppose all the values for the variable months were uncensored times at death. Thelaplace command can be used with uncensored observation by omitting the failure()option. In this case, laplace is simply an alternative to the standard quantile regressioncommands qreg and sqreg.


. qui laplace months trt

. adjust, by(trt) format(%3.2f) noheader

treatment xb

MPA 6.77IFN 9.89

Key: xb = Linear Prediction

. qui qreg months trt

. adjust, by(trt) format(%3.2f) noheader

treatment xb

MPA 6.77IFN 9.96

Key: xb = Linear Prediction

The number of observations in the MPA group is odd (175 patients), and the samplemedian survival is 6.77 months. The number of observations in the IFN group is even(172 patients), and the median is not uniquely defined. The two nearest values are 9.83and 9.96 months. The command qreg picks the larger of the two, while laplace picksa value in between.


In this section, we follow the description provided by Bottai and Zhang (2010). Supposewe have a sample of size n. Let ti, i = 1, . . . , n, be a continuous outcome variable, ci bea continuous censoring variable, and xi = {x1,i, . . . , xr,i}′ and zi = {z1,i, . . . , zs,i}′ betwo vectors of covariates. The sets of covariates contained in xi and zi may partially orentirely overlap. We assume that ci is independent of ti conditionally on the covariates.Suppose we observe (yi, di, x

′i, z

′i), with yi = min(ti, ci) and di = I(ti ≤ ci), where I(A)

denotes the indicator function of the event A. We assume that

ti = x′iβp + exp(z′iσp)εi (1)

where βp = {βp,1, . . . , βp,r}′ and σp = {σp,1, . . . , σp,s}′ indicate the unknown parametervectors, and εi are independent and identically distributed error terms that follow astandard Laplace distribution, f(εi) = p(1 − p) exp{[I(εi ≤ 0) − p]εi}. For any givenp ∈ (0, 1), the p-quantile of the conditional distribution of ti given xi and zi is x′

iβp

because P (ti ≤ x′iβp|xi, zi) = p.

The command laplace estimates the (r + s)-dimensional parameter vector {β′p, σ

′p}

by maximizing the Laplace likelihood function described by Bottai and Zhang (2010).It uses an iterative maximization algorithm based on the gradient of the log likelihoodthat generates a finite sequence of parameter values along which the likelihood increases.Briefly, from a current parameter value, the algorithm searches the positive semiline inthe direction of the gradient for a new parameter value where the likelihood is larger.


The algorithm stops when the change in the likelihood is less than the specified tolerance.Convergence is guaranteed by the continuity and concavity of the likelihood.

The asymptotic variance of the estimator βp for the parameter βp is derived by con-sidering the estimating condition reported by Bottai and Zhang (2010, eq. 4), S(βp) = 0,where

S(βp

)=

1exp (z′iσ)

n∑i=1

xi

{p − I (yi ≤ x′

iβp) − I (yi ≤ x′iβp) (1 − di)

p − 1

1 − F (yi|xi)

}

with F (yi|xi) = p exp{(1 − p)(yi − x′iβp)/ exp(z′iσp)}. Following the standard asymp-

totic theory for method of moments estimators, βp approximately follows a normaldistribution with mean β∗

p and variance V , where β∗p indicates the expected value of βp,

V = H(βp)−1S(βp)′S(βp)H(βp)−1, and H(βp) = ∂S(βp)/∂β′p|βp=bβp

. The derivative in

H(βp) is evaluated numerically. Alternatively, the standard errors can be obtained withbootstrap by specifying the reps() option.

5 Simulation

In this section, we present the setup and results of a small simulation study to as-sess the finite sample performance of the Laplace regression estimator under differentdata-generating mechanisms. We contrast the performance of Laplace with that of theKaplan–Meier estimator, a standard, nonparametric, uniformly consistent, and asymp-totically normal estimator of the survival function. To generate the survival estimates,we used the sts command.

We generated 500 samples from (1) in each of the six different simulation scenariosthat arose from the combination of two sample sizes and three data-generating mech-anisms. In each scenario, we estimated five percentiles (p = 0.10, 0.30, 0.50, 0.70, 0.90)with Laplace regression and the Kaplan–Meier estimator. The two sample sizes weren = 100 and n = 1,000. The three different data-generating mechanisms were obtainedby changing the values of zi, σp, and the censoring variable ci. In all simulation sce-narios, xi = (1, x1,i)′, with x1,i ∼ Bernoulli(0.5), βp = (5, 3)′, and εi was a standardnormal centered at the quantile being estimated.

In scenario number 1, zi = 1, σp = 1, and the censoring variable was set equalto a constant ci = 1,000 for all individuals. In this scenario, no observations werecensored, and Laplace regression was equivalent to ordinary quantile regression. Inscenario number 2, zi = 1, σp = 1, and the censoring variable was generated from thesame distribution as the outcome variable ti. This ensured an expected censoring rateof 50% in both covariate patterns (x1,i = 0, 1). In scenario number 3, zi = (1, x1,i)′ andσp = (0.5, 0.5)′. The censoring variable ci was generated from the same distribution asthe outcome variable ti. In this scenario, the standard deviation of ti was equal to 0.5when x1,i = 0 and equal to 1 when x1,i = 1.


The following table shows the observed relative mean squared error multiplied by1,000 for the predicted quantile in the group x1,i = 1 in each combination of sample size(obs), data-generating scenario (scenario), and percentile (percentile) for Laplace(top entry) and Kaplan–Meier (bottom entry).

. table percentile scenario obs, contents(mean msel mean msekm) format(%4.3f)> stubwidth(12)

obs and scenario100 1000

percentile 1 2 3 1 2 3

10 1.187 1.395 1.268 0.129 0.136 0.1261.233 1.496 1.320 0.132 0.140 0.132

30 0.597 0.685 0.680 0.064 0.073 0.0670.606 0.792 0.831 0.064 0.078 0.075

50 0.496 0.570 0.653 0.053 0.065 0.0730.505 0.860 0.941 0.053 0.074 0.075

70 0.513 0.639 0.711 0.050 0.131 0.1440.518 1.329 1.050 0.050 0.113 0.094

90 0.728 1.661 1.930 0.063 0.876 0.9550.731 1.835 1.701 0.063 0.478 0.450

The relative mean squared error was smaller for Laplace than for Kaplan–Meier at lowerquantiles and with the smaller sample size.

Figure 2 shows the relative mean squared error of Laplace (x axis) and Kaplan–Meier(y axis) estimators of the quantile in group x1,i = 1 over all simulation scenarios.

The Laplace estimator had fewer extreme values than Kaplan–Meier. The overallconcordance correlation coefficient (command concord) was 72.2%. After the 10%largest differences were excluded, the coefficient was 99.1%.


010

2030

Rela

tive

MSE

− K

apla

n−M

eier

0 5 10 15 20Relative MSE − Laplace

Figure 2. Relative mean squared error of Laplace (x axis) and Kaplan–Meier (y axis)estimators of the percentiles in group x1,i = 1 over all simulation scenarios. The solid45-degree line indicates the equal relative mean squared error of the two estimators.

The following two tables show the performance of the estimator of the asymptoticstandard error for the regression coefficients βp,0 (first table) and βp,1 (second table).In each cell of each table, the top entry is the average estimated asymptotic standarderror, and the bottom entry is the corresponding observed standard deviation acrossthe simulated samples.

. table percentile scenario obs, contents(mean s0 mean ms0) format(%4.3f)> stubwidth(12)



10 0.237 0.228 0.131 0.076 0.077 0.0390.235 0.251 0.123 0.073 0.082 0.039

30 0.185 0.200 0.098 0.059 0.062 0.0310.182 0.193 0.097 0.058 0.067 0.032

50 0.176 0.194 0.097 0.056 0.060 0.0300.169 0.185 0.093 0.053 0.064 0.032

70 0.188 0.198 0.098 0.059 0.064 0.0320.185 0.207 0.103 0.057 0.071 0.035

90 0.225 0.227 0.114 0.077 0.076 0.0380.231 0.255 0.141 0.072 0.087 0.046


. table percentile scenario obs, contents(mean s1 mean ms1) format(%4.3f)> stubwidth(12)



10 0.349 0.353 0.276 0.109 0.110 0.0870.330 0.351 0.263 0.104 0.113 0.088

30 0.277 0.292 0.232 0.084 0.089 0.0700.265 0.269 0.216 0.079 0.092 0.066

50 0.255 0.279 0.219 0.080 0.086 0.0680.250 0.257 0.226 0.077 0.086 0.073

70 0.272 0.293 0.227 0.084 0.090 0.0700.265 0.277 0.236 0.081 0.094 0.076

90 0.337 0.339 0.246 0.109 0.108 0.0850.325 0.320 0.284 0.104 0.109 0.098

The estimated standard errors were similar to the observed standard deviation acrossall cells for both regression coefficients.

6 Acknowledgment

Nicola Orsini was partly supported by a Young Scholar Award from the KarolinskaInstitutet’s Strategic Program in Epidemiology.

7 ReferencesBottai, M., and J. Zhang. 2010. Laplace regression with censored data. Biometrical

Journal 52: 487–503.

Jolliffe, D., B. Krushelnytskyy, and A. Semykina. 2000. sg153: Censored least absolutedeviations estimator: CLAD. Stata Technical Bulletin 58: 13–16. Reprinted in StataTechnical Bulletin Reprints, vol. 10, pp. 240–244. College Station, TX: Stata Press.

Medical Research Council Renal Cancer Collaborators. 1999. Interferon-α and survivalin metastatic renal carcinoma: Early results of a randomised controlled trial. Lancet353: 14–17.

Powell, J. L. 1986. Censored regression quantiles. Journal of Econometrics 32: 143–155.

Royston, P., W. Sauerbrei, and A. Ritchie. 2004. Is treatment with interferon-alphaeffective in all patients with metastatic renal carcinoma? A new approach to theinvestigation of interactions. British Journal of Cancer 90: 794–799.


About the author

Matteo Bottai is a professor of biostatistics in the Unit of Biostatistics at the Institute ofEnvironmental Medicine at Karolinska Institutet in Stockholm, Sweden.

Nicola Orsini is an associate professor of medical statistics and an assistant professor of epi-demiology in the Unit of Biostatistics and the Unit of Nutritional Epidemiology at the Instituteof Environmental Medicine at Karolinska Institutet in Stockholm, Sweden.


Importing U.S. exchange rate data from theFederal Reserve and standardizing country

names across datasets

Betul DicleNew Orleans, LA

[email protected]

John LevendisLoyola University New Orleans

New Orleans, LA

[email protected]

Mehmet F. DicleLoyola University New Orleans

New Orleans, LA

[email protected]

Abstract. fxrates is a command to import historical U.S. exchange rate datafrom the Federal Reserve and to calculate the daily change of the exchange rates.Because many cross-country datasets use different spellings and conventions forcountry names, we also introduce a second command, countrynames, to convertcountry names to a common naming standard.

Keywords: dm0069, fxrates, countrynames, exchange rates, country names, stan-dardization, data management, historical data

1 Introduction

Economic and financial researchers must often convert between currencies to facilitatecross-country comparisons. We provide a command, fxrates, that downloads dailyforeign exchange rates relative to the U.S. dollar from the Federal Reserve’s database.

Working with multiple cross-country datasets, such as international foreign exchangerates, introduces a unique problem: variations in country names. They are often spelleddifferently or follow different grammatical conventions across datasets. For example,North Korea is often different among datasets; it could be “North Korea”, “Korea,North”, “Korea, Democratic People’s Republic”, or even “Korea, DPR”. Likewise,“United States of America” is often “United States”, “USA”, “U.S.A.”, “U.S.”, or “US”.A dataset may have country names in all caps. Country names could also have inad-vertent leading or trailing spaces. Thus we provide a second command, countrynames,that renames many country names to follow a standard convention. The command is,of course, editable, so researchers may opt to use their own naming preferences.

c© 2013 StataCorp LP dm0069

316 fxrates and countrynames

2 The fxrates command

2.1 Syntax

fxrates[namelist

] [, period(2000 | 1999 | 1989) chg(ln | per | sper)

save(filename)]

2.2 Options

namelist is a list of country abbreviations for the countries whose foreign exchange datayou wish to download from the Federal Reserve’s website. Exchange rates for allavailable countries will be downloaded if namelist is omitted. The list of countriesincludes the following:

al Australia ma Malaysiaau Austria mx Mexicobe Belgium ne Netherlandsbz Brazil nz New Zealandca Canada no Norwaych China, P.R. po Portugaldn Denmark si Singaporeeu Economic and Monetary Union member countries sf South Africaec European Union ko South Koreafn Finland sp Spainfr France sl Sri Lankage Germany sd Swedengr Greece sz Switzerlandhk Hong Kong ta Taiwanin India th Thailandir Ireland uk United Kingdomit Italy ve Venezuelaja Japan

period(2000 | 1999 | 1989) specifies which block of dates to download. The FederalReserve foreign exchange database is separated into three blocks: one ending in1989, a second for 1990–1999, and a third for 2000 through the present. The default(obtained by omitting period()) is to download the three separate files and mergethem automatically so that the user has all foreign exchange market data available.You can specify one or more periods. If you know which data range you wish todownload, however, you can save time by specifying which of the three blocks todownload. Specifying all three periods is equivalent to the default of downloadingall the data.

B. Dicle, J. Levendis, and M. F. Dicle 317

chg(ln | per | sper) is the periodic return. Three different percent changes can be cal-culated for the adjusted closing price: natural log difference, percentage change, andsymmetrical percentage change. Whenever one of these is specified, a new variableis created with the appropriate prefix: ln for the first-difference of logs method, perfor the percent change, and sper for the symmetric percent change.

save(filename) is the output filename. filename is created under the current workingdirectory.

2.3 Using fxrates to import historical exchange rate data

Example

In this example, we use fxrates to import the entire daily exchange rate datasetfrom the Federal Reserve. Because we did not specify the countries, fxrates downloadsdata from all countries. Because we did not specify the period, fxrates defaults todownloading data for all available dates.

. fxratesau does not have 00be does not have 00

(output omitted )

ve does not have 89

. summarize


date 10551 11403.8 4264.338 4018 18788_al 10145 .8764938 .2391958 .4828 1.4885_au 7013 15.21975 3.999031 9.5381 26.0752_be 7021 38.61327 8.036983 27.12 69.6_bz 4134 1.944 .6851041 .832 3.945

_ca 10158 1.227691 .1689277 .9168 1.6128_ch 7592 6.110467 2.380834 1.5264 8.7409_dn 10151 6.702978 1.32041 4.6605 12.3725_eu 3131 1.197019 .1959648 .827 1.601_ec 4902 1.137739 .1776148 .6476 1.4557

(output omitted )

_sd 10151 6.633961 1.631 3.867 11.027_sz 10152 1.796176 .7191334 .8352 4.318_ta 6665 31.36318 4.002814 24.507 40.6_th 7571 30.99084 7.293934 20.36 56.1_uk 10152 1.779829 .3176284 1.052 2.644

_ve 4127 1.528076 1.111678 .1697 4.3


Output such as au does not have 00 indicates that there were no observations in aparticular block of years (in this case, the 2000–present block) for the particular country.When this appears, it is most often the case that the currency has been discontinued,as when Austria started using the euro.

Example

In this second example, we download the exchange rates of the U.S. dollar versus theFrench franc, the German deutschmark, and the Hong Kong dollar for all the availabledates.

. fxrates fr ge hkfr does not have 00ge does not have 00

. summarize


date 10550 11404.5 4263.934 4021 18788_fr 7021 5.673864 1.227564 3.8462 10.56_ge 7021 2.143872 .5509681 1.3565 3.645_hk 7652 7.63056 .5069678 5.127 8.7

Example

In this example, we download the exchange rate data for United States versus France,Germany, and Hong Kong. Because no period was specified, fxrates downloads thedata from all available dates. We also specified that fxrates calculate the daily percentchange, calculated in two different ways: as the log first-difference and as the arithmeticdaily percent change. The log-difference percent change for each country is prefixed byln; the arithmetic percent change for each country is prefixed by per.

. fxrates fr ge hk, chg(ln per)fr does not have 00ge does not have 00

. summarize


date 10550 11404.5 4263.934 4021 18788_fr 7021 5.673864 1.227564 3.8462 10.56_ge 7021 2.143872 .5509681 1.3565 3.645_hk 7652 7.63056 .5069678 5.127 8.7

ln_fr 6743 -.0000122 .0061762 -.0416059 .0587457

per_fr 6743 6.85e-06 .0061803 -.0407522 .0605055ln_ge 6743 -.0001283 .0064045 -.0414075 .0586776

per_ge 6743 -.0001078 .0064049 -.0405619 .0604333ln_hk 7363 .000054 .0023756 -.0410614 .0653051

per_hk 7363 .0000568 .0023914 -.0402298 .0674847


Example

In this final example, we download the U.S. dollar exchange rate versus the Japaneseyen and the Mexican peso. We calculate the daily percent change by calculating thefirst-differences of natural logs for the data ending in 1999 (that is, for the data endingin 1989 plus the data from 1990 through 1999).

. fxrates ja mx, period(1999 1989) chg(ln)mx does not have 89

. summarize


date 7565 9315 3057.56 4021 14609_ja 7267 195.5763 74.42725 81.12 358.44_mx 1541 7.258319 2.154108 3.1 10.63

ln_ja 6980 -.0001587 .0063255 -.056302 .0625558ln_mx 1477 .0005535 .0132652 -.1796934 .1926843

3 The countrynames command

3.1 Syntax

countrynames countryvar

3.2 Description

The command countrynames changes the name of a country in a dataset to corre-spond to a more standard set of names. By default, countrynames creates a newvariable, changed, containing numeric codes that indicate which country names havebeen changed. A code of 0 indicates no change; a code of 1 indicates that the coun-try’s name has been changed. We recommend you run countrynames on both datasetswhenever two different cross-country datasets are being merged. This minimizes thechance that a difference in names between datasets will prevent a proper merge fromoccurring. However, if you wish to keep a variable with the original names, you needto copy the variable to another variable. For example, before running countrynamescountry, you would need to type generate origcountry = country.


3.3 Using the countrynames command to convert country names toa common naming standard

Example

In this example, we use two macroeconomic datasets that have countries namedslightly differently. The first dataset is native to and shipped with Stata.

. sysuse educ99gdp, clear(Education and GDP)

Though the dataset is very small, it suffices for our purposes. Notice the spelling ofUnited States in this dataset.

. list

country public private

1. Australia .7 .72. Britain .7 .43. Canada 1.5 .94. Denmark 1.5 .15. France .9 .4

6. Germany .9 .27. Ireland 1.1 .38. Netherlands 1 .49. Sweden 1.5 .210. United States 1.1 1.2

. save temp1.dta, replace(note: file temp1.dta not found)file temp1.dta saved

In fact, all the spellings in this dataset correspond with the preferred names listed incountrynames, so nothing is required of us here. We could run countrynames just tobe on the safe side, but it would not have any effect. It is, however, good practice torun countrynames whenever merging datasets to maximize the chances that the twodatasets use the same country names.

The second dataset, using World Health Organization data, is from Kohler andKreuter (2005). The data are available from the Stata website.

. net from http://www.stata-press.com/data/kk/

(output omitted )

. net get data

(output omitted )

. use who2001.dta, clear


Notice how the United States is called United States of America in this dataset.

. list country

country

1. Afghanistan2. Albania

(output omitted )

180. United States of America

(output omitted )

187. Zambia188. Zimbabwe

We now run countrynames on this dataset to standardize the names of the countries.This will rename United States of America to United States, as it was in the firstdataset.

. countrynames country

. list country _changed

country _changed

1. Afghanistan 02. Albania 0

(output omitted )

180. United States 1

(output omitted )

187. Zambia 0188. Zimbabwe 0

Notice that the generated variable, changed, is equal to 1 for the United States entry;this indicates that its name was once something different.

Having run countrynames on both datasets, we have increased the chances thatcountries in both datasets follow the same naming convention. We are now safe tomerge the datasets:

. drop _changed

. sort country

. merge 1:1 country using temp1.dta

Result # of obs.

not matched 180from master 179 (_merge==1)from using 1 (_merge==2)

matched 9 (_merge==3)


The merge results table above is important: It is the result of merging a dataset thatused the countrynames command (master: who2001.dta) with a dataset that did notuse the countrynames command (using: temp1.dta). If the dataset using the commandincludes a country name that is not renamed with countrynames, then it will appearin the merge results table.

. sort country

. list country

country

1. Afghanistan2. Albania

(output omitted )

180. United Arab Emirates

(output omitted )

188. Zambia189. Zimbabwe

3.4 How to edit preferred country names within the countrynamescommand

It is possible to add, remove, or change country name entries within the countrynamescommand. After opening the countrynames.ado file with a do-file editor (any texteditor), you can delete country name entries, add new entries, or change spellings ac-cording to your preferences. Any changes made to the countrynames.ado file shouldbe saved. The discard command will refresh the countrynames command with yourupdates along with all the ado installations to Stata. We recommend that you confirmupdates to the countrynames command with a merge table.1

4 ReferenceKohler, U., and F. Kreuter. 2005. Data Analysis Using Stata. College Station, TX:

Stata Press.

About the authors

Betul Dicle recently earned her PhD from the political science department of Louisiana StateUniversity.

John Levendis is an assistant professor of economics at Loyola University New Orleans.

Mehmet F. Dicle is an assistant professor of finance at Loyola University New Orleans.

1. Please note that if we ever do an update to our program, the user edits to the ado-file will be lostwhen users grab the updated ado-file.


Generating Manhattan plots in Stata

Daniel E. CookUniversity of Iowa

Iowa City, IA

[email protected]

Kelli R. RyckmanUniversity of Iowa

Iowa City, IA

[email protected]

Jeffrey C. MurrayUniversity of Iowa

Iowa City, IA

[email protected]

Abstract. Genome-wide association studies hold the potential for discoveringgenetic causes for a wide range of diseases, traits, and behaviors. However, theincredible amount of data handling, advanced statistics, and visualization havemade conducting these studies difficult for researchers. Here we provide a tool,manhattan, for helping investigators easily visualize genome-wide association stud-ies data in Stata.

Keywords: st0295, manhattan, Manhattan plots, genome-wide association studies,single nucleotide polymorphisms

1 Introduction

The number of published genome-wide association studies (GWAS) has seen a staggeringlevel of growth from 453 in 2007 to 2,137 in 2010 (Hindorff et al. 2011). These studiesaim to identify the genetic cause for a wide range of diseases, including Alzheimer’s(Harold et al. 2009), cancer (Hunter et al. 2007), and diabetes (Hayes et al. 2007), andto elucidate variability in traits, behavior, and other phenotypes. This is accom-plished by looking at hundreds of thousands to millions of single nucleotide poly-morphisms and other genetic features across upward of 10,000 individual genomes(Corvin, Craddock, and Sullivan 2010). These studies generate enormous amounts ofdata, which present challenges for researchers in handling data, conducting statistics,and visualizing data (Buckingham 2008).

One method of visualizing GWAS data is through the use of Manhattan plots, socalled because of their resemblance to the Manhattan skyline. Manhattan plots arescatterplots, but they are graphed in a characteristic way. To create a Manhattan plot,you need to calculate p-values, which are generated through one of a variety of statisticaltests. However, because of the large number of hypotheses being tested in a GWAS, localsignificance levels typically fall below p = 10−5 (Ziegler, Konig, and Thompson 2008).Resulting p-values associated with each marker are −log10 transformed and plotted onthe y axis against their chromosomal position on the x axis. Chromosomes lie end toend on the x axis and often include the 22 autosomal chromosomes and the X, Y, andmitochondrial chromosomes.

Manhattan plots are useful for a variety of reasons. They allow investigators tovisualize hundreds of thousands to millions of p-values across an entire genome and toquickly identify potential genetic features associated with phenotypes. They also enable


324 Manhattan plots

investigators to identify clusters of genetic features, which associate because of linkagedisequilibrium. They can be used diagnostically—to ensure GWAS data are coded andformatted appropriately. Finally, they offer an easily interpretable graphical format topresent signals with formal levels of significance. For these reasons, Manhattan plotsare a common feature of GWAS publications.

While Manhattan plots are in essence scatterplots, formatting GWAS datasets fortheir generation can be difficult and time consuming. To help researchers in this process,we have developed a program executed through a new command, manhattan, thatformats data appropriately for plotting and allows for annotation and customizationoptions of Manhattan plots.

2 Data formatting

Following data cleaning and statistical tests, researchers are typically left with a datasetconsisting of, at a minimum, a list of genetic features (string), p-values (real), chromo-somes (integer), and their base pair location on a chromosome (integer). Using themanhattan command, a user specifies these variables. manhattan uses temporary vari-ables to manipulate data into a format necessary for plotting. The program first iden-tifies the number of chromosomes present and generates base pair locations relative totheir distance from the beginning of the first chromosome as if they were laid end to endin numerical order. The format in which p-values are specified is detected and, if needbe, log transformed. manhattan then calculates the median base pair location of eachchromosome as locations to place labels. Labels are generated by using chromosomenumbers except for the sex chromosomes and mitochondrial chromosomes, which definechromosomes 23, 24, and 25 with the X, Y , and M labels, respectively.

Once data have been reformatted in manhattan, plots are generated. Additionaloptions may require additional data manipulation. These options include spacing(),bonferroni(), and mlabel().

3 The manhattan command

3.1 Syntax

manhattan chromosome base-pair pvalue[if

] [, options

]options are listed in section 3.2.

D. E. Cook, K. R. Ryckman, and J. C. Murray 325

3.2 Options

options Description

Plot optionstitle(string) display a titlecaption(string) display a captionxlabel(string) set x label; default is xlabel(Chromosome)width(#) set width of plot; default is width(15)height(#) set height of plot; default is height(5)

Chromosome optionsx(#) specify chromosome number to be labeled as

X; default is x(23)y(#) specify chromosome number to be labeled as

Y ; default is y(24)mito(#) specify chromosome number to be labeled as

M ; default is mito(25)

Graph optionsbonferroni(h | v | n) draw a line at Bonferroni significance level;

label line with horizontal (h), vertical (v),or no (n) labels

mlabel(var) set a variable to use for labeling markersmthreshold(# | b) set a −log(p-value) above which markers will

be labeled, or use b to set your thresholdat the Bonferroni significance level

yline(#) set log(p-value) at which to draw a linelabelyline(h | v) label line specified with yline() by using

horizontal labels (h) or vertical labels (v)addmargin add a margin to the left and right of the plot,

leaving room for labels

Style optionscolor1(color) set first color of markerscolor2(color) set second color of markerslinecolor(color) set the color of Bonferroni line and label

or y line and label

4 Examples

The following examples were created using manhattan gwas.dta, which is availableas an ancillary file within the manhattan package. All the p-values were generatedrandomly; therefore, all genetic elements are in linkage equilibrium and are not linked.

326 Manhattan plots

4.1 Example 1

Below you will find a typical Manhattan plot generated with manhattan. Several optionswere specified in the generation of this plot. First, bonferroni(h) is used to specifythat a line be drawn at the Bonferroni level of significance. The h indicates that thelabel should be placed horizontally, on the line. Next, mlabel(snp) is used to indicatethat markers should be labeled with the variable snp, which contains the names of eachmarker. Additionally, mthreshold(b) is used to set a value at which to begin labelingmarkers. In this case, b is used to indicate that markers should be labeled at −log10

(p-values) greater than the Bonferroni significance level. Finally, addmargin is used toadd space on either side of the plot to prevent labels from running off the plot.

. manhattan chr bp pvalue, bonferroni(h) mlabel(snp) mthreshold(b) addmarginp-values log transformed.Bonferroni Correction -log10(p) = 5.2891339Label threshold set to Bonferroni value.97298

snp_38620snp_29084 snp_81068

snp_69797snp_94406

snp_97994

snp_49775

snp_63831

5.29

02

46

8

pval

ue

12

34

56

78

910

1112

1314

1516

1718

1920

2122X

Chromosome

4.2 Example 2

Here yline(6.5) is used to draw a horizontal line at log10(6.5), and labelyline(v)adds an axis label for the value of this line. Additionally, the variable used for markerlabels is identified using mlabel(snp), and a threshold at which to begin adding labelsto markers is given as the same value as the horizontal line by using mthreshold(6.5).Spacing is added between chromosomes with spacing(1) to keep labels on the x axisfrom running into one another. Finally, a margin is added on either side of the plot byusing addmargin, because some of the marker labels would otherwise fall off the plot.

The colors of the markers are changed with color1(black) and color2(gray). Thecolor of the line plotted on the y axis by using yline(v) has been changed to black byusing linecolor(black).

D. E. Cook, K. R. Ryckman, and J. C. Murray 327

. manhattan chr bp pvalue, yline(6.5) labelyline(v) mlabel(snp) mthreshold(6.5)> spacing(1) addmargin color1(black) color2(gray) linecolor(black)p-values log transformed.97298

snp_97994

6.5

02

46

8

pval

ue

12

34

56

78

910

1112

1314

1516

1718

1920

2122

X

Chromosome

5 Conclusions

As the number of GWAS publications continues to grow, easier tools are needed for in-vestigators to manipulate, perform statistics on, and visualize data. manhattan aims toprovide an easier, more standard method by which to visualize GWAS data in Stata. Wewelcome help in the development of manhattan by users and hope to improve manhattanin response to user suggestions and comments.

6 Acknowledgments

This work was supported by the March of Dimes (1-FY05-126 and 6-FY08-260), the Na-tional Institutes of Health (R01 HD-52953, R01 HD-57192), and the Eunice Kennedy ShriverNational Institute of Child Health and Human Development (K99 HD-065786). The con-tent is solely the responsibility of the authors and does not necessarily represent theofficial views of the National Institutes of Health or the Eunice Kennedy Shriver Na-tional Institute of Child Health and Human Development.

7 ReferencesBuckingham, S. D. 2008. Scientific software: Seeing the SNPs between us. Nature

Methods 5: 903–908.

Corvin, A., N. Craddock, and P. F. Sullivan. 2010. Genome-wide association studies:A primer. Psychological medicine 40: 1063–1077.

Harold, D., R. Abraham, P. Hollingworth, R. Sims, A. Gerrish, M. L. Hamshere, J. SinghPahwa, V. Moskvina, K. Dowzell, A. Williams, N. Jones, C. Thomas, A. Stretton,

328 Manhattan plots

A. R. Morgan, S. Lovestone, J. Powell, P. Proitsi, M. K. Lupton, C. Brayne, D. C.Rubinsztein, M. Gill, B. Lawlor, A. Lynch, K. Morgan, K. S. Brown, P. A. Passmore,D. Craig, B. McGuinness, S. Todd, C. Holmes, D. Mann, A. D. Smith, S. Love, P. G.Kehoe, J. Hardy, S. Mead, N. Fox, M. Rossor, J. Collinge, W. Maier, F. Jessen,B. Schurmann, H. van den Bussche, I. Heuser, J. Kornhuber, J. Wiltfang, M. Dich-gans, L. Frolich, H. Hampel, M. Hull, D. Rujescu, A. M. Goate, J. S. K. Kauwe,C. Cruchaga, P. Nowotny, J. C. Morris, K. Mayo, K. Sleegers, K. Bettens, S. Engel-borghs, P. P. De Deyn, C. Van Broeckhoven, G. Livingston, N. J. Bass, H. Gurling,A. McQuillin, R. Gwilliam, P. Deloukas, A. Al-Chalabi, C. E. Shaw, M. Tsolaki, A. B.Singleton, R. Guerreiro, T. W. Muhleisen, M. M. Nothen, S. Moebus, K.-H. Jockel,N. Klopp, H.-E. Wichmann, M. M. Carrasquillo, V. S. Pankratz, S. G. Younkin, P. A.Holmans, M. O’Donovan, M. J. Owen, and J. Williams. 2009. Genome-wide associa-tion study identifies variants at CLU and PICALM associated with Alzheimer’s disease.Nature Genetics 41: 1088–1093.

Hayes, M. G., A. Pluzhnikov, K. Miyake, Y. Sun, M. C. Y. Ng, C. A. Roe, J. E.Below, R. I. Nicolae, A. Konkashbaev, G. I. Bell, N. J. Cox, and C. L. Hanis. 2007.Identification of type 2 diabetes genes in Mexican Americans through genome-wideassociation studies. Diabetes 56: 3033–3044.

Hindorff, L. A., J. MacArthur, A. Wise, H. A. Junkins, P. N. Hall, A. K. Klemm,and T. A. Manolio. 2011. A catalog of published genome-wide association studies.http://www.genome.gov/gwastudies/.

Hunter, D. J., P. Kraft, K. B. Jacobs, D. G. Cox, M. Yeager, S. E. Hankinson, S. Wa-cholder, Z. Wang, R. Welch, A. Hutchinson, J. Wang, K. Yu, N. Chatterjee, N. Orr,W. C. Willett, G. A. Colditz, R. G. Ziegler, C. D. Berg, S. S. Buys, C. A. McCarty,H. S. Feigelson, E. E. Calle, M. J. Thun, R. B. Hayes, M. Tucker, D. S. Gerhard,J. F. Fraumeni, Jr., R. N. Hoover, G. Thomas, and S. J. Chanock. 2007. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadicpostmenopausal breast cancer. Nature Genetics 39: 870–874.

Ziegler, A., I. R. Konig, and J. R. Thompson. 2008. Biostatistical aspects of genome-wide association studies. Biometrical Journal 50: 8–28.

About the authors

Daniel E. Cook is a research assistant in the Department of Pediatrics at the University ofIowa. His research focuses on genetics and bioinformatics approaches.

Kelli K. Ryckman is an Associate Research Scientist in the Department of Pediatrics at theUniversity of Iowa. Her research focuses on the genetics and metabolomics of maternal andfetal complications in pregnancy.

Jeffrey C. Murray received his medical degree from Tufts Medical School in Boston in 1978.He has been conducting research at the University of Iowa since 1984. The Murray laboratoryis focused on identifying genetic and environmental causes of complex diseases, specificallypremature birth and birth defects such as a cleft lip and palate.


Semiparametric fixed-effects estimator

Francois LiboisUniversity of Namur

Centre for Research in the Economics of Development (CRED)Namur, Belgium

[email protected]

Vincenzo VerardiUniversity of Namur

Centre for Research in the Economics of Development (CRED)Namur, Belgium

andUniversite Libre de Bruxelles

European Center for Advanced Research in Economics and Statistics (ECARES)and Center for Knowledge Economics (CKE)

Brussels, [email protected]

Abstract. In this article, we describe the Stata implementation of Baltagi andLi’s (2002, Annals of Economics and Finance 3: 103–116) series estimator of par-tially linear panel-data models with fixed effects. After a brief description of theestimator itself, we describe the new command xtsemipar. We then simulate datato show that this estimator performs better than a fixed-effects estimator if therelationship between two variables is unknown or quite complex.

Keywords: st0296, xtsemipar, semiparametric estimations, panel data, fixed effects

1 Introduction

The objective of this article is to present a Stata implementation of Baltagi and Li’s(2002) series estimation of partially linear panel-data models.

The structure of the article is as follows. Section 2 describes Baltagi and Li’s (2002)fixed-effects semiparametric regression estimator. Section 3 presents the implementedStata command (xtsemipar). Some simple simulations assessing the performance ofthe estimator are shown in section 4. Section 5 provides a conclusion.


330 Semiparametric fixed-effects estimator

2 Estimation method

2.1 Baltagi and Li’s (2002) semiparametric fixed-effects regressionestimator

Consider a general panel-data semiparametric model with distributed intercept of thetype

yit = xitθ + f(zit) + αi + εit, i = 1, . . . , N ; t = 1, . . . , T where T << N (1)

To eliminate the fixed effects αi, a common procedure, inter alia, is to differentiate(1) over time, which leads to

yit − yit−1 = (xit − xit−1)θ + {f(zit) − f(zit−1)} + εit − εit−1 (2)

An evident problem here is to consistently estimate the unknown function of z ≡G(zit, zit−1) = {f(zit) − f(zit−1)}. What Baltagi and Li (2002) propose is to approxi-mate f(z) by series pk(z) [and therefore approximate G(zit, zit−1) = {f(zit)− f(zit−1)}by pk(zit, zit−1) = {pk(zit)− pk(zit−1)}], where pk(z) are the first k terms of a sequenceof functions [p1(z), p2(z), . . .]. They then demonstrate the

√N normality for the esti-

mator of the parametric component (that is, θ) and the consistency at the standardnonparametric rate of the estimated unknown function [that is, f(.)]. Equation (2)therefore boils down to

yit − yit−1 = (xit − xit−1)θ +{pk(zit) − pk(zit−1)

}γ + εit − εit−1 (3)

which can be consistently estimated by using ordinary least squares. Having estimatedθ and γ, we propose to fit the fixed effects αi and go back to (1) to estimate the errorcomponent residual

uit = yit − xitθ − αi = f(zit) + εit (4)

The curve f can be fit by regressing uit on zit by using some standard nonparametricregression estimator.

A typical example of pk series is spline, which is a fractional polynomial with piecesdefined by a sequence of knots c1 < c2 < · · · < ck, where they join smoothly.

The simplest case is a linear spline. For a spline of degree m, the polynomials andtheir first m− 1 derivatives agree at the knots, so m− 1 derivatives are continuous (seeRoyston and Sauerbrei [2007] for further details).

A spline of degree m with k knots can be represented as a power series:

S(z) =m∑

j=0

ζjzj +

k∑j=1

λj(z − cj)m+ where (z − cj)m

+ ={

z − cj if z > cj

0 otherwise

F. Libois and V. Verardi 331

The problem here is that successive terms tend to be highly correlated. A probablybetter representation of splines is a linear combination of a set of basic splines called(kth degree) B-splines, which are defined for a set of k + 2 consecutive knots c1 < c2 <· · · < ck+2 as

B(z, c1, . . . , ck+2) = (k + 1)k+2∑j=1

⎧⎨⎩ ∏1≤h≤k+2,h�=j

(ch − cj)

⎫⎬⎭−1

(z − cj)k+

B-splines are intrinsically a rescaling of each of the piecewise functions. The tech-nicalities of this method are beyond the scope of this article, and we refer the reader toNewson (2000b) for further details.

We implemented this estimator in Stata under the command xtsemipar, which wedescribe below.

3 The xtsemipar command

The xtsemipar command fits Baltagi and Li’s double series fixed-effects estimator in thecase of a single variable entering the model nonparametrically. Running the xtsemiparcommand requires the prior installation of the bspline package developed by Newson(2000a).

The general syntax for the command is

xtsemipar varlist[if

] [in

] [weight

], nonpar(varname)

[generate(

[string1

]string2) degree(#) knots1(numlist) nograph spline knots2(numlist)

bwidth(#) robust cluster(varname) ci level(#)]

The first option, nonpar(), is required. It declares that the variable enters the modelnonparametrically. None of the remaining options are compulsory. The user has theopportunity to recover the error component residual—the left-hand side of (4)—whosename can be chosen by specifying string2. This error component can then be used todraw any kind of nonparametric regression. Because the error component has alreadybeen partialled out from fixed effects and from the parametrically dependent variables,this amounts to estimating the net nonparametric relation between the dependent andthe variable that enters the model nonparametrically. By default, xtsemipar reportsone estimation of this net relationship. string1 makes it possible to reproduce the valuesof the fitted dependent variable. Note that the plot of residuals is recentered around itsmean. The remaining part of this section describes options that affect this fit.

A key option in the quality of the fit is degree(). It determines the power ofthe B-splines that are used to consistently estimate the function resulting from the firstdifference of the f(zit) and f(zit−1) functions. The default is degree(4). If the nographoption is not specified—that is, the user wants the graph of the nonparametric fit of thevariable in nonpar() to appear—degree() will also determine the degree of the local


weighted polynomial fit used in the Epanechnikov kernel performed at the last stagefit. If spline is specified, this last nonparametric estimation will also be estimated bythe B-spline method, and degree() is then the power of these splines. knots1() andknots2() are both rarely used. They define a list of knots where the different piecesof the splines agree. If left unspecified, the number and location of the knots will bechosen optimally, which is the most common practice. knots1() refers to the B-splineestimation in (3). knots2() can only be used if the spline option is specified and refersto the last stage fit. More details about B-spline can be found in Newson (2000b). Thebwidth() option can only be used if spline is not specified. It gives the half-widthof the smoothing window in the Epanechnikov kernel estimation. If left unspecified,a rule-of-thumb bandwidth estimator is calculated and used (see [R] lpoly for moredetails).

The remaining options refer to the inference. The robust and cluster() optionscorrect the inference, respectively, for heteroskedasticity and for clustering of errorterms. In the graph, confidence intervals can be displayed by a shaded area aroundthe curve of fitted values by specifying the option ci. Confidence intervals are set to95% by default; however, it is possible to modify them by setting a different confidencelevel through the level() option. This affects the confidence intervals both in thenonparametric and in the parametric part of estimations.

4 Simulation

In this section, we show, by using some simple simulations, how xtsemipar behavesin finite samples. At the end of the section, we illustrate how this command can beextended to tackle some endogeneity problems.

In brief, the simulation setup is a standard fixed-effects panel of 200 individualsover five time periods (1,000 observations). For the design space, four variables, x1,x2, x3, and d, are generated from a normal distribution with mean μ = (0, 0, 0, 0) andvariance–covariance matrix

⎛⎜⎜⎝x1 x2 x3 d

x1 1x2 0.2 1x3 0.8 0.4 1d 0 0.3 0.6 1

⎞⎟⎟⎠Variable d is categorized in such a way that five individuals are identified by each

category of d. In practice, we generate these variables in a two-step procedure wherethe x’s have two components. The first one is fixed for each individual and is correlatedwith d. The second one is a random realization for each time period.


Five hundred replications are carried out, and for each replication, an error terme is drawn from an N(0, 1). The dependent variable y is generated according to thedata-generating process (DGP): y = x1 + x2 − (x3 + 2 × x2

3 − 0.25 × x33) + d + e. As

is obvious from this estimation setting, multivariate regressions with individual fixedeffects should be used if we want to consistently estimate the parameters. So we regressy on the x’s by using three regression models:

1. xtsemipar, considering that x1 and x2 enter the model linearly and x3 entersnonparametrically.

2. xtreg, considering that x1, x2, and x3 enter the model linearly.

3. xtreg, considering that x1 and x2 enter the model linearly, whereas x3 enters themodel parametrically with the correct polynomial form (x2

3 and x33).

Table 1 reports the bias and mean squared error (MSE) of coefficients associatedwith x1 and x2 for the three regression models. What we find is that Baltagi andLi’s (2002) estimator performs much better than the usual fixed-effects estimator withlinear control for x3, in terms of both bias and efficiency. As expected, the most effi-cient and unbiased estimator remains the fixed-effects estimator with the appropriatepolynomial specification. However, this specification is generally unknown. Figure 1displays the average nonparametric fit of x3 (plain line) obtained in the simulation withthe corresponding 95% band. The true DGP is represented by the dotted line.

Table 1. Comparison between xtsemipar and xtreg

Bias x1 Bias x2 MSEx1 MSEx2

xtsemipar with nonparametric control for x3 −0.0006 −0.0007 0.00536 0.00399xtreg with linear control for x3 −0.2641 0.03752 0.07383 0.00462xtreg with 2nd- and 3rd-order polynomial control for x3 −0.0023 −0.0009 0.00410 0.00321


−40

−30

−20

−10

0f(x

3)

−4 −2 0 2 4x3

DGP confidence interval at 95% average fit

Nonparametric prediction of x3 based on splines

Figure 1. Average semiparametric fit of x3

If we want efficient and consistent estimates of parameters, estimations relying onthe correct parametric specification are always better. Nevertheless, this correct formhas to be known. It could be argued that a sufficiently flexible polynomial fit wouldbe preferable to a semiparametric model. However, this is not the case. Indeed, let usconsider the same simulation setting described above, but with the dependent variable ycreated according to the new DGP y = x1+x2+3 sin(2.5x3)+d+e. Figure 2 reports theaverage nonparametric fit of x3 in a black solid line, with a 95% confidence band aroundit. The dotted gray line represents the true DGP, which is quite close to the averagefit estimated by xtsemipar using a fourth-order kernel regression with a bandwidthset to 0.2. The dashed gray line is the average fourth-order polynomial fixed-effectsparametric fit. As is clear from this figure, xtsemipar provides a much better fit forthis quite complex DGP. xtsemipar can also help identify the relevant parametric formand help applied researchers avoid some trial and error.


−50

510

f(x3)

−4 −2 0 2 4x3

DGP avg. fit − 4th−order Taylor exp. confidence interval at 95% avg. fit − 4th−order local polyn.

Nonparametric prediction of x3

Figure 2. Average semiparametric fit of x3

In much of the empirical research in applied economics, measurement errors, omit-ted variable bias, and simultaneity are common issues that can be solved throughinstrumental-variables estimation. Baltagi and Li (2002) extend their results to ad-dress these kinds of problems and establish the asymptotic properties for a partiallylinear panel-data model with fixed effects and possible endogeneity of the regressors. Inpractice, our estimator can be used within a two-step procedure to obtain consistentestimates of the βs. In the first stage, the right-hand side endogenous variable has tobe regressed (and fit) by using (at least) one valid instrument. At this stage, the non-parametric variable linearly enters into the estimation procedure. In the second stage,the semiparametric fixed-effects panel-data model can be used to estimate the relationbetween the dependent variable and the set of regressors. The nonparametric variablenow enters the model nonparametrically, exactly as explained before. If the instrumentis valid, this procedure leads to consistent estimations.

Another problem can arise if the nonparametric variable is subject to endogeneityproblems. In this case, we suggest, as the first step of the estimation procedure, using acontrol functional approach as explained by Ahamada and Flachaire (2008). However,we believe that the technicalities associated with this method go well beyond the scopeof this article.


5 Conclusion

In econometrics, semiparametric regression estimators are becoming standard tools forapplied researchers. In this article, we presented Baltagi and Li’s (2002) series semi-parametric fixed-effects regression estimator. We then introduced the Stata programwe created to put it into practice. Some simple simulations to illustrate the usefulnessand the performance of the procedure were also shown.

6 Acknowledgments

We would like to thank Rodolphe Desbordes, Patrick Foissac, our colleagues at CRED

and ECARES, and especially Wouter Gelade and Peter-Louis Heudtlass, who helped im-prove the quality of the article. The usual disclaimer applies. Francois Libois wishes tothank the ERC grant SSD 230290 for financial support. Vincenzo Verardi is an associateresearcher at the FNRS and gratefully acknowledges their financial support.

7 ReferencesAhamada, I., and E. Flachaire. 2008. Econometrie Non Parametrique. Paris:

Economica.

Baltagi, B. H., and D. Li. 2002. Series estimation of partially linear panel data modelswith fixed effects. Annals of Economics and Finance 3: 103–116.

Newson, R. 2000a. bspline: Stata modules to compute B-splines parameterized by theirvalues at reference points. Statistical Software Components S411701, Department ofEconomics, Boston College. http://ideas.repec.org/c/boc/bocode/s411701.html.

———. 2000b. sg151: B-splines and splines parameterized by their values at referencepoints on the x-axis. Stata Technical Bulletin 57: 20–27. Reprinted in Stata TechnicalBulletin Reprints, vol. 10, pp. 221–230. College Station, TX: Stata Press.

Royston, P., and W. Sauerbrei. 2007. Multivariable modeling with cubic regressionsplines: A principled approach. Stata Journal 7: 45–70.

About the authors

Francois Libois is a researcher and teaching assistant in economics at the University of Namur inthe Centre for Research in the Economics of Development (CRED). His main research interestsare new institutional economics with a special focus on development and environmental issues.

Vincenzo Verardi is a research fellow of the Belgian National Science Foundation (FNRS). He isa professor at the University of Namur and at the Universite Libre de Bruxelles. His researchinterests include applied econometrics and development economics.


Exact Wilcoxon signed-rank and WilcoxonMann–Whitney ranksum tests

Tammy HarrisInstitute for Families in Society

Department of Epidemiology & BiostatisticsUniversity of South Carolina

Columbia, SC

[email protected]

James W. HardinInstitute for Families in Society

Department of Epidemiology & BiostatisticsUniversity of South Carolina

Columbia, SC

[email protected]

Abstract. We present new Stata commands for carrying out exact Wilcoxonone-sample and two-sample comparisons of the median. Nonparametric tests areoften used in clinical trials, in which it is not uncommon to have small samples.In such situations, researchers are accustomed to making inferences by using exactstatistics. The ranksum and signrank commands in Stata provide only asymptoticresults, which assume normality. Because large-sample results are unacceptablein many clinical trials studies, these researchers must use other software packages.To address this, we have developed new commands for Stata that provide exactstatistics in small samples. Additionally, when samples are large, we provide resultsbased on the Student’s t distribution that outperform those based on the normaldistribution.

Keywords: st0297, ranksumex, signrankex, exact distributions, nonparametrictests, median, Wilcoxon matched-pairs signed-rank test, Wilcoxon ranksum test

1 Introduction

Many statistical analysis methods are derived after making an assumption about theunderlying distribution of the data (for example, normality). However, one may alsoconsider nonparametric methods from which to draw statistical inferences where no as-sumptions are made about an underlying population or distribution. For the nonpara-metric equivalents to the parametric one-sample and two-sample t tests, the Wilcoxonsigned-rank test (one sample) is used to test the hypothesis that the median differ-ence between the absolute values of positive and negative paired differences is 0. TheWilcoxon Mann–Whitney ranksum test is used to test the hypothesis of a zero-mediandifference between two independently sampled populations.


338 Exact Wilcoxon rank tests

We present Stata commands to evaluate both of these nonparametric statisticaltests. This article is organized as follows. In section 2, we review the test statistics.In section 3, Stata syntax is presented for the new commands, followed by examples insection 4. A final summary is presented in section 5.

2 Nonparametric Wilcoxon tests

2.1 Wilcoxon signed-rank test

Let Xi and Yi be continuous paired random variables from data consisting of n obser-vations, where observations are denoted as X = (X1, . . . , Xn)T and Y = (Y1, . . . , Yn)T .For these paired bivariate data, (x1, y1), . . . , (xn, yn), the differences are calculated asDi = Yi − Xi. We omit consideration of the subset of observations for which the abso-lute difference is 0. From this one sample of nr ≤ n nonzero differences, ranks (ri) areapplied to the absolute differences |Di|, where rank 1 is the smallest absolute differenceand rank nr is the largest absolute difference. Before assigning ranks, we omit absolutedifferences of 0, Di = 0.

We then test the hypothesis that Xi and Yi are distributed interchangeably by usingthe signed-rank test statistic,

S =nr∑i=1

riI(Di > 0) − nr(nr + 1)4

where I(Di > 0) is an indicator function that the ith difference is positive. Ranks oftied absolute differences are averaged for the relevant set of observations. The varianceof S is given by

V =124

nr(nr + 1)(2nr + 1) − 148

m∑j

tj(tj + 1)(tj − 1)

where tj is the number of values tied in absolute value for the jth rank (Lehmann 1975)out of the m unique assigned ranks; m = nr and tj = 1 ∀j if there are no ties. Thesignificance of S is then computed one of two ways, contingent on sample size (nr). Ifnr > 25, the significance of S can be based on the normal approximation (as is done inStata’s signrank command) or on Student’s t distribution,

S

√nr − 1

nrV − S2

with nr − 1 degrees of freedom (Iman 1974). When nr ≤ 25, the significance of S iscomputed from the exact distribution.

An algorithm for calculation of associated probabilities is the network algorithm ofMehta and Patel (1986). Many new improvements and modifications of that algorithmhave been implemented in various applications to compute the exact p-value. Some in-clude polynomial time algorithms for permutation distributions (Pagano and Tritchler

T. Harris and J. W. Hardin 339

1983), Mann–Whitney-shifted fast Fourier transform (FFT) (Nagarajan and Keich 2009),and decreased computation time for the network algorithm described in Requena andMartın Ciudad (2006). Comprehensive summaries for exact inference methods are pub-lished in Agresti (1992) and Waller, Turnbull, and Hardin (1995).

2.2 Wilcoxon Mann–Whitney ranksum test

Let X be a binary variable (group 1 and group 2) and Yn be a continuous randomvariable from data consisting of n observations where Y = (Y1, . . . , Yn)T . Ranks areassigned to the data, 1 to n, smallest to largest, where tied ranks are given the averageof the ranks. If n > 25, the (asymptotically normal) test statistic Z is given by

Z =R1 − n1(n + 1)/2√

n1n2VR/n

where R1 is the sum of the ranks from group 1, n1 is the sample size of group 1, n2 isthe sample size of group 2, and VR is the variance of the ranks. In Stata, group 1 islesser in numeric value than group 2. However, if n ≤ 25, the normal approximationis not appropriate. In this situation, we calculate the exact test by using the approachoutlined in the following section.

2.3 An exact method based on the characteristic function

Pagano and Tritchler (1983) present the basic methodology for computing distributionfunctions through Fourier analysis of the characteristic function. Superficially, thisapproach appears as complicated as the complete enumeration of results for the distri-butions of the Wilcoxon test statistics, but Fourier analysis via the FFT in the approachbased on the characteristic function is calculated much faster.

Basically, if X is a discrete random variable with a distribution function given byP (X = x) = pj for j = 0, . . . , U , then the complex valued characteristic function isgiven by

φ(θ) =U∑

j=0

pj exp(ijθ)

where i =√−1 and θ ∈ [0, 2π). Because X is defined on a finite integer lattice, the

basic theorem in Fourier series is used to obtain the probabilities pj . For any integerQ > U and j = 0, . . . , U ,

pj =1Q

Q−1∑k=0

φ

(2πk

Q

)exp

(−2πijk

Q

)(1)

Thus knowing the characteristic function at Q equidispersed points on the interval[0, 2π) is equivalent to knowing it everywhere. Furthermore, the probabilities of thedistribution are easily obtained from the characteristic function. We emphasize thatthe imaginary part of (1) is 0.


To allow tied ranks in the commands, we multiply all ranks by L to ensure that theranks and sums of ranks will be integers. This can be accomplished for our two statisticsby setting L = 2. The ranges of the values of the two statistics are easily calculatedso that we may choose Q ≥ U . Defining U as the largest possible value of our statistic(formed from the largest possible ranks), we can choose log2 Q = ceiling{log2(U)}. Wechoose Q to be a power of 2 because of the requirements of the FFT algorithm in Stata(Fourier analysis is carried out by using the Mata fft command).

Using rk to denote the rank of the kth observation, the characteristic function forthe one-sample statistic S1 is given by

φ1 (−2πij/Q) =

{exp(−2πij/Q)

N∏k=1

cos(−2πjLrk/Q)

}

while the characteristic function for S2 is calculated by using the difference equation

φ2(j, k) = exp(−2πijLrk/Q)φ2(j − 1, k − 1) + φ2(j, k − 1)

3 Stata syntax

Software accompanying this article includes the command files as well as supportingfiles for dialogs and help. Equivalent to the signrank command, the basic syntax forthe new Wilcoxon signed-rank test command is

signrankex varname = exp[if

] [in

]Equivalent to the ranksum command, the basic syntax for the new Wilcoxon Mann–

Whitney ranksum test command is

ranksumex varname[if

] [in

], by(groupvar)

[porder

]4 Example

In this section, we present real-world examples with the new nonparametric Wilcoxontest commands. In clinical trials, talinolol is used as a β blocker and is controlled byP–glycoprotein, which protects xenobiotic compounds. Eight healthy men between theages of 22 and 26 were evaluated based on their serum-concentration time profiles oftalinolol with kinetic profile differences. These differences were two enantiomers, S(–)talinolol and R(+) talinolol. The trial examined single intravenous (iv) and repeatedoral talinolol profiles before and after rifampicin comedication. Area under the serumconcentration time curves (AUC) was collected for each subject (see Zschiesche et al.[2002]). We compare AUC values of S(–) iv talinolol before and after comedication ofrifampicin by using the Wilcoxon signed-rank test. The results are given below, whereS is the Wilcoxon signed-rank test statistic.


. use signrank, clear

. signrankex iv_s_before = iv_s_after

Wilcoxon signed-rank test

sign obs sum ranks expected

positive 8 36 18negative 0 0 18

zero 0

all 8 36 36

Ho: iv_s_before = iv_s_afterS = 18.000

Prob >= |S| = 0.0078

The results show there was a statistically significant difference (p-value = 0.0078) be-tween iv S(–) talinolol before and after comedication of rifampicin. There were greaterS(–) talinolol AUC values shown before rifampicin administration than after.

For the Wilcoxon Mann–Whitney ranksum test example, we will use performancedata (table 1) collected on rats’ rotarod endurance (in seconds) from two treatmentgroups. The rats were randomly selected to be in the control group (received salinesolvent) or the treatment group (received centrally acting muscle relaxant) (Bergmann,Ludbrook, and Spooren 2000).

Table 1. Rotarod endurance

Treatment group Control groupEndurance time (sec) Rank Endurance time (sec) Rank

22 2 300 15300 15 300 1575 3 300 15271 5 300 15300 15 300 1518 1 300 15300 15 300 15300 15 300 15163 4 300 15300 15 300 15300 15 300 15300 15 300 15


The results are given below.

. use ranksum, clear

. ranksumex edrce, by(trt)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

trt obs rank sum expected

0 12 180 1501 12 120 150

combined 24 300 300

Exact statisticsHo: edrce(trt==0) = edrce(trt==1)

Prob <= 120 = 0.0186Prob >= 180 = 0.0186Two-sided p-value = 0.0373

The two-sided exact p-value of 0.0373 exhibits a statistically significant difference inaverage rotarod endurance between the groups of rats. We can also illustrate how tocalculate this exact p-value manually by using the rat rotarod endurance data (table 1).In Conover (1999), the Wilcoxon Mann–Whitney ranksum test exact p-value is illus-trated in terms of combinations (arrangements) of ranks. In this example, the numberof arrangements of 12 of the ranks in the table having a sum less than or equal to 120 isthe number of arrangements of choosing all 5 of the ranks less than 15 and 7 of the 19tied ranks of 15; this is given by

(55

)(197

). The total number of ways to choose 12 of 24

ranks is given by(2412

). Thus the p-value is

p-value =

(55

)(197

)(2412

) =50,388

2,704,156= 0.0186

where each of the new commands returns the p-value as well as the numerator anddenominator of the exact fraction (see the return values in the previous example).

5 Summary

In this article, we introduced two supporting Stata commands for the exact nonparamet-ric Wilcoxon signed-rank test and the Wilcoxon Mann–Whitney ranksum test. Theseone-sample and two-sample test statistics can be used to assess the difference in location(median difference) for small samples (exact distribution) and larger samples (Student’st distribution).


6 ReferencesAgresti, A. 1992. A survey of exact inference for contingency tables. Statistical Science

7: 131–153.

Bergmann, R., J. Ludbrook, and W. P. J. M. Spooren. 2000. Different outcomes of theWilcoxon–Mann–Whitney test from different statistics packages. American Statisti-cian 54: 72–77.

Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley.

Iman, R. L. 1974. Use of a t-statistic as an approximation to the exact distribution ofthe Wilcoxon signed ranks test statistic. Communications in Statistics 3: 795–806.

Lehmann, E. L. 1975. Nonparametrics: Statistical Methods Based on Ranks. UpperSaddle River, NJ: Springer.

Mehta, C. R., and N. R. Patel. 1986. Algorithm 643: FEXACT: A FORTRAN subroutinefor Fisher’s exact test on unordered r × c contingency tables. ACM Transactions onMathematical Software 12: 154–161.

Nagarajan, N., and U. Keich. 2009. Reliability and efficiency of algorithms for computingthe significance of the Mann–Whitney test. Computational Statistics 24: 605–622.

Pagano, M., and D. Tritchler. 1983. On obtaining permutation distributions in polyno-mial time. Journal of the American Statistical Association 78: 435–440.

Requena, F., and N. Martın Ciudad. 2006. A major improvement to the network al-gorithm for Fisher’s exact test in 2 × c contingency tables. Computational Statisticsand Data Analysis 51: 490–498.

Waller, L. A., B. W. Turnbull, and J. M. Hardin. 1995. Obtaining distribution func-tions by numerical inversion of characteristic functions with applications. AmericanStatistician 49: 346–350.

Zschiesche, M., G. L. Lemma, K.-J. Klebingat, G. Franke, B. Terhaag, A. Hoffmann,T. Gramatte, H. K. Kroemer, and W. Siegmund. 2002. Stereoselective disposition oftalinolol in man. Journal of Pharmaceutical Sciences 91: 303–311.

About the authors

Tammy Harris is a PhD candidate in the Department of Epidemiology and Biostatistics and anaffiliated researcher in the Institute for Families in Society at the University of South Carolinain Columbia, SC.

James W. Hardin is an associate professor in the Department of Epidemiology and Biostatisticsand an affiliated faculty member in the Institute for Families in Society at the University ofSouth Carolina in Columbia, SC.


Extending the flexible parametric survival modelfor competing risks

Sally R. HinchliffeDepartment of Health Sciences

University of LeicesterLeicester, UK

[email protected]

Paul C. LambertDepartment of Health Sciences

University of LeicesterLeicester, UK

[email protected]

Abstract. Competing risks are present when the patients within a dataset couldexperience one or more of several exclusive events and the occurrence of any one ofthese could impede the event of interest. One of the measures of interest for analy-ses of this type is the cumulative incidence function. stpm2cif is a postestimationcommand used to generate predictions of the cumulative incidence function afterfitting a flexible parametric survival model using stpm2. There is also the optionto generate confidence intervals, cause-specific hazards, and two other measuresthat will be discussed in further detail. The new command is illustrated througha simple example.

Keywords: st0298, stpm2cif, survival analysis, competing risks, cumulative inci-dence, cause-specific hazard

1 Introduction

In survival analysis, if interest lies in the true probability of death from a particularcause, then it is important to appropriately account for competing risks. Competingrisks occur when patients are at risk of more than one mutually exclusive event, suchas death from different causes (Putter, Fiocco, and Geskus 2007). The occurrence ofa competing event may prevent the event of interest from ever occurring. It thereforeseems logical to conduct an analysis that considers these competing risks. The twomain measures of interest for analyses of this type are the cause-specific hazard andthe cumulative incidence function. The cause-specific hazard is the instantaneous riskof dying from a specific cause given that the patient is still alive at a particular time.The cumulative incidence function is the proportion of patients who have experienced aparticular event at a certain time in the follow-up period. Several methods are alreadyavailable to estimate this; however, it is not always clear which approach should beused.

In this article, we explain how to fit flexible parametric models using the stpm2 com-mand by estimating the cause-specific hazard for each cause of interest in a competing-risks situation. The stpm2cif command is a postestimation command used to estimatethe cumulative incidence function for up to 10 competing causes along with confidenceintervals, cause-specific hazards, and two other useful measures.


S. R. Hinchliffe and P. C. Lambert 345

2 Methods

If a patient is at risk from K different causes, the cause-specific hazard, hk(t), is therisk of failure at time t given that no failure from cause k or any of the K − 1 othercauses has occurred. In a proportional hazards model, hk(t) is

hk(t | Z) = hk,0(t) exp(βT

k Z)

(1)

where hk,0(t) is the baseline cause-specific hazard for cause k, and βk is the vector ofparameters for covariates Z. The cumulative incidence function, Ck(t), can be derivedfrom the cause-specific hazards through the equation

Ck(t) =

t∫0

hk(u | Z)K∏

k=1

Sk(u)du (2)

whereK∏

k=1

Sk(u)du = exp(−t∫0

K∑k=1

hk) is the overall survival function (Prentice et al.

1978).

Several programs are currently available in Stata that can compute the cumula-tive incidence function. The command stcompet calculates the function by using theKaplan–Meier estimator of the overall survival function (Coviello and Boggess 2004).It therefore does not allow for the incorporation of covariate effects. A follow-on tostcompet is stcompadj, which fits the cumulative incidence function based on the Coxmodel or the flexible parametric regression model (Coviello 2009). However, it onlyallows one competing event, and because the regression models are built into the com-mand internally, it does not allow users to specify their own options with stcox orstpm2. Finally, Fine and Gray’s (1999) proportional subhazards model can be fit usingstcrreg.

The flexible parametric model was first proposed by Royston and Parmar in 2002.The approach uses restricted cubic spline functions to model the baseline log cumulativehazard. It has the advantage over other well-known models such as the Cox model be-cause it produces smooth predictions and can be extended to incorporate complex time-dependent effects, again through the use of restricted cubic splines. The Stata implemen-tation of the model using stpm2 is described in detail elsewhere (Royston and Parmar2002; Lambert and Royston 2009). Both the cause-specific hazard (1) and the overallsurvival function can be obtained from the flexible parametric model to give the inte-grand in (2). This can be done by fitting separate models for each of the k causes, butthis will not allow for shared parameters. It is possible to fit one model for all k causessimultaneously by stacking the data so that each individual patient has k rows of data,one for each of the k causes. Table 1 illustrates how the data should look once theyhave been stacked (in the table, CVD stands for cardiovascular disease). Each patientcan fail from one of three causes. Patient 1 is at risk from all three causes for 10 yearsbut does not experience any of them and so is censored. Patient 2 is at risk from allthree causes for eight years but then experiences a cardiovascular event. By expanding

346 Competing risks

the dataset, one can allow for covariate effects to be shared across the causes, althoughit is possible to include covariates that vary for each cause.

Table 1. Expanding the dataset

ID Age Time Cause Status

1 50 10 Cancer 01 50 10 CVD 01 50 10 Other 0

2 70 8 Cancer 02 70 8 CVD 12 70 8 Other 0

3 Syntax

stpm2cif newvarlist, cause1(varname #[varname # ...

]) cause2(varname #[

varname # ...])

[cause3(varname #

[varname # ...

]) ...

cause10(varname #[varname # ...

]) obs(#) ci mint(#) maxt(#)

timename(newvar) hazard contmort conthaz]

The names specified in newvarlist coincide with the order of the causes inputted in theoptions.

3.1 Options

cause1(varname #[varname # ...

]) . . . cause10(varname #

[varname # ...

])

request that the covariates specified by the listed varname be set to # when pre-dicting the cumulative incidence functions for each cause. cause1() and cause2()are required.

obs(#) specifies the number of observations (of time) to predict. The default isobs(1000). Observations are evenly spread between the minimum and maximumvalues of follow-up time.

ci calculates a 95% confidence interval for the cumulative incidence function and storesthe confidence limits in CIF newvar lci and CIF newvar uci.

mint(#) specifies the minimum value of follow-up time. The default is set as theminimum event time from stset.

maxt(#) specifies the maximum value of follow-up time. The default is set as themaximum event time from stset.


timename(newvar) specifies the time variable generated during predictions for the cu-mulative incidence function. The default is timename( newt). This is the variablefor time that needs to be used when plotting curves for the cumulative incidencefunction and the cause-specific hazard function.

hazard predicts the cause-specific hazard function for each cause.

contmort predicts the relative contribution to total mortality.

conthaz predicts the relative contribution to hazard.

4 Example

Data were used on 506 patients with prostate cancer who were randomly allocatedto treatment with diethylstilbestrol. The data have been used previously to illustratethe command stcompet (Coviello and Boggess 2004). Patients are classified as aliveor having died from one of three causes: cancer (the event of interest), cardiovasculardisease (CVD), or other causes. To use stpm2cif, the user must first expand the dataset:

. use prostatecancer

. expand 3(1012 observations created)

. by id, sort: generate cause= _n

. generate cancer = cause==1

. generate cvd = cause==2

. generate other = cause==3

. generate treatcancer = treatment*cancer

. generate treatcvd = treatment*cvd

. generate treatother = treatment*other

. generate event = (cause==status)

The data have been expanded so that each patient has three rows of data, one foreach cause as shown in table 1. Three indicator variables have been created for eachof the three competing causes and interactions between treatment. Three causes havealso been generated. The indicator variable event defines whether a patient has diedand the cause of death. We now need to stset the data and run stpm2.

348 Competing risks

. stset time, failure(event)

failure event: event != 0 & event < .obs. time interval: (0, time]exit on or before: failure

1518 total obs.0 exclusions

1518 obs. remaining, representing356 failures in single record/single failure data

54898.8 total analysis time at risk, at risk from t = 0earliest observed entry t = 0

last observed exit t = 76

. stpm2 cancer cvd other treatcancer treatcvd treatother, scale(hazard)> rcsbaseoff dftvc(3) nocons tvc(cancer cvd other) eform nolog

Log likelihood = -1150.4866 Number of obs = 1518

exp(b) Std. Err. z P>|z| [95% Conf. Interval]

xbcancer .2363179 .0275697 -12.37 0.000 .188015 .2970303

cvd .1801868 .0238668 -12.94 0.000 .1389876 .2335983other .1008464 .018132 -12.76 0.000 .070895 .1434515

treatcancer .6722196 .1096964 -2.43 0.015 .4882109 .9255819treatcvd 1.188189 .2013301 1.02 0.309 .8524237 1.65621

treatother .6345498 .1672676 -1.73 0.084 .3785199 1.063758_rcs_cancer1 3.501847 .44435 9.88 0.000 2.730788 4.49062_rcs_cancer2 .8842712 .0742915 -1.46 0.143 .7500191 1.042554_rcs_cancer3 1.046436 .0371625 1.28 0.201 .9760756 1.121868

_rcs_cvd1 2.841936 .2619063 11.33 0.000 2.372299 3.404545_rcs_cvd2 .8772848 .0498866 -2.30 0.021 .7847607 .9807176_rcs_cvd3 1.008804 .0352009 0.25 0.802 .9421175 1.08021

_rcs_other1 2.751505 .3563037 7.82 0.000 2.134738 3.546467_rcs_other2 .7962094 .0558593 -3.25 0.001 .6939208 .913576_rcs_other3 .9614597 .0512891 -0.74 0.461 .8660117 1.067428

By including the three cause indicators (cancer, cvd, and other) as both maineffects and time-dependent effects (using the tvc() option), we have fit a stratifiedmodel with three separate baselines, one for each cause. For this reason, we have usedthe rcsbaseoff option together with the nocons option, which excludes the baselinehazard from the model. The interactions between treatment and the three causes havealso been included in the model. This estimates a different treatment effect for each ofthe three causes. The hazard ratios (95% confidence intervals) for the treatment effectare 0.67 [0.49, 0.93], 1.19 [0.85, 1.66], and 0.63 [0.38, 1.06] for cancer, CVD, and othercauses, respectively.

Now that we have run stpm2, we can run the new postestimation command stpm2cifto obtain the cumulative incidence function for each cause. Because we have two groupsof patients, treated and untreated, we must run the command twice. This will giveseparate cumulative incidence functions for the treated and the untreated groups andfor each of the three causes.


. stpm2cif cancer0 cvd0 other0, cause1(cancer 1) cause2(cvd 1) cause3(other 1)> ci hazard contmort conthaz maxt(60)

. stpm2cif cancer1 cvd1 other1, cause1(cancer 1 treatcancer 1)> cause2(cvd 1 treatcvd 1) cause3(other 1 treatother 1)> ci hazard contmort conthaz maxt(60)

The cause1() to cause3() options give the linear predictor for each of the threecauses for which we want a prediction. The commands have generated six new vari-ables containing the cumulative incidence functions. The untreated group members aredenoted with a 0 at the end of the variable name, and the treated group members aredenoted with a 1. These labels come from the input into newvarlist in the above com-mand line. The six cumulative incidence functions are therefore labeled CIF cancer0,CIF cvd0, CIF other0, CIF cancer1, CIF cvd1, and CIF other1. Each of these vari-ables has a corresponding high and low confidence bound, for example, CIF cancer0 lciand CIF cancer1 uci. These were created because the ci option was specified. Themaxt() option has been specified to restrict the predictions for the cumulative incidencefunction to a maximum follow-up time of 60 months; this was done for illustrative pur-poses only.

By specifying the hazard option, we have generated cause-specific hazards thatcorrespond with each of the cumulative incidence functions. These are labeled ash cancer0, h cvd0, h other0, h cancer1, h cvd1, and h other1. The options contmortand conthaz are the two additional measures mentioned previously. The contmort op-tion produces what we have named the “relative contribution to the total mortality”.This is essentially the cumulative incidence function for each specific cause divided bythe sum of all the cumulative incidence functions. It can be interpreted as the prob-ability that you will die from a particular cause given that you have died by time t.The conthaz option produces what we have named the “relative contribution to theoverall hazard”. This is similar to the last measure in that it is the cause-specific hazardfor a particular cause divided by the sum of all the cause-specific hazards. It can beinterpreted as the probability that you will die from a particular cause given that youdie at time t.

350 Competing risks

If we plot the cumulative incidence functions for each cause against time, we canachieve plots as shown in figure 1.

0.0

0.1

0.2

0.3

0.4

0 20 40 60

Cancer

0.0

0.1

0.2

0.3

0.4

0 20 40 60

CVD

0.0

0.1

0.2

0.3

0.4

0 20 40 60

Other

Cum

ulat

ive in

ciden

ce

Months since randomization

Untreated Treated

Figure 1. Cumulative incidence of cancer, CVD, and other causes of death in treatedand untreated patients with prostate cancer

The plots in figure 1 give the actual probabilities of dying from each cause, takinginto account the competing causes. The treated group have a lower probability of dyingfrom cancer or other causes compared with the untreated group, but have a higherprobability of dying from CVD.

The model fit above is relatively simple because it only considers treatment as apredictor for the three causes of death. Age is an important factor when fitting theprobability of death, so we shall now consider a model including age as a continuousvariable with a time-dependent effect. Although the effect of age will most likely differbetween the three causes of death, for demonstrative purposes, we will assume that theeffect of age can be shared across all three causes. This is one of the main advantages ofstacking the data as shown previously. The stpm2 command can be rerun to include agein both the variable list and the tvc() option. The three cause indicators (cancer, cvd,and other) remain as time-dependent effects with 3 degrees of freedom to maintain thestratified model with three separate baselines. Age is now included as a time-dependenteffect with only 1 degree of freedom.


. stpm2 cancer cvd other treatcancer treatcvd treatother age, scale(hazard)> rcsbaseoff dftvc(cancer:3 cvd:3 other:3 age:1) nocons> tvc(cancer cvd other age) eform nolog

Log likelihood = -1140.8413 Number of obs = 1515

exp(b) Std. Err. z P>|z| [95% Conf. Interval]

xbcancer .0146644 .0103431 -5.99 0.000 .0036804 .0584297

cvd .0109487 .0078047 -6.33 0.000 .0027076 .0442727other .0061321 .0044357 -7.04 0.000 .0014856 .0253121

treatcancer .6862214 .112055 -2.31 0.021 .4982751 .9450598treatcvd 1.208279 .2048582 1.12 0.264 .8666626 1.684553

treatother .6468979 .1705538 -1.65 0.099 .3858491 1.084561age 1.039325 .009951 4.03 0.000 1.020004 1.059013

_rcs_cancer1 15.00829 10.53069 3.86 0.000 3.793838 59.37229_rcs_cancer2 .897379 .0757066 -1.28 0.199 .7606152 1.058734_rcs_cancer3 1.046672 .0375211 1.27 0.203 .9756565 1.122858

_rcs_cvd1 12.71949 9.099712 3.55 0.000 3.129737 51.69301_rcs_cvd2 .8897659 .0515221 -2.02 0.044 .7943039 .9967009_rcs_cvd3 1.013176 .0358712 0.37 0.712 .9452535 1.085979

_rcs_other1 12.19435 8.773031 3.48 0.001 2.976976 49.95076_rcs_other2 .7976752 .0553103 -3.26 0.001 .6963126 .9137932_rcs_other3 .9682531 .0511771 -0.61 0.542 .8729684 1.073938

_rcs_age1 .980301 .0092192 -2.12 0.034 .9623973 .9985378

As before, we can now use the stpm2cif command to obtain the cumulative incidencefunctions for cancer, CVD, and other causes. This time, we want to predict for ages 65and 75 in both of the two treatment groups, so we will need to run the command fourtimes.

. stpm2cif age65cancer0 age65cvd0 age65other0, cause1(cancer 1 age 65)> cause2(cvd 1 age 65) cause3(other 1 age 65) ci hazard contmort conthaz> maxt(60)

. stpm2cif age65cancer1 age65cvd1 age65other1,> cause1(cancer 1 treatcancer 1 age 65)> cause2(cvd 1 treatcvd 1 age 65) cause3(other 1 treatother 1 age 65)> ci hazard contmort conthaz maxt(60)

. stpm2cif age75cancer0 age75cvd0 age75other0, cause1(cancer 1 age 75)> cause2(cvd 1 age 75) cause3(other 1 age 75) ci hazard contmort conthaz> maxt(60)

. stpm2cif age75cancer1 age75cvd1 age75other1,> cause1(cancer 1 treatcancer 1 age 75)> cause2(cvd 1 treatcvd 1 age 75) cause3(other 1 treatother 1 age 75)> ci hazard contmort conthaz maxt(60)

The stpm2cif commands have generated 12 new variables for the cumulative inci-dence functions, labeled CIF age65cancer0, CIF age65cvd0, CIF age65other0,CIF age65cancer1, CIF age65cvd1, CIF age65other1, CIF age75cancer0,CIF age75cvd0, CIF age75other0, CIF age75cancer1, CIF age75cvd1, andCIF age75other1. A 65 next to age represents the prediction for those 65 years old; a75 represents a prediction for those 75 years old.

352 Competing risks

Rather than plotting the cumulative incidence function as a line for each causeseparately as we did previously, we display them by stacking them on top of each other.This produces a graph as shown in figure 2. To do this, we need to generate newvariables that sum up the cumulative incidence functions. This is done for each of thetwo treatment groups and two ages. The code shown below is for the 65-year-olds inthe treatment group only.

. generate age65treat1 = CIF_age65cancer1(518 missing values generated)

. generate age65treat2 = age65treat1+CIF_age65cvd1(518 missing values generated)

. generate age65treat3 = age65treat2+CIF_age65other1(518 missing values generated)

. twoway (area age65treat3 _newt, sort fintensity(100))> (area age65treat2 _newt, sort fintensity(100))> (area age65treat1 _newt, sort fintensity(100)), ylabel(0(0.2)1, angle(0)> format(%3.1f)) ytitle("") xtitle("")> legend(order(3 "Cancer" 2 "CVD" 1 "Other") rows(1) size(small))> title("Treated") plotregion(margin(zero)) scheme(sj)> saving(treatedage65, replace)

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Treated

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Untreated

Age 65

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Treated

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Untreated

Age 75

Cum

ulat

ive in

ciden

ce


Cancer CVD Other

Figure 2. Stacked cumulative incidence of cancer, CVD, and other causes of death forthose aged 65 and 75 in treated and untreated patients with prostate cancer


The results in figure 2 allow us to visualize the total probability of dying in boththe treated and the untreated groups for those aged 65 and 75 and allow us to see howthis is broken down by the specific causes. As expected, the total probability of deathis higher for the oldest age in both treatment groups. The distribution of deaths acrossthe three causes in each treatment group is roughly the same for both ages. Againwe see that although the treatment reduces the total probability of death, it actuallyincreases the probability of death from CVD.

Using a similar process to the one used above to obtain the stacked cumulativeincidence plots, we can also produce stacked plots of the relative contribution to thetotal mortality and the relative contribution to the hazard. These graphs are shown infigures 3 and 4.

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Treated

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Untreated

Age 65

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Treated

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Untreated

Age 75

Rela

tive

cont

ribut

ion

to th

e to

tal m

orta

lity


Cancer CVD Other

Figure 3. Relative contribution to the total mortality for those aged 65 and 75 in treatedand untreated patients with prostate cancer

Figure 3 shows the relative contribution to the total mortality for those aged 65and 75 in the two treatment groups. If we focus on the 65-year-olds in the treatedgroup, the plot shows us that given a patient 65 years old is going to die by 40 monthsif treated, then the probability of dying from cancer is 0.39, the probability of dyingfrom CVD is 0.48, and the probability of dying from other causes is 0.13. However, if thepatient is untreated, then the same probabilities are 0.49, 0.34, and 0.17, respectively.

354 Competing risks

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Treated

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Untreated

Age 65

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Treated

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60

Untreated

Age 75

Rela

tive

cont

ribut

ion

to th

e ov

eral

l haz

ard


Cancer CVD Other

Figure 4. Relative contribution to the hazard for those aged 65 and 75 in treated anduntreated patients with prostate cancer

Figure 4 shows the relative contribution to the overall hazard for those aged 65 and 75in the two treatment groups. Again, if we focus on the 65-year-olds in the treated group,the plot shows us that given a patient 65 years old is going to die at 40 months if treated,then the probability of dying from cancer is 0.39, the probability of dying from CVD

is 0.45, and the probability of dying from other causes is 0.16. However, if the patientis untreated, then the same probabilities are 0.48, 0.32, and 0.20, respectively.

5 Conclusion

The new command stpm2cif provides an extension to the command stpm2 to enableusers to estimate the cumulative incidence function through the flexible parametricfunction. We hope that it will be a useful tool in medical research.

6 ReferencesCoviello, E. 2009. stcompadj: Stata module to estimate the covariate-adjusted cu-

mulative incidence function in the presence of competing risks. Statistical SoftwareComponents S457063, Department of Economics, Boston College.http://ideas.repec.org/c/boc/bocode/s457063.html.

Coviello, V., and M. Boggess. 2004. Cumulative incidence estimation in the presence ofcompeting risks. Stata Journal 4: 103–112.

Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistributionof a competing risk. Journal of the American Statistical Association 94: 496–509.


Lambert, P. C., and P. Royston. 2009. Further development of flexible parametricmodels for survival analysis. Stata Journal 9: 265–290.

Prentice, R. L., J. D. Kalbfleisch, A. V. Peterson, Jr., N. Flournoy, V. T. Farewell, andN. E. Breslow. 1978. The analysis of failure times in the presence of competing risks.Biometrics 34: 541–554.

Putter, H., M. Fiocco, and R. B. Geskus. 2007. Tutorial in biostatistics: Competingrisks and multi-state models. Statistics in Medicine 26: 2389–2430.

Royston, P., and M. K. B. Parmar. 2002. Flexible parametric proportional-hazards andproportional-odds models for censored survival data, with application to prognosticmodelling and estimation of treatment effects. Statistics in Medicine 21: 2175–2197.

About the authors

Sally Hinchliffe is a PhD student at the University of Leicester, UK. She is currently workingon developing a methodology for application in competing risks.

Paul Lambert is a reader in medical statistics at the University of Leicester, UK. His maininterest is in the development and application of methods in population-based cancer research.


Goodness-of-fit tests for categorical data

Rino BelloccoUniversity of Milano–Bicocca

Milan, [email protected]

andKarolinska InstitutetStockholm, [email protected]

Sara AlgeriTexas A&M University

College Station, TX

[email protected]

Abstract. A significant aspect of data modeling with categorical predictors isthe definition of a saturated model. In fact, there are different ways of specifyingit—the casewise, the contingency table, and the collapsing approaches—and theystrictly depend on the unit of analysis considered.

The analytical units of reference could be the subjects or, alternatively, groupsof subjects that have the same covariate pattern. In the first case, the goal is topredict the probability of success (failure) for each individual; in the second case,the goal is to predict the proportion of successes (failures) in each group. Theanalytical unit adopted does not affect the estimation process; however, it doesaffect the definition of a saturated model. Consequently, measures and tests ofgoodness of fit can lead to different results and interpretations. Thus one mustcarefully consider which approach to choose.

In this article, we focus on the deviance test for logistic regression models.However, the results and the conclusions are easily applicable to other linear modelsinvolving categorical regressors.

We show how Stata 12.1 performs when implementing goodness of fit. In thissituation, it is important to clarify which one of the three approaches is imple-mented as default. Furthermore, a prominent role is played by the shape of thedataset considered (individual format or events–trials format) in accordance withthe analytical unit choice. In fact, the same procedure applied to different datastructures leads to different approaches to a saturated model. Thus one must at-tend to practical and theoretical statistical issues to avoid inappropriate analyses.

Keywords: st0299, saturated models, categorical data, deviance, goodness-of-fittests

1 Deviance test for goodness of fit

It is common to find applications of logistic regression models in categorical data anal-ysis. In particular, considering the simplest case of a binary outcome Y , the logisticregression model for the probability of success π {P (Y = 1)} is defined as

ln{

π(x)1 − π(x)

}= β0 + β1x1 + · · · + βpxp (1)


R. Bellocco and S. Algeri 357

where π(x) is the probability of success given the set of covariates x = (x1, . . . , xp). Con-sidering β = (β0, . . . , βp), the vector containing the unknown parameters in (1), underthe assumption of independent outcomes, we can obtain the corresponding maximumlikelihood estimates β by maximizing the following log-likelihood function:

n∑i=1

[yi ln{π(xi)} + (1 − yi) ln{1 − π(xi)}] (2)

where n is the total number of observations and yi is the observed outcome for theith subject. This situation is based on having subjects as analytical units; thus thedata layout presents one record for each individual considered in the dataset (individualformat).

When one works with categorical data, it is possible (and frequently more useful)to consider some groups of subjects as units of analysis. These groups correspondto the covariate patterns (that is, the specific combinations of predictor values xj).Thus it is possible to reshape the dataset so that each record will correspond to aparticular covariate pattern or profile (events–trials format), including the total numberof individuals and total number of successes (deaths, recoveries, etc.). In this case, thegoal is to predict the proportion of successes for each group. The quantity π will be thesame for any individual in the same group (Kleinbaum and Klein 2010), and we adoptthe binomial distribution as reference to model this probability. So if we rewrite thelog-likelihood function (2) in terms of covariate patterns, we obtain

K∑j=1

[sj ln{π(xj)} + (mj − sj) ln{1 − π(xj)}] (3)

where K is the total number of possible (observed) covariate patterns, sj represents thenumber of successes, mj is the number of total individuals, and π(xj) is the proportionof successes corresponding to the jth covariate pattern. Therefore, in spite of differ-ent structures, because the information contained is exactly the same, the parameterestimates from (2) and (3) are exactly the same.

Having defined the log-likelihood function, we can perform the assessment of good-ness of fit with different methods. In this article, we focus our attention on the likelihood-ratio test (LRT) based on the deviance statistics. The deviance statistic compares, interms of likelihood, the model being fit with the saturated model. The deviance statisticfor a generalized linear model (see Agresti [2007]) is defined as

G2 = 2[ln

{Ls

(β)}

− ln{

Lm

(β)}]

(4)

where ln{Lm(β)} is the maximized log likelihood of the model of interest and ln{Ls(β)}is the maximized log likelihood of the saturated model. This quantity can also beinterpreted as a comparison between the values predicted by the fitted model and thosepredicted by the most complete model. Evidence for model lack-of-fit occurs when thevalue of G2 is large (see Hosmer et al. [1997]).

358 Goodness-of-fit tests for categorical data

It is generally accepted that this statistic, under specific conditions of regularity,converges asymptotically to a χ2 distribution with h degrees of freedom, where h is thedifference between the parameters in the saturated model and the parameters in themodel being fit:

G2 ∼ χ2(h)

Therefore, we use the deviance test to assess the following hypothesis

H0 : βh = 0

where βh is the vector containing the additional parameters of the saturated modelcompared with the model considered. So H0 is rejected when

G2 ≥ χ21−α

where α is the level of significance. If H0 cannot be rejected, we can safely conclude thatthe fitting of the model of interest is substantially similar to that of the most completedmodel that can be built (see section 2). We must clarify that the LRT can always beused to compare two nested models in terms of differences of deviances.

2 Definition of saturated model

A particular issue that is not carefully considered in categorical data analysis is thedefinition of a saturated model. In fact, according to Simonoff (1998), three differentspecifications are available and depend on the unit of analysis. In general, we can thinkof the saturated model as the model that leads to the perfect prediction of the outcomeof interest and represents the largest model we can fit. Thus it is used as a referencefor the assessment of the fitting of any other model of interest.

The traditional approach is the one that considers the saturated model as the modelthat gives a perfect fit of the data. So it assumes the subjects to be the analyticalunit and is identified with the casewise approach. This model contains the interceptand n− 1 covariates (where n is the total number of available observations as specifiedabove). Consequently, the maximum likelihood function in (2) is always equal to 0 (seeKleinbaum and Klein [2010]). So the deviance statistic shown in (4) result in

G2 = −2{

lnm(β)}

= −2n∑

i=1

[yi ln{π(xi)} + (1 − yi) ln{1 − π(xi)}]

= −2n∑

i=1

[yi ln

{π(xi)

1 − π(xi)

}+ ln{1 − π(xi)}

]This approach is generally followed in the case of continuous covariates whose values can-not be grouped into categorical values. In fact, in this situation, each covariate patternwill most likely correspond to one subject (n = K), and obviously, the most reasonableanalytical unit is the subject. However, in this case, the G2 goodness-of-fit statisticscannot be approximated to a χ2 distribution (see Kuss [2002] and Kleinbaum and Klein


[2010]). Thus even if statistical packages provide a p-value from a χ2 distribution, werecommend using the second or third approach—the contingency table or collapsingapproach—if all the regressors are or can be reduced to categorical variables.

These two approaches are based on groups of subjects as analytical units (and thedata layout will be in events–trials format). These groups correspond to the covariatepatterns, and individuals sharing the same covariate pattern are members of the samegroup. The saturated model is identified as the one with the intercept and K − 1regressors, where K is the number of all the possible covariate patterns; in other words,the model includes all possible main effects and all possible interaction effects (two-way, three-way, etc., until the maximum possible interaction order). The difference ofthe two situations is based on the covariate pattern specification. In the first one, thecovariate patterns are built by considering all the covariates available in the dataset. Inthe second one, the covariate patterns are based only on the variables specified in themodel of interest.

Clearly, if the model of interest includes all the variables available in the dataset,the two approaches coincide. Under these situations (if n �= K), the log likelihood ofthe saturated model is not equal to 0, and the G2 statistic is

G2 = 2[ln

{Ls

(β)}

− ln{Lm

(β)}]

= 2

⎛⎝ K∑j=1

[sj ln

{πs(xj)πm(xj)

}+ (mj − sj) ln

{1 − πs(xj)1 − πm(xj)

}]⎞⎠where πs(xj) is the proportion of successes for the jth covariate pattern predicted bythe saturated model and πm(xj) is the one predicted by the fitted model.

The collapsing approach has a main drawback: it uses different saturated modelscorresponding to different models of interest, complicating the comparison of their re-sults in terms of goodness of fit. On the other hand, the contingency approach mayrequire the listing of a high number of covariates. In this case, we could have manycovariate patterns with a small number of subjects, making the use of the χ2 approx-imation in the LRT for goodness of fit difficult once again. A possible remedy couldbe that the hypothetical saturated model in the contingency approach should be basedon variables identified through the corresponding directed acyclic graphs. In a causalinference framework, we could then use only the variables suggested by the d-separationalgorithm applied to the directed acyclic graph, which imposes the researcher to specifythe interrelationship among the variables (Greenland, Pearl, and Robins 1999).

3 Implementation of the LRT

In this section, we implement the LRT, and we show the results by considering the threedifferent saturated model specifications, using Stata 12.1. The data used in the analysesrefer to the Titanic disaster on 15 April 1912. Information on 2,201 persons is availableon three covariates: sex (male or female), economic status (first-class passenger, second-


class passenger, third-class passenger, or crew), and age (adult or child), which defines16 different covariate patterns (among which 14 were observed). The outcome of interestis either passenger’s survival (1 = survivor, 0 = deceased) or the number of survivorsand total number of passengers.

As anticipated above, two possible ways to represent these data can be consideredwith respect to the goal and the unit of analysis (Kleinbaum and Klein 2010):

• Individual-record format: One record for each subject considered with the infor-mation on survival (or death) contained in a binary variable (individ.txt).

• Events–trials format: One record for each covariate pattern with frequencies onsurvivors and total number of passengers (grouped.txt) available as follows:

age sex status survival n

1. Adult Male First 57 1752. Adult Male Second 14 1683. Adult Male Third 75 4624. Adult Male Crew 192 8625. Child Male First 5 5

6. Child Male Second 11 117. Child Male Third 13 488. Child Male Crew 0 09. Adult Female First 140 144

10. Adult Female Second 80 93

11. Adult Female Third 76 16512. Adult Female Crew 20 2313. Child Female First 1 114. Child Female Second 13 1315. Child Female Third 14 31

16. Child Female Crew 0 0

Clearly, with simple reshaping data procedures, available in Stata, it is possible toswap from one format to another. This will allow us to implement each of the threeapproaches summarized in the previous section.

3.1 The casewise approach

The glm procedures, available in Stata, use as the default saturated model definitionthe model with as many covariates as the number of records in the data file. Thus usingindivid.txt with subjects as analytical units, we can easily implement the deviancetest by considering the casewise definition of the saturated model, shown below:


. insheet using individ.txt, tab clear(5 vars, 2201 obs)

. generate male = sex=="Male"

. encode status, generate(econ_status)

. glm survival i.male i.econ_status, family(binomial) link(logit)

Iteration 0: log likelihood = -1116.4813Iteration 1: log likelihood = -1114.4582Iteration 2: log likelihood = -1114.4564Iteration 3: log likelihood = -1114.4564

Generalized linear models No. of obs = 2201Optimization : ML Residual df = 2196

Scale parameter = 1Deviance = 2228.91282 (1/df) Deviance = 1.014988Pearson = 2228.798854 (1/df) Pearson = 1.014936

Variance function: V(u) = u*(1-u) [Bernoulli]Link function : g(u) = ln(u/(1-u)) [Logit]

AIC = 1.017225Log likelihood = -1114.45641 BIC = -14672.97

OIMsurvival Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.male -2.421328 .1390931 -17.41 0.000 -2.693946 -2.148711

econ_status2 .8808128 .1569718 5.61 0.000 .5731537 1.1884723 -.0717844 .1709268 -0.42 0.675 -.4067948 .2632264 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916

_cons 1.187396 .1574664 7.54 0.000 .878767 1.496024

The LRT for goodness of fit can be obtained with the following code:

. scalar dev=e(deviance)

. scalar df=e(df)

. display "GOF casewise "" G^2="dev " df="df " p-value= " chiprob(df, dev)GOF casewise G^2=2228.9128 df=2196 p-value= .30705384

Thus the deviance statistic G2 is 2228.91 with 2196 (= 2201−5) degrees of freedom,and the p-value referred to the deviance test is 0.3071. We notice that as expected, theG2 corresponds to −2{lnm(β)} (= −2[−1114.46]). So in this case, the null hypothesiscannot be rejected, and the fit of the model of interest is not different from the fit ofthe saturated model.

3.2 The contingency table approach

The intuitive way of implementing the contingency table approach is to apply this sameprocedure (glm) on grouped.txt. In this situation, we want to estimate the proportionof successes; thus we need to redefine the outcome by specifying two new variables:the first is the total number of subjects in each category n, and the second is the totalnumber of events, survival.


In Stata, we also need to add the variable containing the number of trials, n, in thefamily() option:

. insheet using grouped.txt, tab clear(5 vars, 16 obs)



. glm survival i.male i.econ_status if n>0, family(binomial n) link(logit)




Variance function: V(u) = u*(1-u/n) [Binomial]Link function : g(u) = ln(u/(n-u)) [Logit]

AIC = 13.43138Log likelihood = -89.01967223 BIC = 107.6668


1.male -2.421328 .1390931 -17.41 0.000 -2.693946 -2.148711

econ_status2 .8808128 .1569718 5.61 0.000 .5731537 1.1884723 -.0717844 .1709268 -0.42 0.675 -.4067948 .2632264 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916

_cons 1.187396 .1574664 7.54 0.000 .878767 1.496024

Now we can obtain the deviance test statistic:


. scalar df=e(df)

. display "GOF contingency "" G^2="dev " df="df " p-value= " chiprob(df, dev)GOF contingency G^2=131.41831 df=9 p-value= 6.058e-24

The parameter estimates do not change as they do in the casewise approach. Butas expected, the deviance statistic (131.42) has significantly decreased; the degrees offreedom have changed (9 = 14 − 5); and the p-value for the deviance test will now letus reject the null hypothesis, implying that the model of interest is not as good as thesaturated model.

3.3 The collapsing approach

Both the casewise and contingency table approaches can be applied very easily byusing the procedures shown above, whereas the collapsing approach requires more effort.


Thus, concerning the grouped dataset and by using the egen command, we first generatea variable that allows us to identify all the possible covariate patterns referring just tothe variables male and econ status.

. insheet using grouped.txt, tab clear(5 vars, 16 obs)



. egen trtp=group(male econ_status)

. list

age sex status survival n male econ_s~s trtp

1. Adult Male First 57 175 1 First 62. Adult Male Second 14 168 1 Second 73. Adult Male Third 75 462 1 Third 84. Adult Male Crew 192 862 1 Crew 55. Child Male First 5 5 1 First 6

6. Child Male Second 11 11 1 Second 77. Child Male Third 13 48 1 Third 88. Child Male Crew 0 0 1 Crew 59. Adult Female First 140 144 0 First 210. Adult Female Second 80 93 0 Second 3

11. Adult Female Third 76 165 0 Third 412. Adult Female Crew 20 23 0 Crew 113. Child Female First 1 1 0 First 214. Child Female Second 13 13 0 Second 315. Child Female Third 14 31 0 Third 4

16. Child Female Crew 0 0 0 Crew 1

Second, we collapse the data by using the variable obtained in the previous stepand applying it to the two variables introduced into the model of interest (male andecon status). In this way, we obtain a dataset where each record corresponds to acovariate pattern identified by the combination of the covariates in the model.

. collapse (sum) survival n (first) male econ_status, by(trtp)

. list

trtp survival n male econ_s~s

1. 1 20 23 0 12. 2 141 145 0 23. 3 93 106 0 34. 4 90 196 0 45. 5 192 862 1 1

6. 6 62 180 1 27. 7 25 179 1 38. 8 88 510 1 4


We continue as we did in the contingency table approach:

. glm survival i.male i.econ_status if n>0, family(binomial n) link(logit)




Variance function: V(u) = u*(1-u/n) [Binomial]Link function : g(u) = ln(u/(n-u)) [Logit]

AIC = 14.33907Log likelihood = -52.35628099 BIC = 58.94151


1.male -2.421328 .1390931 -17.41 0.000 -2.693946 -2.148711

econ_status2 .8808128 .1569718 5.61 0.000 .5731537 1.1884723 -.0717844 .1709268 -0.42 0.675 -.4067948 .2632264 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916

_cons 1.187396 .1574664 7.54 0.000 .878767 1.496024


. scalar df=e(df)

. display "GOF contingency "" G^2="dev " df="df " p-value= " chiprob(df, dev)GOF contingency G^2=65.179831 df=3 p-value= 4.591e-14

By reshaping the data, we obtain the results according to the collapsing approachdefinition. As in the contingency table approach, we reject H0, but now the value ofthe deviance statistic has changed to 65.18 with 3 (= 8 − 5) degrees of freedom. Asexpected, estimates do not change.

4 Discussion

The casewise approach is often considered the standard for defining the saturated model.The reason is that the analysis is focused on subjects, and the saturated model, insteadof the fully parameterized model, is seen as the model that gives the “perfect fit” (seeKleinbaum and Klein [2010]). This fact does not affect the estimation process; however,it fatally compromises the inferential step in a goodness-of-fit evaluation where the χ2

approximation becomes questionable. The consideration of the other approaches canlead to different and meaningful results in terms of both descriptive and inferentialanalysis, but the problem is how to implement them in the right way with the statisticalpackage we are working on.


Considering Stata 12.1, we have noticed that in all cases, the default procedures forgoodness of fit consider the saturated model to be the one with as many covariates asthe number of records present in the dataset. Thus, using an individual data layout,we obtain results relative to the casewise saturated model, where the analytical unitsare subjects. However, when considering an events–trials data format, we assess thegoodness of fit based on the contingency table approach, where the unit of analysisis the covariate pattern defined by the possible values of all the independent variablesin the dataset. The less intuitive implementation is the one based on the collapsingapproach, which uses the covariate patterns defined by the variables involved in themodel. One simple solution could be to build a new dataset containing only thesevariables, like we did with the useful commands egen and collapse, which are veryhelpful in showing how the collapsing approach works.

5 ReferencesAgresti, A. 2007. An Introduction to Categorical Data Analysis. 2nd ed. Hoboken, NJ:

Wiley.

Greenland, S., J. Pearl, and J. M. Robins. 1999. Causal diagrams for epidemiologicresearch. Epidemiology 10: 37–48.

Hosmer, D. W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. A comparison ofgoodness-of-fit tests for the logistic regression model. Statistics in Medicine 15: 965–980.

Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rded. New York: Springer.

Kuss, O. 2002. Global goodness-of-fit tests in logistic regression with sparse data.Statistics in Medicine 21: 3789–3801.

Simonoff, J. S. 1998. Logistic regression, categorical predictors, and goodness-of-fit: Itdepends on who you ask. American Statistician 52: 10–14.

About the authors

Rino Bellocco is an associate professor of biostatistics in the Department of Statistics andQuantitative Methods at the University of Milano–Bicocca, Italy, and in the Department ofMedical Epidemiology and Biostatistics at the Karolinska Institutet, Sweden.

Sara Algeri is a statistician and currently a PhD student at Texas A&M University in CollegeStation, TX. She obtained both her bachelor’s and her master’s degrees from the Universityof Milano–Bicocca, Italy. In the last year of her master’s studies, she worked at Mount SinaiSchool of Medicine in New York as a visiting master’s student. This experience has been crucialin developing her interest in biostatistics and clinical trials. Her current research mainly focuseson longitudinal data analysis, Bayesian statistics, and statistical applications in genetics.


Standardizing anthropometric measures inchildren and adolescents with functions for

egen: Update

Suzanna I. VidmarClinical Epidemiology and Biostatistics UnitMurdoch Childrens Research Institute and

University of Melbourne Department of PaediatricsRoyal Children’s Hospital

Melbourne, [email protected]

Tim J. ColeMRC Centre of Epidemiology for Child Health

UCL Institute of Child HealthLondon, UK

[email protected]

Huiqi PanMRC Centre of Epidemiology for Child Health

UCL Institute of Child HealthLondon, UK

[email protected]

Abstract. In this article, we describe an extension to the egen functions zanthro()and zbmicat() (Vidmar et al., 2004, Stata Journal 4: 50–55). All functionality ofthe original version remains unchanged. In the 2004 version of zanthro(), z scorescould be generated using the 2000 U.S. Centers for Disease Control and PreventionGrowth Reference and the British 1990 Growth Reference. More recent growthreferences are now available. For measurement-for-age charts, age can now be ad-justed for gestational age. The zbmicat() function previously categorized childrenaccording to body mass index (weight/height2) as normal weight, overweight, orobese. “Normal weight” is now split into normal weight and three grades of thin-ness. Finally, this updated version uses cubic rather than linear interpolation tocalculate the values of L, M, and S for the child’s decimal age between successiveages (or length/height for weight-for-length/height charts).

Keywords: dm0004 1, zanthro(), zbmicat(), z scores, LMS, egen, anthropometricstandards

c© 2013 StataCorp LP dm0004 1

S. I. Vidmar, T. J. Cole, and H. Pan 367

1 Introduction

Comparison of anthropometric data from children of different ages is complicated bythe fact that children are still growing. We cannot directly compare the height of a5-year-old with that of a 10-year-old. Clinicians and researchers are often interested indetermining how a child compares with other children of the same age and sex: Is thechild taller, shorter, or about the same height as the average for his or her age and sex?

The growth references available to zanthro() tabulate values obtained by the LMS

method, developed by Cole (1990) and Cole and Green (1992). The LMS values are usedto transform raw anthropometric data, such as height, to standard deviation scores (zscores). These are standardized to the reference population for the child’s age and sex (orfor length/height and sex). Two sets of population-based reference data that were widelyused at the time zanthro() was initially developed are the 2000 Centers for Disease Con-trol and Prevention (CDC) Growth Reference in the United States (Kuczmarski et al.2000) and the British 1990 Growth Reference (Cole, Freeman, and Preece 1998). Sincethen, the following population-based reference data have been released and are nowavailable in zanthro(): the WHO Child Growth Standards, the WHO Reference 2007,the UK-WHO Preterm Growth Reference, and the UK-WHO Term Growth Reference.

1.1 WHO Child Growth Standards

The WHO Child Growth Standards (World Health Organization 2006, 2007) are theresult of the Multicentre Growth Reference Study (MGRS) undertaken by the WorldHealth Organization (WHO) between 1997 and 2003. They replace the 1977 NationalCenter for Health Statistics (NCHS)/WHO Growth Reference created by the U.S. NCHS

and WHO. The 1977 reference underestimated levels of low weight-for-age (underweight)for breast-fed infants. A number of specific limitations were noted by the WHO workinggroup in 1995: “1) the sample was limited to Caucasian infants from mostly middle-income families; 2) data were collected every three months rather than monthly, whichlimited the accuracy of developing the growth curve, particularly from 0–6 monthsof age; and 3) the majority of the infants in the sample were bottle-fed, and if theywere breast-fed it was only for a short duration (typically less than three months)”(Binagwaho, Ratnayake, and Smith Fawzi 2009).

WHO concluded that new growth curves were necessary, a recommendation endorsedby the World Health Assembly. The MGRS collected primary growth data and relatedinformation on 8,440 healthy breast-fed infants and young children in Brazil, Ghana,India, Norway, Oman, and the United States. It combined a longitudinal follow-up frombirth to 24 months and a cross-sectional survey of children aged 18–71 months. “TheMGRS is unique in that it was purposely designed to produce a standard by selectinghealthy children living under conditions likely to favor the achievement of their fullgenetic growth potential” (World Health Organization 2006). The WHO Child GrowthStandards can be used to assess the growth and development of children from 0–5 years.

368 Standardizing anthropometric measures: Update

1.2 WHO Reference 2007

The WHO Reference 2007 (de Onis et al. 2007) is a modification of the 1977 NCHS/WHO

Growth Reference for children and adolescents aged 5–19 years. It was merged with datafrom the cross-sectional sample of children aged 18–71 months to smooth the transitionat 5 years. The WHO Reference 2007 can be used for children and adolescents aged 5–19years. It complements the WHO Child Growth Standards.

1.3 UK-WHO Growth References

In 2007, the Scientific Advisory Committee on Nutrition recommended that a modifiedWHO chart be adopted in the UK. Two composite UK-WHO data files (Cole et al. 2011),one for preterm and the other for term births, were launched in May 2009. Bothcomprise three sections:

• A birth section based on the British 1990 Growth Reference. Acknowledgmentstatements for these data should specify the data source as “British 1990 GrowthReference, reanalyzed 2009”.

• A postnatal section from 2 weeks to 4 years copied from the WHO Child GrowthStandards.

• The 4–20 years section from the British 1990 Growth Reference.

Term infants are those born at 37 completed weeks’ gestation and beyond. The UK-WHO Term Growth Reference can be used for these infants. For infants born before 37completed weeks’ gestation, the UK-WHO Preterm Growth Reference can be used, withgestationally corrected age.

1.4 Additional growth charts available in zanthro()

The WHO and composite UK-WHO growth data are now available in zanthro(). Inaddition, two new measurements have been added to the British 1990 Growth Refer-ence: waist-for-age and percentage body fat–for-age (based on Tanita body compositionanalyzer/scales).


1.5 Categorizing children into grades of thinness and overweight

Body mass index (BMI) cutoffs are used to define categories of thinness (Cole et al.2007) and overweight (Cole et al. 2000) in children and adolescents aged 2–18 years.BMI data were obtained from nationally representative surveys of children in Brazil,Great Britain, Hong Kong, the Netherlands, Singapore, and the United States. Thethinness cutoffs correspond to equivalent adult BMI cutoff points endorsed by WHO of16, 17, and 18.5 kg/m2. zbmicat() now categorizes children into these three thinnessgrades as well as normal weight, overweight, or obese according to international cutoffpoints.

1.6 Comparison with LMSgrowth

LMSgrowth is a Microsoft Excel add-in to convert measurements to and from UnitedStates, UK, WHO, and composite UK-WHO reference z scores. It can be downloadedvia this link: http://www.healthforallchildren.com/index.php/shop/product/Software/Gr5yCsMCONpF39hF/0. z scores can only be calculated for the range of ages withina growth chart. Where the age ranges for the two programs do not overlap (table 1),only one of zanthro() and LMSgrowth will generate z scores.

Table 1. Differences in age ranges for zanthro() and LMSgrowth

Chart Age range for zanthro() Age range for LMSgrowth

CDC: Weight 0–20 years 0–19.96 yearsCDC: Height 2–20 years 1.96–19.96 yearsCDC: BMI 2–20 years 1.96–19.96 years

Each growth reference is summarized by three numbers, called L, M, and S, whichrepresent the skewness, median, and coefficient of variation of the measurement asit changes with age (or length and height). For age, L, M, and S values are generallytabulated at monthly intervals. For length and height, these parameters are tabulated at0.5 or 1 cm intervals. Where a child’s age, length, and height occur within these intervals,values of L, M, and S are obtained via cubic interpolation except at the endpoints ofthe charts, where linear interpolation is used. (The BMI cutoff points are tabulated at 6monthly intervals. The cutoff point where a child’s age occurs within these intervals isalso obtained via cubic interpolation—or linear interpolation from 2–2.5 years and 17.5–18 years.) Minor discrepancies in the z scores calculated by zanthro() and LMSgrowthwill be caused by different segment lengths and methods of interpolation for a segment(figure 1). Even so, the z scores generated by the two programs should agree within onedecimal place.


CDC: WeightAge (years) 0 0.04 . . . 19.88 19.96 20Age (months) 0 0.5 . . . 238.5 239.5 240zanthro() + + . . . + + +︸︷︷︸︸︷︷︸

linear linearLMSgrowth + + . . . + + -︸︷︷︸︸︷︷︸

linear linear

CDC: Height and BMIAge (years) 1.96 2 2.04 . . . 19.88 19.96 20Age (months) 23.5 24 24.5 . . . 238.5 239.5 240zanthro() - + + . . . + + +︸︷︷︸︸︷︷︸

linear linearLMSgrowth + - + . . . + + -︸︷︷︸︸︷︷︸

linear linear

Figure 1. Use of linear interpolation for charts with different age ranges

2 Syntax

egen[type

]newvar = zanthro(varname,chart,version)

[if

] [in

],

xvar(varname) gender(varname) gencode(male=code, female=code)[ageunit(unit) gestage(varname) nocutoff

]egen

[type

]newvar = zbmicat(varname)

[if

] [in

], xvar(varname)

gender(varname) gencode(male=code, female=code)[ageunit(unit)

]by cannot be used with either of these functions.


3 Functions

zanthro(varname,chart,version) calculates z scores for anthropometric measures inchildren and adolescents according to United States, UK, WHO, and composite UK-WHO reference growth charts. The three arguments are the following:

varname is the variable name of the measure in your dataset for which z scores arecalculated (for example, height, weight, or BMI).

chart; see tables 3–7 for a list of valid chart codes.

version is US, UK, WHO, UKWHOpreterm, or UKWHOterm. US calculates z scores by usingthe 2000 CDC Growth Reference; UK uses the British 1990 Growth Reference; WHOuses the WHO Child Growth Standards and WHO Reference 2007 composite datafiles as the reference data; and UKWHOpreterm and UKWHOterm use the British andWHO Child Growth Standards composite data files for preterm and term births,respectively.

zbmicat(varname) categorizes children and adolescents aged 2–18 years into three thin-ness grades—normal weight, overweight, and obese—by using BMI cutoffs (table 2).BMI is in kg/m2. This function generates a variable with the following values andlabels:

Table 2. Values and labels for grades of thinness and overweight

Value Grade/Label BMI range at 18 years

-3 Grade 3 thinness <16-2 Grade 2 thinness 16 to <17-1 Grade 1 thinness 17 to <18.50 Normal wt 18.5 to <251 Overweight 25 to <302 Obese 30+

Note that since the previous version of zbmicat(), the value label for BMI categoryhas been changed from 1 = Normal wt, 2 = Overweight, and 3 = Obese.

4 Options

xvar(varname) specifies the variable used (along with gender) as the basis for stan-dardizing the measure of interest. This variable is usually age but can also be lengthor height when the measurement is weight; that is, weight-for-age, weight-for-length,and weight-for-height are all available growth charts.

gender(varname) specifies the gender variable. It can be string or numeric. The codesfor male and female must be specified by the gencode() option.


gencode(male=code, female=code) specifies the codes for male and female. The gen-der can be specified in either order, and the comma is optional. Quotes around thecodes are not allowed, even if the gender variable is a string.

ageunit(unit) gives the unit for the age variable and is only valid for measurement-for-age charts; that is, omit this option when the chart code is wl or wh (see section 5).The unit can be day, week, month, or year. This option may be omitted if the unitis year, because this is the default. Time units are converted as follows:

1 year = 12 months = 365.25/7 weeks = 365.25 days1 month = 365.25/84 weeks = 365.25/12 days1 week = 7 days

Note: Ages cannot be expressed to full accuracy for all units. The consequence ofthis will be most apparent at the extremes of age in the growth charts, where zscores may be generated when the age variable is in one unit and missing for someof those same ages when they have been converted to another unit.

gestage(varname) specifies the gestational age variable in weeks. This option enablesage to be adjusted for gestational age. The default is 40 weeks. If gestational age isgreater than 40 weeks, the child’s age will be corrected by the amount over 40 weeks.A warning will be given if the gestational age variable contains a nonmissing valueover 42. As with the ageunit() option, this option is only valid for measurement-for-age charts.

nocutoff forces calculation of all z scores, allowing for extreme values in your dataset.By default, any z scores with absolute values greater than or equal to 5 (that is,values that are 5 standard deviations or more away from the mean) are set to missing.

The decision to have a default cutoff at 5 standard deviations from the mean wasmade as a way of attempting to capture extreme data entry errors. Apart from thisand setting to missing any z scores where the measurement is a nonpositive number,these functions will not automatically detect data errors. As always, please checkyour data!

5 Growth charts

Growth charts available in zanthro() are presented in tables 3–7. Note: Where xvar()is outside the permitted range, zanthro() and zbmicat() return a missing value.


Table 3. 2000 CDC Growth Charts, version US

chart Description Measurement unit xvar() range

la length-for-age cm 0–35.5 monthsha height-for-age cm 2–20 yearswa weight-for-age kg 0–20 yearsba BMI-for-age kg/m2 2–20 yearshca head circumference–for-age cm 0–36 monthswl weight-for-length kg 45–103.5 cmwh weight-for-height kg 77–121.5 cm

Table 4. British 1990 Growth Charts, version UK


ha length/height-for-age cm 0–23 yearswa weight-for-age kg 0–23 yearsba BMI-for-age kg/m2 0–23 yearshca head circumference–for-age cm Males: 0–18 years

Females: 0–17 yearssha sitting height–for-age cm 0–23 yearslla leg length–for-age cm 0–23 yearswsa waist-for-age cm 3–17 yearsbfa body fat–for-age % 4.75–19.83 years

Length/height and BMI growth data are available from 33 weeks gestation. Weight andhead circumference growth data are available from 23 weeks gestation.


Table 5. WHO Child Growth Charts and WHO Reference 2007 Charts, version WHO


ha length/height-for-age cm 0–19 yearswa weight-for-age kg 0–10 yearsba BMI-for-age kg/m2 0–19 yearshca head circumference–for-age cm 0–5 yearsaca arm circumference–for-age cm 0.25–5 yearsssa subscapular skinfold–for-age mm 0.25–5 yearstsa triceps skinfold–for-age mm 0.25–5 yearswl weight-for-length kg 45–110 cmwh weight-for-height kg 65–120 cm

Table 6. UK WHO Preterm Growth Charts, version UKWHOpreterm


ha length/height-for-age cm 0–20 yearswa weight-for-age kg 0–20 yearsba BMI-for-age kg/m2 0.038–20 yearshca head circumference–for-age cm Males: 0–18 years

Females: 0–17 years

Length/height growth data are available from 25 weeks gestation. Weight and headcircumference growth data are available from 23 weeks gestation.

Table 7. UK WHO Term Growth Charts, version UKWHOterm


ha length/height-for-age cm 0–20 yearswa weight-for-age kg 0–20 yearsba BMI-for-age kg/m2 0.038–20 yearshca head circumference–for-age cm Males: 0–18 years

Females: 0–17 years

Length/height, weight, and head circumference growth data are available from 37 weeksgestation.


6 Examples

Below is an illustration with data on a set of British newborns. The British 1990 GrowthReference is used; the variable sex is coded male = 1, female = 2; and the variablegestation is “completed weeks gestation”.

. use zwtukeg

. list, noobs abbreviate(9)

sex ageyrs weight gestation

1 .01 3.53 382 .073 5.05 402 .115 4.68 421 .135 4.89 362 .177 2.75 28

To compare the weight of the babies in this sample, for instance, with respect tosocioeconomic grouping, we can convert weight to standardized z scores. The z scoresare created using the following command:

. egen zwtuk = zanthro(weight,wa,UK), xvar(ageyrs) gender(sex)> gencode(male=1, female=2)(Z values generated for 5 cases)(gender was assumed to be coded male=1, female=2)(age was assumed to be in years)

In the command above, we have assumed all are term births. If some babies are bornprematurely, we can adjust for gestational age as follows.

. egen zwtuk_gest = zanthro(weight,wa,UK), xvar(ageyrs) gender(sex)> gencode(male=1, female=2) gestage(gestation)(Z values generated for 5 cases)(gender was assumed to be coded male=1, female=2)(age was assumed to be in years)

Here are the results for both of the above commands:


sex ageyrs weight gestation zwtuk zwtuk_gest

1 .01 3.53 38 -.2731358 .63294742 .073 5.05 40 1.696552 1.6965522 .115 4.68 42 .2438011 -.41624391 .135 4.89 36 -.3017253 1.2048232 .177 2.75 28 -4.707812 -.2137314

Note that at gestation = 40 weeks, the z score is the same whether or not thegestage() option is used. The formula for gestationally corrected age is

actual age + (gestation at birth − 40)

where “actual age” and “gestation at birth” are in weeks.


Gestational age may be recorded as weeks and days, as in the following example:

gestwks gestdays

38 340 642 036 228 1

These variables first need to be combined into a single gestation variable, which canthen be used with the gestage() option:

. generate gestation = gestwks + gestdays/7

Here we use the UK-WHO Term Growth Reference for term babies:

. egen zwtukwho = zanthro(weight,wa,UKWHOterm), xvar(ageyrs) gender(sex)> gencode(male=1, female=2)(Z values generated for 5 cases)(gender was assumed to be coded male=1, female=2)(age was assumed to be in years)

Here we use the UK-WHO Preterm Growth Reference for preterm babies, adjustingfor gestational age:

. egen zwtukwho_pre = zanthro(weight,wa,UKWHOpreterm), xvar(ageyrs) gender(sex)> gencode(male=1, female=2) gestage(gestation)(Z values generated for 5 cases)(gender was assumed to be coded male=1, female=2)(age was assumed to be in years)

Note: Where the gestationally corrected age is from 37 to 42 weeks, the UK-WHO

preterm and term growth charts generate different z scores. For example, the gestation-ally corrected age of a 2-week-old baby girl who was born at 37 weeks gestation is 39weeks. If her weight is 3.34 kg, the following z scores are generated using the UK-WHO

preterm and term growth charts:

. use zwtukwhoeg, clear

. egen zpreterm = zanthro(weight,wa,UKWHOpreterm), xvar(agewks) gender(sex)> gencode(male=1, female=2) ageunit(week) gestage(gestation)(Z value generated for 1 case)(gender was assumed to be coded male=1, female=2)(age was assumed to be in weeks)

. egen zterm = zanthro(weight,wa,UKWHOterm), xvar(agewks) gender(sex)> gencode(male=1, female=2) ageunit(week) gestage(gestation)(Z value generated for 1 case)(gender was assumed to be coded male=1, female=2)(age was assumed to be in weeks)



sex weight agewks gestation zpreterm zterm

2 3.34 2 37 .2403702 -.0422754

To determine the proportion of children who are thin, normal weight, overweight,and obese, we can categorize each child by using the following command:

. use zbmicateg, clear

. egen bmicat = zbmicat(bmi), xvar(ageyrs) gender(sex)> gencode(male=1, female=2)./zbmicat.dta(BMI categories generated for 10 cases)(gender was assumed to be coded male=1, female=2)(age was assumed to be in years)

Here are the results:

. list, noobs

sex ageyrs bmi bmicat

1 5.95 13.01 Grade 2 thinness1 9.46 16.43 Normal wt2 6.71 20.62 Obese2 6.89 13.45 Grade 1 thinness2 8.63 18.96 Overweight

1 8.48 17.45 Normal wt1 7.08 15.65 Normal wt2 7.56 11.54 Grade 3 thinness2 9.78 19.56 Normal wt2 8.25 20.58 Overweight

7 Acknowledgment

This work was supported by the Victorian Government’s Operational InfrastructureSupport Program.

8 ReferencesBinagwaho, A., N. Ratnayake, and M. C. Smith Fawzi. 2009. Holding multilateral

organizations accountable: The failure of WHO in regards to childhood malnutrition.Health and Human Rights 10(2): 1–4.

Cole, T. J. 1990. The LMS method for constructing normalized growth standards.European Journal of Clinical Nutrition 44: 45–60.


Cole, T. J., M. C. Bellizzi, K. M. Flegal, and W. H. Dietz. 2000. Establishing a standarddefinition for child overweight and obesity worldwide: International survey. BritishMedical Journal 320: 1240–1243.

Cole, T. J., K. M. Flegal, D. Nicholls, and A. A. Jackson. 2007. Body mass indexcut offs to define thinness in children and adolescents: International survey. BritishMedical Journal 335: 194–201.

Cole, T. J., J. V. Freeman, and M. A. Preece. 1998. British 1990 growth reference cen-tiles for weight, height, body mass index and head circumference fitted by maximumpenalized likelihood. Statistics in Medicine 17: 407–429.

Cole, T. J., and P. J. Green. 1992. Smoothing reference centile curves: The LMS methodand penalized likelihood. Statistics in Medicine 11: 1305–1319.

Cole, T. J., A. F. Williams, and C. M. Wright. 2011. Revised birth centiles for weight,length and head circumference in the UK-WHO growth charts. Annals of HumanBiology 38: 7–11.

Kuczmarski, R. J., C. L. Ogden, L. M. Grummer-Strawn, K. M. Flegal, S. S. Guo,R. Wei, Z. Mei, L. R. Curtin, A. F. Roche, and C. L. Johnson. 2000. CDC growthcharts: United States. Advance Data 314: 1–27.

de Onis, M., A. W. Onyango, E. Borghi, A. Siyam, C. Nishida, and J. Siekmann. 2007.Development of a WHO growth reference for school-aged children and adolescents.Bulletin of the World Health Organization 85: 660–667.

Vidmar, S., J. Carlin, K. Hesketh, and T. Cole. 2004. Standardizing anthropometricmeasures in children and adolescents with new functions for egen. Stata Journal 4:50–55.

World Health Organization. 2006. WHO child growth standards: Length/height-for-age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age:Methods and development. Geneva: World Health Organization.

———. 2007. WHO child growth standards: Head circumference-for-age, armcircumference-for-age, triceps skinfold-for-age and subscapular skinfold-for-age:Methods and development. Geneva: World Health Organization.

About the authors

Suzanna I. Vidmar is a senior research officer in the Clinical Epidemiology and BiostatisticsUnit at the Murdoch Childrens Research Institute and University of Melbourne Departmentof Paediatrics at the Royal Children’s Hospital in Melbourne, Australia.

Tim J. Cole is a professor of medical statistics in the MRC Centre of Epidemiology for ChildHealth at the UCL Institute of Child Health in London, UK, and has published widely on theanalysis of human growth data.

Huiqi Pan is a statistical programmer in the MRC Centre of Epidemiology for Child Health atthe UCL Institute of Child Health in London, UK.


Bonferroni and Holm approximations for Sidakand Holland–Copenhaver q-values

Roger B. NewsonNational Heart and Lung Institute

Imperial College LondonLondon, UK

[email protected]

Abstract. I describe the use of the Bonferroni and Holm formulas as approxi-mations for Sidak and Holland–Copenhaver formulas when issues of precision areencountered, especially with q-values corresponding to very small p-values.

Keywords: st0300, parmest, qqvalue, smileplot, multproc, multiple-test procedure,familywise error rate, Bonferroni, Sidak, Holm, Holland, Copenhaver

1 Introduction

Frequentist q-values for a range of multiple-test procedures are implemented in Stata byusing the package qqvalue (Newson 2010), downloadable from the Statistical SoftwareComponents (SSC) archive. The Sidak q-value for a p-value p is given by qsid = 1 −(1 − p)m, where m is the number of multiple comparisons (Sidak 1967). It is a lessconservative alternative to the Bonferroni q-value, given by qbon = min(1,mp). However,the Sidak formula may be incorrectly evaluated by a computer to 0 when the input p-value is too small to give a result lower than 1 when subtracted from 1, which is thecase for p-values of 10−17 or less, even in double precision. q-values of 0 are logicallypossible as a consequence of p-values of 0, but in this case, they may be overliberal. Thisliberalism may possibly be a problem in the future, given the current technology-driventrend of exponentially increasing multiple comparisons and the human-driven problemof ingenious data dredging. I present a remedy for this problem and discuss its use incomputing q-values and discovery sets.

2 Methods for q-values

The remedy used by the SSC packages qqvalue and parmest (Newson 2003) is to sub-stitute the Bonferroni formula for the Sidak formula for such small p-values. This worksbecause the Bonferroni and Sidak q-values converge in ratio as p tends to 0. To provethis, I show that for 0 ≤ p < 1,

dqbon/dp = m and dqsid/dp = m(1 − p)m−1

and that the Sidak/Bonferroni ratio of these derivatives is (1 − p)m−1, which is 1 ifp = 0. By L’Hopital’s rule, it follows that the ratio qsid/qbon also tends to 1 as p tendsto 0.


380 Approximations for Sidak and Holland–Copenhaver q-values

A similar argument shows that the same problem exists with the q-values outputby the Holland–Copenhaver procedure (Holland and Copenhaver 1987). If the m inputp-values, sorted in ascending order, are denoted pi for i from 1 to m, then the Holland–Copenhaver procedure is defined by the formula

si = 1 − (1 − pi)m−i+1

where si is the ith s-value. (In the terminology of Newson [2010], s-values are truncatedat 1 to give r-values, which are in turn input into a step-down procedure to give theeventual q-values.) The remedy used by qqvalue here is to substitute the s-valueformula for the procedure of Holm (1979), which is

si = (m − i + 1)pi

whenever 1 − pi is evaluated as 1. This also works because the two s-value formulasconverge in ratio as pi tends to 0. Note that the Holm procedure is derived from theBonferroni procedure by using the same step-down method as is used to derive theHolland–Copenhaver procedure from the Sidak procedure.

3 Methods for discovery sets

The SSC package smileplot (Newson and the ALSPAC Study Team 2003) also imple-ments a range of multiple-test procedures by using two commands, multproc andsmileplot. However, instead of outputting q-values, smileplot outputs a correctedcritical p-value threshold and a corresponding discovery set, defined as the subset ofinput p-values at or below the corrected critical p-value. The Sidak-corrected criti-cal p-value corresponding to an uncorrected critical p-value punc is given by csid =1 − (1 − punc)1/m and may be overconservative if wrongly evaluated to 0. In this case,the quantity that might be wrongly computed as 1 is (1−punc)1/m. When this happens,smileplot substitutes the Bonferroni-corrected critical p-value cbon = punc/m. How-ever, this is a slightly less elegant remedy in this case because the quantity (1−punc)1/m

is usually evaluated to 1 because m is large and not because punc is small.

To study the behavior of the Bonferroni approximation for large m, we define λ =1/m and note that

dcbon/dλ = punc and dcsid/dλ = − ln(1 − punc)(1 − punc)λ

implying (by L’Hopital’s rule) that in the limit, as λ tends to 0, the Sidak/Bonferroniratio of the two derivatives (and therefore of the two corrected thresholds) tends to− ln(1−punc)/punc. This quantity is not as low as 1 but is 1.150728, 1.053605, 1.025866,and 1.005034 if punc is 0.25, 0.10, 0.05, and 0.01, respectively. Therefore, the Bonferroniapproximation in this case is still slightly conservative for a very large number of multiplecomparisons over a range of commonly used uncorrected critical p-values, but is lessconservative than the value of 0, which would otherwise be computed.

This argument is easily generalized to the Holland–Copenhaver procedure. In thiscase, smileplot initially calculates a vector of m candidate critical p-value thresholdsby using the formula

R. B. Newson 381

ci = 1 − (1 − punc)1/(m−i+1)

for i from 1 to m and selects the corrected critical p-value corresponding to a givenuncorrected critical p-value from these candidates by using a step-down procedure. Ifthe quantity (1 − punc)1/(m−i+1) is evaluated as 1, then smileplot substitutes thecorresponding Holm critical p-value threshold

ci = punc/(m − i + 1)

which again is conservative as m − i + 1 becomes large (corresponding to the smallestp-values from a large number of multiple comparisons), but is less conservative than thevalue of 0, which would otherwise be computed.

Newson (2010) argues that q-values are an improvement on discovery sets because,given the q-values, different members of the audience can apply different input criticalp-values and derive their own discovery sets. The technical issue of precision presentedhere may be one more minor reason for preferring q-values to discovery sets.

4 Acknowledgment

I would like to thank Tiago V. Pereira of the University of Sao Paulo in Brazil fordrawing my attention to this issue of precision with the Sidak and Holland–Copenhaverprocedures.

5 ReferencesHolland, B. S., and M. D. Copenhaver. 1987. An improved sequentially rejective Bon-

ferroni test procedure. Biometrics 43: 417–423.

Holm, S. 1979. A simple sequentially rejective multiple test procedure. ScandinavianJournal of Statistics 6: 65–70.

Newson, R. 2003. Confidence intervals and p-values for delivery to the end user. StataJournal 3: 245–269.

Newson, R., and the ALSPAC Study Team. 2003. Multiple-test procedures and smileplots. Stata Journal 3: 109–132.

Newson, R. B. 2010. Frequentist q-values for multiple-test procedures. Stata Journal10: 568–584.

Sidak, Z. 1967. Rectangular confidence regions for the means of multivariate normaldistributions. Journal of the American Statistical Association 62: 626–633.

About the author

Roger B. Newson is a lecturer in medical statistics at Imperial College London, UK, workingprincipally in asthma research. He wrote the parmest, qqvalue, and smileplot Stata packages.


Fitting the generalized multinomial logit modelin Stata

Yuanyuan GuCentre for Health Economics Research and Evaluation

University of Technology, SydneySydney, Australia

[email protected]

Arne Risa HoleDepartment of Economics

University of SheffieldSheffield, UK

[email protected]

Stephanie KnoxCentre for Health Economics Research and Evaluation

University of Technology, SydneySydney, Australia

[email protected]

Abstract. In this article, we describe the gmnl Stata command, which can beused to fit the generalized multinomial logit model and its special cases.

Keywords: st0301, gmnl, gmnlpred, gmnlcov, generalized multinomial logit, scaleheterogeneity multinomial logit, maximum simulated likelihood

1 Introduction

Explaining variations in the behaviors of individuals is of central importance in choiceanalysis. For the last decade, the most popular explanation has been preference or tasteheterogeneity; that is, some individuals care more about particular product attributesthan do others. This assumption is most naturally represented via random parametermodels, among which the mixed logit (MIXL) model has become the standard to use(McFadden and Train 2000).

Recently, however, a group of researchers (for example, Louviere et al. [1999], Lou-viere et al. [2002], Louviere and Eagle [2006], and Louviere et al. [2007]) has argued thatin most choice contexts, much of the preference heterogeneity may be better describedas “scale” heterogeneity; that is, with attribute coefficients fixed, the scale of the id-iosyncratic error term is greater for some consumers than it is for others. Because thescale of the error term is inversely related to the error variance, this argument impliesthat choice behavior is more random for some consumers than it is for others. Althoughthe scale of the error term in discrete choice models cannot be separately identified fromthe attribute coefficients, it is possible to identify relative scale terms across consumers.Thus the statement that all heterogeneity is in the scale of the error term “is observa-tionally equivalent to the statement that heterogeneity takes the form of the vector ofutility weights being scaled up or down proportionately as one ‘looks’ across consumers”(Fiebig et al. 2010). These arguments have led to the scale heterogeneity multinomiallogit (S-MNL) model, a much more parsimonious model specification than MIXL.


Y. Gu, A. R. Hole, and S. Knox 383

To accommodate both preference and scale heterogeneity, Fiebig et al. (2010) devel-oped a generalized multinomial logit (G-MNL) model that nests MIXL and S-MNL. Theirresearch also shows that the two sources of heterogeneity often coexist but that theirimportance varies in different choice contexts.

In this article, we will describe the gmnl Stata command, which can be used to fit theG-MNL model and its special cases. The command is a generalization of the mixlogitcommand developed by Hole (2007). We will also present an empirical example thatdemonstrates how to use gmnl, and we will discuss related computational issues.

2 The G-MNL model and its special cases

We assume a sample of N respondents with the choice of J alternatives in T choicesituations.1 Following Fiebig et al. (2010), the G-MNL model gives the probability ofrespondent i choosing alternative j in choice situation t as

Pr(choiceit = j|βi) =exp(β

′ixitj)∑J

k=1 exp(β′ixitk)

(1)

i = 1, . . . , N ; t = 1, . . . , T ; j = 1, . . . , J

where xitj is a vector of observed attributes of alternative j and βi is a vector ofindividual-specific parameters defined as

βi = σiβ + {γ + σi(1 − γ)}ηi (2)

The specification of βi in (2) is central to G-MNL and differentiates it from previousheterogeneity models. It depends on a constant vector β, a scalar parameter γ, arandom vector ηi distributed MVN(0,Σ), and σi, the individual-specific scale of theidiosyncratic error.

In Fiebig et al. (2010), γ is constrained to be between 0 and 1. In extreme cases,γ = 1 leads to G-MNL-I: βi = σiβ + ηi and γ = 0 leads to G-MNL-II:2 βi = σi(β + ηi). Tounderstand the difference between these two models, Fiebig et al. (2010) describe themwith a single equation: βi = σiβ + η∗

i , where σi captures scale heterogeneity and η∗i

captures residual preference heterogeneity. Through this, we can see that in G-MNL-I,the standard deviation of η∗

i is independent of the scaling of β, whereas in G-MNL-II, itis proportional to σi.

However, an article by Keane and Wasi (forthcoming) points out that γ < 0 or γ > 1still permits sensible behavioral interpretations, and thus there is no reason to imposethe constraint. We follow their advice and allow γ to take any value.

1. We could also consider a different number of alternatives and choice situations for each respondent;for example, see Greene and Hensher (2010). The gmnl command can handle both of these cases.

2. Greene and Hensher (2010) call this the “scaled mixed logit model”.

384 Fitting the generalized multinomial logit model

Three useful special cases of G-MNL are the following:

• MIXL: βi = β + ηi (when σi = 1)

• S-MNL: βi = σiβ (when var(ηi) = 0)

• Standard multinomial logit: βi = β (when σi = 1 and var(ηi) = 0)

The gmnl command includes an option for fitting MIXL models, but we recommend thatmixlogit be used for this purpose because it is usually faster.

To complete the model specification, we need to choose a distribution for σi. Al-though any distribution defined on the positive real line is a theoretical possibility,Fiebig et al. (2010) assume that σi is distributed lognormal with standard deviation τand mean σ +θzi, where σ is a normalizing constant and zi is a vector of characteristicsof individual i that can be used to explain why σi differs across people.

3 Maximum simulated likelihood

The log likelihood for G-MNL is given by

LL(β, γ, τ, θ,Σ) =N∑

i=1

ln

⎧⎨⎩∫ T∏

t=1

J∏j=1

Pr(choiceit = j|βi)yitj p(βi|β, γ, τ, θ,Σ)dβi

⎫⎬⎭ (3)

where yitj is the observed choice variable, Pr(choiceit = j|βi) is given by (1), andp(βi|β, γ, τ, θ,Σ) is implied by (2).

Maximizing the log likelihood in (3) directly is rather difficult because the integraldoes not have a closed-form representation and so must be evaluated numerically. Wechoose to approximate it with simulation (see Train [2009], for example). The simulatedlikelihood is

SLL(β, γ, τ, θ,Σ) =N∑

i=1

ln

⎧⎨⎩ 1R

R∑r=1

T∏t=1

J∏j=1

Pr(choiceit = j|β[r]i )yitj

⎫⎬⎭β

[r]i = σ

[r]i β +

{γ + σ

[r]i (1 − γ)

}η[r]i

σ[r]i = exp(σ + θzi + τν[r])

where η[r]i is a vector generated from MVN(0,Σ) and ν[r] is a N(0, 1) scalar. η

[r]i and ν[r]

are generated using Halton draws (Halton 1964) and pseudorandom draws, respectively.When testing the code, we found that this combination works better than using Haltondraws to generate all the random terms.

Following Fiebig et al. (2010), we set the normalizing constant σ as − ln{ 1N

∑Ni=1

exp(τν[r]i )}, where ν

[r]i is the rth draw for the ith person. We also draw ν from a

truncated normal with truncation at ±2.


4 The gmnl command

4.1 Syntax

gmnl is implemented as a gf0 ml evaluator. The Halton draws used in the estimationprocess are generated using the Mata function halton() (Drukker and Gates 2006).The generic syntax for the command is as follows:

gmnl depvar[varlist

] [if

] [in

], group(varname)

[rand(varlist) id(varname)

corr nrep(#) burn(#) gamma(#) scale(matrix) het(varlist) mixl seed(#)

level(#) constraints(numlist) vce(vcetype) maximize options]

The command gmnlpred can be used following gmnl to obtain predicted probabili-ties. The predictions are available both in and out of sample; type gmnlpred . . . ife(sample) . . . if predictions are wanted for the estimation sample only.

gmnlpred newvar[if

] [in

] [, nrep(#) burn(#) ll

]The command gmnlcov can be used following gmnl to obtain the elements in the coeffi-cient covariance matrix along with their standard errors. This command is only relevantwhen the coefficients are specified to be correlated; see the corr option below. gmnlcovis a wrapper for nlcom (see [R] nlcom).

gmnlcov[, sd

]The command gmnlbeta can be used following gmnl to obtain the individual-level pa-rameters corresponding to the variables in the specified varlist by using the methodproposed by Revelt and Train (2000) (see also Train [2009, chap. 11]). The individual-level parameters are stored in a data file specified by the user. As with gmnlpred, thepredictions are available both in and out of sample; type gmnlbeta . . . if e(sample). . . if predictions are wanted for the estimation sample only.

gmnlbeta varlist[if

] [in

], saving(filename)

[replace nrep(#) burn(#)

]4.2 gmnl options

group(varname) specifies a numeric identifier variable for the choice occasions. group()is required.

rand(varlist) specifies the independent variables whose coefficients are random (nor-mally distributed). The variables immediately following the dependent variable inthe syntax are specified to have fixed coefficients.


id(varname) specifies a numeric identifier variable for the decision makers. This optionshould be specified only when each individual performs several choices, that is, whenthe dataset is a panel.

corr specifies that the random coefficients be correlated. The default is that theyare independent. When the corr option is specified, the estimated parameters arethe means of the (fixed and random) coefficients plus the elements of the lower-triangular matrix L, where the covariance matrix for the random coefficients is givenby Σ = LL′. The estimated parameters are reported in the following order: themeans of the fixed coefficients, the means of the random coefficients, and the elementsof the L matrix. The gmnlcov command can be used postestimation to obtain theelements in the Σ matrix along with their standard errors.

If the corr option is not specified, the estimated parameters are the means of thefixed coefficients and the means and standard deviations of the random coefficients,reported in that order. The sign of the estimated standard deviations is irrelevant.Although in practice the estimates may be negative, interpret them as being positive.

The sequence of the parameters is important to bear in mind when specifying startingvalues.

nrep(#) specifies the number of draws used for the simulation. The default is nrep(50).

burn(#) specifies the number of initial sequence elements to drop when creating theHalton sequences. The default is burn(15). Specifying this option helps reduce thecorrelation between the sequences in each dimension. Train (2009, 227) recommendsthat # should be at least as large as the largest prime number used to generate thesequences. If there are K random coefficients, gmnl uses the first K primes togenerate the Halton draws.

gamma(#) constrains the gamma parameter to the specified value in the estimations.

scale(matrix) specifies a matrix whose elements indicate whether their correspondingvariable will be scaled (1 = scaled and 0 = not scaled). The matrix should haveone row, and the number of columns should be equal to the number of explanatoryvariables in the model.

het(varlist) specifies the variables in the zi vector (if any).

mixl specifies that a mixed logit model should be estimated instead of a G-MNL model.

seed(#) specifies the seed. The default is seed(12345).

level(#); see [R] estimation options.

constraints(numlist); see [R] estimation options.

vce(vcetype); vcetype may be oim, robust, cluster clustvar, or opg; see [R] vce option.

maximize options: difficult, technique(algorithm spec), iterate(#), trace,gradient, showstep, hessian, tolerance(#), ltolerance(#), gtolerance(#),nrtolerance(#), and from(init specs); see [R] maximize.


4.3 gmnlpred options



ll estimates individual log likelihoods.

4.4 gmnlcov option

sd reports the standard deviations of the correlated coefficients instead of the covariancematrix.

4.5 gmnlbeta options

saving(filename) saves individual-level parameters to filename. saving() is required.

replace overwrites filename.



5 Computational issues

As in any model estimated using maximum simulated likelihood, parameter estimates ofG-MNL would depend on four factors: the random-number seed, number of draws, start-ing values, and optimization method. If the four factors are fixed, the same maximumlikelihood estimates would be obtained at each simulation.


To have a good approximation of the likelihood, we must use a reasonable numberof draws: the more draws used, the better the accuracy. However, a larger number ofdraws almost surely leads to a longer computation time. To determine the minimumnumber of draws for a desirable level of accuracy remains a theoretical challenge, butempirically, we may run the gmnl command several times with an increasing number ofdraws in each run (with the other three factors fixed) until the estimates stabilize. Weshould mention that too few draws may lead to serious convergence problems: a goodstarting point is 500 draws.

Starting values are crucial to achieve convergence, especially for the full model, thatis, G-MNL with correlated random parameters (G-MNL correlated). If we see optimizationas climbing a hill, then where we start climbing is one of the major factors that decidehow long it will take to reach the top or if we can ever get there in the end. If we startfrom the bottom where the territory is often flat, the direction guidance (that is, thefirst-order derivatives of the likelihood) may not function well and lead us farther awayfrom the top. For this reason, the default starting values based on the multinomial logitestimates may not be the best, and we often need to choose our own set of startingvalues.

To estimate “G-MNL correlated”, we have tried four different sets of starting values:G-MNL uncorrelated, MIXL uncorrelated, MIXL correlated, and G-MNL correlated with γfixed as 0. In most cases, these four sets of starting values would all lead to convergence,but the speed might be very different. In any case, we should not be content with onlyone set of starting values because even if the model converges, it is not guaranteedthat we have reached the global maximum. We suggest running the routine multipletimes, each with different starting values, and reporting the estimates from the run thatobtains the largest likelihood.

The choice of optimization method is another important factor that affects modelconvergence. Stata allows four options: Newton–Raphson (default), Berndt–Hall–Hall–Hausman, Davidon–Fletcher–Powell, and Broyden–Fletcher–Goldfarb–Shanno. Chap-ter 8 in Train (2009) describes these methods in detail and concludes that Broyden–Fletcher–Goldfarb–Shanno usually performs better than the others. With mixlogitand gmnl, however, we have found that Newton–Raphson often works best in the sensethat it is more likely to converge than the alternative algorithms. The only problemwith Newton–Raphson is that it can be very slow when there are a lot of parameters toestimate.

Finally, we have found that in some cases, different computers can give differentresults if there are several random parameters in the model and γ is freely estimated.This can happen when the model is numerically unstable and different numbers ofprocessors are used during estimation.


6 Empirical example

We will now present some examples that demonstrate how the gmnl command canbe used to fit the different models described in section 2. We will start by fittingsome relatively simple models, and then we will build up the complexity gradually.The data used in the examples come from a stated preference study on Australianwomen who were asked to choose whether to have a pap smear test; see Fiebig and Hall(2004). There were 79 women in the sample, and each respondent was presented with32 scenarios. Thus in terms of the model structure described in section 2, N = 79,T = 32, and J = 2. The dataset also contains five attributes, which are describedin table 1. Besides these five attributes, an alternative specific constant (ASC) will beused to measure intangible aspects of the pap smear test not captured by the designattributes (some women would choose or not choose the test just because of theseintangible aspects no matter what attributes they are presented with).

Table 1. Pap smear test data. Definition of variables.

Variable Definition

knowgp 1 if the general practitioner is known to the patient; 0 otherwisemalegp 1 if the general practitioner is male; 0 if the general practitioner

is femaletestdue 1 if the patient is due or overdue for a pap smear test; 0 otherwisedrrec 1 if the general practitioner recommends that the patient have a pap

smear test; 0 otherwisecost cost of test (unit: 10 Australian dollar)

To give an impression of how the data are structured, we have listed the first sixobservations below. Each observation corresponds to an alternative, and the dependentvariable y is 1 for the chosen alternative in each choice situation and 0 otherwise.gid identifies the alternatives in a choice situation; rid identifies the choice situationsfaced by a given individual; and the remaining variables are the alternative attributesdescribed in table 1 and the ASC (dummy test). In the listed data, the same individualfaces three choice situations.


. use paptest.dta

. generate cost = papcost/10

. list y dummy_test knowgp malegp testdue drrec cost gid rid in 1/6, sep(2)> abb(10)

y dummy_test knowgp malegp testdue drrec cost gid rid

1. 0 1 1 0 0 0 2 1 12. 1 0 0 0 0 0 0 1 1

3. 0 1 1 0 0 1 2 2 14. 1 0 0 0 0 0 0 2 1

5. 0 1 0 1 0 1 2 3 16. 1 0 0 0 0 0 0 3 1

We start by fitting a relatively simple S-MNL model with a fixed (nonrandom) ASC.Fiebig et al. (2010) have pointed out that ASCs should not be scaled, because they arefundamentally different from observed attributes. We can fit the model with a fixedASC by using the scale() option of gmnl as described below.

. /*S-MNL with fixed ASC*/

. matrix scale = 0,1,1,1,1,1

. gmnl y dummy_test knowgp malegp testdue drrec cost, group(gid) id(rid)> nrep(500) scale(scale)

Iteration 0: log likelihood = -1452.649 (not concave)

(output omitted )

Iteration 7: log likelihood = -1123.7542

Generalized multinomial logit model Number of obs = 5056Wald chi2(6) = 238.93

Log likelihood = -1123.7542 Prob > chi2 = 0.0000

(Std. Err. adjusted for clustering on rid)

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

dummy_test -1.938211 .13976 -13.87 0.000 -2.212136 -1.664286knowgp 1.811842 .4322376 4.19 0.000 .9646719 2.659012malegp -.9527219 .319577 -2.98 0.003 -1.579081 -.3263625testdue 5.305355 1.229367 4.32 0.000 2.895841 7.714869

drrec 2.656325 .6021513 4.41 0.000 1.47613 3.83652cost .0043925 .0932526 0.05 0.962 -.1783792 .1871643

/tau 1.458027 .1726576 8.44 0.000 1.119624 1.79643

The sign of the estimated standard deviations is irrelevant: interpret them asbeing positive

To avoid scaling the ASC, we create a matrix whose elements indicate whether theircorresponding variable will be scaled (1 = scaled and 0 = not scaled). Here the “scale”matrix defined as (0, 1, 1, 1, 1, 1) corresponds to the variables in the order in which theyare specified in the model (dummy test, knowgp, etc.). Therefore, among these sixvariables, only dummy test (that is, the ASC) is not scaled.


We should mention that the number of observations reported in the table, 5,056, isN × T × J , that is, the total number of choices times the number of alternatives. Formost purposes, such as computing information criteria, it is more appropriate to usethe total number of choices (N × T ); therefore, we do not recommend that you use theestat ic command after gmnl.

We then let dummy test be random, which leads to our second model: S-MNL withrandom ASC.3

. /*S-MNL with random ASC*/

. matrix scale = 1,1,1,1,1,0

. gmnl y knowgp malegp testdue drrec cost, group(gid) id(rid) rand(dummy_test)> nrep(500) scale(scale) gamma(0)


(output omitted )






Meanknowgp .6263819 .1431434 4.38 0.000 .3458261 .9069378malegp -1.350731 .2024933 -6.67 0.000 -1.74761 -.9538514testdue 2.954924 .2950128 10.02 0.000 2.37671 3.533139

drrec .7730114 .1608242 4.81 0.000 .4578018 1.088221cost -.1701498 .0585679 -2.91 0.004 -.2849408 -.0553588

dummy_test -.7052151 .3578936 -1.97 0.049 -1.406674 -.0037565

SDdummy_test 2.660664 .2798579 9.51 0.000 2.112152 3.209175

/tau .9255689 .1032721 8.96 0.000 .7231593 1.127978


3. Strictly speaking, this is not an S-MNL but a parsimonious form of G-MNL; that is, we modelASCs using preference heterogeneity but model other attributes using scale heterogeneity. Thisspecification of G-MNL has been used by Fiebig et al. (2011) and Knox et al. (2013).


Comparing “S-MNL with fixed ASC” with “S-MNL with random ASC”, we can see thatthe latter model improved the model fit by adding one more parameter, the standarddeviation of dummy test, which is statistically significant.4 The improvement in fit isnot surprising because the random ASC captures preference heterogeneity and allowsfor correlation across choice situations because of the panel nature of the data. Theparameter estimates between the two models are somewhat different, but they cannotbe compared directly because of differences in scale across models, as indicated by theestimate of τ . Instead, we should run the gmnlpred command to compare the predictedprobabilities. We shall demonstrate how to do predictions after fitting the full G-MNL

model.

The third example is a G-MNL model in which dummy test, testdue, and drrecare given random coefficients. For the moment, the coefficients are specified to beuncorrelated; that is, the off-diagonal elements of Σ are all 0. To speed up the estimation,we constrain γ to 0 by using the gamma(0) option, which implies that the fitted modelis a G-MNL-II (or “scaled mixed logit”).

. /*G-MNL with uncorrelated random coefficients*/

. matrix scale = 1,1,1,0,1,1

. gmnl y knowgp malegp cost, group(gid) id(rid) rand(dummy_test testdue drrec)> nrep(500) scale(scale) gamma(0)


(output omitted )






Meanknowgp .9123367 .1748867 5.22 0.000 .5695651 1.255108malegp -2.742707 .3543753 -7.74 0.000 -3.43727 -2.048144

cost -.1419785 .0637313 -2.23 0.026 -.2668895 -.0170675dummy_test -.4904328 .2685654 -1.83 0.068 -1.016811 .0359457

testdue 5.79628 .8667601 6.69 0.000 4.097462 7.495099drrec 1.492487 .2652157 5.63 0.000 .9726734 2.0123

SDdummy_test 2.988542 .3213189 9.30 0.000 2.358769 3.618316

testdue 3.166774 .4859329 6.52 0.000 2.214363 4.119185drrec 1.356382 .194595 6.97 0.000 .974983 1.737781

/tau 1.177626 .115934 10.16 0.000 .9503993 1.404852


. *Save coefficients for later use

. matrix b = e(b)

4. Note that we constrain γ to 0 by using the gamma(0) option. This is to prevent gmnl from attemptingto estimate the gamma parameter, because it is not identified in this model.


The square root of the diagonal elements of Σ is estimated and shown in the blockunder SD. All the standard deviations are significantly different from 0, which suggeststhe presence of substantial preference heterogeneity in the data.

In the last example, we allow the random coefficients of dummy test, testdue, anddrrec to be correlated, which implies that the off-diagonal elements of Σ will not befixed as zeros. Instead of using the default starting values, we use the parameters fromthe previous model, setting the starting values for the off-diagonal elements of Σ to 0.

. *Starting values

. matrix start = b[1,1..7],0,0,b[1,8],0,b[1,9..10]

. /*G-MNL with correlated random coefficients*/

. gmnl y knowgp malegp cost, group(gid) id(rid) rand(dummy_test testdue drrec)> nrep(500) from(start,copy) scale(scale) corr gamma(0)


(output omitted )






knowgp 1.016481 .1871831 5.43 0.000 .6496085 1.383353malegp -3.082839 .4511159 -6.83 0.000 -3.96701 -2.198668

cost -.1506127 .0695989 -2.16 0.030 -.2870241 -.0142012dummy_test -.5499832 .2755279 -2.00 0.046 -1.090008 -.0099584

testdue 6.372514 .9780339 6.52 0.000 4.455603 8.289425drrec 2.488203 .5429958 4.58 0.000 1.423951 3.552455

/l11 2.749374 .3800353 7.23 0.000 2.004518 3.494229/l21 -.155092 .2641671 -0.59 0.557 -.67285 .3626659/l31 1.116604 .3198175 3.49 0.000 .4897736 1.743435/l22 3.423451 .5413485 6.32 0.000 2.362427 4.484474/l32 .719799 .3048254 2.36 0.018 .1223522 1.317246/l33 1.448777 .2470165 5.87 0.000 .9646339 1.932921/tau 1.250552 .1531374 8.17 0.000 .9504077 1.550695


The six parameters from l11 to l33 are the elements of the lower-triangular matrixL, the Cholesky factorization of Σ (Σ = LL′). Given the estimate of L, we may recoverΣ and the standard deviations of the random coefficients by using gmnlcov:

. gmnlcov

v11: [l11]_b[_cons]*[l11]_b[_cons]v21: [l21]_b[_cons]*[l11]_b[_cons]v31: [l31]_b[_cons]*[l11]_b[_cons]v22: [l21]_b[_cons]*[l21]_b[_cons] + [l22]_b[_cons]*[l22]_b[_cons]v32: [l31]_b[_cons]*[l21]_b[_cons] + [l32]_b[_cons]*[l22]_b[_cons]v33: [l31]_b[_cons]*[l31]_b[_cons] + [l32]_b[_cons]*[l32]_b[_cons] +

> [l33]_b[_cons]*[l33]_b[_cons]


v11 7.559055 2.089718 3.62 0.000 3.463284 11.65483v21 -.4264059 .7303392 -0.58 0.559 -1.857844 1.005033v31 3.069963 1.026024 2.99 0.003 1.058993 5.080933v22 11.74407 3.674788 3.20 0.001 4.541615 18.94652v32 2.29102 1.166262 1.96 0.049 .0051882 4.576851v33 3.863872 1.571455 2.46 0.014 .7838763 6.943867

. gmnlcov, sd

dummy_test: sqrt([l11]_b[_cons]*[l11]_b[_cons])testdue: sqrt([l21]_b[_cons]*[l21]_b[_cons] + [l22]_b[_cons]*[l22]_b[_cons])

drrec: sqrt([l31]_b[_cons]*[l31]_b[_cons] +> [l32]_b[_cons]*[l32]_b[_cons] + [l33]_b[_cons]*[l33]_b[_cons])


dummy_test 2.749374 .3800353 7.23 0.000 2.004518 3.494229testdue 3.426962 .5361583 6.39 0.000 2.376111 4.477813

drrec 1.965673 .3997244 4.92 0.000 1.182228 2.749119

There are other useful postestimation commands besides gmnlcov. For example, togenerate predicted probabilities, we may use the gmnlpred command:

. gmnlpred p_hat, nrep(500)

. list rid gid y p_hat in 1/4

rid gid y p_hat

1. 1 1 0 .519866082. 1 1 1 .480133923. 1 2 0 .637891774. 1 2 1 .36210823


Moreover, if we are also interested in estimating individual log likelihoods, we mayuse the ll option of gmnlpred:

. gmnlpred loglik, nrep(500) ll

. list rid loglik in 1

rid loglik

1. 1 -22.560377

Finally, we may use the gmnlbeta command to calculate individual-level parameters(Revelt and Train 2000):

. gmnlbeta dummy_test testdue drrec, nrep(500) saving(beta) replacefile beta.dta saved

. use beta.dta, clear

. list rid dummy_test testdue drrec in 1/4

rid dummy_test testdue drrec

1. 1 -1.5343953 1.1355702 .361380152. 2 -3.1375251 .93454325 .062518023. 3 -3.7519709 6.2841202 1.37227454. 4 .83317825 1.5324372 .88101046

A file is now created and saved as the beta.dta dataset, which contains all the estimatedindividual β’s.

7 Conclusion

In this article, we described the gmnl Stata command, which can be used to fit theG-MNL model and its variants. As pointed out in Fiebig et al. (2010), G-MNL is veryflexible and nests a rich family of model specifications. In the previous sections, wedemonstrated several important models, which are summarized (along with some otheruseful specifications) below in table 2. This list does not exhaust all the possible modelsthat the gmnl routine can estimate. One example is the type of model considered inFiebig et al. (2011) and Knox et al. (2013), which includes interaction terms betweensociodemographic variables and ASCs.

Finally, a word of warning: While we have found that the gmnl command canbe used successfully to implement a range of model specifications, analysts need tobear in mind that estimation times can be substantial when fitting complex modelswith large datasets. As discussed in section 5, it may also be necessary to experimentwith alternative starting values, number of draws, and estimation algorithms to achieveconvergence.


Table 2. Special cases of G-MNL and their Stata commandsModel Command

MIXL gmnl y, group(csid) id(id) rand(x) mixl

S-MNL gmnl y x, group(csid) id(id)

S-MNL+fixed ASC gmnl y asc x, group(csid) id(id) scale(scale)

S-MNL+random ASC gmnl y x, group(csid) id(id) rand(asc) scale(scale) gamma(0)

G-MNL(uncorrelated) gmnl y, group(csid) id(id) rand(x)

G-MNL(correlated) gmnl y, group(csid) id(id) rand(x) corr

G-MNL(uncorrelated)+fixed ASC gmnl y asc, group(csid) id(id) rand(x) scale(scale)

G-MNL(correlated)+fixed ASC gmnl y asc, group(csid) id(id) rand(x) scale(scale) corr

G-MNL(uncorrelated)+random ASC gmnl y, group(csid) id(id) rand(asc x) scale(scale)

G-MNL(correlated)+random ASC gmnl y, group(csid) id(id) rand(asc x) scale(scale) corr

8 Acknowledgments

We are grateful to a referee and to Kristin MacDonald of StataCorp for helpful com-ments. The research of Yuanyuan Gu and Stephanie Knox was partially supported bya Faculty of Business Research Grant at the University of Technology in Sydney.

9 ReferencesDrukker, D. M., and R. Gates. 2006. Generating Halton sequences using Mata. Stata

Journal 6: 214–228.

Fiebig, D. G., and J. Hall. 2004. Discrete choice experiments in the analysis of healthpolicy. Productivity Commission Conference: Quantitative Tools for MicroeconomicPolicy Analysis 6: 119–136.

Fiebig, D. G., M. P. Keane, J. Louviere, and N. Wasi. 2010. The generalized multinomiallogit model: Accounting for scale and coefficient heterogeneity. Marketing Science 29:393–421.

Fiebig, D. G., S. Knox, R. Viney, M. Haas, and D. J. Street. 2011. Preferences for newand existing contraceptive products. Health Economics 20 (Suppl.): 35–52.

Greene, W. H., and D. A. Hensher. 2010. Does scale heterogeneity across individualsmatter? An empirical assessment of alternative logit models. Transportation 37:413–428.

Halton, J. H. 1964. Algorithm 247: Radical-inverse quasi-random point sequence. Com-munications of the ACM 7: 701–702.

Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.Stata Journal 7: 388–401.


Keane, M., and N. Wasi. Forthcoming. Comparing alternative models of heterogeneityin consumer choice behavior. Journal of Applied Econometrics.

Knox, S. A., R. C. Viney, Y. Gu, A. R. Hole, D. G. Fiebig, D. J. Street, M. R. Haas,E. Weisberg, and D. Bateson. 2013. The effect of adverse information and positivepromotion on women’s preferences for prescribed contraceptive products. Social Sci-ence and Medicine 83: 70–80.

Louviere, J., and T. Eagle. 2006. Confound it! That pesky little scale constant messesup our convenient assumptions. In Proceedings of the Sawtooth Software Conference,211–228. Sequim, WA: Sawtooth Software.

Louviere, J. J., R. J. Meyer, D. S. Bunch, R. T. Carson, B. Dellaert, W. M. Hanemann,D. Hensher, and J. Irwin. 1999. Combining sources of preference data for modelingcomplex decision processes. Marketing Letters 10: 205–217.

Louviere, J. J., D. Street, L. Burgess, N. Wasi, T. Islam, and A. A. J. Marley. 2007.Modeling the choices of individual decision-makers by combining efficient choice ex-periment designs with extra preference information. Journal of Choice Modelling 1:128–163.

Louviere, J. J., D. Street, R. Carson, A. Ainslie, J. R. Deshazo, T. Cameron, D. Hensher,R. Kohn, and T. Marley. 2002. Dissecting the random component of utility. MarketingLetters 13: 177–193.

McFadden, D., and K. Train. 2000. Mixed MNL models for discrete response. Journalof Applied Econometrics 15: 447–470.

Revelt, D., and K. Train. 2000. Customer-specific taste parameters and mixed logit:Households’ choice of electricity supplier. Working Paper No. E00-274, Departmentof Economics, University of California, Berkeley.

Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:Cambridge University Press.

About the authors

Yuanyuan Gu is a research fellow at the Centre for Health Economics Research and Evaluation,University of Technology in Sydney. His recent research focuses on discrete choice modelingwith applications in health economics.

Arne Risa Hole is a senior lecturer in economics at the University of Sheffield in the UK. Hisresearch interests lie in the area of applied microeconometrics, with a focus on health and laboreconomics. Since obtaining his PhD, he has been particularly interested in stated preferencemethods and the econometric analysis of discrete choice data.

Stephanie Knox is a research fellow at the Centre for Health Economics Research and Eval-uation, University of Technology in Sydney. Her research interests include the design andapplication of stated preference methods in the area of health service research.


Speaking Stata: Creating and varying boxplots: Correction

Nicholas J. CoxDepartment of Geography

Durham UniversityDurham, UK

[email protected]

A previous article (Cox 2009) discussed the creation of box plots from first principles,particularly when a box plot is desired that graph box or graph hbox cannot provide.

This update reports and corrects an error in my code given in that article. Theproblems are centered on page 484. The question is how to calculate the positions ofthe ends of the so-called whiskers.

To make this more concrete, the article’s example starts with

. sysuse lifeexp

. egen upq = pctile(lexp), by(region) p(75)

. egen loq = pctile(lexp), by(region) p(25)

. generate iqr = upq - loq

and that holds good.

Given interquartile range (IQR), the position of the end of the upper whisker isthat of the largest value not greater than the upper quartile + 1.5 IQR. Similarly, theposition of the end of the lower whisker is that of the smallest value not less than thelower quartile − 1.5 IQR.

The problem lines are on page 484:

. egen upper = max(min(lexp, upq + 1.5 * iqr)), by(region)

. egen lower = min(max(lexp, loq - 1.5 * iqr)), by(region)

This code works correctly if there are no values beyond where the whiskers should end.Otherwise, it yields upper quartile + 1.5 IQR as the position of the upper whisker,but this position will be correct only if there are values equal to that. Commonly,that position will be too high. A similar problem applies to the lower whisker, whichcommonly will be too low.

More careful code might be

. egen upper2 = max(lexp / (lexp < upq + 1.5 * iqr)), by(region)

. egen lower2 = min(lexp / (lexp > loq - 1.5 * iqr)), by(region)

That division / may look odd if you have not seen it before in similar examples. But itis very like a common kind of conditional notation often seen,

c© 2013 StataCorp LP gr0039 1

N. J. Cox 399

max(argument | condition)

or

min(argument | condition)

where we seek the maximum or minimum of some argument, restricting attention tocases in which a specified condition is satisfied, or true.

The connection is given in this way. Divide an argument by a logical expression thatevaluates to 1 when the expression is true and 0 otherwise. The result is the argumentremains unchanged on division by 1 but evaluates as missing on division by 0. In anycontext where Stata ignores missings, that is what is wanted. True cases are includedin the computation, and false cases are excluded.

This “divide by zero” trick appears not to be widely known. There was some pub-licity within a later article (Cox 2011).

Turning back to the box plots, we will see what the difference is in our example.

. tabdisp region, c(upper upper2 lower lower2)

Region upper upper2 lower lower2

Eur & C.Asia 79 79 65 65N.A. 79 79 58.5 64S.A. 75 75 63 67

Here upper2 and lower2 are from the more careful code just given, and upper andlower are from the code in the 2009 column. The results can be the same but need notbe.

Checking Stata’s own box plot

. graph box lexp, over(region) yli(75 79 64 65 67)

shows consistency with the corrected code.

Thanks to Sheena G. Sullivan, UCLA, who identified the problem on Statalist(http://www.stata.com/statalist/archive/2013-03/msg00906.html).

1 ReferencesCox, N. J. 2009. Speaking Stata: Creating and varying box plots. Stata Journal 9:

478–496.

———. 2011. Speaking Stata: Compared with ... Stata Journal 11: 305–314.

400 Speaking Stata: Creating and varying box plots: Correction

About the author

Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-mands in official Stata. He wrote several inserts in the Stata Technical Bulletin and is an editorof the Stata Journal.


Stata tip 115: How to properly estimate the multinomialprobit model with heteroskedastic errorsMichael HerrmannDepartment of Politics and Public AdministrationUniversity of KonstanzKonstanz, [email protected]

Models for multinomial outcomes are frequently used to analyze individual decisionmaking in consumer research, labor market research, voting, and other areas. Themultinomial probit model provides a flexible approach to analyzing decisions in thesefields because it does not impose some of the restrictive assumptions inherent in theoften used conditional logit approach. In particular, multinomial probit relaxes 1) theassumption of independent error terms, allowing for correlation in individual choicesacross alternatives, and 2) it does not impose the assumption of identically distributederrors, allowing unobserved factors to affect the choice of some alternatives more stronglythan others (that is, heteroskedasticity).

By default, asmprobit relaxes both the assumptions of independence and homoske-dasticity. To avoid overfitting, however, the researcher may sometimes wish to relaxthese assumptions one at a time.1 A seemingly straightforward solution would be to relyon the options stddev() and correlation(), which allow the user to set the structurefor the error variances and their covariances, respectively (see [R] asmprobit).

When doing so, however, the user should be aware that specifying std(het) andcorr(ind) does not actually fit a pure heteroskedastic multinomial probit model. WithJ outcome categories, if errors are independent, J − 1 error variances are identified (seebelow). Instead, Stata estimates J−2 error variances and, hence, imposes an additionalconstraint, which causes the model to be overidentified. As a result, the estimated modelis not invariant to the choice of base and scale outcomes; that is, changing the base orscale outcome leads to different values of the likelihood function.

To properly estimate a pure heteroskedastic model, the user needs to define thestructure of the error variances manually. This is easy to accomplish using the patternor fixed option. The following example illustrates the problem and shows how toestimate the model correctly.

1. Another reason to relax them one at a time is that heteroskedasticity and error correlation cannotbe distinguished from each other in the default specification. That is, one cannot simply lookat the estimated covariance matrix of the errors and see whether the errors are heteroskedastic,correlated, or both. What Stata estimates is the normalized covariance matrix of error differenceswhose elements do not allow one to draw any conclusions on the covariance structure of the errorsthemselves.


402 Stata tip 115

Consider an individual’s choice of travel mode with the alternatives being air, train,bus, and car and predictor variables, including general cost of travel, terminal time,household income, and traveling group size. One might suspect the choice of somealternatives to be driven more by unobserved factors than the choice of others. Forexample, there might be more unobserved reasons related to an individual’s decision totravel by plane than by train, bus, or car. Allowing the error variances associated withthe alternatives to differ, we fit the following model:

. use http://www.stata-press.com/data/r12/travel

. asmprobit choice travelcost termtime, casevars(income partysize)> case(id) alternatives(mode) std(het) corr(ind) nolog

Alternative-specific multinomial probit Number of obs = 840Case variable: id Number of cases = 210

Alternative variable: mode Alts per case: min = 4avg = 4.0max = 4

Integration sequence: HammersleyIntegration points: 200 Wald chi2(8) = 71.57Log simulated-likelihood = -181.81521 Prob > chi2 = 0.0000

choice Coef. Std. Err. z P>|z| [95% Conf. Interval]

modetravelcost -.012028 .0030838 -3.90 0.000 -.0180723 -.0059838

termtime -.050713 .0071117 -7.13 0.000 -.0646517 -.0367743

air (base alternative)

trainincome -.03859 .0093287 -4.14 0.000 -.0568739 -.0203062

partysize .7590228 .190438 3.99 0.000 .3857711 1.132274_cons -.9960951 .4750053 -2.10 0.036 -1.927088 -.0651019

busincome -.0119789 .0081057 -1.48 0.139 -.0278658 .003908

partysize .5876645 .1751734 3.35 0.001 .2443309 .930998_cons -1.629348 .4803384 -3.39 0.001 -2.570794 -.6879016

carincome -.004147 .0078971 -0.53 0.599 -.019625 .011331

partysize .5737318 .163719 3.50 0.000 .2528485 .8946151_cons -3.903084 .750675 -5.20 0.000 -5.37438 -2.431788

/lnsigmaP1 -1.097572 .7967201 -1.38 0.168 -2.659115 .4639704/lnsigmaP2 -.3906271 .3468426 -1.13 0.260 -1.070426 .2891719

sigma1 1 (base alternative)sigma2 1 (scale alternative)sigma3 .3336802 .2658497 .0700102 1.590376sigma4 .6766324 .2346849 .3428624 1.335321

(mode=air is the alternative normalizing location)(mode=train is the alternative normalizing scale)

M. Herrmann 403

As can be seen, two of the four error variances are set to one. These are the baseand scale alternatives. While choosing a base and scale alternative is necessary toidentify the model, the problem here is that because errors are uncorrelated, fixing thevariance of the base alternative is not necessary to identify the model. As a result, anadditional constraint is imposed, which leads to a different model structure dependingon the choice of base and scale alternatives. For example, changing the base alternativeto car produces a different log likelihood:

. quietly asmprobit choice travelcost termtime, casevars(income partysize)> case(id) alternatives(mode) std(het) corr(ind) nolog base(4)

. display e(ll)-181.58795

To properly estimate an unconstrained heteroskedastic model, one needs to definea vector of variance terms in which one element (the scale alternative) is fixed andpass this vector on to the estimation command. For example, to set the error varianceof the second alternative to unity, define a vector of missing values, stdpat, whosesecond element is 1, and then call this vector from inside asmprobit using the optionstd(fixed) (see [R] asmprobit for details):

404 Stata tip 115

. matrix define stdpat = (.,1,.,.)

. asmprobit choice travelcost termtime, casevars(income partysize)> case(id) alternatives(mode) std(fixed stdpat) corr(ind) nolog base(1)

Alternative-specific multinomial probit Number of obs = 840Case variable: id Number of cases = 210

Alternative variable: mode Alts per case: min = 4avg = 4.0max = 4

Integration sequence: HammersleyIntegration points: 200 Wald chi2(8) = 26.84Log simulated-likelihood = -180.01839 Prob > chi2 = 0.0008

choice Coef. Std. Err. z P>|z| [95% Conf. Interval]

modetravelcost -.0196389 .0067143 -2.92 0.003 -.0327988 -.006479

termtime -.0664153 .0140353 -4.73 0.000 -.093924 -.0389065

air (base alternative)

trainincome -.0498732 .0154884 -3.22 0.001 -.08023 -.0195165

partysize 1.126922 .3651321 3.09 0.002 .4112761 1.842568_cons -1.072849 .680711 -1.58 0.115 -2.407018 .2613198

busincome -.0210642 .0139892 -1.51 0.132 -.0484826 .0063542

partysize .8678651 .3179559 2.73 0.006 .244683 1.491047_cons -1.831363 .7345686 -2.49 0.013 -3.271091 -.3916349

carincome -.010205 .0131711 -0.77 0.438 -.0360199 .01561

partysize .8708577 .3202671 2.72 0.007 .2431458 1.49857_cons -4.971594 1.261002 -3.94 0.000 -7.443112 -2.500075

/lnsigmaP1 .558377 .3076004 1.82 0.069 -.0445087 1.161263/lnsigmaP2 -1.0078 1.116358 -0.90 0.367 -3.195822 1.180223/lnsigmaP3 -.0158072 .3593511 -0.04 0.965 -.7201225 .6885081

sigma1 1.747833 .5376342 .9564673 3.193964sigma2 1 (scale alternative)sigma3 .3650213 .4074946 .0409329 3.255099sigma4 .9843171 .3537155 .4866926 1.990743

(mode=air is the alternative normalizing location)(mode=train is the alternative normalizing scale)

Now the model is properly normalized, and the user may verify that changing eitherthe scale alternative (that is, changing the location of the 1 in stdpat) or the basealternative leaves results unchanged. Note that while, in theory, the only restrictionnecessary to identify the heteroskedastic probit model is to fix one of the varianceterms, in the Stata implementation of the model, the base and scale outcomes must bedifferent. That is, Stata does not allow the same alternative to be the base outcomeand the scale outcome. However, this is more of an inconvenience than a restriction:such a model would be equivalent to one in which the base and scale outcomes differed.

M. Herrmann 405

Finally, to show that independence of errors indeed implies J − 1 estimable errorvariances, we must verify that the error variances can be calculated directly from thevariance and covariance parameters of the normalized error differences. Only the latterare identified and, hence, estimable (Train 2009). Suppose, without loss of generality,J = 3, and let j = 1 be the base outcome.

Following the normalization approach advocated by Train (2009, 100f.), the normal-ized covariance matrix of error differences is given by

Ω∗1 =

(1 θ∗23

θ∗33

)with elements θ∗ relating to the actual error variances σjj and covariances σij as follows:

θ∗23 =σ23 + σ11 − σ12 − σ13

σ22 + σ11 − 2σ12

θ∗33 =σ33 + σ11 − 2σ13

σ22 + σ11 − 2σ12

Under independence, σij = 0. Fixing σ22 = 1 (that is, choosing j = 2 as the scaleoutcome) yields θ∗23 = σ11/(1 + σ11) and θ∗33 = (σ33 + σ11)/(1 + σ11). Obviously, σ11

can be calculated from θ∗23, and subsequent substitution produces σ33 from θ∗33. Thesame is true if we choose to fix either σ11 or σ33 because in each case, we would obtaintwo equations in two unknowns. Similar conclusions follow when there are four ormore outcome categories. Thus, with independent errors, J −1 variance parameters areestimable.

ReferenceTrain, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:

Cambridge University Press.

The Stata Journal (2013)13, Number 2, p. 406

Software Updates

st0210 1: Making spatial analysis operational: Commands for generating spatial-effectvariables in monadic and dyadic data. E. Neumayer and T. Plumper. Stata Journal10: 585–605.

Changes affecting all ado-files: Users are no longer required to have mmerge.adoinstalled.

Changes affecting spdir.ado: A bug was fixed that affected row-standardizedspatial-effect variables and spatial-effect variables with additive link functions. Alsousers can now choose from a larger variety of link functions.

Changes affecting spundir.ado: A bug was fixed that affected row-standardizedspatial-effect variables and spatial-effect variables with additive link functions.

Changes affecting spagg.ado and spspc.ado: A bug was fixed that affected row-standardized spatial-effect variables. Also users can now choose from a larger varietyof link functions; this is achieved by introducing a compulsory link-function choice,which replaces the previously existing default choice of the link function and thereverse W option.

st0220 1: eq5d: A command to calculate index values for the EQ-5D quality-of-lifeinstrument. J. M. Ramos-Goni and O. Rivero-Arias. Stata Journal 11: 120–125.

Five additional country-specific value sets recently published for France, Italy, SouthKorea, Thailand, and Canada have been included using the N3 model methodology.These additional value sets are also acknowledged by the EuroQol group on itswebsite as available or “in development” value sets.

A fix has been made to correct a software bug when if was used when estimatingthe predicted EQ-5D with the saving() option.

c© 2013 StataCorp LP up0040

the stata journal - mahidol university · (2013), which build on kelejian and prucha (1998, 1999,...

Documents