issued march 2004 data processing in census 2000 · issued march 2004 data processing in census...

37
Census 2000 Topic Report No. 7 TR-7 Issued March 2004 Data Processing in Census 2000 U.S. Department of Commerce Economics and Statistics Administration U.S. CENSUS BUREAU Census 2000 Testing, Experimentation, and Evaluation Program

Upload: others

Post on 25-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

Census 2000 Topic Report No. 7TR-7

Issued March 2004

Data Processing in Census 2000

U.S.Department of CommerceEconomics and Statistics Administration

U.S. CENSUS BUREAU

Census 2000 Testing, Experimentation, and Evaluation Program

Page 2: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

The Census 2000 Evaluations Executive SteeringCommittee provided oversight for the Census 2000Testing, Experimentation, and Evaluations (TXE)Program. Members included Cynthia Z. F. Clark,Associate Director for Methodology and Standards;Preston J. Waite, Associate Director for DecennialCensus; Carol M. Van Horn, Chief of Staff; TeresaAngueira, Chief of the Decennial ManagementDivision; Robert E. Fay III, Senior MathematicalStatistician; Howard R. Hogan, (former) Chief of theDecennial Statistical Studies Division; Ruth AnnKillion, Chief of the Planning, Research and EvaluationDivision; Susan M. Miskura, (former) Chief of theDecennial Management Division; Rajendra P. Singh,Chief of the Decennial Statistical Studies Division;Elizabeth Ann Martin, Senior Survey Methodologist;Alan R. Tupek, Chief of the Demographic StatisticalMethods Division; Deborah E. Bolton, AssistantDivision Chief for Program Coordination of thePlanning, Research and Evaluation Division; Jon R.Clark, Assistant Division Chief for Census Design ofthe Decennial Statistical Studies Division; David L.Hubble, (former) Assistant Division Chief forEvaluations of the Planning, Research and EvaluationDivision; Fay F. Nash, (former) Assistant Division Chieffor Statistical Design/Special Census Programs of theDecennial Management Division; James B. Treat,Assistant Division Chief for Evaluations of the Planning,Research and Evaluation Division; and VioletaVazquez of the Decennial Management Division.

As an integral part of the Census 2000 TXE Program,the Evaluations Executive Steering Committee char-tered a team to develop and administer the Census2000 Quality Assurance Process for reports. Past andpresent members of this team include: Deborah E.Bolton, Assistant Division Chief for ProgramCoordination of the Planning, Research and EvaluationDivision; Jon R. Clark, Assistant Division Chief forCensus Design of the Decennial Statistical StudiesDivision; David L. Hubble, (former) Assistant DivisionChief for Evaluations and James B. Treat, AssistantDivision Chief for Evaluations of the Planning, Researchand Evaluation Division; Florence H. Abramson,Linda S. Brudvig, Jason D. Machowski, andRandall J. Neugebauer of the Planning, Research and Evaluation Division; Violeta Vazquez of theDecennial Management Division; and Frank A.Vitrano (formerly) of the Planning, Research andEvaluation Division.

The Census 2000 TXE Program was coordinated by thePlanning, Research and Evaluation Division: Ruth AnnKillion, Division Chief; Deborah E. Bolton, AssistantDivision Chief; and Randall J. Neugebauer andGeorge Francis Train III, Staff Group Leaders. KeithA. Bennett, Linda S. Brudvig, Kathleen HaysGuevara, Christine Louise Hough, Jason D.Machowski, Monica Parrott Jones, Joyce A. Price,Tammie M. Shanks, Kevin A. Shaw,

George A. Sledge, Mary Ann Sykes, and CassandraH. Thomas provided coordination support. FlorenceH. Abramson provided editorial review.

This report was prepared by Nicholas S. Alberti ofthe Decennial Statistical Studies Division. The follow-ing authors and project managers prepared Census2000 experiments and evaluations that contributed tothis report:

Decennial Statistical Studies Division:Nicholas S. AlbertiStephanie K. BaumgardnerNathan A. CarterKimball T. JonasMiriam D. RosenthalMichael C. TenebaumMark A. Viator

Planning, Research and Evaluation Division:James B. Treat

Greg Carroll and Everett L. Dove of the Admin-istrative and Customer Services Division, and WalterC. Odom, Chief, provided publications and printingmanagement, graphic design and composition, and edi-torial review for print and electronic media. Generaldirection and production management were providedby James R. Clark, Assistant Division Chief, andSusan L. Rappa, Chief, Publications Services Branch.

Acknowledgments

Page 3: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Department of CommerceDonald L. Evans,

Secretary

Vacant,Deputy Secretary

Economics and Statistics AdministrationKathleen B. Cooper,

Under Secretary for Economic Affairs

U.S. CENSUS BUREAUCharles Louis Kincannon,

Director

Census 2000 Topic Report No. 7Census 2000 Testing, Experimentation,

and Evaluation Program

DATA PROCESSINGIN CENSUS 2000

TR-7

Issued March 2004

Page 4: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

Suggested Citation

Nicholas S. AlbertiCensus 2000 Testing,

Experimentation, and EvaluationProgram Topic Report No. 7, TR-7,

Data Processing in Census 2000, U. S. Census Bureau,

Washington, DC 20233

ECONOMICS

AND STATISTICS

ADMINISTRATION

Economics and StatisticsAdministration

Kathleen B. Cooper,Under Secretary for Economic Affairs

U.S. CENSUS BUREAU

Charles Louis Kincannon,Director

Hermann Habermann,Deputy Director and Chief Operating Officer

Cynthia Z. F. Clark,Associate Director for Methodology and Standards

Preston J. Waite, Associate Director for Decennial Census

Teresa Angueira, Chief, Decennial Management Division

Ruth Ann Killion, Chief, Planning, Research and Evaluation Division

For sale by the Superintendent of Documents, U.S. Government Printing Office

Internet: bookstore.gpo.gov Phone: toll-free 866-512-1800; DC area 202-512-1800

Fax: 202-512-2250 Mail: Stop SSOP, Washington, DC 20402-0001

Page 5: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 iii

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v

1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2 Data processing background . . . . . . . . . . . . . . . . . . . . . .1

2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72.1 Decennial Management Division (DMD)

operational assessments . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Census 2000 evaluations . . . . . . . . . . . . . . . . . . . . . . . . .7

3. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93.1 General limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93.2 Evaluation L.2 - Operational Analysis of the Decennial

Response File Linking and Setting of Housing Unit Status and Expected Household Size . . . . . . . . . . . . . . . . . . . . .9

3.3 Evaluation L.4 - Census Unedited file creation . . . . . . . . . .9

4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114.1 Decennial Response File - Phase 1 (DRF1) . . . . . . . . . . . .114.2 Decennial Response File - Phase 2 (DRF2) . . . . . . . . . . . .124.3 Hundred percent census unedited file for housing units .174.4 Non-ID addresses processing . . . . . . . . . . . . . . . . . . . . .184.5 Group quarters processing . . . . . . . . . . . . . . . . . . . . . . .19

5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215.1 Processing systems design architecture . . . . . . . . . . . . .215.2 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215.3 Quality Assurance processes . . . . . . . . . . . . . . . . . . . . .225.4 Non-ID processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225.5 Count imputation of housing unit status . . . . . . . . . . . .225.6 Primary selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

6. Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

LIST OF TABLES

Table 1. Number of response records comprising a return . . . . .13

Table 2. Housing unit status by type of return . . . . . . . . . . . . . .13

Table 3. Formation of PSA households at addresses with two returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

Table 4. Combinations of return types for PSA households with two returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Table 5. Source of housing unit status for DMAF addresses . . . . .17

Contents

Page 6: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 7: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 v

The Census 2000 Testing, Experimentation, and Evaluation Programprovides measures of effectiveness for the Census 2000 design,operations, systems, and processes and provides information on the value of new or different methodologies. By providing measuresof how well Census 2000 was conducted, this program fully sup-ports the Census Bureau’s strategy to integrate the 2010 planningprocess with ongoing Master Address File/TIGER enhancements andthe American Community Survey. The purpose of the report that follows is to integrate findings and provide context and backgroundfor interpretation of related Census 2000 evaluations, experiments,and other assessments to make recommendations for planning the 2010 Census. Census 2000 Testing, Experimentation, andEvaluation reports are available on the Census Bureau’s Internet siteat: www.census.gov/pred/www/.

Foreword

Page 8: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 9: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

1.1 Introduction

The Data Processing Topic Reportprovides a synthesis of the results,recommendations and lessonslearned from the Census 2000post-data capture data processing.The Census 2000 incorporated var-ied methods of data collection anddata capture. Once census datawere collected and captured a col-lection of interdependent process-es were implemented to organizeand integrate the data. Theseprocesses accomplished the tasksof organizing and integrating indi-vidual responses to the census,editing and coding data, integrat-ing geographic data with censusresponse data, identifying andgeocoding addresses addedthrough enumeration activities,identifying the best data to repre-sent each household and groupquarters, determining the finalCensus housing universe and ulti-mately the Census populationbased on all available data.

1.2 Data processingbackground

1.2.1 Non-ID processing

Most of the information in this sec-tion comes from Medina (2001).

Every address in the census has aunique identifier, the MasterAddress File (MAF) identification(ID) number. This ID number isused to link each census responseto its address. Most censusaddresses are assigned an ID num-ber prior to the census enumera-tion operations and most censusresponses had a MAF ID preprintedon the questionnaire. However,

some operations generated

responses without a preassigned

MAF ID. The Non-ID operation

attempted to assign a MAF ID to

those responses.

Response records without a MAF ID

were divided into three groups. A

description of these groups and

the Non-ID processes follows:

• Type A Records - This group

included housing unit addresses

for responses from the Be

Counted program, the

Telephone Questionnaire

Assistance (TQA) operation and

Service-Based Enumeration (SBE)

operation, Usual Home

Elsewhere (UHE)1 addresses pro-

vided on Group Quarters (GQ)

questionnaires (GQ/UHE

addresses) and UHE addresses

provided on enumerator ques-

tionnaire responses in response

to the In-Mover and Whole

Household UHE coverage

improvement probes.

• Type B Records - This group

included responses, from the Be

Counted program, that indicated

the respondent had no usual

home on April 1, 2000. These

responses could be included in

the GQ universe if the

Geography Division (GEO) identi-

fied the Local Census Office

(LCO) geography that contained

the address.

• Type C Records - This groupincluded housing unit addressesthat were added to the censusthrough the Update/Leave (U/L),Urban Update/Leave (UU/L),Nonresponse Followup (NRFU),Coverage ImprovementFollowup (CIFU), Transient-night(T-night), or GQ enumerationprograms.

The Decennial Systems andContracts Management Office(DSCMO) identified the Non-IDrecords and forwarded them to theGEO for processing. The GEO pro-vided a Census ID number (MAF ID)for each address it could eithermatch or geocode. It updated theMAF with any new housing unitaddresses found among the Non-IDresponses. The GEO forwarded theresults of the Non-ID process tothe DSCMO who added the newaddresses to the Decennial MasterAddress File (DMAF).

Type A record processing

The GEO conducted an automatedmatch of city style (i.e., housenumber and street name) and non-city style addresses to the MAF. Italso carried out an automatedprocess to geocode city- styleaddresses that could not bematched to the MAF in the auto-mated process.

The GEO also carried out an inter-active telephone and computerassisted operation in NationalProcessing Center (NPC) to matchand geocode records that couldnot be matched or geocoded in theautomated processes. If the initialattempt to clerically match or

U.S. Census Bureau Data Processing in Census 2000 1

1. Background

1 A Usual Home Elsewhere address is aCensus Day address reported by a respon-dent when it is different from the address atwhich they are interviewed.

Page 10: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

geocode an address failed, theaddress was compared to a com-mercial database of addresses inorder to obtain a telephone num-ber and/or correct any deficienciesin the address. If appropriate, asecond attempt was made to cleri-cally match or geocode theaddress based on updated addressinformation. If still unsuccessful,the clerical staff used the tele-phone number to contact therespondent and correct any errorsin the address information. If cor-rections were made, anotherattempt was then made to matchor geocode the address.

New addresses (i.e., those notmatched to the MAF) that could begeocoded were added to theDecennial Master Address File(DMAF). Census plans specifiedthat existence of new Type Aaddresses added to the DMAFthrough the Non-ID process wouldbe confirmed by the FieldVerification (FV) operation.Enumerators visited the location ofthe new addresses in the FV opera-tion to determine whether or notthe address existed as a Censushousing unit on April 1, 2000.

Type B record processing

The Type B addresses weregeocoded only to the geographicarea of the LCO since only geo-graphic information collected wasthe place and county where theperson without a usual residencestayed on Census Day. New TypeB address locations geocoded tothe LCO geography were added tothe DMAF.

Type C record processing

The GEO assigned a MAF ID to allType C addresses. The GEOattempted to first match theaddress to an existing address onthe MAF. If no match was found

and it could be geocoded theaddress was added to the DMAF.

1.2.2 Decennial response file processing

The Decennial Response File (DRF)processing merged, organized andedited various data response filesproduced from the paper data cap-ture processes, and Internet andtelephone data collection process-es. Each response for a Censusaddress was represented on theDRF by a return level record(housing unit level data). The DRFcontained one person level recordfor each person reported on cen-sus questionnaires each of whichwas linked to the appropriatereturn level record. There could bemore than one response for anaddress and thus more than onereturn level record and associatedset of person level records.

The DRF processing encompassedthe following major tasks: mergingresponse data from the data cap-ture output databases with theDMAF to create an initial databaseof in-scope responses, reformattingand editing the in-scope responsedata, assigning a housing unit sta-tus to each return level record andselecting the response data thatwould be accepted as the responsefor each address in subsequentprocessing.

All of this was carried out in twophases. Each phase is discussed ina separate section below.

1.2.2.1 Decennial response file -Phase 1 (DRF1)

Much of the information in thissection comes from Fowler (2003).

The Decennial Response File -Phase 1 (DRF1) process includedthe following functions:

• standardizing data formats

• merging data capture responsedata with the DMAF to create aninitial file of in-scope responses

• interfacing with the Non-IDprocess and DMAF updates.

• interfacing with the AutomatedGeneral Coding operations andthe Coverage Edit Followupoperation.

• initial editing of the responsedata to blank illegal characters

• identifying data defined personrecords2 and formatting genera-tional name suffixes

The primary inputs to the DRF1processing were the DecennialCapture System 2000 (DCS2000)output files of data captureresponse records transmitted tothe DSCMO. The DRF input filesalso included data from the auto-mated TQA response file andresponse data received via Internetresponses. These files containedrecords for 82 different types ofquestionnaires in 15 different for-mats. The data on these files werereformatted into one standard for-mat for all questionnaire types.This process created the normal-ized data capture files. The nor-malized data capture files werethen merged with the DMAF. Theproduct of this merge was the ini-tial DRF1, a database consisting ofa collection of 559 files containingdata response records for censushousing units. There was oneDRF1 file for each LCO. Only dataresponse records for addressesrepresented on the DMAF at thetime of the merge operation wereincluded on the DRF1.

2 Data Processing in Census 2000 U.S. Census Bureau

2 Data defined person records arerecords with at least minimum data for twoor more of the 100 percent census dataitems - name, gender, relationship, age/dateof birth, race and Hispanic origin.

Page 11: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

1.2.2.2 Decennial response file -Phase 2 (DRF2)

Much of the information in thissection comes from Rosenthal(2003).

The second stage of the DRF pro-cessing consisted of the followingsteps:

• reformatting data

• linking response records (e.g.,linking response records repre-senting enumerator forms withthose representing the corre-sponding continuation forms)

• determining the housing unitstatus and household sizeimplied by the data on eachhousing unit return

• running the Primary SelectionAlgorithm process which select-ed the most appropriate hous-ing unit return(s) for eachaddress

Linking response records

The linking of response recordsrefers to the process of combiningresponse records to form housingunit returns. Each housing unitreturn could be made up of one ormore response records (i.e., thedata capture records for censusforms). When response recordswere linked, the resulting housingunit return included the demo-graphic data for all persons fromthe linked response records. Oneresponse record was identified asthe parent record which con-tributed the housing unit level datato the housing unit return that wasformed.

This linking component of theDRF2 was primarily aimed at link-ing the response records for theSimplified Enumerator Question-naire (Forms D-1(E), D-2(E)) withthe response records for enumera-tor continuation forms (Forms D-

1(E)Supp and D-2(E)Supp). TheSimplified Enumerator Question-naire provided space to list datafor five persons. At a householdwith six or more persons, an enu-merator used a continuation formto complete the enumeration ofpersons in the household.

The control information on thiscontinuation form identified theCensus address for which thequestionnaire was associated butdid not indicate to which question-naire it was associated. Whenthere was only one response for anaddress the linking of the responserecord for enumerator question-naires to the response record ofthe appropriate continuation formwas a simple matter. When therewere response records for two ormore enumerator questionnairesand one or more continuationforms the linking was not trivial.We used information about thehousehold size reported on theenumerator questionnaires and thenumber of persons enumerated onthe continuation form(s) to identifythe most likely linkage betweenresponse records.

We also designed the linkingprocess to handle cases whereenumerators and mail respondentsused questionnaires other than acontinuation form to complete theenumeration of a large householdand cases where the enumeratorsused the continuation form tocomplete the enumeration begunon a mail return. Criteria similar tothose used to link responserecords for enumerator question-naires to continuation forms wereused to link questionnaires inthese other types of cases.

When two response records werelinked on the DRF2, the question-naire to which the continuationform was linked was designated asthe parent record. The response

record for the continuation formwas deleted from the file and thecontrol information on the associ-ated person records was updatedso that the person records wouldbe associated with the parentrecord.

Assigning housing unit status andhousehold size

Once the DRF2 linking process wascompleted, the housing unit statusand household size of each returnwas assigned. The informationused to assign the housing unitstatus and household size includedthe number of person recordsassociated with a return, the num-ber of names listed in the ques-tionnaire roster, the household sizereported by the respondent, andthe occupancy status and house-hold size summary informationcompleted by Census enumerators.These data were compared withina return for consistency and exam-ined for sufficiency.

If the data were sufficient and rea-sonably consistent, the housingunit status and household sizewere resolved according to a pre-specified set of rules. Small incon-sistencies in the data were permit-ted as long as there wasconvincing evidence of the statusand household size. In each casethe status was supported by morethan one of the response dataitems. For example, the number ofpersons enumerated on a mailreturn could differ from the house-hold size reported by the respon-dent. The housing unit is obvious-ly occupied. If in this case, thehousehold size reported by therespondent was five or less, thehousehold size assigned was themaximum of the values indicatedby the questionnaire responsedata.

Each return was given one of thefollowing six occupancy statuses:

U.S. Census Bureau Data Processing in Census 2000 3

Page 12: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

Occupied, Vacant, Delete,Occupied/Unresolved HouseholdSize, Occupancy Status Unknownand Status Undetermined.

The Occupied and Vacant statuseswere assigned when the returnclearly showed that the addresswas an occupied or vacant housingunit. If the data clearly showedthat the address was not a Censushousing unit or was nonexistent,the Delete status was assigned.

Each housing unit return assigneda status of Occupied was assigneda household size based on allavailable information including thetotal number of person recordsassociated with the return, thenumber of names on the question-naire roster, the respondent report-ed household size, and the house-hold size reported by anenumerator in the InterviewerSummary section of the SimplifiedEnumerator Questionnaire.

The latter three statuses wereassigned whenever the housingunit return contained insufficientor conflicting data about housingunit status. They indicate the vari-ous levels of an unresolved hous-ing status and/or household size.

• Occupied with UnresolvedHousehold Size - This statuswas assigned whenever theresponse data indicated theaddress was occupied but theinformation on household sizewas insufficient.

• Occupancy Status Unknown -This status was assigned when-ever the information on house-hold status indicated that theaddress was a Census housingunit but due to conflicts or defi-ciencies in the response data wecould not determine whether ornot the housing unit was occu-pied or vacant.

• Status Undetermined - This sta-tus was assigned whenever wecould not determine whether ornot the address was a housingunit because the response dataprovided extremely conflictinginformation or were extremelydeficient.

Primary selection algorithm

Most of the information in this sec-tion comes from Baumgardner(2002).

The Primary Selection Algorithm(PSA) process was designed toresolve the receipt of multipleresponses to the Census 2000 foran individual housing unit address.More than one response could bereceived for an address becausethere were several ways torespond to Census 2000. Theseincluded responding via mail,responding via the Internet or tele-phone, completing a BCF, or beingenumerated by a census enumera-tor as part of the List Enumerate(LE), the NRFU, CIFU or GQ enumer-ation operations.

It operated on housing unit returnsafter the housing unit status andhousehold size were assigned.When the DRF2 contained multiplehousing unit returns for a censusaddress, the PSA selected the hous-ing unit return(s) along with theassociated set(s) of person levelrecords that best represented thehousehold at that address.

The PSA formed PSA households bycombining housing unit returnswhich represented the same cen-sus household. When multiplereturns were present for a singleaddress, the PSA matched thenames of household membersacross the returns (within anaddress) to determine whichresponses represented the samehousehold. Names and demo-graphic characteristics were used

to identify duplicate records forthe same person. The presence ofduplicate household membersacross two returns was evidencethat two census responses werecompleted for the same household.When duplicate household mem-bers were identified, the two hous-ing unit returns were combined toform one PSA household.

A PSA household consisted of justone return if there was only onereturn for the address or if a returnhad no household members incommon with any other returns atthe address. It consisted of two ormore returns if matching personrecords were found across thereturns. All returns with a statusof vacant were combined to formone PSA household.

When a PSA household was formedfrom two or more returns, one ofthe returns was designated as theBasic Return. All of the data onthe Basic Return were associatedwith the PSA household while onlythe demographic data for selectedhousehold members from theother return(s) were associatedwith the PSA household in subse-quent data processing.

Once the PSA households wereformed, selection criteria wereapplied to all PSA households at anaddress to identify the most appro-priate PSA household to representthe enumeration at that address.The selected PSA household wasdesignated as the Primary PSAHousehold. At each address, per-sons from questionnaires with aRespondent Provided Address(RPA)3 could be added to thePrimary PSA Household if theywere not matched in the earliersteps of the PSA. The data thatmade up the Primary PSA

4 Data Processing in Census 2000 U.S. Census Bureau

3 BCFs, and GQ questionnaires with aGQ/UHE address.

Page 13: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 5

Household were then used as the

input to the Hundred Percent

Census Unedited File (HCUF) and

other subsequent data processing

activities.

1.2.3 Hundred percent census

unedited file for housing units

Much of the information in this

section comes from Jonas (2003).

The processing of housing unit

data and group quarters data was

conducted independently on paral-

lel tracks. The HCUF processing of

housing unit data is discussed

below. The processing of group

quarter data is discussed separate-

ly in Section 1.2.4, below. Once

both processes were completed,

the data were combined to form

the final HCUF.

The preliminary Census housing

unit universe was determined

through the HCUF process applied

to potential housing unit addresses

on the DMAF. The HCUF housing

unit universe was preliminary

because it contained records repre-

senting duplicate housing units.

Potential duplicate HCUF records

were identified and retained on the

HCUF. The final determine of

duplicate records was completed

after the HCUF processing. The

duplicate records were subsequent-

ly removed during of the process-

ing of the Hundred Percent Census

Edited File (HCEF).

The HCUF process for housing unit

data brought together the data

from the DMAF and the DRF2. The

process determined which address-

es on the DMAF represented a cen-

sus housing unit. The occupancy

status of each housing unit was

then assigned and the size of the

household in each occupied unit

was determined.

1.2.3.1 Identifying ‘Kills’

The first step in the HCUF process-ing for housing unit data was toidentify addresses on the DMAFthat do not represent a housingunit. The addresses eliminatedfrom the housing unit universe atthis stage of processing werereferred to as ‘Kills’. The determi-nation of ‘Kills’ was based onsource of addresses, status ofaddresses in the U.S. Postal ServiceDelivery Sequence Files, results ofthe Local Update of CensusAddresses (LUCA), the postal deliv-ery statuses and results of fieldaddress listing and housing unitenumeration operations.

1.2.3.2 Integrating the DMAF andDRF2

The next step in the HCUF processwas to bring together data for theremaining DMAF addresses andDRF2 data. At this stage of theprocess one of the following hous-ing unit statuses was assigned toeach address on the DMAF:Occupied, Vacant, Delete, Occupiedwith Unresolved Household Size,Occupancy Status Unknown, andStatus Unknown. These statusescorrespond to those assigned toindividual returns during the DRF2processing. The latter three cate-gories represent addresses withincomplete information on housingstatus and/or household size.

The source of the data that deter-mined the housing status andhousehold size of occupied unitscould be the status and householdsize assigned to the DRF2 housingunit return selected by the PSA asthe Primary PSA Household, or theresults of the NonresponseFollowup, the CoverageImprovement Followup or the FieldVerification Followup operations asrecorded on the DMAF.

1.2.3.3 Count imputation

The count imputation processimputed the housing status and/orhousehold size to addresses asnecessary in the next step.Imputation processes were con-ducted independently for the threegroups of addresses with incom-plete information.

• Occupied with UnresolvedHousehold Size (Household SizeImputation) - A household sizeof one or greater was imputedto housing units at theseaddresses.

• Occupancy Status Unknown(Occupancy Imputation) - Ahousing unit status of Occupiedor Vacant was imputed to theseaddresses. Addresses in theOccupancy Status Unknown cat-egory were determined to existas a housing unit but it couldnot be determined if the unitwas occupied or vacant. Ahousehold size was imputed tohouseholds given an imputedstatus of Occupied.

• Status Unknown (StatusImputation) - The imputed sta-tus of these units could beOccupied, Vacant or Delete.Status Unknown was assignedwhenever we could not deter-mine whether or not the addressexisted as a Census housingunit because of the lack of suffi-cient data. Addresses assignedan imputed status of deletewere eliminated from the HCUF.

The imputation method used toimpute housing status and house-hold size to addresses from eachgroup was a nearest neighbor hotdeck imputation method. Donorhouseholds were identified foreach group from among theaddresses with a resolved statusand household size. The nearestneighbor housing unit in the donor

Page 14: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

pool was selected to fill in thehousing status and/or householdsize for the address with incom-plete information. The imputationfor each group was carried outindependently within each LCO.

1.2.3.4 Duplicate Delete Operation

The final step in the HCUF processfor housing units was the identifi-cation of duplicate addresses. Theoperations to identify duplicateaddresses were designed and con-ducted in the summer and fall of2000 to correct a potential over-count of housing units. Two pri-mary methods were used to identi-fy potential duplicate addresses: 1) address matching based oncharacteristics of the addressderived from MAF data and 2) per-son matching based on name anddate of birth.

The identification of duplicateaddresses was carried out in twophases in order to meet the sched-ule for Accuracy and CoverageEvaluation (A.C.E.). In Phase 1 aprovisional list of duplicateaddresses was identified. In thisphase, address and person match-ing were carried out independently.Matching addresses were paired.Addresses with one or more exactperson matches and similar house-holds were paired. After identify-ing Kills and addresses given a sta-tus of Delete, one address wasselected from each remaining pair.The addresses not selected wereconsidered provisional deletions.

No addresses were eliminated fromthe HCUF as a result of Phase 1 ofthe duplicate identificationprocess. Phase 2 was implement-ed after the creation of the HCUFand the results that phase wereused to identify which provisionaldeletions from Phase 1 would beretained on the HCEF. In Phase 2additional information on addressmatching and person matching

was combined to decide which ofthe provisional deletions to rein-state. Additional person matchingused a modified version of theCensus Bureau’s probabilisticmatching methods. At the comple-tion of Phase 2, a total of1,392,686 HCUF addresses wereidentified as duplicate addressesand not retained on the HCEF.

1.2.4 Group quarters

Most of the information in this sec-tion comes from Jonas (2002).

The report covers two aspects ofthe processing of data for GQ. Thefirst aspect addresses how thepopulation count for GQs wasdetermined and the secondaddresses the processing of GQquestionnaires with a reported UHEaddress.

The GQ response data wereprocessed separately from thehousing unit data until the finalstep of the HCUF processing whendata from both universes were col-lected on the same file for the firsttime.

The Census Bureau relied heavilyon the number of GQ question-naires completed and captured bythe DCS2000 to determine thepopulation of each GQ. IndividualGQ questionnaires were nottracked during the enumerationprocesses. Clerical counts of thenumber of questionnaires at sever-al points of field processing and acount of records by the DCS2000were recorded. The count of thenumber of questionnaires wasrecorded at five points in the post-enumeration processing:

• By the enumerator immediatelyfollowing the enumeration of aGQ

• By the LCO staff when the ques-tionnaires were received

• By the LCO staff when the ques-tionnaires were shipped to theNPC

• By the NPC staff when the ques-tionnaires were received

• By the DCS2000 during the datacapture of questionnaires

The counts listed above formedthe basis for determining the finalpopulation count for each GQ.Other processes that contributeddata to determining the populationcount for GQ were:

• The results of telephone fol-lowup interviews with GQ estab-lishments that initially refusedto be enumerated. No question-naires were returned for theseGQs. The Census Day popula-tion of each GQ was ascertainedby the followup interview

• Identification of BCF question-naires with a GQ address

• Identification of housing unitquestionnaires with a reportedUHE address for a GQ

• Unduplication of persons at SBEFacilities

• Identification of GQ question-naires with a reported UHEaddress for a housing unit

Although residents of all types ofGQs were allowed to report UHE(i.e., a Census Day residence otherthan the GQ at which they wereenumerated) only questionnairedata for eligible UHE responseswere to be sent to the Non-IDprocesses. Only persons with eli-gible UHE responses could beremoved from the GQ universe andincluded in the housing unit uni-verse. Eligibility was determinedby the type of GQ from which thequestionnaire was received andresponse to a screening questionwhich identified a person’s primaryresidence.

6 Data Processing in Census 2000 U.S. Census Bureau

Page 15: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

The information for this topicreport was obtained primarily fromthe Decennial ManagementDivision (DMD) OperationalAssessments and the Census 2000Evaluation studies that addresstopics associated with Census2000 data processing. The majorresources for information comefrom the documents discussedbelow. For a complete list ofresources, see the list of referencesat the end of the document.

2.1 Decennial ManagementDivision operationalassessments

The purpose of the operationalassessments was to document thesuccesses and lessons learnedfrom the planning and implementa-tion of Census 2000 data process-ing operations. They provided rec-ommendations for considerationby the research and development,and working groups that will plansimilar operations in the future.

The primary source informationcame from a combination of inputsby all divisions responsible fordecennial operations planning andimplementation. The recommen-dations reflect the opinions of thecontributors to the assessmentsand do not necessarily representthe official position of the CensusBureau. The development of therecommendations focused on theindividual operation, without anattempt to assess the implications

across the entire census process.

The DMD operational assessments

referenced in this topic report are:

• Assessment Report for

Decennial Response Files DRF1,

DRF2, PSA

• Assessment Report for Non-ID

Questionnaire Processing

(including BCF/TQA Field

Verification)

2.2 Census 2000evaluations

Results from the following Census

2000 Evaluation studies con-

tributed to this topic report:

• Group Quarters Enumeration,

E.5

• Evaluation of Nonresponse

Followup - Whole Household

Usual Home\Elsewhere Probe,

I.2

• Operational Analysis of the

Decennial Response File Linking

and Setting of Housing Unit

Status and Expected Household

Size, L.2

• Analysis of the Primary Selection

Algorithm, L.3.a

• Resolution of Multiple Census

Returns Using a Re-interview,

L.3.b

• Census Unedited File Creation,

L.4

2.2.1 Evaluations E.5, L.2, L.3.a,

L.4, I.2

These evaluations provide descrip-

tive statistics summarizing the out-

come of the Census processes.

The results were derived from data

files available as products of the

Census 2000 processes. These

files include:

• Decennial Response File

• Hundred Percent Unedited

Census File

• Special Place/Group Quarter

Control File

• Normalized file of matching and

geocoding results for enumera-

tor questionnaires from the Non-

ID processing

2.2.2 Evaluation L.3.b

The evaluation relies on data from

a national sample of addresses

affected by the PSA. A post-census

interview was conducted at each

sample address with someone

familiar with the household(s) enu-

merated on Census question-

naire(s) captured for the address.

The residency status of each

household member was ascer-

tained through the interview. The

data collected were used to judge

the accuracy of the decision made

by PSA with respect to selection of

household members.

U.S. Census Bureau Data Processing in Census 2000 7

2. Methods

Page 16: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 17: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 9

3.1 General limitations

The lack of documentation for theCensus data processing procedureswas an impediment not only forthis topic report but for many ofthe evaluations. The lack of docu-mentation in some cases limitedthe scope of the evaluations andthe details of what we can learnabout the processes. It put theevaluations and assessments atrisk of conveying false conclu-sions.

3.2 Evaluation L.2 -Operational Analysis of theDecennial Response FileLinking and Setting ofHousing Unit Status andExpected Household Size

This evaluation discusses a possi-ble coding error with respect to

Simplified Enumerator Question-naire Interview Summary Item Bfield on the DRF. The schedule didnot allow time to investigate thetrue extent this error by visuallyexamining the responses on thequestionnaire images. The evalua-tion could only assess the extentof the effect of the error based onrelated data.

3.3 Evaluation L.4 - CensusUnedited file creation

An exact tally of the “Kill” address-es was not possible. The DMAFdid not provide the informationnecessary to identify whichaddresses were identified as a“Kill”. As part of a quality assur-ance on procedures to identify the“Kills”, the Decennial StatisticalStudies Division (DSSD) implement-

ed software to independently veri-

fy the identification. At the time of

production the results of the DSSD

identification matched exactly the

DSCMO production results. The

DSSD software was subsequently

applied to the current DMAF in

order to identify the “Kill” address-

es for the evaluation L.4. However,

there were approximately 14,000

addresses not identified as a “Kill”

that are strongly suspected to be

“Kills”. This conclusion is based on

the data for the results of the field

followup operations and the fact

that they were not included the

Census. These addresses were

assumed to be “Kills” for the pur-

pose of the evaluation.

3. Limitations

Page 18: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 19: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 11

4.1 Decennial response fileprocessing - overalloperational issues

Much of the information in thissection comes from Fowler (2003).

The construction of the DecennialResponse File (DRF) is considered asuccessful element of the Census2000 data processing operations.The DSCMO successfully collectedand processed response data toproduce a complete and integratedDRF. The completion of the DRFprocess was the result of a suc-cessful integration of many diverseprocessing components employedto edit and identify a complete setof response data for each housingunit in the Census 2000.

4.1.1.Requirements development

Complete documentation ofrequirements was not developedfor the DRF processing or relatedQuality Assurance (QA) and testingprocedures. Requirements weredeveloped for some individualcomponents of the DRF processbut not for others.

• Requirements were developedfor the process of reformattingdata capture input into a stan-dard format (file normalization),the Coverage Edit Followup pro-cessing and coding extraction,the linking of continuationforms, the determination ofhousing status and householdsize, and the PSA process.

• Requirements were lacking forthe other critical steps includinginterfaces with Non-ID question-naire processing, and the initialedits of the response data to

blank illegal characters, andidentifying data defined personsand formatting generationalname suffixes.

Late changes to requirements werea challenge for the DSCMO toreview and implement. Two exam-ples of changes to requirementsthat occurred late in the softwaredevelopment cycle were the deci-sion to include Large Householdsin the Coverage Edit Followup andchanges to DRF input data fromthe Data Capture Audit Resolutionprocesses.

The technical requirements of thequestionnaire designs were exten-sive and the demographic areastruggled to finalize questionnaireformats and content within theallotted time frame. A total of 82different form types were designedfor the Census 2000 which result-ed in 15 different formats for datacapture. The large number of dif-ferent form types and file formatsrequired greatly increased theschedule and complexity ofdesigning procedures for reformat-ting the data capture file input intoa standard format and for design-ing the initial edit of the responsedata.

The lack of timely development ofrequirements for questionnairedata capture output and the Non-IDprocess output hindered the reviewof documentation and the planningof file testing and QA procedures.

The PSA was the most successfullydeveloped component of the DRFprocesses. Sufficient staff wasdedicated to the development of

PSA requirements. The require-ments were adequately developedand documented.

4.1.2 Quality assurance

Appropriate quality assurance safe-guards were undertaken and docu-mented in the development of thePSA software development, butwere not developed or document-ed as well for the other compo-nents of the DRF processing.

Formal walkthroughs of all PSAspecifications were conducted toexamine the completeness andidentify necessary modifications.The PSA software underwent avery thorough formal testing pro-gram. The initial contractor hiredto assist in the development of thetesting process contributed greatlyto the process because he had anabundant knowledge of softwaretesting and a sufficient under-standing of the PSA and the DRFprocess. Midway through thedevelopment of software testingthe initial contractor was replacedby a contractor who did not havean adequate knowledge of the PSAand the DRF. This became prob-lematic and the remainder of thesoftware testing development wascompleted by Census Bureau staff.This effort was successful butintroduced risks to the success ofthe testing plan and redirectedscarce resources away from othercritical operations.

The software testing of other com-ponents of the DRF was muchmore informal. Due to the lack ofstaff resources and time, there wasno formal QA or testing plan for

4. Major Findings

Page 20: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

12 Data Processing in Census 2000 U.S. Census Bureau

any other component of the DRF.Informal testing was conducted forthe processes of linking continua-tion forms, and determininghousehold status and expectedhousehold size. During thesesteps, problems with related DRFprocesses were discovered andcorrected.

4.1.3 Documentation

Complete documentation ofrequirements for all DRF compo-nents was not accomplished. Insome instances it was not evidentwhich staff was primarily responsi-ble for developing requirements.Some of these requirements werenot listed on planning schedulesand the deliverables were nottracked. The documentation andreview of these requirements werenot adequate for ensuring theircompleteness and appropriatenessto the task.

4.1.4 Scheduling

The DRF development compensat-ed successfully for late developingdata requirements.

Due to scheduling constraints, itwas necessary to complete the DRFprocessing before all inputs wereavailable. More than 207,000housing unit addresses on DMAFwere not included on the DRF.These addresses were added to theDMAF after the DRF1 processmerged the data capture file withthe DMAF. The housing unit statuswas subsequently imputed forthese addresses in the HCUFprocess.

4.1.5 Recommendations

These recommendations comefrom Fowler (2003).

• Design and implement a formalprocess to develop completeDRF requirements. Planning isneeded to ensure sufficientstaffing resources are available

to adequately plan and developthe requirements.

• Reduce the number of differentquestionnaire formats to reducethe number of different outputformats. This will simplify thedata processing.

• Complete the final forms designmuch earlier in the planningcycle, desirably prior to theDress Rehearsal. This will allowsufficient time to develop, docu-ment and test specifications forDRF1 processing.

• Improve the planning of soft-ware testing and quality assur-ance (QA) procedures. The test-ing and QA proceduresimplemented for the PSA are anexample of what we should aimto achieve.

• Reduce the risk associated withplacing large responsibilities ona few staff members or one con-tractor by developing inter-divi-sional teams. Teams, made upof representatives from variousdivisions, should develop anearly and cooperative partner-ship. They can ensure that ade-quate requirements, staffingassessments, and documenta-tion are developed. The struc-ture used for development ofthe PSA requirements and soft-ware is a good example of effec-tive program development.

4.2 Decennial response fileprocessing - Phase 2 (DRF2)processes

4.2.1 Linking returns

The information in this sectioncomes from Rosenthal (2003)

The DRF included about 1.5 millionenumerator continuation returns.All but about 2.3 percent could belinked to a Simplified EnumeratorQuestionnaire (SEQ). At the com-

pletion of the linking process,

33,472 continuation returns

remained unlinked to another

response record (a parent record).

Few (126) enumerator continuation

forms from List Enumerate (LE)

areas were included on the DRF.

The DCS 2000 captured many

more continuation forms from LE

areas but they were not included

on the DRF due to an undocument-

ed processing error. The impact of

this error on the population cover-

age is mitigated by fact that the

total household size was recorded

by enumerators on most SEQs.

The linking process resulted in

1,387,085 DRF returns that were

made up of more than one

response record.

The linking process looked for

SEQs, mail return questionnaires

and BCF questionnaires that were

used in the place of enumerator

continuation forms and attempted

to link any of those found to a par-

ent record. Few were found to be

used this way.

After the linking process was com-

pleted, there were 129,389,529

housing unit returns on the DRF2,

excluding returns that were ineligi-

ble to receive a status. Returns

that were ineligible for status

assignment include 197,091 blank

returns and 696,691 replaced SEQ

returns. Replaced SEQ returns are

returns replaced by another SEQ as

a result of a field operations quali-

ty assurance check of enumerators’

work. Table 1 shows the number

of eligible returns after the linking

process was completed. For the

results shown in this table,

addresses are grouped by the num-

ber of response records that con-

tributed to a return.

Page 21: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 13

4.2.2 Assigning housing unit statusand household size

Table 2 below shows the housingunit statuses assigned to the129,389,529 returns on DRF.

• About 62.7 percent of thereturns were self responsereturns which includes all mailback questionnaire returns,internet responses and respons-es received through the TQAoperation.

• About 36.4 percent of thereturns were SEQ returns.

• The DRF2 also included 604,954returns for individuals enumer-ated at a GQ claiming a usualhome elsewhere in the housingunit universe (0.5 percent of theDRF returns), and 604,713 BCF

returns (0.5 percent of the DRF

returns). The GQ returns were

all assigned a status of

Occupied/Unresolved Household

Size and the BCF returns were

all assigned a status of

Occupied.

4.2.2.1 Unresolved housing unit

status among enumerator response

returns

The key data items from enumera-

tor returns used to assign the

housing unit status are the

following:

• The number of persons enu-

merated on the questionnaire

• The value of the respondent

reported household size

• The response to the SEQInterview Summary Item A(Status on April 1, 2000)

• The response to the SEQInterview Summary Item B (POPon April 1, 2000)

• The response to the SEQInterview Summary Item C (Typeof vacant unit)

Occupied/unresolved householdsize

On most of the 329,895 enumera-tor returns assigned the Occupied/Unresolved Household Size status,the enumerator clearly indicatedthat the unit was occupied but thatthe household size could not bedetermined. Almost 72 percent ofthese returns had an InterviewSummary Item A value of‘Occupied’ and an InterviewSummary Item B value of ‘POPUnknown’.

Responses on nearly all of theremaining returns clearly indicatedthat the housing unit was occupiedbut the return lacked sufficientdata on the size of the household.Almost all (98 percent) of thesereturns had an Interview Summary

Table 1.Number of Response Records Comprising a Return

Response Records Per Return Number of Returns Percent

1 (No continuation forms) . . . . . . . . . . . . . . . . . . . . . . . . . . 128,002,444 98.92 (1+ continuation forms) . . . . . . . . . . . . . . . . . . . . . . . . . . 1,347,977 1.043 or more (2+ continuation forms). . . . . . . . . . . . . . . . . . . 39,108 0.03

Total Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129,389,529 100.00

Source: Rosenthal (2003) Table 4.

Table 2.Housing Unit Status by Type of Return

Housing Unit Status

Type of ReturnNumber

(Column Percent)

Total Self ResponseEnumerator

Response GQ/BCF

Occupied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112,150,512 81,080,662 30,465,137 604,713(86.7) (100.0) (64.7) (50.0)

Vacant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14,141,843 18,504 14,123,339 0(10.9) (0.0) (30.0) (0.0)

Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1,778,824 0 1,778,824 0(1.4) (0.0) (3.8) (0.0)

Occupied/Unresolved Household Size . . . . . . . . . . . . . . . . . . . 934,849 0 329,895 604,954(0.7) (0.0) (0.7) (50.0)

Occupancy Status Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329,804 538 329,266 0(0.3) (0.0) (0.7) (0.0)

Status Undetermined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53,697 0 53,697 0(0.0) (0.0) (0.0) (0.0)

Total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129,389,529 81,099,704 47,080,158 1,209,667

Source: Rosenthal (2003)

Page 22: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

14 Data Processing in Census 2000 U.S. Census Bureau

Item A value of ‘Occupied’ and anInterview Summary Item B value of1 to 97, but there were no personsenumerated on the questionnaireand the respondent reportedhousehold size was 0 or missing.The responses to InterviewSummary Item B were numeric val-ues captured by an optical charac-ter recognition methodology. Thismethod can introduce error in theresponse captured, there were nodata that confirmed the householdsize captured for the InterviewSummary Item B.

Occupancy status unknown

The responses to as many as258,963 (78.5 percent) of thereturns given a housing status ofOccupancy Status Unknown mayhave been incorrectly coded in theDRF2 during a data editingprocess. Responses of ‘0’ toInterviewer Summary Item B wereincorrectly recoded as ‘missing’ bythe edits of the DRF2 data. Thiserror caused many returns to beerroneously given a status ofOccupancy Status Unknowninstead of a status of Vacant.

It is impossible to know for surehow many of the 258,963 wereaffected by the coding error.However, there is convincing evi-dence that nearly all of them wereaffected by the coding error. Morethan 94 percent of these returnshad a response to the InterviewSummary Item C which allowed theenumerator to report the type ofvacancy of the address. InterviewSummary Item C was only filled forvacant units. This suggests thatenumerators took care to fillInterview Summary Item B with ‘0’as well as filling InterviewSummary Item C for almost all ofthese cases.

The DRF2 returns potentiallyaffected by the coding error were alarge portion of the addresses

placed in the Occupancy Impu-tation category during the HCUFprocessing and given an imputedhousing unit status. They account-ed for about 74 percent of the195,245 housing units on theHCUF placed in the OccupancyImputation category. As such, thecoding error contributed to anundercount of vacant housingunits and an over count of occu-pied housing units.

Status undetermined

Almost 91 percent of the 53,697returns given a housing status ofStatus Undetermined had no per-sons enumerated on the question-naire and an Interviewer SummaryItem A value of ‘Delete’, but theyhad a conflicting value in theInterviewer Summary Item B. TheInterview Summary Item B responsefor these cases was 1 to 97, or PopUnknown.

4.2.2.2 Resolution of housing unitstatus by NRFU vs. CIFU

About 2.3 percent of the enumera-tor returns received from the CIFUwere assigned one of the threeunresolved housing unit statusesdiscussed above, while only 1.34percent of the NRFU enumeratorreturns had one of these statuses.

The rate that enumerator returnswere assigned an Occupied/Unresolved Household Size statuswas twice as large for CIFU com-pare to NRFU, 1.3 percent vs. 0.6percent, respectively.

There were more than 4.8 millionaddresses visited in both the NRFUand CIFU operations. However,data on only 4,233 of theseaddresses visited in both opera-tions were so insufficient that theyresulted in the assignment of anunresolved housing unit status.This shows that a census enumera-tor completed an enumeration at

all but small number of housingunits included in the NRFU.

4.2.2.3 Proxy responses for enumerator returns

The respondents for about 17.4percent of the occupied enumera-tor returns were proxy respon-dents (i.e., the respondent did notbelong to the household enumerat-ed on the return). This rate doesnot include cases for which thetype of respondent is unknown.Vacant and Delete returns are, bydefault, all said to have a proxyrespondent because there is nohousehold respondent in thesecases.

The returns given an Occupied/Unresolved Household Size hous-ing unit status had a much higherrate of proxy respondents (30.8percent) as would be expected.

The returns in the two other unre-solved housing unit status cate-gories (Occupancy StatusUnknown, Status Undetermined)had a very high rate of proxyrespondents. About 76.5 percentof the respondents for thesereturns were proxy respondents.This high rate is misleadingbecause more than 67 percent ofthese returns may have been erro-neously given an Occupancy StatusUnknown instead of Vacant, asnoted in an earlier discussion.Respondents for vacant housingunits are expected to be proxyrespondents, thus the high rate ofproxy respondents for these cases.

4.2.2.4 Setting of expected household size

There was a high level of consis-tency on each return among thekey data items used to assign theexpected household size. Amongboth mail and enumerator returnsthere was agreement between keysitems for over 93 percent of thereturns. On mail returns the key

Page 23: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 15

items were the respondent provid-ed household size and the numberof person enumerated on thereturn. The key items on enumera-tor returns were the InterviewerSummary Item B (POP on April 1,2000) and the number of personsenumerated on the return.

4.2.3 Primary selection algorithm

Most of these results come fromBaumgardner (2002).

The PSA was applied to127,610,705 eligible returns at118,360,443 census addresses.

• About 7.6 percent of theaddresses on the DRF2 had twoor more returns with 7.4 per-cent of all addresses having justtwo returns.

• There were another 158,530addresses that had only returnsnot eligible for the PSA. Thereturns ineligible for the PSAincluded blank returns, thoseassigned the status of Deleteand enumerator returns that areunusable because they werereplaced by another SEQ as aresult of a quality assurancecheck of enumerators’ work.

Formation and selection PSA households at addresses with twoor more returns

A total of 11,426,952 PSA house-holds were formed at 8,960,245addresses with two or more eligi-ble returns.

Only one PSA Household wasformed at 6,564,116 (73.3 percent)of these addresses.

• About 40.4 percent of these PSAHouseholds were vacant hous-ing units.

Two or more PSA Households wereformed at 2,396,129 addresses.

• These addresses make up just2.0 percent of all DRF2 address-es.

• Three or more PSA Householdswere formed at just 46,141addresses.

• A total of 1,235,327 addresses(51.6 percent) had one vacantPSA Household and one or morenon-vacant PSA Household. ThePSA selected the vacant PSAhousehold over the non-vacanthousehold(s) at only 194,596 ofthese addresses.

In order to identify the Primary PSAHousehold, the PSA applied,sequentially, an ordered list ofseven criteria. Two criteria causeda return with a resolved housingunit status to be selected over

returns with an unresolved hous-

ing unit status (POP Count Status

criterion) and selected the return

that was the highest in a hier-

archy of questionnaire form

types (Return Type criterion),

respectively.

• The Return Type criterion was

the criterion that selected the

Primary Household about 74

percent of the time.

• The POP Count Status criterion

was the selection criterion about

16 percent of the time.

• No other criterion triggered the

selection of more than 3.2 per-

cent of the Primary Households.

Formation of primary selection

algorithm households at addresses

with two returns

About 97.3 percent of the address-

es with two or more returns had

just two eligible returns. Among

these 8,716,359 addresses, the

two returns were combined to

form one PSA household 74 per-

cent of the time.

Table 3 shows the results of form-

ing PSA households at those

addresses with just two eligible

returns.

Table 3.Formation of PSA Households at Addresses with Two Returns

Outcome of PSA Household Formation Number Percent of total

One PSA Household Formed . . . . . . . . . . . . . . . . . . . . . . . . . . . 6,463,756 74.1Both returns are Vacant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2,634,322 30.2The Basic Return contains all of the persons on theother return. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3,469,789 38.8

The returns have person(s) in common but theBasic Return does not contain all of the persons onthe other return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359,645 4.1

Two PSA Households Formed . . . . . . . . . . . . . . . . . . . . . . . . . . 2,252,603 25.8One Non-Vacant4 and One Vacant . . . . . . . . . . . . . . . . . . . 1,162,675 13.3Both Non-Vacant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1,089,928 12.5

Total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8,716,359

Source: Baumgardner (2003) Table 14

The term Non-Vacant includes Occupied returns and returns with an unresolved housing unit status.4

Page 24: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

16 Data Processing in Census 2000 U.S. Census Bureau

There were 3,829,434 addresses atwhich only one occupied PSAhousehold was formed.

• Only about 9 percent of theseare cases where the Basic Returndid not contain all of the per-sons enumerated on the otherreturn.

• An estimated 82 percent ofthese PSA Households were cor-rectly formed, i.e., each of thereturns that made up the house-hold had at least one censusresident.

There are 1,089,928 addresseswhere two returns for an occupiedhousehold could not be combinedinto one PSA household.

• At an estimated 38 percent ofthe addresses with two occu-pied returns and two PSAHouseholds, both returns repre-sented the same household.That is, there were residents inboth returns.

• The PSA had no chance of com-bining an estimated 75.1 per-cent of these by matching per-sons because there were noduplicate persons between thetwo forms.

• The PSA performed matchingand failed to identify duplicatepersons at an estimated 16 per-cent of these addresses.

• At an estimated 58 percent ofthe addresses with two occu-pied returns and two PSAHouseholds there were censusresidents on only one return.

• PSA chose the correct PSAHousehold in about 65 percentof these cases.

Primary selection algorithm house-hold formed from two returns

A total of 6,561,984 PSAHouseholds (57.4 percent of allPSA Households) were comprisedof two returns. Table 4. belowshows the combination of returntypes for these PSA households.

The large proportion of PSA house-holds that contain two enumeratorreturns is primarily the result ofthe CIFU operation design. Mostaddresses that were determined tobe vacant by the NRFU wereincluded in the CIFU. WheneverCIFU determined that one of theseas occupied or vacant at least twoenumerator returns were capturedfor the census address.

The large proportion of PSA house-holds made up of a mail and anenumerator return is primarily dueto mail returns that were receivedafter the identification of the NRFUuniverse.

Only a small proportion (14 per-cent) of the PSA households withtwo mail returns is the result of a

request for a foreign languagequestionnaire. Mail returns includepaper mail back returns, internetresponses and TQA reverse-Computer Assisted TelephoneInterview (CATI) responses. Itappears that the remainder ofthese PSA households representduplicate attempts by respondentsto report their households usingtwo of these three methods.

The PSA automatically adds indi-viduals from some returns for anRPA address to the Primary PSAhousehold. Evaluation L.3.b. esti-mated that at least 60 percent ofthe individuals added from RPAs inthis fashion were correctly includ-ed in the Census.

4.2.4 Recommendations

These recommendations comefrom Rosenthal (2003) andBaumgardner (2002).

4.2.4.1 Linking returns

Attempt to link only enumeratorquestionnaires and enumeratorcontinuation forms, if these formsare used in the future. Doing sowould greatly simplify the require-ments of the linking process whilecausing negligible loss of data andpossibly no effect on populationcount.

Ensure that all continuation formsare included on the DRF.

Table 4.Combinations of Return Types for PSA Households with Two Returns

Combination of Returns Number of Addresses Percent of Addresses

Mail/Mail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196,751 3.0Mail/Enumerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2,732,392 41.6Enumerator/Enumerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2,845,843 43.4Mail/RPA5 & Enumerator /RPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782,906 11.9RPA/RPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4,092 0.1

Total . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6,561,984

Source: Baumgardner (2003) Table 9

RPA (Respondent Provided Address) returns -These include GQ returns with a Usual Home Elsewhere housing unit address, BCF returns, NRFU returns for a Whole Household Usual Elsewhere and In-Mover address.

5

Page 25: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 17

4.2.4.2 Determining housing unitstatus and household size

Construct more comprehensivesinstructions for enumerators toassist them in completing questionnaires for complicated andunusual interviews.

Pursue the use of computer assist-ed personal interviews through theuse of hand held computerdevices.

4.2.4.3 Primary selection algorithm

Define a simpler PSA process thatrelies more on type of return,source of return, and status ofreturn and less on person match-ing. Person matching added muchcomplexity to the process butaffected the selection of a PrimaryPSA household at only a very smallnumber of addresses.

Plan for the PSA to address onlythose combinations of returns thatcan be predicted by the design ofthe census operations. The PSAwas robust and was designed tohandle a large variety of casesincluding combinations of returnsnot anticipated by the design of

census operations. Few combina-tions of returns occurred that werenot anticipated by the censusdesign.

Conduct research on the feasibilityof integrating the PSA withprocesses to identify duplicateaddresses and persons duplicatedat more than one address.

Pursue methods to reduce theinclusion of mail response house-holds in the NRFU. Mail returnswere received for more than 3.4million addresses after the identifi-cation of the NRFU universe. As aresult, the DRF had both a mailreturn and an enumerator ques-tionnaire for these addresses.

4.3 Hundred percentcensus unedited file forhousing units

Most of these results come fromJonas (2003).

4.3.1 The hundred percent censusunedited file processing

The first step in the HundredPercent Census Unedited File (HCUF)processing for housing unit datawas to identify addresses on the

DMAF that did not represent a cen-sus housing unit. The addresseseliminated from the housing unituniverse at this stage of processingwere referred to as ‘Kills’. ‘Kills’were almost entirely address loca-tion confirmed not to be housingunits by the many address developoperations in the census.

• There are 127,828,778 addresseson the DMAF of which 9,057,195(7.1 percent) were identified as a“Kill”.

The DMAF addresses that remainedin-scope after the ‘Kills’ were identi-fied were merged with the DRF2. Atthis stage of processing one of thefollowing housing unit statuses wasassigned to each address on theDMAF: Occupied, Vacant, Delete,Occupied with UnresolvedHousehold Size, Occupancy StatusUnknown, and Status Unknown.These statuses correspond to thoseassigned to individual censusreturns during the DRF2 processing.

Table 5 shows the final statusassigned to each address that wasin-scope at this stage.

Table 5.Source of Housing Unit Status for DMAF Addresses

Housing Unit Status

Source of Status Data

SelfResponse

EnumeratorResponse

RespondentProvided

Response No Data Total Percent

Resolved Occupancy Status:

Occupied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80,781,126 26,636,881 197,778 0 107,615,785 90.6Vacant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16,277 10,438,871 0 0 10,455,148 8.8Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 8,653 0 1 8,654 0.0

Occupied/Unresolved Household Size . . . . . . . 0 169,902 30,232 0 200,134 0.2(Household Size Imputation)

Unresolved Occupancy Status:Occupancy Status Unknown . . . . . . . . . . . . . . . . 506 194,739 0 0 195,245 0.2(Occupancy Imputation)Status Undetermined . . . . . . . . . . . . . . . . . . . . . . 0 45,113 27 251,477 296,617 0.2(Status Imputation)

Total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80,797,909 37,494,159 228,037 251,478 118,771,583 100.0

Percent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68.0 31.6 0.2 0.2 100.0

Source: Jonas (2003) Table 2

Page 26: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

18 Data Processing in Census 2000 U.S. Census Bureau

The source of the status for eachaddress could be either the statusassigned to the DRF return selectedby the PSA or the DMAF data onthe outcome of the NRFU, CIFU orField Verification (FV) operations.The categories for the source ofthe data on which the housing unitstatus was assigned are definedbelow:

• Self Response - the source wasa DRF2 return for a paper mailreturn questionnaire, internetresponse or a reverse CATIresponse.

• Enumerator Response - thesource was a DRF2 return for aSimple EnumeratorQuestionnaire return, an enu-merator continuation form, orthe DMAF data on the outcomeof the NRFU, CIFU or FV.

• Respondent Provided Address -the source was a DRF2 returnfor a paper BCF return or a GQreturn.

• No Data - This source indicatesthat there was no return on theDRF2 for the address, and thatthere were no data on the DMAFfrom the NRFU, CIFU or FV oper-ations.

During the imputation of housingunit status to addresses in theStatus Imputation category, a totalof 47,126 addresses were given ahousing unit status of Delete andremoved from the HCUF.

No Data for a Housing Unit - Therewere no data for almost 85 percentof the addresses assigned a hous-ing unit status of Status Unknown.

• Over 82 percent of these areaddresses added to the DMAFfrom updates that occurred inAugust 2000 or later.

Those addresses added late in theprocessing schedule were added tothe DMAF after the merge of the

DMAF and DRF2 data. Thus, thedata capture responses for theseaddresses were not included onthe DRF2. None of these wereaddresses pre-assigned to theNRFU, CIFU or FV operationsalthough a large proportion ofthese were added to the Censusduring these operations. Since theresults of these followup opera-tions were only recorded on theDMAF for addresses pre-assignedto the operation, no data on hous-ing unit status are available on theDMAF for these addresses.

4.3.2 Duplicate delete processing

The Duplicate Delete process iden-tified 1,392,686 duplicate housingunits that were deleted at the timethe HCEF creation.

• About 2.9 percent of the delet-ed housing units had a Vacanthousing unit status.

• About 0.6 percent of the deletedhousing units had a pre-imputa-tion housing unit status ofOccupancy Status Unknown orStatus Unknown.

4.3.3 Recommendation

• Reexamine the timing and coor-dination of data processingoperations in order to ensurethat the responses captured forall addresses can be included inthe final census files (Jonas,2003).

4.4 Non-ID addressesprocessing

Most of these results come fromMedina (2001).

4.4.1 Non-ID processing results

The geocoding and matching oper-ations were effective and efficientprocesses.

• The clerical geocoding operationutilizing the Interactive Mappingand Geocoding System, an inter-

active clerical matching andgeocoding operation whichinvolves calling respondentswhile allowing clerical staff tosimultaneously see both theMAF and TIGER databases, is aviable and effective operation.

• The clerical staff was able tomatch and geocode addresses ata much faster rate than estimat-ed. These faster productionrates significantly contributed tolower staffing levels thanplanned for this clerical opera-tion.

• The automated non-city-stylematching was an effective toolfor reducing the workload forclerical matching and geocod-ing. Based on the clericalgeocoding rate, the matchingsaved approximately 300 per-son days clerical staff time.

The workload of Type A and Type Brecords for the Non-ID process wasmore than two times as large as itshould have been. Almost 2.3 mil-lion of the 4.2 million Type A andType B Non-ID cases were includedin error. The DSCMO did not applythe filter to exclude ineligible GQUHE returns from the Non-IDprocess prior to identifying returnsthat required the assignment of aMAFID through the Non-ID process.As a result, 2,281,712 GQ returnswere erroneously included in theNon-ID process while only 659,566GQ returns were legitimatelyincluded.

The GEO received more than onemillion records too late to beappropriately processed in subse-quent Census 2000 operations.Although we do not have directmeasurements of the errors causedby the tardy transfer of records, anexamination of how these recordswere treated in Census processesillustrates the potential for seriouscoverage errors and loss of data.

Page 27: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 19

• The DSCMO delivered over830,000 Type A and Type Brecords to the GEO after theJune 14, 2000 processing cut-offdate for identifying Type Aaddresses that should be includ-ed in the FV operation. Therecords were received too latefor the Non-ID process to com-plete them prior to the deadlinefor identifying the FV universe.

Any of these records for eligibleGQ UHE addresses that could begeocoded but not matched tothe MAF were eligible for the FVoperation. Since they wereprocessed too late to be includ-ed in the FV, they were added tothe DMAF without verification.

• More than 78,000 such address-es were added to the Censuswithout having been included inthe FV operations. Most ofthese addresses were obtainedduring the NRFU interviewthrough the UHE and In-Moverprobes.

• Most of these addresses wereobtained through the NRFUUHE probe. A total ofapproximately 55,000addresses obtained throughthis probe should have beensent to FV. Only one percentof these were sent to the FV.The remainder were added tothe Census without verifica-tion (Viator, 2003).

Most of the added addressesobtained through the NRFUUHE probe are suspiciousadditions to the Census2000. More than 70 percentof the approximately 54,000addresses obtained throughthis UHE probe were reportedto be vacant on April 1,2000. A report of a vacanthousing unit is contrary tothe concept of the usualhome elsewhere.

Furthermore, the FV opera-tion found that about half ofthe UHE addresses that wereincluded in the field opera-tion were not housing units.This rate is consistent withthe overall FV operationwhich found about only one-half of the cases processedwere census housing units.

• The GEO received over 6,800Type A and Type B records fromthe DSCMO after the August 4,2000 processing cut-off date forthe August 15, 2000 delivery ofMAF updates to the DMAF.Response for addresses addedto the MAF after this updatewere not included on the DRF.

• The GEO received more than124,000 UHE addresses forIndividual CensusQuestionnaires (ICQs) andShipboard Census Reports(SCRs) after the July 23, 2000suspension of the clericalgeocoding processing. As aresult, no attempt was made toclerically geocode any of theseaddresses that failed the auto-mated matching and geocodingprocess. Thus, new addressescould not be identified andincluded in the Census.

• More than 207,000 Type Crecords were processed too latefor their response data to beincluded on the DRF or includedin the FV process. All of theType C addresses identified ashaving been processed too latewere addresses that were newto the MAF. The DMAF wasupdated with these newaddresses after the DRF2 wascreated.

4.4.2 Recommendation

• Continue to refine and test theInteractive Matching andGeocoding System (IMAGS) soft-

ware as a product to clericallymatch and geocode addresses(Medina, 2001).

4.5 Group quartersprocessing

Most of these results come fromJonas (2002).

4.5.1 Resolution of missing data

The GQ processing successfullydealt with difficulties surroundinga potentially large amount of miss-ing data. In May 2000, the NPCreported that a large number of GQquestionnaires did not have GQIdentification (ID) numbers onthem and/or had no associatedcontrol sheet. Procedures werequickly designed and implementedby Census Bureau Headquartersstaff to clerically review thesequestionnaires and, if possible,identify them with the correct GQ.An estimated 700,000 question-naires were reviewed during thisoperation. This operation appearsto have been highly successfulalthough no official accountingwas made of the outcome of thisreview.

A unique barcode and numberwere printed on each GQ question-naire in the Census 2000.However, the barcode was notused to track GQ questionnairesfrom enumeration through datacapture. Census enumerators wererequired to transcribe the 14 digitGQ identification number to eachGQ questionnaire. When this wasnot done or was done incorrectly itwas difficult and sometimes impos-sible to identify the GQ at whichthe respondent was enumerated.

After GQ questionnaires were cap-tured by the DCS2000, the countsof captured questionnaires wereexamined by an inter-divisionalteam of staff knowledgeable of theGQ enumeration and processing

Page 28: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

20 Data Processing in Census 2000 U.S. Census Bureau

operations. This review was not

originally part of the design for GQ

processing. The team found that

the data capture was incomplete in

several ways:

• No questionnaires were received

for a number of GQs which were

believed to have refused our

attempts to enumerate them.

• The count of questionnaires for

a number of GQs was far less

than projected by pre-enumera-

tion operations.

• A number of GQs had a higher

count of questionnaires sent to

NPC by the LCOs than were cap-

tured by the DCS2000.

(A portion of the missing ques-

tionnaires can be attributed to

the missing GQ ID numbers on

some forms and our inability to

identify them with the appropri-

ate GQ.)

A previously unplanned telephone

followup operation was implement-

ed to address the first two of the

count deficiencies described

above. This followup ascertained

the Census Day population count

for GQs but did not collect the

demographic data of residents.

• A total count of 101,598 per-

sons was added to the GQ pop-

ulation as result of this fol-

lowup. This was 1.3 percent of

the total Census 2000 GQ popu-

lation.

• About 4.4 percent of GQ resi-dents at hospitals were enumer-ated by this followup.

The DSSD designed a procedure toderive a count of the expectednumber of persons enumerated atGQs to mitigate the problemsposed by the last of the threecount discrepancies. When theaggregate count of forms shippedto the NPC for a Special Place washigher than the aggregate count offorms captured, the difference inthese two counts was allocated tothe GQs within the Special Placeproportional to the differences inthe two counts for each GQ.

Collectively, all these operationsadded about 200,000 persons tothe Census 2000 GQ population of7,825,407. As a result, it was nec-essary to impute demographic datafor 2.6 percent of the Census 2000GQ population.

4.5.2 Processing of responses witha usual home elsewhere address

The GQ processing successfullyrecovered from the erroneous rout-ing of returns for GQ residentsreporting UHE addresses to theNon-ID process. The GQ responsessent to the Non-ID process couldbe removed from the GQ universeand placed in the housing unit uni-verse if the UHE address was con-firmed to be a housing unit so itwas important that we successfullyidentified ineligible GQ UHEresponses thereby preventing themfrom being erroneously removedfrom the GQ universe.

• A total of 659,566 responseswith a UHE housing unit addresswere correctly removed fromthe GQ universe.

• A total of 150,315 responseswere incorrectly removed fromthe GQ universe because theywere incorrectly identified ashaving a UHE address.

• GQ processing erroneously sentnearly 2.3 million GQ responsesto the Non-ID process.

• There were 1,892,742responses with a UHE addresscollected from those types ofGQs that made them ineligi-ble to be sent to the Non-IDprocess.

• There were 388,970 respons-es that were incorrectly iden-tified as having a UHEaddress.

4.5.3 Recommendations

These recommendations comefrom Jonas (2002).

• Track GQ questionnairesthroughout the operation, fromenumeration through data cap-ture. A unique barcode andnumber printed on each GQquestionnaire can be used forthis purpose using existingproducts to record and managea database of these identifica-tion numbers.

• Institute more effective softwarequality assurance programsinvolving representatives frommore than one division.

Page 29: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 21

5.1 Processing systemsdesign architecture

• No critical failures of the pro-cessing system design werereported during the implementa-tion of the census. The designof the census processing sys-tems for housing unit responseswas adequate for the requiredtasks. The processing systemshandled millions of censusresponses from a large varietyof data collection methods anddata collection systems.Enumeration results, demo-graphic data, housing unit data,and geographic informationwere successfully integrated, ontime, into the critical Censusdatabases such as the DRF, theHCUF and the DMAF.

5.2 Development anddocumentation ofrequirements

• A common thread in all of thestudies was the need for welldocumented processing require-ments. It has been suggestedthat more time be allocated tothe development and design ofrequirements.

According to Fowler (2003) therequirements development forthe DRF1 and DRF2 were lack-ing. Complete integratedrequirements development wasnot available to address all com-ponents as necessary to pro-duce adequate DRF1 and DRF2documentation, and designquality assurance processes andsoftware testing. Instead,requirement documents were

produced piecemeal, whichresulted in processing complica-tions.

The DRF1 requirements docu-mentation existed for somecomponents of the file creationprocess, such as creation of thenormalized data response files,the interface with the CoverageEdit Followup (CEFU) operationand coding extraction, but didnot exist for other critical stepssuch as interfaces with the Non-ID process, and the editing andcoding of the response data.Late changes to requirements toaddress the inclusion of largehouseholds in the CEFU andchanges to the Data CaptureAudit resolution process werechallenging for DSCMO to imple-ment. There are no document-ed software testing or QAprocesses for these operations.

The steps undertaken to devel-op and implement requirementsfor the Primary SelectionAlgorithm (PSA) was an exem-plary and successful process.Considerable staff resourceswere devoted to the develop-ment of requirements and soft-ware for the PSA which wasapplied to less than ten percentof the Census 2000 housingunit addresses. As stated byFowler (2003), the developmentof requirements for the PSA wasadequate to the task. The dedi-cation of sufficient staff to thistask contributed to the success-ful development of require-ments. The requirements forthe PSA software consisted of avery complex set of criteria and

person matching. The timelydevelopment and documenta-tion of these requirementsallowed for the design andimplementation of more com-plete and effective softwaretesting and QA procedures.

Processing requirements werealso fully developed for theprocesses of linking question-naire and continuation formresponse data, and assigninghousing unit status and house-hold size. However, the soft-ware testing for these processeswas much more informal thanfor the PSA. Insufficientresources and the lateness ofrequirements development didnot allow adequate time todevelop formal software testingand QA procedures.

• A complete identification of therequirements needed was notaccomplished prior to the imple-mentation of census operations.Some requirements documentswere not listed in the MasterActivity Schedule. The responsi-bility for these requirementspecifications was not assignedwith sufficient time to completeall steps of a full process devel-opment and the deliverablescould not be tracked.

There are several examples ofprocesses for which the needfor requirements and for whichthe assignment of the responsi-bility for specifying require-ments was not identified in atimely manner. These includethe DRF2 processing (require-ments for linking response

5. Conclusions

Page 30: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

22 Data Processing in Census 2000 U.S. Census Bureau

records and requirements forassigning housing unit status),and the CUF processing (require-ments for identifying ‘Kills’, inte-grating of the DMAF and DRF2).

5.3 Quality assuranceprocesses

• The lack of quality process con-trol and quality assurance soft-ware testing put many of theprocessing steps at risk. Theoverall success of the Census2000 processing is apparentfrom the studies that con-tributed to this report.However, some missteps didoccur that had important effectson Census data. These mayhave been preventable throughmore formal and thorough quali-ty assurance and control proce-dures.

• A well designed and formal QAprogram was carried out for thePSA. The PSA used inter-divi-sional teams to develop require-ments and software testing pro-cedures. Formal walkthroughand testing were conducted forall software components.

Primary responsibility for thedesign of testing procedures forthe PSA software was initiallygiven to contractors. Thisworked well in the beginningbecause the contractor hadexpertise in software testingand had gained sufficient knowl-edge of the PSA process throughthe Census 2000 DressRehearsal experience. Midwaythrough the developmentprocess, this contractor left andwas replaced by others who didnot have an adequate under-standing of the PSA softwarerequirements. From that pointon, it became necessary forCensus Bureau staff from theDSSD to assume primary respon-sibility for the design of the

software testing. By this stageof development the CensusBureau staff had gained suffi-cient knowledge of the princi-ples of software testing. Thesoftware testing was successfulbut the transition of responsibil-ities put the testing process atgreat risk.

5.4 Non-ID process

• The geocoding and matchingoperations of the Non-IDprocess were effective and effi-cient processes, yet Non-ID pro-cessing was not completed intime so that the data form allcases could be integrated intothe DRF2 and the HCUF.

It appears that the Non-IDprocess was overwhelmed withthe enormous number of GQUHE addresses erroneouslyincluded the process by theDSCMO. Jonas (2002) reportedthat the GQ processing opera-tion erroneously sent nearly 2.3million UHE addresses to theNon-ID process. This was overhalf of the Non-ID workload. Itappears that due to this largeworkload, many Non-ID recordswere delivered to the GEO afterimportant processing cutoffdates (Medina, 2001). Medina(2001) points out that the GEOreceived more than 830,000Type A and B addresses fromthe DSCMO later than the June14, 2000 processing cutoff datefor the identification of the FVworkload. Many of these weredelivered to the GEO, after the acutoff for inclusion of theresponse data into the DRF2.Although Medina (2001) doesnot discuss Type C addresses inthe Non-ID processes, the latedelivery of such a large numberof Type A and B records almostcertainly delayed the processingof Type C records.

The need to adhere to a tightprocessing schedule meant thatother processing took priorityover the processing of Non-IDrecords by the DSCMO. All Non-ID records were ultimately deliv-ered by the DSCMO to the GEO,processed by the GEO andreturned to the DSCMO process-ing queue. All Non-ID addressesgeocoded by the GEO wereincluded in the DMAF.

The completion of Non-ID pro-cessing late in the Censusschedule had two importantimpacts on the Census enumera-tions. The first is that theresponse data from more than207,000 housing units were notincluded in the Census. Someof these housing units weredeleted from the Census by thecount imputation process. Thesecond is that it may haveadded many nonexistent hous-ing units to the Census. Thelate timing of Non-ID processingmeant that more than 78,000housing units were added to thecensus without having been ver-ified by the FV operation. Anevaluation of the FV found thatonly about half of the addressesprocessed by the FV were veri-fied as valid housing units.

5.5 Count imputation ofhousing unit status

• It was noted by Fay (2001) thatthe Census 2000 experienced ahigher rate of whole personimputations than the 1990Census. The count imputationprocess accounted for 0.42 per-cent of the total Census 2000population, a rate several timeshigher than experienced in the1990 Census.

Count imputation included thethree categories of whole house-hold imputation in which ahousing unit status and/or

Page 31: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 23

household size were imputed tocensus addresses. These threecategories include a total of691,996 addresses: 1) 200,134for Household Size Imputation,2) 195,245 for OccupancyImputation and 3) 296,617 forStatus Imputation.

It appears that processing errorscaused the missing data for asignificant portion of theaddresses in the latter twoimputation categories. Theseerrors may have tripled thenumber of Census housing unitsfor which the occupancy statuswas imputed. How the numberof addresses in each of thesetwo categories was affected bydeviations from specifications isdiscussed below.

Status Imputation - The StatusImputation category includes atotal of 251,477 addresses forwhich there was no return onthe DRF. More than 80 percent(207,283) of these addresseshave been identified as Type Caddresses processed very lateby the Non-ID process andadded to the DMAF in August2000 or later. These were partof the updates to the DMAF thatoccurred as late as November2000. These updates to theDMAF occurred so late that theresponse data for these added

addresses could not be includedon the DRF2. As a result it wasnecessary to impute the hous-ing unit status and householdsize for these addresses.

Occupancy Imputation - TheOccupancy Imputation categoryincludes 145,367 Censusaddresses which are suspectedto be vacant housing units butwere included in this imputationcategory. Rosenthal (2003)states that responses of ‘0’ tothe Interviewer Summary Item B(POP on April 1, 2000) were mis-takenly coded as a blank entry.This processing error occurredon as many as 258,963 returnsassigned the housing status ofOccupancy Status Unknown.Almost 95 percent of thesereturns had the InterviewSummary Item C (Type ofVacancy) filled. Enumeratorswere instructed to fill InterviewSummary Item C only for vacanthousing units. This suggeststhat enumerators took care toenter ‘0’ in Interview SummaryItem B as well as fillingInterview Summary Item C foralmost all of the 258,963responses with a suspectederror. The PSA chose thesereturns as the Primary PSAHousehold at 145,367 address-es included on the HCUF before

the imputation of housing unitstatus.

5.6 Primary selectionalgorithm

• For addresses with two returns,the outcome of the PSA couldhave differed for fewer than anestimated 500,000 addresseshad it not included a withinaddress person matching func-tion. This number includes anestimated 104,000 addresseswith two returns for the samehousehold that the PSA failed tomatch. This does not imply thatthe results would have necessar-ily been different or that theywould have been less correct.

More than ninety-seven of alladdresses with multiple returnshad just two returns. One ofthe following situations existsfor all of the addresses with tworeturns: 1) all Vacant returns, 2)one Vacant return and oneOccupied return, 3) a BasicReturn that included all of thepersons on the other return, 4)two returns for two differenthouseholds, or 5) two returnswith matching persons and eachhaving unique persons notfound on the other return. Onlyfor the last scenario could theperson matching have affectedthe accuracy of enumerationsresulting from the PSA outcome.

Page 32: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 33: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

U.S. Census Bureau Data Processing in Census 2000 25

6. Recommendations

• Ensure timely and complete doc-umentation of processingrequirements and design.Timely processing requirementswill reduce the time required forsoftware development, allow foradequate software testing, andallow for the design and imple-mentation of quality assuranceprocesses. The timely develop-ment of documentation willallow for complete and accurateassessments of the interdepen-dencies between procedures.Preferably, complete documenta-tion would be achieved for aCensus dress rehearsal. In mostcases the final Census documen-tation would evolve directlyfrom the dress rehearsal docu-mentation reflecting smallchanges based on lessonslearned in a dress rehearsal.Having full documentation com-pleted for the dress rehearsalwould ensure that the impact ofmodifications to one procedureon related activities could beadequately evaluated.

• Identify the staffs with the criti-cal skills and knowledge neededto develop all requirements

early in the developmentprocess.

• Ensure that effective QA proce-dures are in place for all censusdata processing operations.Allow flexibility in the standardson which the QA procedures arebased. The scope of QA proce-dures and resources required toimplement them should be com-mensurate with the level ofrisks associated with processingerrors.

• Incorporate the use of interac-tive geocoding software such asthe Interactive Matching andGeocoding Systems (IMAGS)software again for the 2010Census. Continue to developand test software such as thisas a product to clerically matchand geocode addresses. Theuse of this software resulted inan efficient clerical geocodingoperation with respect to timingand outcome. The process tookmuch less time than expected.

The addresses geocoded usingthis software were addressesthat could not be geocoded byan automated process. Yet, the

proportion of clerically geocod-ed addresses found by FV in theblock to which they were codedwas similar to the proportion foraddresses that could be geocod-ed by the completely automatedprocess. This indicates that anadequate level of accuracy wasachieved for the clerical geocod-ing process using the interactivegeocoding software.

• Eliminate the within addressperson matching function fromthe PSA. Define a simpler withinaddress return selection processthat relies more on type ofreturn and status of return andless on person matching. In theCensus 2000, the elimination ofwithin address person matchingwould not have significantlydegraded coverage or the quali-ty of Census data. Extensiveresources in the Census 2000were devoted to designing andimplementing a PSA that reliedon within address personmatching. In the 2010 Census,these resources may be better directed toward the unduplication of persons acrossaddresses.

Page 34: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 35: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

References

Baumgardner, Stephanie (2002),Analysis of the Primary SelectionAlgorithm, Census 2000 EvaluationL.3.a, November 2002.

Baumgardner, Stephanie (2003),Resolution of Multiple CensusReturns Using a Re-interview,Census 2000 Evaluation L.3.b,September 2003.

Carter, Nathan (2002), Be CountedCampaign for Census 2000,Census 2000 Evaluation A.3,September 2002.

Coan, Edmund (2000), ProgramMaster Plan: Census 2000Decennial Response File Program,Census 2000 InformationalMemorandum No. 85, December2000.

Fay, Robert (2001), The 2000Housing Unit Duplication Operationand Their Effect on the Accuracy ofthe Population Count, JointStatistical Meetings Proceedings2001, August 2001.

Fowler, Charles (2003), AssessmentReport for Decennial Response FilesDRF1, DRF2, PSA, Census 2000Informational Memorandum No.137, May 2003.

Griffin, Richard (2001), Census2000 - Missing Housing Unit Statusand Population Data, DSSD Census2000 Procedures and OperationsMemorandum Series B-17, February2001.

Jonas, Kim (2002), Group QuartersEnumeration, Census 2000Evaluation E.5, September 2002.

Jonas, Kim (2003), Census UneditedFile Creation, Census 2000Evaluation L.4, August, 2003.

Medina, Karen (2001), AssessmentReport for Non-ID QuestionnaireProcessing (including BCF/TQAField Verification), September2003.

Nash, Fay (2001), Analysis ofCensus Imputations, ExecutiveSteering Committee For A.C.E.Policy II Report No. 21, September2001.

Pike, A. Edward (2001), ProgramMaster Plan: Census 2000Matching/Geocoding Non-MasterAddress File (MAF) Identification(ID) Questionnaires, Census 2000Informational Memorandum No.98, April 2001.

Rosenthal, Miriam (2003),Operational Analysis of theDecennial Response File Linkingand Setting of Housing Unit Statusand Expected Household Size,Census 2000 Evaluation L.2, June,2003.

Tenebaum, Michael (2001),Assessment of Field Verification,Census 2000 Evaluation H.2, July 2001.

Viator, Mark; Alberti, Nicholas(2003), Evaluation of NonresponseFollowup - Whole Household UsualHome Elsewhere Probe, Census2000 Evaluation I.2, February2003.

U.S. Census Bureau Data Processing in Census 2000 27

Page 36: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

This page intentionally left blank.

Page 37: Issued March 2004 Data Processing in Census 2000 · Issued March 2004 Data Processing in Census 2000 U.S.Department of Commerce Economics and Statistics Administration ... Donald

Census 2000 Topic Report No. 6TR-6

Issued February 2004

Evaluationsof theCensus 2000Partnershipand MarketingProgram

U.S.Department of CommerceEconomics and Statistics Administration

U.S. CENSUS BUREAU

Census 2000 Testing, Experimentation, and Evaluation Program