data processing at icpsr

32
Data Processing at ICPSR Data Processing at ICPSR Peggy Overcashier Senior Systems Analyst, ICPSR CESSDA Expert Seminar Neuchâtel, Switzerland September 9, 2004

Upload: dante-kelly

Post on 02-Jan-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Data Processing at ICPSR. Peggy Overcashier Senior Systems Analyst, ICPSR CESSDA Expert Seminar Neuchâtel, Switzerland September 9, 2004. What is ICPSR?. Membership organization founded in 1962 Over 500 colleges and universities 2004-2005 budget approximately $10 million (USD) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Processing at ICPSR

Data Processing at ICPSRData Processing at ICPSR

Peggy OvercashierSenior Systems Analyst, ICPSR

CESSDA Expert SeminarNeuchâtel, Switzerland

September 9, 2004

Page 2: Data Processing at ICPSR

What is ICPSR?What is ICPSR?

• Membership organization founded in 1962– Over 500 colleges and universities

• 2004-2005 budget approximately $10 million (USD)

(8.2 million EUR; 12.5 million CHF)

– 30% from membership fees

– 70% from grants and contracts

• Around 100 employees; 40 data processing staff• World’s largest archive of computer-readable social science data

– About 7,000 titles and 140,000 data files

– Close to 300 data files available for online analysis

Page 3: Data Processing at ICPSR

Two Kinds of Archival Holdings:Two Kinds of Archival Holdings:

• General Archive Holdings are funded with member dues and are available only to members

• Special Topic Archives are supported by foundations or federal agencies and holdings are available to all– Aging– Child Care and Early Education– Criminal Justice– Demographic Research– Education– Health and Medical Care– Substance Abuse and Mental Health

Page 4: Data Processing at ICPSR

TopicsTopics

1. What we do today and how– Current ICPSR processing pipeline

2. Development of aids to efficient and accurate processing– Automated scripts and tools– Semi-automated techniques

3. Where we’re headed– ICPSR process improvement initiative

Page 5: Data Processing at ICPSR

Current ICPSR Processing PipelineCurrent ICPSR Processing Pipeline

Page 6: Data Processing at ICPSR

AcquireStudyData&DocumentsProcessor

4• Scan deposited electronic files for viruses• Inventory files and documentation received• Verify that electronic files open, are readable• Prepare acquisition form (text)• Transmit original data and documentation for

preservation

Page 7: Data Processing at ICPSR

PlanProcessing

Processor

6• In consultation with processing supervisor• Determine processing level (routine, intensive)• Initial review of files

– Potential disclosure risks– Completeness of variable-level metadata– Wild/undocumented codes

• Discuss identified problems/solutions

Page 8: Data Processing at ICPSR

Buildthe

DatasetProcessor

7 • Resolve problems• Eliminate identified disclosure risks

– Routine handling vs. full disclosure analysis

• Build dataset, typically in SPSS or SAS– Recode– Add and/or delete variables– Fill in missing metadata– Identify missing values– Check full frequencies and/or descriptives

• Convert data to ASCII with Data Definition Statements (archival format)– Tools used historically have been buggy– New in-house conversion tool ready for release

Page 9: Data Processing at ICPSR

Buildthe

DocumentSet(I)Processor

9• Gather existing pieces of documentation

– Methodology– Other information received from depositor

• Assess what other documentation needs to be included in final products

• Hand off to Electronic Document Conversion unit for conversion to PDF or hold until documentation set is completely assembled

Page 10: Data Processing at ICPSR

Processor

10Build

theStudyDescription

• Gather and document study-level metadata• Write study summary• Enter into study description form (text)• Submit to editing staff

Page 11: Data Processing at ICPSR

SetUpStudyforOnlineAnalysisProcessor

11• Optional, at discretion of archive• Assess for potential problems in online

analysis– Multiple weights– Outliers– Multiple linkable files

• Prepare question text file in SDA native format (DDL)

• Configure for online analysis system– Automated test setup; administrators

*name = PREGNANTtext = The next questions are about your health andhealth care.

Are you currently pregnant?*

Page 12: Data Processing at ICPSR

Buildthe

DocumentSet(II)Processor

13• Generate frequencies, descriptive statistics

for codebook• Document variable-level metadata• Add processor notes• Source documents typically in Word,

sometimes WordPerfect, ASCII, PDF, other• Create additional documents as needed• Hand off to Electronic Document Conversion

unit for conversion to PDF• The two document steps are frequently

combined into one

Page 13: Data Processing at ICPSR

PackagetheStudy

Processor

15• Make sure all files handed off have been

returned and reviewed– study description– PDF documentation

• Test all data files and data definition statements– SAS, SPSS, (Stata)– UNIX, Windows

• Prepare turnover form (text)• Create turnover directory, move all study files• Quality control check by another processor• Turn over study files for preservation and

dissemination

Page 14: Data Processing at ICPSR

Tool DevelopmentTool Development

• Skill/knowledge set: Programmer vs. Data Processor

• Programming skills required for some tools– Fully-automated scripts– Web-based forms

• Creativity and software knowledge required for others– Semi-automated techniques– Use of existing software in non-conventional ways

Page 15: Data Processing at ICPSR

Tools: Semi-automated MethodsTools: Semi-automated Methods

Regular Expressions

• Search and replace using patterns rather than literal strings

• Multi-Edit, TextPad: Windows-based editors– Capable of regular expressions– Can save files with UNIX formatting

• Extract syntax from existing documentation– Value labels– Question text

Page 16: Data Processing at ICPSR

Tools: Semi-automated MethodsTools: Semi-automated Methods

Excel for Text Editing

• VLOOKUP• List management

– Variable disposition

• Merging related information from multiple sources• Running counts• Remapping metadata to new variable names

Page 17: Data Processing at ICPSR

Tools: Semi-automated MethodsTools: Semi-automated Methods

SDA Conversion Utilities

• Documentation– Frequency, descriptive statistics with variable-level metadata,

question text embedded– Can include introductory materials, links to external documents– ASCII→PDF, XML, HTML

• DDS conversion– SPSS, SAS, Stata, XML

• Prepare metadata for variable-level search

Page 18: Data Processing at ICPSR

A Little More TechnicalA Little More Technical

• Macros– Automate repetitious sequences of commands, keystrokes– Recordable in many applications

• Variable Arrays– Pre-define groups of variables on which the same data

transformations will be performed

• Loops– Repeatedly run a single set of commands as long as a condition

is true

Page 19: Data Processing at ICPSR

Tools: A Few UNIX Script ExamplesTools: A Few UNIX Script Examples

• Automated QC script• Batch-test Data Definition Statements in UNIX• Disclosure analysis and processing system for the

Treatment Episode Data Set• Web-based XML generator for Quick Tables

configuration files• Hermes: automated batch production system

– Early implementation of process improvement recommendations

Page 20: Data Processing at ICPSR
Page 21: Data Processing at ICPSR

Process Improvement at ICPSRProcess Improvement at ICPSR

• Begun in spring 2003• 4 distinct phases

– Mapping the current pipeline– Designing the future– External review– Planning and implementation

Page 22: Data Processing at ICPSR

Phase 1: Map Current Processing Phase 1: Map Current Processing PipelinePipeline• Consultant interviewed groups and individuals• Drew and refined process maps• General agreement that the story and pictures were

correct before proceeding

Page 23: Data Processing at ICPSR

Process Mapping: Insider’s ViewProcess Mapping: Insider’s View

Overview

More detailed, with processing milestones

Very detailed, covers a corridor wall

Page 24: Data Processing at ICPSR

Phase 2: Designing the FuturePhase 2: Designing the Future

• Internal Process Improvement Committee formed• Brainstorming• “Evolutionary” vs. “Revolutionary” ideas• Formal reports and recommendations

Page 25: Data Processing at ICPSR

Process Improvement: Guiding PrinciplesProcess Improvement: Guiding Principles

• Automation• Standardization• Centralization• Quality Control• Version Control• Focus on the User• Electronic Collection Management• Staff Development and Career Path Expansion

Page 26: Data Processing at ICPSR

Future Processing FrameworkFuture Processing Framework

Page 27: Data Processing at ICPSR

CharacteristicsCharacteristics

• More linear• Integrated; steps connected• Automated milestone tracking• Metadata migrates to database

– Eliminate rekeying– Single authoritative source

Page 28: Data Processing at ICPSR

Phase 3: External Review CommitteePhase 3: External Review Committee

• Outside experts reviewed reports• Met with individuals and small groups of staff• Endorsed the PIC’s recommendations• Additional recommendations provided• Formal report written

Page 29: Data Processing at ICPSR

Phase 4: Planning and ImplementationPhase 4: Planning and Implementation

• Communication with staff– PIC Web site– PIC/staff information sessions

• Implementation manager hired• Implementation plans developed for several

recommendations• PIC reconstituted as a standing committee

– Review new process improvement suggestions– Provide input for implementation plan

Page 30: Data Processing at ICPSR

Some Improvements in DevelopmentSome Improvements in Development

• Automated batch production of enhanced suite of products– Hermes for current and future releases– Retrofit project for previous releases

• Web-based forms (acquisition, study description, turnover)– Replace text forms– Eliminate rekeying

• Automated processing milestone tracking

Page 31: Data Processing at ICPSR

Issues Under ConsiderationIssues Under Consideration

• “Ready-to-go” files– How to handle missing data by default (SAS, Stata)– How to best provide SAS formats

• Development of standardized bibliographic citation for online analyses

• Archival vs. distribution formats• How to handle qualitative data• New formats (e.g., video, audio files)• Development of best practices, automated tools for

disclosure analysis

Page 32: Data Processing at ICPSR

For more information:For more information:

Peggy Overcashier

[email protected]

+1 734 615 9529