sda: a tool for teaching and research with microdata laine ruus...

SDA: a tool for teaching and research with microdata

Laine Ruus [email protected]

University of Toronto. Data Library Service

2007/05/17

mailto:[email protected]

What this poster covers:

• Introduction• Demo of main SDA capabilities • Advantages and disadvantages for teaching

and research• Common questions about SDA

SDA@UT is brought to you by:

• University of California, Berkeley. Computer-assisted Survey Methods Program (CSM) – writes and supports the server-side software

• University of Toronto. Centre for Computing in the Humanities and Social Sciences (CHASS) – provides the hardware, buys the software, and provides system support wetware

• University of Toronto. Libraries – provides the budget to purchase the data, and care, feeding and user support wetware

Our experience with SDA

• CHASS installed SDA in the fall of 2004• At last count, have 600+ data files in SDA• Some have only the metadata that was generated

from the original syntax files (SAS/SPSS/Stata), but a number also have full question text.

• Most are microdata, but a few are aggregate statistics (census files)

• A number of voracious data users now expect to find the latest microdata released by Stat Can in SDA

SDA@UT monthly usage

0

5000

10000

15000

20000

25000

30000

35000

40000m

ay jun jul

aug

sep

oct

nov

dec

jan

feb

mar ap

r

Pag

es

2004-2005

2005-2006

2006-2007

How SDA is used

Descriptivestatistics (70%)

Inferentialstatistics (5%)

Datamanagement(12%)

Downloading(13%)

Review of main SDA utilities

• Frequencies, weighted & unweighted

• Crosstabulations

• Comparison of means (ANOVA)

• Correlations

• Regressions

• Logit/probit regressions

Tips & tricks

• Have we not gotten around to coding the missing values?

• Want to include missing values in your cross-tabulation, or other analysis?

• Collapsing uniform categories of continuous variables on the fly

• Recoding variables on the fly

Tips & tricks (2)

• Computing percentages in aggregate data?

• Dummy coding variables in regressions

• Defining an interaction on the fly

Advantages for teaching:

• Stable environment, 24x7 access• Very easy to explain to novice users• Reduce/eliminates need for computer labs or

statistical software• Teach statistics rather than software• Students get hands on data quickly• Switch easily between weighted and unweighted

distributions

Advantages for teaching (2):• Measures of association and tests of significance

comparable to SAS • Design effects, where cluster/sample variables

available• Interactive demonstration of statistical concepts• Share recoded variables• Can quickly mount additional data to fulfill your

teaching needs

Advantages for research:

• Stable environment, 24x7 access• Access to latest available version of the data• Basic exploratory data analysis: eg are there enough

cases for my subset?• Download data and import to SAS/SPSS/Stata on

own workstation• Share recoded variables• Integrated variable descriptions (selected data files)

Advantages for data management:

• Creates metadata from SAS/SPSS/Stata syntax or DDI format xml files

• Very easy and fast to import files with good syntax files

• Control over what users can and cannot do• Outputs include SAS/SPSS/Stata syntax or DDI

format xml files• Overhead: size of uncompressed data + about 50%

Disadvantages of SDA:• Can’t search for variables/values within/between data

files (yet) – at least, not at UT/CHASS• Can’t download created/recoded variables – coming

in spring 2009• No random sampling function. See<http://www.chass.utoronto.ca/datalib/caq/sda.htm>• Graphics minimal, eg no stem-and-leaf, box-plots etc• Can only output to Word/Excel from IE, not from

Netscape/Mozilla/Firefox• Doesn’t output SAS/SPSS/Stata system/export files• Little support for Study/File level metadata (DDI)• No support for nCubes (DDI 2)

Common questions from researchers & students:

• When to weight versus not to weight• Does it only do cross-tabs?• But I want the raw data, not a cross-tabulation!• Why can’t I get a cross-tab of this [eg continuous

income] variable?• Differences between syntax, data, and system files.

An application we wouldn’t have tackled without SDA:

• Q: I need the average expenditure on eye care in Canada by age group of household head for as long a time-period as possible.

• A: Once we explained SDA, the student had generated this statistics from each of the FAMEX/SHS files, 1969-2004 in under 30 mins. (He knew only Stata.)

Functions we know to be coming in SDA

• Within and between file variable searching• Will allow users to load own data files (Archiver in

SDA 3.1) – we have not played with this yet

Questions:

• Question 1: Where will I find the SDA server at University of Toronto?

• Answer 1: The URL is:

http://www.chass.utoronto.ca/datalib/

Select ‘Microdata analysis and extraction’

http://www.chass.utoronto.ca/datalib/

Questions (cont’d):

• Question 2• How are files chosen to

be mounted on the SDA server at UT?

• Answer 2• All significant Canadian

microdata files, eg by Statistics Canada as released by DLI

• Other files based on faculty/student requests

Questions (cont’d):

• Question 3:• My research is being

done collaboratively with a colleague at another Canadian university. Can my colleague get access to SDA?

• Answer 3:• SDA is available as a

subscription service to other Canadian DLI-member universities and colleges. Current subscribers include: U of Victoria, Ryerson U, and Memorial U

sda: a tool for teaching and research with microdata laine ruus...

Documents

sda sda

86sheet1how sda

aggregate data

data files

data management1304812

descriptive statistics

inferential statistics

software university