sda: a tool for teaching and research with microdata laine ruus...
TRANSCRIPT
SDA: a tool for teaching and research with microdata
Laine Ruus [email protected]
University of Toronto. Data Library Service
2007/05/17
What this poster covers:
• Introduction• Demo of main SDA capabilities • Advantages and disadvantages for teaching
and research• Common questions about SDA
SDA@UT is brought to you by:
• University of California, Berkeley. Computer-assisted Survey Methods Program (CSM) – writes and supports the server-side software
• University of Toronto. Centre for Computing in the Humanities and Social Sciences (CHASS) – provides the hardware, buys the software, and provides system support wetware
• University of Toronto. Libraries – provides the budget to purchase the data, and care, feeding and user support wetware
Our experience with SDA
• CHASS installed SDA in the fall of 2004• At last count, have 600+ data files in SDA• Some have only the metadata that was generated
from the original syntax files (SAS/SPSS/Stata), but a number also have full question text.
• Most are microdata, but a few are aggregate statistics (census files)
• A number of voracious data users now expect to find the latest microdata released by Stat Can in SDA
SDA@UT monthly usage
0
5000
10000
15000
20000
25000
30000
35000
40000m
ay jun jul
aug
sep
oct
nov
dec
jan
feb
mar ap
r
Pag
es
2004-2005
2005-2006
2006-2007
How SDA is used
Descriptivestatistics (70%)
Inferentialstatistics (5%)
Datamanagement(12%)
Downloading(13%)
Review of main SDA utilities
• Frequencies, weighted & unweighted
• Crosstabulations
• Comparison of means (ANOVA)
• Correlations
• Regressions
• Logit/probit regressions
Tips & tricks
• Have we not gotten around to coding the missing values?
• Want to include missing values in your cross-tabulation, or other analysis?
• Collapsing uniform categories of continuous variables on the fly
• Recoding variables on the fly
Tips & tricks (2)
• Computing percentages in aggregate data?
• Dummy coding variables in regressions
• Defining an interaction on the fly
Advantages for teaching:
• Stable environment, 24x7 access• Very easy to explain to novice users• Reduce/eliminates need for computer labs or
statistical software• Teach statistics rather than software• Students get hands on data quickly• Switch easily between weighted and unweighted
distributions
Advantages for teaching (2):• Measures of association and tests of significance
comparable to SAS • Design effects, where cluster/sample variables
available• Interactive demonstration of statistical concepts• Share recoded variables• Can quickly mount additional data to fulfill your
teaching needs
Advantages for research:
• Stable environment, 24x7 access• Access to latest available version of the data• Basic exploratory data analysis: eg are there enough
cases for my subset?• Download data and import to SAS/SPSS/Stata on
own workstation• Share recoded variables• Integrated variable descriptions (selected data files)
Advantages for data management:
• Creates metadata from SAS/SPSS/Stata syntax or DDI format xml files
• Very easy and fast to import files with good syntax files
• Control over what users can and cannot do• Outputs include SAS/SPSS/Stata syntax or DDI
format xml files• Overhead: size of uncompressed data + about 50%
Disadvantages of SDA:• Can’t search for variables/values within/between data
files (yet) – at least, not at UT/CHASS• Can’t download created/recoded variables – coming
in spring 2009• No random sampling function. See<http://www.chass.utoronto.ca/datalib/caq/sda.htm>• Graphics minimal, eg no stem-and-leaf, box-plots etc• Can only output to Word/Excel from IE, not from
Netscape/Mozilla/Firefox• Doesn’t output SAS/SPSS/Stata system/export files• Little support for Study/File level metadata (DDI)• No support for nCubes (DDI 2)
Common questions from researchers & students:
• When to weight versus not to weight• Does it only do cross-tabs?• But I want the raw data, not a cross-tabulation!• Why can’t I get a cross-tab of this [eg continuous
income] variable?• Differences between syntax, data, and system files.
An application we wouldn’t have tackled without SDA:
• Q: I need the average expenditure on eye care in Canada by age group of household head for as long a time-period as possible.
• A: Once we explained SDA, the student had generated this statistics from each of the FAMEX/SHS files, 1969-2004 in under 30 mins. (He knew only Stata.)
Functions we know to be coming in SDA
• Within and between file variable searching• Will allow users to load own data files (Archiver in
SDA 3.1) – we have not played with this yet
Questions:
• Question 1: Where will I find the SDA server at University of Toronto?
• Answer 1: The URL is:
http://www.chass.utoronto.ca/datalib/
Select ‘Microdata analysis and extraction’
Questions (cont’d):
• Question 2• How are files chosen to
be mounted on the SDA server at UT?
• Answer 2• All significant Canadian
microdata files, eg by Statistics Canada as released by DLI
• Other files based on faculty/student requests
Questions (cont’d):
• Question 3:• My research is being
done collaboratively with a colleague at another Canadian university. Can my colleague get access to SDA?
• Answer 3:• SDA is available as a
subscription service to other Canadian DLI-member universities and colleges. Current subscribers include: U of Victoria, Ryerson U, and Memorial U