an introduction to the large-scale government surveys & samples of anonymised records jo wathan...
TRANSCRIPT
An introduction to the large-scale Government Surveys &
Samples of Anonymised Records
Jo WathanESDS(Government) & SARs support
teamCCSR, University of Manchester
Today• What data is available?• What is it like?• Considerations when using the data• How are they used in research?• How do you access them?• Resources & Support
Why should you want to know?
• Because the data are...• Very cost effective: data free of charge to
academic researchers• Saves time: no need to conduct survey • Access to high quality, well documented
data • Can provide nationally representative data
‑ allows generalisation to population• Allows historical and geographical
comparisons to be made• ESRC funded data support services
What data am I talking about?
• UK is particularly rich in microdata which is available for secondary analysis
• Today focus on cross-sectional microdata from government surveys and The Census– Samples of Anonymised Records– ESDS Government Surveys (e.g. LFS, GHS)
• Other major sources:– Longitudinal data (e.g. LS, BHPS)– International microdata (e.g. ESS)– ESDS core function/UK Data Archive– Aggregate data
The Samples of Anonymised Records
(SARs)• Microdata samples from Census 1991 & 2001
• Available for the first time after research into the confidentiality risk
• More flexible than conventional aggregate tables
SAR Files Individual Household Small Area Microdata
1991(GB/NI)
2% with SAR area
1% with Region
-
2001 licensed data
3% with GOR (UK)
1% England & Wales only (special license)
5% with LA/UA/PC
2001 Controlled Access Microdata
3% with LA/UA/PC
1% with LA/UA/PC
-
What’s in the SARs?
• UK Census Microdata• Census has high response rate because compulsory
– 1991 only enumerated cases in data– 2001 missing people are ‘imputed’
• Census topics only – brief self-completion form– Accomodation, transport, socio-economic characteristics,
ethnicity, religion, health
• Anonymised and data limited to ensure confidentiality – Most restrictive in the end user license files for 2001, e.g.
less geography in the individual and household files, age banded
– Unusual cases perturbed
• Extremely large sample sizes!
ESDS Government Surveys• General Household Survey• Labour Force Survey• Family Resources Survey • Expenditure and Food Survey (previously the
National Food Survey and Family Expenditure Survey)
• ONS Omnibus Survey • National Travel Survey • Time Use Survey • British Crime Survey/Scottish Crime Survey• British Social Attitudes/Scottish Social
Attitudes/Northern Ireland Life & Times/Young People’s Social Attitudes
• Health Survey for England/Wales/Scotland• Survey of English Housing (England only)
What are ESDS Government data like?
• ‘Nationally’ representative survey microdata
• Large sample sizes (but smaller than the SARs)
• Identifying information is removed
• Most are conducted on an annual basis
• Continuous surveys – always up-to-date
• Cross-sectional (although the LFS has a 5-quarter panel element)
• Specialist topic surveys – more depth than the Census
All of these microdata are:• Individual information akin to the sort of data
you would collect if you were conducting your own survey
• Need to be analysed in an appropriate software package (like SPSS or Stata)
• Cross-sectional snapshots (exception: the LFS is actually 5 snapshots per address!)
• Good quality collected by a professional data collection organisation– Office for National Statistics– National Centre for Social Research
• Collected for policy purposes• Has good quality documentation & support
services
Thinking about using the data?
1. What is your research question?2. What evidence do you need to answer
your research question?3. Is the evidence you need already
available • check the literature and published reports.
4. Is cross-sectional secondary microdata appropriate for your research question?
• Is your question quantitative?• Do you need to follow individuals over time?
5. Is data available?
Locating and assessing data
• Locating data:– What data is available for my topic?– Are the variables I need available?
• Assessing data for analysis:– What population is the sample drawn
from?– What sampling scheme was used?– Do I need to weight?
What datasets cover my topic?
• Question Bank http://qb.soc.surrey.ac.uk – has topic guides and a search engine
across questionnaires • Census topics:
– Limited due to legislation, scale & self-completion;
– View the codebooks to see what data is in which files on SARs web pages
• Finding topics in surveys:– Much wider range of topics from large
number of different sources– ESDS Government topic guides on
employment, health, social capital, Scotland
– ESDS/UK Data Archive search engine
What variables are available for my topic?
• To understand the variables you have available– View the documentation/user guide– A list of variables & codings should
be available– Information on how derived variables
were created should be available– Double check in the dataset!
What do the variables mean?
Unless...• you can track your variable back
to the question(s) asked on the questionnaire
• Know who the questions were asked of
• And what was done with the raw data to turn it into the final data...
You don’t understand the data
Routeing in the documentation: GHS
Variable Name : ECSTILOVariable Label : Economic status
(harmonised)Topic : EmploymentPopulation : AdultsHhld/indiv.level : IndividualRange : 1 to 10Missing values : -6, -8
1 'Working (incl Unpaid FW'2 'Gov sch with emp'3 'Gov sch at coll'4 'Unemployed (ILO)'5 'Other Unemployed'7 'Retired'6 'Perm unable to work'8 'Keeping house'9 'Student'10 'Other inactive'-8 'NA, ECSTA not known'-6 'Child/No int'.
Derived variablesDO IF SCHEDTYP = 3 OR AGE LT 16.+ COMPUTE ECSTILO = -6.ELSE.+ DO IF DVILO3A = 1.+ DO IF SCHEMEET = 1.+ DO IF TRN = 1.+ COMPUTE ECSTILO = 2.+ ELSE IF TRN = 2.+ COMPUTE ECSTILO = 3.+ END IF.+ ELSE.+ COMPUTE ECSTILO = 1.+ END IF.+ ELSE IF DVILO3A = 2.+ COMPUTE ECSTILO = 4.+ ELSE IF DVILO3A = 3.+ DO IF YINACT = 1.+ COMPUTE ECSTILO = 9.+ ELSE IF YINACT = 2.+ COMPUTE ECSTILO = 8.+ ELSE IF YINACT = 3.+ COMPUTE ECSTILO = 10.
The population base: nation
• Most large scale surveys seek to be nationally representative but what is a nation?– Labour Force Survey = UK– General Household Survey = GB
(but strange things can happen North of the Caledonian Canal)
– Health Survey for England = England
– Not always apparent from the name
– Increase of country-specific surveys following devolution
• Over 80% of the population live in England (9% Scotland, 5% Wales, 3% NI) so surveys designed for UK wide analyses will not generally have large enough samples to analyse separate countries
Population base: type of survey
• Most large scale surveys are household surveys they interview 1+ person in private households– This will exclude people in institutions– Has knock effects for particular topics;
health, age etc.
• Surveys tend to gather limited information about children – May only relate to their existence age and
relationships to other household members– There may also be other age restrictions on
all or part of the survey
Population base - setting
• You may need to subset to obtain a reasonable database– SARs 1991 could double count
visitors (at place of residence AND location on Census night)
– SARs 2001 can double count students (at place of termtime residence AND parental address)
– Need to subset to prevent double counting
The sampling strategy will affect your results
• Few data sources approximate simple random sampling – the SARs does
• Stratification increases the precision of estimates – the Labour Force Survey is stratified
• Clustering reduces the precision of estimates – e.g. the General Household Survey
• Many major surveys use stratification and clustering
• Guidance should be available in the documentation
• PEAS website
Disproportionate sampling
• The British Social Attitudes survey takes only 1 person per household– If left like this the chance of selection
in the sample would be inversely proportional to the size of one’s household
• Over-sampling in order to obtain satisfactory sample sizes for minority groups (often referred to as ‘boosts’)– Health Survey for England has done
this with ethnic minorities
Weighting can be used to prevent bias from
disproportionate sampling weighted unweighted
Frequency % of all Frequency % of all
Number in household including R? Q37
1 759.2 17.1 1326 29.9
2 1608.4 36.3 1522 34.3
3 838.3 18.9 671 15.1
4 774.6 17.5 596 13.4
5 311.3 7 232 5.2
6 91.4 2.1 57 1.3
7 31.4 0.7 16 0.4
8 13.8 0.3 9 0.2
9 1.1 0 1 0
10 1.7 0 1 0
12 1.1 0 1 0
Total 4432.1 100 4432 100Dataset: British Social Attitudes Survey, 2003
Non-response trends – another reason for weighting
Source: Barton in ESDS weighting guidehttp://www.esds.ac.uk/government/docs/weighting.pdf
Imputation: 2001SARs
Not ONC imputed
ONC imputed
White 94.8 5.2
Mixed 91.5 8.5
Asian 84.6 15.4
Black 76.5 13.5
Chinese/Other
85.6 14.4
All 93.8 6.2
ExerciseSuggest datasets which would fulfil the
following criteria, for a range of employment projects:
1. A large up-to-date UK dataset with extensive questions on employment and training
2. The maximum possible sample size for a single time point to allow minority groups to be distinguished in analysis.
3. Any 1960s employment microdata4. A dataset with extensive questions on
income from sources other than just earnings
5. A dataset which could be used to look at attitudes to work
What would you use the data for?
• Straightforward secondary analysis– To assess theoretical accounts– To quantify characteristics or behaviours– To challenge official views– To apply alternative definitions
• Context to your own primary research – Your research could be quantitative or qualitative– To assess the national context of an area study– To assess whether your sample is typical– To assess the scale of behaviours
Practical research uses of the data
• Looking at change over time
• Look at sub-populations
• Using the flexibility of the data to look at alternative definitions
• Looking within households
Secondary analysis:change for subpopulations
SMOKING AND SOCIAL CLASS - MEN
05
101520
253035
4045
1994 1995 1996 1997 1998 1999 2000 2001
year
%
all sc I&II sc IV&VSource:HSE
Marmot, M (2003)
Using successive cross-sectional data over time
Pros…• Reasonable
amount of comparability
• Can pool years/quarters
• Data is representative at each time point
• Good at looking at impacts on groups
Cons…• Limits to
continuity in the data (e.g. ethnic)
• Cannot establish individual change
Looking at small populations
• Many surveys with 10+k respondents– Permits minority groups to be
represented– Rare subpopulations sample size may be
too small… can consider combining years if appropriate
• Largest sample sizes available from the Samples of Anonymised Records– The Small Area Microdata file contains
nearly 3 million records!
Survey data is subject to sampling error!
Example: Pregnancy and Employment
•Using 1998-99 General Household Survey data alone there are only 168 pregnant women aged 16-49
•95% Confidence interval for % pregnant women economically inactive 34.2 – 49.1%
•Combined 3 years’ data to obtain sample of 465 pregnant women
•Confidence interval using 3 years’ data: 34.9 – 43.9%
Combining datasets to increase sample size
Using the flexibility of the data to look at alternative
definitionsWhat are ‘hours worked’?• Is it just paid work? Or unpaid as well?• Hours usually worked, or actually worked
last week?• In main job, or in any job? • What about students?• Overtime – paid?• Overtime – unpaid?• Lunch hours?• Do non-workers work zero hours or
should they be excluded?
Hierarchical data: conceptually
Household 1North West
Social rented
Household 2Wales
Owner occupier
Person 1HoH
Female28
GCSEP/T WorkNo LTILL
Person 2Son of HoH
Male12N/AN/A
No LTILL
Person 1 HoHMale33
DegreeF/T Employee
No LTILL
Person 2Spouse of HOH
Female31
DegreeP/T Employee
No LTILL
Person 3Parent of HoH
Female 72
No qualsEcon Inactive
LTILL
Workless households (source FES, various years 1968-1996)
0
5
10
15
20
25
68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
Year
Pe
rce
nta
ge
(o
f p
res
en
t w
ork
ing
ag
e h
oh
)
workless households
children in worklesshouseholds
Source: Richard Dickens, Paul Gregg and Jonathan Wadsworth(2000) ‘New Labour and the Labour Market, CMPO Working Paper Series00/19 Table 5
Finding out about what’s been/being done with the
data• User meetings
– General Household Survey– Labour Force Survey– Health Surveys– Samples of Anonymised Records
• ESDS Government– Publications database– Usage pages
Accessing & Support Services
• The data teams: – ESDS Government– SARs team at CCSR
• Registering to use the data• Special license and CAM data• Getting support
SARs Data team
• CENSUS MICRODATA SUPPORT• http://www.ccsr.ac.uk/sars• Register for the data• Access SARs documentation for all
SARs dataset• Explore data online or download
datasets in SPSS, Stata, or tab delimited form for:– 1991 data, 2001 Individual licensed file,
2001 Small Area Microdata• Information about 2001 Special Licence
Household SAR – link to UK Data Archive for download
ESDS Government• MAJOR CROSS-SECTIONAL UK
SURVEYS• http://www.esds.ac.uk/government• Survey pages • Introductory guides and resources
including topic guides, weighting guide, software guides
• Links to relevant external resources• Links to the UK Data Archive for
– Register for the data– Download the data in Stata, SPSS etc.– Explore the data online in Nesstar– Access documentation
The licence• All users need to be licensed• Academics complete license as part of
the Census Registration System Process
• Non-academic users contact UK Data Archive (Surveys) or CCSR (SARs) to arrange registration – charges may apply
• Cannot pass the data to an unlicensed user
• Cannot attempt to identify an individual
The licence – good practice
• Keep your data password protected• Destroy your data when you have
finished using it• Remove files before passing on
your PC to someone else• Tell the data team about your
publications• Tell the data team if you leave your
institution
Special licence files• Special licence is new way of
making more detailed data available to social researchers– Annual Population Survey data– Household SAR 2001
• Full & legally binding paper registration process – requires institutional signature & ONS approval
• Must agree to extensive data stewardship conditions
Controlled Access Microdata
• SARs Controlled Access Microdata designed for professional researchers who have no other data options open to them
• Access in safe setting only at ONS site• Specification on SARs website• Individual file and Household file• Files contains much more detail; e.g.
– Individual year of age (topcoded at 95)– Full coding on country of birth– SOC Unit Goup– Local authority geography– Index of Deprivation for SOAs– Index of Deprivation for migrants last address
• Further information and appropriate forms at http://www.statistics.gov.uk/census2001/sar_cams.asp
• Contact [email protected] for more details
User supportSARs:helpdesk email: [email protected]: (0161) 275 4262SARS jiscmail listhttp://www.ccsr.ac.uk/sars
ESDS Government:helpdesk email: [email protected]: (0161) 275 1980ESDS-Govsurveys jiscmail listhttp://www.esds.ac.uk/government