the importance of data management

24
Paul Lambert, 31 st January 2012 Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data Management through e-Social Science ESRC research Node www.dames.org.uk The importance of data management DAMES, 31/JAN/2012, T1

Upload: dafydd

Post on 21-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

The importance of data management. Paul Lambert, 31 st January 2012 Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data Management through e-Social Science ESRC research Node www.dames.org.uk. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The importance of data management

Paul Lambert, 31st January 2012

Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data

Management through e-Social Science ESRC research Node www.dames.org.uk

The importance of data management

DAMES, 31/JAN/2012, T1

Page 2: The importance of data management

Today’s session (2V1/2V3)

DAMES, 31/JAN/2012, T1 2

Page 3: The importance of data management

3

‘Data Management though e-Social Science’

DAMES – www.dames.org.uk

ESRC funded research NodeFunded 2008-11, with ongoing work into 2012 with the NeISS

(www.neiss.org.uk) and ‘eStat’ (www.bristol.ac.uk/cmm/research/estat/) projects

Aim: Useful social science provisionsSpecialist data topics – occupations; education

qualifications; ethnicity; social care; health Computer science research on secure data models;

metadata and linking data; workflowsProgramme of case studies and provisions

DAMES, 31/JAN/2012, T1

Page 4: The importance of data management

4

‘Data management’ means… ‘the tasks associated with linking related data resources, with

coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]

Usually performed by social scientists themselvesMost overt in quantitative survey data analysis

• ‘variable constructions’, ‘data manipulations’• navigating abundance of data – thousands of variables

Usually a substantial component of the work process

Here we differentiate from archiving / controlling data itselfHere we differentiate from archiving / controlling data itself

DAMES, 31/JAN/2012, T1

Page 5: The importance of data management

5

Some components…

Manipulating data Recoding categories / ‘operationalising’ variables

Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)

Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions

Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’

Cleaning data ‘missing values’; implausible responses; extreme values

DAMES, 31/JAN/2012, T1

Page 6: The importance of data management

6

Example – recoding data

Count

323 0 0 0 0 323

982 0 0 0 0 982

0 425 0 0 0 425

0 1597 0 0 0 1597

0 0 340 0 0 340

0 0 3434 0 0 3434

0 0 161 0 0 161

0 0 0 1811 0 1811

0 0 0 0 2518 2518

0 0 0 331 0 331

0 0 0 0 421 421

0 0 0 257 0 257

102 0 0 0 0 102

0 0 0 0 2787 2787

138 0 0 0 0 138

1545 2022 3935 2399 5726 15627

-9 Missing or wild

-7 Proxy respondent

1 Higher Degree

2 First Degree

3 Teaching QF

4 Other Higher QF

5 Nursing QF

6 GCE A Levels

7 GCE O Levels or Equiv

8 Commercial QF, No OLevels

9 CSE Grade 2-5,ScotGrade 4-5

10 Apprenticeship

11 Other QF

12 No QF

13 Still At School No QF

Highesteducationalqualification

Total

-9.001.00

Degree2.00

Diploma

3.00 Higherschool orvocational

4.00 Schoollevel orbelow

educ4

Total

Page 7: The importance of data management

Example - Linking data (on related adults in the BHPS)

Used health services in last year (Y=43%)

GHQ score

indv cp hh xhid indv cp hh xhid

Female 0.63 0.77 0.69 0.65 1.36 1.36 1.36 1.53

Age 0.02 0.03 0.02 0.02 0.13 0.13 0.14 0.14

Age-squared(*100) -0.12 -0.13 -0.13 -0.13

Cohabiting -0.58 -0.58 -0.54 -0.59

Ln(household inc.) -0.09 -0.14 -0.12 -0.11 -0.63 -0.62 -0.63 -0.62

Constant -0.65 -0.67 -0.59 -0.55 12.9 12.8 12.6 12.6

ICC L2% (VC) 0 6.3 8.8 7.9 0 22.9 15.8 7.8

Mean cluster size 1 1.4 1.8 4.6 1 1.4 1.8 4.5

L2:sd(cons) 0.61 0.51 0.53 2.54 1.91 1.15

L2:sd(fem) 2.00 0.82 0.00 2.81 2.32 1.64

L1:sd(cons) 1.81 1.81 1.81 1.81 5.40 4.30 4.76 5.28

-Log-like (-40k) 9648 9625 9624 9632 3529 3383 3410 3512

Page 8: The importance of data management

8

‘The significance of data management for social survey research’

The data manipulations described above are a major component of the social survey research workload

Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories; Dealing with missing records

Post-release manipulations performed by researchers • Re-coding measures into simple categories• All serious researchers perform extended post-release management (and have the scars to show for it)

We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently

So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

DAMES, 31/JAN/2012, T1

Page 9: The importance of data management

..being more effective probably involves..

Knowing about, using and citing previous standard measures/strategies

Effective documentation/dissemination of information on the approach used

Being proactive (not just relying on the most convenient measure to hand)

Trying a few alternatives – sensitivity analysis

DAMES, 31/JAN/2012, T1 9

Page 10: The importance of data management

‘Documentation’ (and its dissemination) is probably the key…

By documentation we mean the ‘paper trail’ (such as data & syntax files during secondary survey research)

For scientists, this is the log book / journal / laboratory notebook

For social sciences, there are few agreed standards

10

Image of Alexander Graham Bell’s 1876 notebook, taken from: http://sandacom.wordpress.com/2010/03/11/the-face-rings-a-bell/

Effective documentation is possible, but requires some effort (e.g. Long, 2009)

Page 11: The importance of data management

11

..good levels of documentation are not engrained in the social sciences!

DAMES, 31/JAN/2012, T1

“…Little or nothing is systematically archived from these electronic sources. How many of us routinely keep copies of our old word-processing files once they are no longer of current relevance for research or teaching activities. We have been reminded…of the insecurity and non-survival of departmental and professional files stored in broom cupboards, but how many electronic files even get into that cupboard in the first place?” (p142 of Scott, J. (2005) ‘Some principal concerns in the shaping of sociology’, in Halsey, A.H. and Runciman, W. (eds) British Sociology: See from without and within. London: British Academy)

...Yet, ‘documentation for replication’ is a reasonable expectation for a scientific model of research

(e.g. Steuer, Dale, Freese)…

Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic.Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research

Methodology, 9(2), 143-158.Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology?

Sociological Methods & Research, 36(2), 153-71.

Page 12: The importance of data management

12

A bit of focus…

Most of the DAMES applications aim to facilitate one of two data management activities, their documentation, and the dissemination of that documentation:

1) Variable constructions o Coding and re-coding values

2) Linking datasetso Internal and external linkages

DAMES, 31/JAN/2012, T1

Page 13: The importance of data management

13

‘Documentation for replication’ supports replication of..

Your own analysis in response to comments, revisions, requests for access)

Others’ analysis To build upon – cumulative science To critique / cross-examine

In secondary survey research Complex data is often updated (new related records; revised

and re-released; re-weighted or re-standardardised; new levels of access/linkage)

New analysis feasible - variable operationalisations; new statistical methods

Most documentation requirements are achieved by effective use of software (‘syntax’ programming)

See our training workshops, www.dames.org.uk/workshops

DAMES, 31/JAN/2012, T1

Page 14: The importance of data management

14

Keep clear records of your DM activities!

Reproducible (for self)Replicable (for all)Paper trail for whole

lifecycle

In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

Syntax Examples: www.dames.org.uk/workshops

DAMES, 31/JAN/2012, T1

Page 15: The importance of data management

15

Page 16: The importance of data management

We’ve written a guide for researchers... ‘Software Session 1: Documentation and workflows with

popular software packages’ (www.dames.org.uk/workshops/stir10/docs_workflows_2010.html)

Dozens of sample command files in SPSS, Stata and R from DAMES Node workshops at www.dames.org.uk

DAMES, 31/JAN/2012, T1 16

Page 17: The importance of data management

17DAMES, 31/JAN/2012, T1

For data distributors, the provision of systematic metadata is also beneficial

Example of DDI format metadata

(see also talk 5)

Page 18: The importance of data management

18DAMES, 31/JAN/2012, T1

Page 19: The importance of data management

19

NESSTAR

DAMES, 31/JAN/2012, T1

Page 20: The importance of data management

What more is needed for good data management?

1) Good standards in the operationalisation of variables

See yesterday’s workshop sessions (www.dames.org.uk)Most options have already been studied!Using GEODE/GEMDE/GEEDE to facilitate sensitivity

analysis and comparisons of alternative plausible measures

• Collect documentation/metadata on specialist records• Promote more effective measurement options

e.g. effect proportional scaling; replication of measures used before; derivation of recommended standards

DAMES, 31/JAN/2012, T120

Page 21: The importance of data management

DAMES ‘GESDE’ tools: online services for data coordination/organisation

Tools for handing variables in social science data

Recoding measures; standardisation / harmonisation; Linking; Curating

21

Page 22: The importance of data management

0.0

2.0

4.0

6

ES5

ES2E9

E6E5

E3E2

G13G11

G10G7

G5G3

G2K4

R7WR

WR9O17

O8O4

MNI9

I99CM

CFCM2

CF2CG

ISEISIOP

AWMWG1

WG2WG3

GN1

Increase in R-squared Increase in BIC

Britain

-.05

0.0

5.1

ES5

ES2E9

E6E5

E3E2

G13G11

G10G7

G5G3

G2K4

R7WR

WR9O17

O8O4

MNI9

I99CM

CFCM2

CF2CG

ISEISIOP

AWMWG1

WG2WG3

GN1

Sweden

Source: BHPS and LNU 1991, adults aged 23-55 in work in 1991, N=4536 Britain, 2504 Sweden. Model 1: Health = quadratic age + gender + age*gender; Model 2: Health = (Model 1) + classificationGraph shows improvement in Pseudo R2 for Logistic regression, Model 2 v's Model 1,plus scaled BIC statistic (Model 2 BIC - Model 1 BIC / Model 1 BIC), cropped at 2*r2. Unweighted data.

Predictors of ‘poor health’ in Sweden (comparison of different occupation-based measures, from DAMES, TP 2011-1)

Page 23: The importance of data management

What more is needed for good data management?

2) Incentives/disincentivesArguably, good data management is penalised at

present (‘Don’t get it right, get it published’)Few formalised requirements of documentation or

data management activity (cf. metadata publishing standards such as DDI)

Citation rankings might incentivise here (citation of your do files..)

Prospects are probably rather bleak for good science..!!

DAMES, 31/JAN/2012, T1 23

Page 24: The importance of data management

Summary

the ‘significance’ of DM is about how much better research might be if we did things more effectively…

Can (try to) provide data oriented facilities supporting improved data management

May also need a cultural change in expectations…

DAMES, 31/JAN/2012, T1 24