open science and data sharing: the datafirst experience/martin wittenberg

24
Open Science and Data sharing: the DataFirst experience Martin Wittenberg DataFirst 26 October 2017

Upload: african-open-science-platform

Post on 22-Jan-2018

192 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Open Science and Data sharing:the DataFirst experience

Martin WittenbergDataFirst

26 October 2017

Open Science

Overview• Introduction• Data and the research ecosystem• The problem of measurement in the social

sciences• Difficulties with sharing data• Why sharing data is essential• The role of a data platform like DataFirst

Open Science

Introduction• I’m an economist trying to understand what has happened to South

Africa since the end of apartheid– Particularly in relation to wages, employment, inequality, service

delivery• Data and data quality are key

• I also direct DataFirst, which is an organisation based at UCT dedicated to making it easier for researchers to access social science microdata

• www.datafirst.uct.ac.za• https://sites.google.com/site/martinwwittenberg/home

Open Science

Data and the research ecosystem

• Data doesn’t just appear• The value and meaning of data arises from

how it emerges within the

Open Science

Data and the research ecosystem

Theory• e.g. how markets work

Application• e.g. the impact of

imposing a minimum wage in 2018

Measurement• e.g. Quarterly Labour

Force Survey• e.g. tax returns

Open Science

Measurement• Sometimes for research purposes• But also incidental to other purposes

– e.g. tax data, satellite “night light” data

• Understand context, rules and procedures used– Sampling theory– Measurement instrument (e.g. questionnaire)– Fieldwork practice– Post-fieldwork data capture & processing– Imputations for missing values

Open Science

Measurement in the social sciences

• Crucial to also understand what you are notseeing– Non-response

• In the social sciences the subjects of research often have an interest in the outcome– Choose what to report

Open Science

An example from my researchCompare earnings in tax data and surveys• Wages of

employees

Blog post at http://www.econ3x3.org/

Open Science

Measurement issuesThe picture when looking at earnings from self-employment (business profits)

Why?• Penalties for

not reporting• But accurate

reporting means paying more tax

Open Science

Data within the research ecosystem

• In summary, data is not useful for research unless– We know where it has come from– What sort of errors/biases are likely to be involved

in the measurement process• AND

– People who are working on applied questions know that it exists/can be accessed

Open Science

Difficulties with sharing data• One of the challenges of sharing data is to

provide enough information about– Context– Measurement process(Metadata)

• Plus the data must be stored in a way that it is “discoverable”

• All of this costs time and effort

Open Science

Other difficulties• Fear of getting scooped with one’s own data• Fear of someone else finding a path-breaking

application of the data that one hadn’t thought of• Fear of problems/errors in the measurement

process being exposed• Confidentiality/privacy of respondents

– Ethics clearance

Open Science

How might one deal with these?

• Getting scooped– Delay public release

• “Important Science” vs “Mere data gathering”– Underlying issue is really one of skill– Response is often “data squatting”/rent extraction– A more creative response is to find ways to get

training programmes up around the data

Open Science

Issues with sharing, cont.• Exposing problems with the measurement

process– Becomes more critical if these data are the only

ones available– Reality is that there is no 100% clean dataset– Provided that there is still a detectable “signal” in

the data, it can still be used for science• It becomes easier to “fix” the problems if they are

openly acknowledged

Open Science

Issues with sharing, cont.

• Confidentiality– “Open science” doesn’t mean that the data has to

be available on the web for anyone– Key issue is that there have to be transparent

protocols for access– e.g. “Secure Labs” as recently established in

DataFirst

Open Science

Why sharing is essential• Proper science

– Can only be done if results can be replicated– Errors in analysis/measurement exposed

• New insights– It is impossible for one team to be on top of all the ways in

which a dataset could be used– Making data available allows some of the best and brightest

people in the world to think about your issues/problems• e.g. much of our insights into the impact and effectiveness of South

Africa’s old age pension system came from American academics– Of course some garbage is likely to be generated in the process

too

Open Science

Why sharing is essential, cont.

• Improvement in skills– South African quantitative social scientists of my

generation learned most of what we know from seeing international economists (notably Nobel prize winner Angus Deaton) work on our data

• He showed that there are fascinating questions to be answered

• He made his code available

Open Science

How do we make sharing more successful?

• This is really a question not only about the incentives to researchers and research organisations

• But also about institutions that can facilitate this process

• Organisations like DataFirst play an important role here

Open Science

The issue is really how to strengthen the links

Theory• e.g. how markets work

Application• e.g. the impact of

imposing a minimum wage in 2018

Measurement• e.g. Quarterly Labour

Force Survey• e.g. tax returns

Overview

Dissemination

Data Producer Skilled userDissemination

Feedback

Overview

Replicability of results

Data Published Paper

Analysis

Review/ReplicationFollow-up

Skilled Researcher

Reader

Overview

Best practice data production

Data ProducerMethodological

Research“Best practice”

Practical Issues

Feedback

Overview

Best practice data analysis

Open Science

How can we strengthen these loops?

• These are not “add-ons” – they are an integral part of a successful science infrastructure– Like libraries, research clouds etc.– Need to be supported:

• Financially• Mandates for sharing data, particularly if public funds

have been used in collecting them