research data management for econometrics

82
Econometrics of Panel Data and Network Analysis Research Data Management Module 1 Dr. Peter Löwe Berlin, 03. 08. 2017

Upload: peter-loewe

Post on 23-Jan-2018

162 views

Category:

Economy & Finance


2 download

TRANSCRIPT

Econometrics of Panel Data and Network Analysis

Research Data

Management

Module 1Dr. Peter Löwe

Berlin, 03. 08. 2017

Agenda

1. Why bother: A crisis, horror stories & a Panda-Oncologist

2. Size is relative: Doctor House, Big Data, and a long tail

3. Reality Check: Doing science in the 21st century

4. Research Data Management according to Gollum and XKCD

5. Persistent Identifiers: Digital dog tags for everything and everyone !

6. Research Data Repositories & good reads

7. Conclusion: Culture change & happy Pandas

3.1. Unterpunkt Nummer eins

3.2. Nächster Unterpunkt

3.3. Und noch ein Unterpunkt

8. Und man kann auch so weiter machen

Peter Löwe 2017-08-02Research Data Management: Module 12

1 Today‘s menue

Peter Löwe 2017-08-02Research Data Management: Module 13

• Why Research Data Management matters and how it

should work (perfect world)

• How stuff currently works (state of the art)

• How stuff will work soon (outlook)

• How to get started (self help)

1 Drivers for Research Data Management

Peter Löwe 2017-08-02Research Data Management: Module 14

https://www.kent.ac.uk/library/research/data-

management/manage.html

Why you should care (internal motivation)

• Increase the efficiency of your research process

• Avoid losing data

• Enable data re-use and sharing

Why you are going care (external motivation)

• Meet the requirements of research funders and your institute

• Comply with the policies of a growing number journal publishers on

making the data underlying publications available

• Increase your visibility (citations)

1 Research Data includes

Peter Löwe 2017-08-02Research Data Management: Module 15

• Questionnaires/surveys

• Raw experimental data

• Analysed data

• Databases

• Simulations and research code (software)

• Audio-visual materials

• Laboratory and field notes

• Clinical data, including clinical records

• Images and photographs

1 The Research Data Spectrum

Peter Löwe 2017-08-02Research Data Management: Module 16

• Hand written letters

• Images or photos

• Soil samples

• Tissue samples

• Archeological dig sites

• …..

• Scanned & OCR version

• Scanned digital version

• Analysed result of samples

• Analysed result of samples

• 3D models of the dig site

• …..

Physical Digital

1 Issue: The Reproducibility Crisis

Peter Löwe 2017-08-02Research Data Management: Module 17

Nature 533, 452–454 (26 May 2016) doi:10.1038/533452a

https://www.slideshare.net/AustralianNationalDataService/research-data-management-in-practice-ria-data-management-

workshop-brisbane-2017

• A methodological crisis in

science

• the phrase was coined in the

early 2010s as part of a

growing awareness of the

problem

• 2016: poll of 1,500 scientists

• 70% of them had failed to

reproduce at least one other

scientist's experiment

• results of many scientific

studies are difficult or

impossible to replicate on

subsequent investigation

https://en.wikipedia.org/wiki/Replication_crisis

1 Data Sharing and Management Snafu in 3 Short Acts

Peter Löwe 2017-08-02Research Data Management: Module 18

[Snafu: „Situation normal, all f***ed up“]

1 Video

Peter Löwe 2017-08-02Research Data Management: Module 19

1 Discussion

Peter Löwe 2017-08-02Research Data Management: Module 110

Have you encountered something similar ?

How to deal with such a situation ?

Where do you store your data?

How much data would you lose if your laptop was stolen?

1Reproducibility decreases of time

due to increasing data loss over time

Peter Löwe 2017-08-02Research Data Management: Module 111

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

“In their parents' attic, in boxes in the garage, or stored on now-defunct

floppy disks — these are just some of the inaccessible places in which

scientists have admitted to keeping their old research data. Such practices

mean that data are being lost to science at a rapid rate, a study has now

found.”

1 Night of the Living Data

Peter Löwe 2017-08-02Research Data Management: Module 112

http://www.eweek.com/database/5-data-management-horror-stories-to-avoid

1 Self-help Groups

Peter Löwe 2017-08-02Research Data Management: Module 113

1 Way Out: Keep Science FAIR (perfect world)

Peter Löwe 2017-08-02Research Data Management: Module 114

Principles to ensure research data is FAIR:

Findable, Accessible, Interoperable, Reusable

“The problem the FAIR Principles address is the lack of widely shared, clearly

articulated, and broadly applicable best practices around the publication of scientific

data”

“FAIRness is a prerequisite for proper data management and

data stewardship”Mark D. Wilkinson et al. The FAIR Guiding Principles for scientific data management and

stewardship, Scientific Data (2016). DOI: 10.1038/sdata.2016.18

https://www.force11.org/node/6062

Data Storage Evolutionhttps://www.nimbushosting.co.uk/evolution-data-storage/

We arehere

Ancienttimes

•2https://villagevoice.freetls.fastly.net/wp-content/uploads/2014/08/beatleboys560.jpg

2 Life Expectancy of Digital Storage Media

Peter Löwe 2017-08-02Research Data Management: Module 116

http://www.zeit.de/wissen/2013-10/s37-infografik-speichermedien.pdf

https://homsum.files.wordpress.com/2014/04/dr_house_hugh_laur

ie_desktop_1152x864_wallpaper-83467.jpg

2 Life Expectancy of Digital Storage Media

Peter Löwe 2017-08-02Research Data Management: Module 117

Storage capacity grows, but not the lifespan

Average life-span: about 10- 30 years

2 Big Data Buzzwords: The Four V‘s

Peter Löwe 2017-08-02Research Data Management: Module 118

2Size is not everything:Big Data and the Long Tail of Science

Peter Löwe 2017-08-02Research Data Management: Module 119

http://www.nature.com/neuro/journal/v17/n11/full/nn.3838.html

Big data from small data:

data-sharing in the 'long tail' of neuroscience

Long Tail of Science

• {Astro|Nuclear}-

physics,

• Genome studies,

• Remote Sensing

Overall amountcontinues to

increases due to„Big Data“

(Volume | Velocity)

3 Data-driven Science

Peter Löwe 2017-08-02Research Data Management: Module 120

http://www.allthingsdistributed.com/2007/02/help_find_jim_gray.html

Paradigms of Science:

1. empirical,

2. theoretical,

3. Computational

4. data-driven

3 The Fourth Paradigm

Peter Löwe 2017-08-02Research Data Management: Module 121

"It's the data, stupid"

Dr Gray's call-to-arms was [..] “to have a world

in which

• all of the science literature is online,

• all of the science data is online, and they

• interoperate with each other.”

3 Innovation in Science travels at different velocities

Peter Löwe 2017-08-02Research Data Management: Module 122

• Science in general is affected by digital innovation

• Every field of science is different

• but some are more ahead embracing different aspects of change.

• Exchange of lessons learned across disciplines needed.

http://i.quoteaddicts.com/media/q1/1487862.png

The Lifecycle of a Scientific Idea (Elegant High Level Perspective)•3

Influeced by computer-driven Science and „Big Data“ ?

The Lifecycle of a Scientific Idea : Reality check

1. Formulate a theory2. Gather data3. Learn about data storage4. Learn about data

movement protocols5. Lose data6. Check out of rehab7. Learn about backup and

replication8. Gather data9. Learn about versioning10. Start preliminary analysis11. Buy a newer laptop12. Buy more memory13. Buy a desktop with more

memory

14. Buy a bigger monitor & GPUs “for work”

15. Google “250GB Excel Spreadsheet”

16. Learn about batch processing

17. Learn about batch schedulers

18. Learn about patience.19. Learn more about data

storage20. Learn about distributed

systems.21. Go back through notes to

remember the science question.

22. Learn R & Python23. Learn linux admin24. Finish preliminary

analysis.25. Grow a ponytail26. Write a paper.27. Learn about data

publishing28. Learn about

reproducibility29. Plot the death of your

advisor/dept. head30. Apply for grants & research

allocations on public systems

31. Wait to apply next time32. Finish analyzing data33. Reformulate your theory

34.Goto 1Source: John Fonner (2016) Jupyter Ascending, http://bit.ly/2vmTwCR

Reality Check:

Science is green IT & the rest is blue

Data-wrangling is red

•3

Many data-wrangling challenges !

4Data Wrangling:Research Data Management (RDM)

Peter Löwe 2017-08-02Research Data Management: Module 125

http://www.oclc.org/content/dam/research/images/publications/rdm-framework-4-with-cc.png

Today‘smenue

YOU

Infrastructure (is there one - yet ?)

4RDMResponsibilities before, during and after a research project

Peter Löwe 2017-08-02Research Data Management: Module 126

data/assets/pdf_file/0009/394056/research-data-management-in-practice.pdf

YOU

4 Data Curation Continuum

Peter Löwe 2017-08-02Research Data Management: Module 127

Transfer Transfer Publication

Personal

domain

Group

domain

Persistent

domain

Access

domain

Gliederung des Data Curation Continuum in vier Verantwortungsdomänen.. Im Prozess des

Datentransfers werden die vorliegenden Metadaten um weitere Elemente angereichert.

(Nach Klump, 2009)

Post ResearchPre Research

Research

4 Pre Research: Institutional Requirements

Peter Löwe 2017-08-02Research Data Management: Module 128

Institutional Policy and Procedures

Support services - people and other means of providing advice

and support

IT Infrastructure - the hardware, software and other

facilities

Metadata management - so that data records can be meaningful

and fit for purpose

Institutional Data Management Framework

4 Pre Research: Data Management Plan (perfect world)

Peter Löwe 2017-08-02Research Data Management: Module 129

data organisation and storage;

metadata standards and guidelines;

backups;

archiving for long-term preservation;

version control and derived data products;

data sharing or publishing intentions, including licensing;

ensuring security of confidential data;

data synchronisation; and

governance, roles and responsibilities.

4 Documentation 101

Peter Löwe 2017-08-02Research Data Management: Module 130

a) Document your data sets.

b) Ask your data repository how to document correctly (Metadata !)

c) If you do not document, you‘re wasting an opportunity to receive credit

by citation and reuse

d) Not to be missed:

Topic (keywords, controleld vocabulary, abstract)

Observation unit (counties, people, etc)

Database (random sampling, complete survey, etc.)

Sampling method

Extent

Access: Limitations, embargo, POC

4 Metadata 101

Peter Löwe 2017-08-02Research Data Management: Module 131

Metadata (structured data about the data)

• Who collected the data?

• Who funded the research project?

• When (and where) was it collected?

• Instruments and setting for collecting the data?

• Title of the dataset

• Methods used to process the data

• Etc. etc.

4 Appropriate File Formats

Peter Löwe 2017-08-02Research Data Management: Module 132

• Open and non-proprietary

• Human readible, non-binary

• Patent-free

• ISO-standards

• textual data: XML, TXT, HTML, PDF/A (Archival PDF)

• Tabular data (spreadsheets): CSV

• Databases: XML, CSV

• Images: TIFF, PNG, JPEG*

• Audio: FLAC, WAV, MP3

4 Include a Manifest / readme File !

Peter Löwe 2017-08-02Research Data Management: Module 133

4 Data Life Cycle: Personal Domain Perspective

Peter Löwe 2017-08-02Research Data Management: Module 134

http://cdn.ttgtmedia.com/informationsecurity/images/vol4iss7/ism_v4i7_f4_DataLifecycle.gif

Most critical stage in the research

data lifecycle is the completion of

the research project. In the most

cases there is no follow up funding

to maintain the research data. Also,

the scientist has to focus on the

next project.

!!!

4 Publishing and Sharing Data

Peter Löwe 2017-08-02Research Data Management: Module 135

Publishing and Sharing data ≠ Open Access to data

• “Open” and “Closed” are relative concepts.

• “Closed” ≈ conditional access based on individual

permission

• “Closed” ≈ conditional access based on roles

Metadata Research Data

Open Open

Open Closed

Closed Open

Closed Closed

4 Continual data curation across domains

Peter Löwe 2017-08-02Research Data Management: Module 136

4 Data Curation Continuum: Visibility und Circulation

Peter Löwe 2017-08-02Research Data Management: Module 137

Transfer Transfer Publication

Personal

domain

Group

domain

Persistent

domain

Access

domain

Low

visibility

High

visibility

4 Data Delay Strategies ?

Peter Löwe 2017-08-02Research Data Management: Module 138

https://www.explainxkcd.com/wiki/index.php/1805:_Unpublished_Discoveries

4 The Grant Cycle according to XKCD (and Machiavelli ?)

Name + Datum (über Kopf- und Fußleiste einstellen)Titel und Untertitel 39

http://phdcomics.com/comics/archive.php?comicid=1431

4 The Reputation Economy

Peter Löwe 2017-08-02Research Data Management: Module 140

Open Access to Data:• Science has become a reputation economy

• The fundamental difference between disciplines is the trade-off between reputation

and collaboration at points of the reputation economy where changes in the form of

capital occur.

• Sharing data as a form of collaboration must be balanced by a similar gain in

reputation.

• […]collaborative disciplines enforce data sharing as a social norm where non-

compliance will result in some form of penalty […]

4Research Parasites Paradigm:

Open Access for Data is evil

Peter Löwe 2017-08-02Research Data Management: Module 141

https://media.tenor.com/images/236ee382fdf16973567dc3bb44c21

b51/tenor.gif

Lego Gollum

4Alternative Paradigm:

Sharing the fire of the Open Data „torch“

Peter Löwe 2017-08-02Research Data Management: Module 142

4A Solution for the CrisisOpen Science enables Reproducible Science

Peter Löwe 2017-08-02Research Data Management: Module 143

https://en.wikipedia.org/wiki/Op

en_science#/media/File:Open_

Science_-_Prinzipien.png

Benefits:

• Greater availability

and accessibility of

publicly funded

scientific research

outputs;

• Possibility for

rigorous peer-review

processes;

• Greater

reproducibility and

transparency of

scientific works;

• Greater impact of

scientific research.

Open Science is the

movement to make

scientific research

and data accessible

to all

4 Reality check: Gollum (still) beats Prometheus by 10:1

Peter Löwe 2017-08-02Research Data Management: Module 144

https://s-media-cache-

ak0.pinimg.com/originals/21/94/ed/2194ed6879d5bfd93679326508d382cd.jpg

• Gift culture still prevails

• It‘s not the technology

• It‘s not the generational change

• How to trigger cultural change ?

Science Technology Medicine (STM):

2006-2016: ~ 30 million papers published

~ 3 million data publications

(Klump 2017)

10:1

4Pradigm Change induced by Funding Agencies:Watering hole approach instead of stick & carrot

Peter Löwe 2017-08-02Research Data Management: Module 145

http://i.dailymail.co.uk/i/pix/2016/01/14/17/3025C04C00000578-3398562-image-a-16_1452793763082.jpg

Carrot & stick did not work

Control the watering hole: Works (for now)

4 FAIR principles: As guidelines

Peter Löwe 2017-08-02Research Data Management: Module 146

https://commons.wikimedia.org/wiki/File:FAIR_data_principles.jpg

http://www.macs.hw.ac.uk/~ajg33/wp-

content/uploads/2016/03/FAIR-Article-Poster.jpg

“The problem the FAIR Principles address

is the lack of widely shared, clearly

articulated, and broadly applicable best

practices around the publication of

scientific data”

5 Technical Requirement for FAIR

Peter Löwe 2017-08-02Research Data Management: Module 147

• Easy and permanent access to

research data via the internet

• Enhanced discovery, retrieval

and management of data to

enable data reuse and

verification of research results

5 Benefits of Citation

Peter Löwe 2017-08-02Research Data Management: Module 148

• Including citable data in related publications increases

the citation rate of those publications

• Only cited data can be counted and tracked (in a similar

manner to journal articles) to measure impact

• Routine citation of data will assist in gaining

acknowledgement of data as a first class research output

• Citations for published data can be included in CVs along

with journal articles, reports and conference papers

5Technical Challenge: Unbreakable internet-based Citation

Peter Löwe 2017-08-02Research Data Management: Module 149

Stable linking needed

• Data will move, URL links to Webpages will break.

• Unbreakable alternative needed !

5 Digital Object Identifiers (DOI)

Peter Löwe 2017-08-02Research Data Management: Module 150

• International DOI Foundation was founded in 1998.

• The DOI system offers long-term persistence and

accessibility of data.

• Based on the Handle system.

• In May 2012 the DOI System ISO Standard 26324 was

published.

• Part of the quality control is mandatory metadata for

each object registered with a DOI.

5 What is a DOI ?

Peter Löwe 2017-08-02Research Data Management: Module 151

DOI: Acronym for "digital object identifier“.

A DOI name is an identifier (not a location) of an entity on digital

networks.

What you see: alphanumeric string (never changes)

Associated with: location (such as URL)

Accompanied with: who, what, when… (metadata)

5DataCite Metadata SchemaMandatory properties

Peter Löwe 2017-08-02Research Data Management: Module 152

Part of the quality control is mandatory metadata for each

object registered with a DOI:

• Identifier (with type attribute)

• Creator (with type and nameIdentifier attributes)

• Title (with optional type attribute)

• Publisher

• PublicationYear

5 DOI is a quality label for data

Peter Löwe 2017-08-02Research Data Management: Module 153

Datasets with a DOI have to be:

Stable (i.e. not going to be modified)

Complete (i.e. not going to be updated)

Permanent – by assigning a DOI we’re committing to make

the dataset available for posterity

Good quality – by assigning a DOI its receiving the data

centre’s stamp of approval, saying that it’s complete and all

the metadata is available

DOI:Seal of

Approval

5 DOI for Research Data

Peter Löwe 2017-08-02Research Data Management: Module 154

https://support.datacite.org/docs/doi-basics

5 DOI Citation Examples

Peter Löwe 2017-08-02Research Data Management: Module 155

Fahrenberg, Jochen (2010): Freiburger Beschwerdenliste FBL. Primärdaten der

Normierungsstichprobe 1993. Version 1.0.0. ZPID- Leibniz-Zentrum für Psychologische

Information und Dokumentation.

Dataset. http://doi.org/10.5160/psychdata.fgjn05an08

Rattinger, Hans; Roßteutscher, Sigrid; Schmitt-Beck, Rüdiger; Weßels, Bernhard(2012):

Wahlkampf-Panel (GLES 2009). Version: 3.0.0. GESIS Datenarchiv.

Dataset.doi:10.4232/1.11131.

Schupp, Jürgen; Kroh, Martin; Goebel, Jan; Bartsch, Simone; Giesselmann, Marco et.

al. (2013): Sozio-oekonomisches Panel (SOEP), Daten der Jahre 1984-2012. Version: 29.

SOEP- Sozio-oekonomisches Panel.

Dataset. doi:10.5684/soep.v29.

5 DOI System Architecture

Peter Löwe 2017-08-02Research Data Management: Module 156

5 DataCite Services

Peter Löwe 2017-08-02Research Data Management: Module 157

Search.datacite.org

5 Upcoming: Search DOI-registered datasets by ORCID

Peter Löwe 2017-08-02Research Data Management: Module 158

Find any DOI-registered publication by ORCID

http://dashboard.project-thor.eu

Example: Löwe / Loewe / Lowe ?

Which of the four Peter Löwe ?

6 Data Curation Continuum: Research Data Repositories

Peter Löwe 2017-08-02Research Data Management: Module 159

Transfer Transfer Publication

Personal

domain

Group

domain

Persistent

domain

Access

domain

Low

visibility

High

visibility

6 re3data: Registry of Research Data Repositories

Peter Löwe 2017-08-02Research Data Management: Module 160

1,500 research dara repositories

described by tags:

6 re3data: Search options

Peter Löwe 2017-08-02Research Data Management: Module 161

6 Research Data Repository (RDR) Development and Services

Peter Löwe 2017-08-02Research Data Management: Module 162

Currently, DFG funds two RDR-related Projects:

1. SowiDataNet: addressing the social sciences

2. RADAR: addressing the long tail of Science

Technology and Metadata are compatible.

RADAR is a service offering by FIZ Karlsruhe (testing phase)

Near future:

• SowiDtaaNet will become a serice offering (GESIS)

• Datorium will merge with SowiDataNet

6 RADAR: Research Data Repository Services

Peter Löwe 2017-08-02Research Data Management: Module 163

Van den Broel K, Furtado F, Engel T (2015): RADAR – A Research Data Repository for the “Long-Tail of Science”

6RADAR: Research Data Repositories Roles & Responsibilities

Peter Löwe 2017-08-02Research Data Management: Module 164

6Datorium.gesis.org: Repository for Social Science andEconomic Science

Peter Löwe 2017-08-02Research Data Management: Module 165

6 Datorium: Data Set Description

Peter Löwe 2017-08-02Research Data Management: Module 166

6 Datorium: Terms of Access

Peter Löwe 2017-08-02Research Data Management: Module 167

4 Where NOT to „publish“ your Data

Peter Löwe 2017-08-02Research Data Management: Module 168

Required:

Professional repositories which enable

• long term access,

• search,

• retrieval,

• thorough metadata

6Alternative (Self help): All-purpose Repositories

Peter Löwe 2017-08-02Research Data Management: Module 169

Rueda, Laura. (2017, May). Introduction to DataCite. Zenodo.

http://doi.org/10.5281/zenodo.571808

6 OPENAIRE: RDM on the European Level

Peter Löwe 2017-08-02Research Data Management: Module 170

https://www.openaire.eu/

https://www.slideshare.net/OpenAIRE_eu/enabling-better-science-results-and-vision-of-the-openaire-infrastructure-and-rda-

data-publishing-working-group-55075375

6 Adoption of Open Science in Europe

Peter Löwe 2017-08-02Research Data Management: Module 171

https://www.fosteropenscience.eu/

6Forschungsdaten in den Sozial- und Wirtschaftswissenschaften

Peter Löwe 2017-08-02Research Data Management: Module 172

http://dx.doi.org/10.4232/10.fisuzida2014.1

http://auffinden-zitieren-dokumentieren.de

6 Handbuch Forschungsdatenmanagement

Peter Löwe 2017-08-02Research Data Management: Module 173

ISBN 978-3-88347-283-6 PDF: http://bit.ly/2uPJdaf

6 Rat für Sozial- und Wirtschaftdaten / DFG

Peter Löwe 2017-08-02Research Data Management: Module 174

http://www.dfg.de/download/pdf/foerderung/antragstellung/forschungsd

aten/basisinformationen_forschungsdatenmanagement.pdf

6 WIKI: FORSCHUNGSDATEN.ORG

Peter Löwe 2017-08-02Research Data Management: Module 175

http://www.forschungsdaten.org

6 RESEARCH DATA ALLIANCE

Peter Löwe 2017-08-02Research Data Management: Module 176

https://www.rd-alliance.org/

6 Data Carpentry Workshops

Peter Löwe 2017-08-02Research Data Management: Module 177

http://www.datacarpentry.org/

7 AUSTRALIAN NATIONAL DATA SERVICE (ANDS)

Peter Löwe 2017-08-02Research Data Management: Module 178

7 Wise Advise

Peter Löwe 2017-08-02Research Data Management: Module 179

https://nicolahemmings.wordpress.com/2016/04/05/mistakes-ive-

made-as-an-early-career-researcher/

Mistakes I’ve made as an early career researcher

APRIL 5, 2016

Nicola Hemmings (post-doc, University of Sheffield)

Failing to organise my data adequately (circa 2007).

“Prepare your datasets like you would if you were giving them to a

stranger who knew nothing about them. Label, annotate and

meticulously file your R scripts. Incorporate read-me files into everything

and write them for the monkey that will be you in five years, when you

return to your data and/or analyses for some unforeseen but vitally

important reason. Don’t get this wrong. You will regret it.“

7Back to the start:Snafu ? Things are getting better

Peter Löwe 2017-08-02Research Data Management: Module 180

• This film is scientific nontextual information

• It is available on the AV-portal of TIB Hannover, a data portal for

scientic audiovisual content.

• DOI-link: https://doi.org/10.5446/31036

Vielen Dank für Ihre Aufmerksamkeit.

DIW Berlin — Deutsches Institut

für Wirtschaftsforschung e.V.

Mohrenstraße 58, 10117 Berlin

www.diw.de

RedaktionPeter Löwe ([email protected])

http://dilbert.com/strip/2010-08-24

Based on the works of

• Paul Wong (2017) ANDS,Research Integrity Advisor Data Management Workshop

• 3TU.Datacentre (2014): Data citation and DOIs

• and others

Vielen Dank für Ihre Aufmerksamkeit.

DIW Berlin — Deutsches Institut

für Wirtschaftsforschung e.V.

Mohrenstraße 58, 10117 Berlin

www.diw.de

RedaktionPeter Löwe ([email protected])