research data management for econometrics
TRANSCRIPT
Econometrics of Panel Data and Network Analysis
Research Data
Management
Module 1Dr. Peter Löwe
Berlin, 03. 08. 2017
Agenda
1. Why bother: A crisis, horror stories & a Panda-Oncologist
2. Size is relative: Doctor House, Big Data, and a long tail
3. Reality Check: Doing science in the 21st century
4. Research Data Management according to Gollum and XKCD
5. Persistent Identifiers: Digital dog tags for everything and everyone !
6. Research Data Repositories & good reads
7. Conclusion: Culture change & happy Pandas
3.1. Unterpunkt Nummer eins
3.2. Nächster Unterpunkt
3.3. Und noch ein Unterpunkt
8. Und man kann auch so weiter machen
Peter Löwe 2017-08-02Research Data Management: Module 12
1 Today‘s menue
Peter Löwe 2017-08-02Research Data Management: Module 13
• Why Research Data Management matters and how it
should work (perfect world)
• How stuff currently works (state of the art)
• How stuff will work soon (outlook)
• How to get started (self help)
1 Drivers for Research Data Management
Peter Löwe 2017-08-02Research Data Management: Module 14
https://www.kent.ac.uk/library/research/data-
management/manage.html
Why you should care (internal motivation)
• Increase the efficiency of your research process
• Avoid losing data
• Enable data re-use and sharing
Why you are going care (external motivation)
• Meet the requirements of research funders and your institute
• Comply with the policies of a growing number journal publishers on
making the data underlying publications available
• Increase your visibility (citations)
1 Research Data includes
Peter Löwe 2017-08-02Research Data Management: Module 15
• Questionnaires/surveys
• Raw experimental data
• Analysed data
• Databases
• Simulations and research code (software)
• Audio-visual materials
• Laboratory and field notes
• Clinical data, including clinical records
• Images and photographs
1 The Research Data Spectrum
Peter Löwe 2017-08-02Research Data Management: Module 16
• Hand written letters
• Images or photos
• Soil samples
• Tissue samples
• Archeological dig sites
• …..
• Scanned & OCR version
• Scanned digital version
• Analysed result of samples
• Analysed result of samples
• 3D models of the dig site
• …..
Physical Digital
1 Issue: The Reproducibility Crisis
Peter Löwe 2017-08-02Research Data Management: Module 17
Nature 533, 452–454 (26 May 2016) doi:10.1038/533452a
https://www.slideshare.net/AustralianNationalDataService/research-data-management-in-practice-ria-data-management-
workshop-brisbane-2017
• A methodological crisis in
science
• the phrase was coined in the
early 2010s as part of a
growing awareness of the
problem
• 2016: poll of 1,500 scientists
• 70% of them had failed to
reproduce at least one other
scientist's experiment
• results of many scientific
studies are difficult or
impossible to replicate on
subsequent investigation
https://en.wikipedia.org/wiki/Replication_crisis
1 Data Sharing and Management Snafu in 3 Short Acts
Peter Löwe 2017-08-02Research Data Management: Module 18
[Snafu: „Situation normal, all f***ed up“]
1 Discussion
Peter Löwe 2017-08-02Research Data Management: Module 110
Have you encountered something similar ?
How to deal with such a situation ?
Where do you store your data?
How much data would you lose if your laptop was stolen?
1Reproducibility decreases of time
due to increasing data loss over time
Peter Löwe 2017-08-02Research Data Management: Module 111
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
“In their parents' attic, in boxes in the garage, or stored on now-defunct
floppy disks — these are just some of the inaccessible places in which
scientists have admitted to keeping their old research data. Such practices
mean that data are being lost to science at a rapid rate, a study has now
found.”
1 Night of the Living Data
Peter Löwe 2017-08-02Research Data Management: Module 112
http://www.eweek.com/database/5-data-management-horror-stories-to-avoid
1 Way Out: Keep Science FAIR (perfect world)
Peter Löwe 2017-08-02Research Data Management: Module 114
Principles to ensure research data is FAIR:
Findable, Accessible, Interoperable, Reusable
“The problem the FAIR Principles address is the lack of widely shared, clearly
articulated, and broadly applicable best practices around the publication of scientific
data”
“FAIRness is a prerequisite for proper data management and
data stewardship”Mark D. Wilkinson et al. The FAIR Guiding Principles for scientific data management and
stewardship, Scientific Data (2016). DOI: 10.1038/sdata.2016.18
https://www.force11.org/node/6062
Data Storage Evolutionhttps://www.nimbushosting.co.uk/evolution-data-storage/
We arehere
Ancienttimes
•2https://villagevoice.freetls.fastly.net/wp-content/uploads/2014/08/beatleboys560.jpg
2 Life Expectancy of Digital Storage Media
Peter Löwe 2017-08-02Research Data Management: Module 116
http://www.zeit.de/wissen/2013-10/s37-infografik-speichermedien.pdf
https://homsum.files.wordpress.com/2014/04/dr_house_hugh_laur
ie_desktop_1152x864_wallpaper-83467.jpg
2 Life Expectancy of Digital Storage Media
Peter Löwe 2017-08-02Research Data Management: Module 117
Storage capacity grows, but not the lifespan
Average life-span: about 10- 30 years
2Size is not everything:Big Data and the Long Tail of Science
Peter Löwe 2017-08-02Research Data Management: Module 119
http://www.nature.com/neuro/journal/v17/n11/full/nn.3838.html
Big data from small data:
data-sharing in the 'long tail' of neuroscience
Long Tail of Science
• {Astro|Nuclear}-
physics,
• Genome studies,
• Remote Sensing
Overall amountcontinues to
increases due to„Big Data“
(Volume | Velocity)
3 Data-driven Science
Peter Löwe 2017-08-02Research Data Management: Module 120
http://www.allthingsdistributed.com/2007/02/help_find_jim_gray.html
Paradigms of Science:
1. empirical,
2. theoretical,
3. Computational
4. data-driven
3 The Fourth Paradigm
Peter Löwe 2017-08-02Research Data Management: Module 121
"It's the data, stupid"
Dr Gray's call-to-arms was [..] “to have a world
in which
• all of the science literature is online,
• all of the science data is online, and they
• interoperate with each other.”
3 Innovation in Science travels at different velocities
Peter Löwe 2017-08-02Research Data Management: Module 122
• Science in general is affected by digital innovation
• Every field of science is different
• but some are more ahead embracing different aspects of change.
• Exchange of lessons learned across disciplines needed.
http://i.quoteaddicts.com/media/q1/1487862.png
The Lifecycle of a Scientific Idea (Elegant High Level Perspective)•3
Influeced by computer-driven Science and „Big Data“ ?
The Lifecycle of a Scientific Idea : Reality check
1. Formulate a theory2. Gather data3. Learn about data storage4. Learn about data
movement protocols5. Lose data6. Check out of rehab7. Learn about backup and
replication8. Gather data9. Learn about versioning10. Start preliminary analysis11. Buy a newer laptop12. Buy more memory13. Buy a desktop with more
memory
14. Buy a bigger monitor & GPUs “for work”
15. Google “250GB Excel Spreadsheet”
16. Learn about batch processing
17. Learn about batch schedulers
18. Learn about patience.19. Learn more about data
storage20. Learn about distributed
systems.21. Go back through notes to
remember the science question.
22. Learn R & Python23. Learn linux admin24. Finish preliminary
analysis.25. Grow a ponytail26. Write a paper.27. Learn about data
publishing28. Learn about
reproducibility29. Plot the death of your
advisor/dept. head30. Apply for grants & research
allocations on public systems
31. Wait to apply next time32. Finish analyzing data33. Reformulate your theory
34.Goto 1Source: John Fonner (2016) Jupyter Ascending, http://bit.ly/2vmTwCR
Reality Check:
Science is green IT & the rest is blue
Data-wrangling is red
•3
Many data-wrangling challenges !
4Data Wrangling:Research Data Management (RDM)
Peter Löwe 2017-08-02Research Data Management: Module 125
http://www.oclc.org/content/dam/research/images/publications/rdm-framework-4-with-cc.png
Today‘smenue
YOU
Infrastructure (is there one - yet ?)
4RDMResponsibilities before, during and after a research project
Peter Löwe 2017-08-02Research Data Management: Module 126
data/assets/pdf_file/0009/394056/research-data-management-in-practice.pdf
YOU
4 Data Curation Continuum
Peter Löwe 2017-08-02Research Data Management: Module 127
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Gliederung des Data Curation Continuum in vier Verantwortungsdomänen.. Im Prozess des
Datentransfers werden die vorliegenden Metadaten um weitere Elemente angereichert.
(Nach Klump, 2009)
Post ResearchPre Research
Research
4 Pre Research: Institutional Requirements
Peter Löwe 2017-08-02Research Data Management: Module 128
Institutional Policy and Procedures
Support services - people and other means of providing advice
and support
IT Infrastructure - the hardware, software and other
facilities
Metadata management - so that data records can be meaningful
and fit for purpose
Institutional Data Management Framework
4 Pre Research: Data Management Plan (perfect world)
Peter Löwe 2017-08-02Research Data Management: Module 129
data organisation and storage;
metadata standards and guidelines;
backups;
archiving for long-term preservation;
version control and derived data products;
data sharing or publishing intentions, including licensing;
ensuring security of confidential data;
data synchronisation; and
governance, roles and responsibilities.
4 Documentation 101
Peter Löwe 2017-08-02Research Data Management: Module 130
a) Document your data sets.
b) Ask your data repository how to document correctly (Metadata !)
c) If you do not document, you‘re wasting an opportunity to receive credit
by citation and reuse
d) Not to be missed:
Topic (keywords, controleld vocabulary, abstract)
Observation unit (counties, people, etc)
Database (random sampling, complete survey, etc.)
Sampling method
Extent
Access: Limitations, embargo, POC
4 Metadata 101
Peter Löwe 2017-08-02Research Data Management: Module 131
Metadata (structured data about the data)
• Who collected the data?
• Who funded the research project?
• When (and where) was it collected?
• Instruments and setting for collecting the data?
• Title of the dataset
• Methods used to process the data
• Etc. etc.
4 Appropriate File Formats
Peter Löwe 2017-08-02Research Data Management: Module 132
• Open and non-proprietary
• Human readible, non-binary
• Patent-free
• ISO-standards
• textual data: XML, TXT, HTML, PDF/A (Archival PDF)
• Tabular data (spreadsheets): CSV
• Databases: XML, CSV
• Images: TIFF, PNG, JPEG*
• Audio: FLAC, WAV, MP3
4 Data Life Cycle: Personal Domain Perspective
Peter Löwe 2017-08-02Research Data Management: Module 134
http://cdn.ttgtmedia.com/informationsecurity/images/vol4iss7/ism_v4i7_f4_DataLifecycle.gif
Most critical stage in the research
data lifecycle is the completion of
the research project. In the most
cases there is no follow up funding
to maintain the research data. Also,
the scientist has to focus on the
next project.
!!!
4 Publishing and Sharing Data
Peter Löwe 2017-08-02Research Data Management: Module 135
Publishing and Sharing data ≠ Open Access to data
• “Open” and “Closed” are relative concepts.
• “Closed” ≈ conditional access based on individual
permission
• “Closed” ≈ conditional access based on roles
Metadata Research Data
Open Open
Open Closed
Closed Open
Closed Closed
4 Data Curation Continuum: Visibility und Circulation
Peter Löwe 2017-08-02Research Data Management: Module 137
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Low
visibility
High
visibility
4 Data Delay Strategies ?
Peter Löwe 2017-08-02Research Data Management: Module 138
https://www.explainxkcd.com/wiki/index.php/1805:_Unpublished_Discoveries
4 The Grant Cycle according to XKCD (and Machiavelli ?)
Name + Datum (über Kopf- und Fußleiste einstellen)Titel und Untertitel 39
http://phdcomics.com/comics/archive.php?comicid=1431
4 The Reputation Economy
Peter Löwe 2017-08-02Research Data Management: Module 140
Open Access to Data:• Science has become a reputation economy
• The fundamental difference between disciplines is the trade-off between reputation
and collaboration at points of the reputation economy where changes in the form of
capital occur.
• Sharing data as a form of collaboration must be balanced by a similar gain in
reputation.
• […]collaborative disciplines enforce data sharing as a social norm where non-
compliance will result in some form of penalty […]
4Research Parasites Paradigm:
Open Access for Data is evil
Peter Löwe 2017-08-02Research Data Management: Module 141
https://media.tenor.com/images/236ee382fdf16973567dc3bb44c21
b51/tenor.gif
Lego Gollum
4Alternative Paradigm:
Sharing the fire of the Open Data „torch“
Peter Löwe 2017-08-02Research Data Management: Module 142
4A Solution for the CrisisOpen Science enables Reproducible Science
Peter Löwe 2017-08-02Research Data Management: Module 143
https://en.wikipedia.org/wiki/Op
en_science#/media/File:Open_
Science_-_Prinzipien.png
Benefits:
• Greater availability
and accessibility of
publicly funded
scientific research
outputs;
• Possibility for
rigorous peer-review
processes;
• Greater
reproducibility and
transparency of
scientific works;
• Greater impact of
scientific research.
Open Science is the
movement to make
scientific research
and data accessible
to all
4 Reality check: Gollum (still) beats Prometheus by 10:1
Peter Löwe 2017-08-02Research Data Management: Module 144
https://s-media-cache-
ak0.pinimg.com/originals/21/94/ed/2194ed6879d5bfd93679326508d382cd.jpg
• Gift culture still prevails
• It‘s not the technology
• It‘s not the generational change
• How to trigger cultural change ?
Science Technology Medicine (STM):
2006-2016: ~ 30 million papers published
~ 3 million data publications
(Klump 2017)
10:1
4Pradigm Change induced by Funding Agencies:Watering hole approach instead of stick & carrot
Peter Löwe 2017-08-02Research Data Management: Module 145
http://i.dailymail.co.uk/i/pix/2016/01/14/17/3025C04C00000578-3398562-image-a-16_1452793763082.jpg
Carrot & stick did not work
Control the watering hole: Works (for now)
4 FAIR principles: As guidelines
Peter Löwe 2017-08-02Research Data Management: Module 146
https://commons.wikimedia.org/wiki/File:FAIR_data_principles.jpg
http://www.macs.hw.ac.uk/~ajg33/wp-
content/uploads/2016/03/FAIR-Article-Poster.jpg
“The problem the FAIR Principles address
is the lack of widely shared, clearly
articulated, and broadly applicable best
practices around the publication of
scientific data”
5 Technical Requirement for FAIR
Peter Löwe 2017-08-02Research Data Management: Module 147
• Easy and permanent access to
research data via the internet
• Enhanced discovery, retrieval
and management of data to
enable data reuse and
verification of research results
5 Benefits of Citation
Peter Löwe 2017-08-02Research Data Management: Module 148
• Including citable data in related publications increases
the citation rate of those publications
• Only cited data can be counted and tracked (in a similar
manner to journal articles) to measure impact
• Routine citation of data will assist in gaining
acknowledgement of data as a first class research output
• Citations for published data can be included in CVs along
with journal articles, reports and conference papers
5Technical Challenge: Unbreakable internet-based Citation
Peter Löwe 2017-08-02Research Data Management: Module 149
Stable linking needed
• Data will move, URL links to Webpages will break.
• Unbreakable alternative needed !
5 Digital Object Identifiers (DOI)
Peter Löwe 2017-08-02Research Data Management: Module 150
• International DOI Foundation was founded in 1998.
• The DOI system offers long-term persistence and
accessibility of data.
• Based on the Handle system.
• In May 2012 the DOI System ISO Standard 26324 was
published.
• Part of the quality control is mandatory metadata for
each object registered with a DOI.
5 What is a DOI ?
Peter Löwe 2017-08-02Research Data Management: Module 151
DOI: Acronym for "digital object identifier“.
A DOI name is an identifier (not a location) of an entity on digital
networks.
What you see: alphanumeric string (never changes)
Associated with: location (such as URL)
Accompanied with: who, what, when… (metadata)
5DataCite Metadata SchemaMandatory properties
Peter Löwe 2017-08-02Research Data Management: Module 152
Part of the quality control is mandatory metadata for each
object registered with a DOI:
• Identifier (with type attribute)
• Creator (with type and nameIdentifier attributes)
• Title (with optional type attribute)
• Publisher
• PublicationYear
5 DOI is a quality label for data
Peter Löwe 2017-08-02Research Data Management: Module 153
Datasets with a DOI have to be:
Stable (i.e. not going to be modified)
Complete (i.e. not going to be updated)
Permanent – by assigning a DOI we’re committing to make
the dataset available for posterity
Good quality – by assigning a DOI its receiving the data
centre’s stamp of approval, saying that it’s complete and all
the metadata is available
DOI:Seal of
Approval
5 DOI for Research Data
Peter Löwe 2017-08-02Research Data Management: Module 154
https://support.datacite.org/docs/doi-basics
5 DOI Citation Examples
Peter Löwe 2017-08-02Research Data Management: Module 155
Fahrenberg, Jochen (2010): Freiburger Beschwerdenliste FBL. Primärdaten der
Normierungsstichprobe 1993. Version 1.0.0. ZPID- Leibniz-Zentrum für Psychologische
Information und Dokumentation.
Dataset. http://doi.org/10.5160/psychdata.fgjn05an08
Rattinger, Hans; Roßteutscher, Sigrid; Schmitt-Beck, Rüdiger; Weßels, Bernhard(2012):
Wahlkampf-Panel (GLES 2009). Version: 3.0.0. GESIS Datenarchiv.
Dataset.doi:10.4232/1.11131.
Schupp, Jürgen; Kroh, Martin; Goebel, Jan; Bartsch, Simone; Giesselmann, Marco et.
al. (2013): Sozio-oekonomisches Panel (SOEP), Daten der Jahre 1984-2012. Version: 29.
SOEP- Sozio-oekonomisches Panel.
Dataset. doi:10.5684/soep.v29.
5 Upcoming: Search DOI-registered datasets by ORCID
Peter Löwe 2017-08-02Research Data Management: Module 158
Find any DOI-registered publication by ORCID
http://dashboard.project-thor.eu
Example: Löwe / Loewe / Lowe ?
Which of the four Peter Löwe ?
6 Data Curation Continuum: Research Data Repositories
Peter Löwe 2017-08-02Research Data Management: Module 159
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Low
visibility
High
visibility
6 re3data: Registry of Research Data Repositories
Peter Löwe 2017-08-02Research Data Management: Module 160
1,500 research dara repositories
described by tags:
6 Research Data Repository (RDR) Development and Services
Peter Löwe 2017-08-02Research Data Management: Module 162
Currently, DFG funds two RDR-related Projects:
1. SowiDataNet: addressing the social sciences
2. RADAR: addressing the long tail of Science
Technology and Metadata are compatible.
RADAR is a service offering by FIZ Karlsruhe (testing phase)
Near future:
• SowiDtaaNet will become a serice offering (GESIS)
• Datorium will merge with SowiDataNet
6 RADAR: Research Data Repository Services
Peter Löwe 2017-08-02Research Data Management: Module 163
Van den Broel K, Furtado F, Engel T (2015): RADAR – A Research Data Repository for the “Long-Tail of Science”
6RADAR: Research Data Repositories Roles & Responsibilities
Peter Löwe 2017-08-02Research Data Management: Module 164
6Datorium.gesis.org: Repository for Social Science andEconomic Science
Peter Löwe 2017-08-02Research Data Management: Module 165
4 Where NOT to „publish“ your Data
Peter Löwe 2017-08-02Research Data Management: Module 168
Required:
Professional repositories which enable
• long term access,
• search,
• retrieval,
• thorough metadata
6Alternative (Self help): All-purpose Repositories
Peter Löwe 2017-08-02Research Data Management: Module 169
Rueda, Laura. (2017, May). Introduction to DataCite. Zenodo.
http://doi.org/10.5281/zenodo.571808
6 OPENAIRE: RDM on the European Level
Peter Löwe 2017-08-02Research Data Management: Module 170
https://www.openaire.eu/
https://www.slideshare.net/OpenAIRE_eu/enabling-better-science-results-and-vision-of-the-openaire-infrastructure-and-rda-
data-publishing-working-group-55075375
6 Adoption of Open Science in Europe
Peter Löwe 2017-08-02Research Data Management: Module 171
https://www.fosteropenscience.eu/
6Forschungsdaten in den Sozial- und Wirtschaftswissenschaften
Peter Löwe 2017-08-02Research Data Management: Module 172
http://dx.doi.org/10.4232/10.fisuzida2014.1
http://auffinden-zitieren-dokumentieren.de
6 Handbuch Forschungsdatenmanagement
Peter Löwe 2017-08-02Research Data Management: Module 173
ISBN 978-3-88347-283-6 PDF: http://bit.ly/2uPJdaf
6 Rat für Sozial- und Wirtschaftdaten / DFG
Peter Löwe 2017-08-02Research Data Management: Module 174
http://www.dfg.de/download/pdf/foerderung/antragstellung/forschungsd
aten/basisinformationen_forschungsdatenmanagement.pdf
6 WIKI: FORSCHUNGSDATEN.ORG
Peter Löwe 2017-08-02Research Data Management: Module 175
http://www.forschungsdaten.org
6 RESEARCH DATA ALLIANCE
Peter Löwe 2017-08-02Research Data Management: Module 176
https://www.rd-alliance.org/
6 Data Carpentry Workshops
Peter Löwe 2017-08-02Research Data Management: Module 177
http://www.datacarpentry.org/
7 Wise Advise
Peter Löwe 2017-08-02Research Data Management: Module 179
https://nicolahemmings.wordpress.com/2016/04/05/mistakes-ive-
made-as-an-early-career-researcher/
Mistakes I’ve made as an early career researcher
APRIL 5, 2016
Nicola Hemmings (post-doc, University of Sheffield)
Failing to organise my data adequately (circa 2007).
“Prepare your datasets like you would if you were giving them to a
stranger who knew nothing about them. Label, annotate and
meticulously file your R scripts. Incorporate read-me files into everything
and write them for the monkey that will be you in five years, when you
return to your data and/or analyses for some unforeseen but vitally
important reason. Don’t get this wrong. You will regret it.“
7Back to the start:Snafu ? Things are getting better
Peter Löwe 2017-08-02Research Data Management: Module 180
• This film is scientific nontextual information
• It is available on the AV-portal of TIB Hannover, a data portal for
scientic audiovisual content.
• DOI-link: https://doi.org/10.5446/31036
Vielen Dank für Ihre Aufmerksamkeit.
DIW Berlin — Deutsches Institut
für Wirtschaftsforschung e.V.
Mohrenstraße 58, 10117 Berlin
www.diw.de
RedaktionPeter Löwe ([email protected])
http://dilbert.com/strip/2010-08-24
Based on the works of
• Paul Wong (2017) ANDS,Research Integrity Advisor Data Management Workshop
• 3TU.Datacentre (2014): Data citation and DOIs
• and others
Vielen Dank für Ihre Aufmerksamkeit.
DIW Berlin — Deutsches Institut
für Wirtschaftsforschung e.V.
Mohrenstraße 58, 10117 Berlin
www.diw.de
RedaktionPeter Löwe ([email protected])