real-world data challenges: moving towards richer data ecosystems
TRANSCRIPT
| 1
Anita de Waard 0000-0002-9034-4119VP Research Data CollaborationsElsevier RDM [email protected]
Big Data PI MeetingMarch 16, 2016
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
| 2
ESGF-VLESGF
ESG-CETESG-II
ESG-I
Usablecapabilities
Futurecapabilities
Prototypecapabilities
1999-2001
2001-2006
2006-2011
2011-2020
2020-
Planned Earth System Grid System Evolution
Planned Earth System Grid System Data ArchivalModel
IntercomparisonProjects
Remote Sensing, In Situ, Climatology,
Diagnostics, Ecosystem, Hydrology, Biology,
Etc.
In s
itu
An
alys
is
Data
M
anag
emen
t
Dist
ribut
ed
Sear
ch
Fede
ratio
n Di
strib
uted
Co
mpu
tatio
n
Prov
enan
ce
Capt
ure
Auth
entic
atio
n &
Au
thor
izatio
n
Netw
ork
Anal
ytic
al
Mod
elin
g
Dyna
mic
Re
sour
ces
Data
Tr
ansf
er
Long
-tail
Pu
blic
atio
n
Data
Ci
tatio
n
Mac
hine
Le
arni
ng
Wor
kflo
w
Qual
ity C
ontro
l
&
Assu
ranc
e
Met
rics
User
Notif
icat
ion
User
In
terfa
ce
Vers
ion
Co
ntro
l
Repl
icat
ion
Petabytes (1015) Exabytes (1018)
1999 20222017Centralized Archive Distributed Data Ecosystem Virtual Laboratory
Source: Dean Williams, Lawrence Livermore/ESGF, March 1st 2017
Trend # 1: Repositories are becoming virtual labs
| 3
Trend # 2: Scientists are Moving ‘Beyond Downloads’
| 4
Trend # 3: Computers are scientists, too!
“intelligent systems for computer-aided discovery can complement and integrate
into the insight generation loop in scalable ways…”
http://ieeexplore.ieee.org/abstract/document/7515118/: Computer-Aided Discovery: Toward Scientific Insight Generation with Machine Support
“This work combines time series Principal Component Analysis with InSAR to constrain
the space of possible model explanations on current empirical data sets and achieve a better
identification of deformation patterns”
| 5
Raising many technical/organisational/policy questions:
• Is Long-Tail Data + Semantics = Big Data?• Is Data Science a field, or a skill? (A department, or a class?)• Are supercomputing centers research departments or bits of infrastructure? (And if
infrastructure, are they part of IT? (“Oh, no, anything but that!”)• Are repositories places to store outputs, or places where science is conducted?• If so, how are repositories and HPC’s recognised and rewarded?• How can we keep track of (micro)provenance of parts of data sets?• Should we explore Blockchain technology for this? (“Oh no, anything but that!”)• Is a piece of software part of the University’s Research Outputs? • If so, how do we reward brilliant coders who blog, but don’t write?• How do we reward (virtual) collaboration? • Why won’t those damn scientists share their data?• Who will own the Data Science Cloud: Amazon? Or the joint HPC’s (NDS??) Is
NIH Data Commons the Model? Or is this a free for all? What is the role of commercial parties?
• Is data curation/stewardship a part of science, or a glorified administrator's job?• What is the role of libraries, in all this? • And why the hell is a publisher talking about it?
| 6 6
Inst. Data Repositorie(s)
Lab ELN(s)
Data Journal
Data search
Link to article
Journal
FindTopic
Identify gaps
Plan & Fund
Discover data, people, methods & protocols
Collect, analyze & vizualize
Store, preserve & share
Publish
Prepare, reproduce, re-use & benchmark
Domain-specificRepositories
General search
Faculty LIMS
Data center
Inst. Data Repositorie(s)
Lab ELN(s)
Data Journal
Data search
Data ManagementPlans
Metadata, methods & protocols ready for preservation and publishing
Link to article
JournalPublish data (under embargo)
Secure discoverabilityin & outside the institution
Plan each step from experiment to publish
Domain-specificRepositories
General search
What Elsevier is Interested in: Supporting RDM Networks
| 7
Biological Pathways extracted via semantic text mining
A upregulates B
B upregulates C
C increases disease D
Normalizing vocabularies required: proteins, diseases, drugs, chemicals
A B C D
Bioactivities through text analysis
IC50 6.3nM, kinase binding assay 10mM concentration
Chemical StructuresAnd Properties
InChi,Name
NCBI,Uniprot
EMTREEReaxysTree,Structures
What Elsevier is Interested in: Knowledge Graphs in Life Science
| 8
What Elsevier is Interested in: Knowledgegraphs in Research
| 9
Thank you!
Links to things we’re involved with:• https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data• https://www.elsevier.com/about/open-science/research-data• https://www.hivebench.com• https://data.mendeley.com/• https://datasearch.elsevier.com/ • https://www.elsevier.com/books-and-journals/content-innovation/data-base-
linking• http://www.journals.elsevier.com/softwarex/• https
://www.elsevier.com/physical-sciences/earth-and-planetary-sciences/the-2015-international-data-rescue-award-in-the-geosciences
• https://rd-alliance.org/groups/rdawds-publishing-data-services-wg.html • https://www.force11.org/• http://www.nationaldataservice.org/• https://rd-alliance.org/
Anita de Waard, [email protected]