cern open data and data analysis knowledge preservation · 2015. 4. 30. · cern open data and data...

26
CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 · 21–23 April 2015 · Jasná, Slovakia @tiborsimko · @inveniosoftware 1 / 26

Upload: others

Post on 18-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

CERN Open Data and Data AnalysisKnowledge Preservation

Tibor Šimko

Digital Library 2015 · 21–23 April 2015 · Jasná, Slovakia

@tiborsimko · @inveniosoftware 1 / 26

Page 2: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

1

Invenio

@tiborsimko · @inveniosoftware 2 / 26

Page 3: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

What is Invenio?digital library and document repository software

– mature platform: first public release in 2002– rich data: articles, books, notes, photos, videos, software, data

some Invenio-based services at CERN:

co-developed by an international collaboration

participating in EU FP7/H2020 projects

@tiborsimko · @inveniosoftware 3 / 26

Page 4: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

“Data behind plots”

@tiborsimko · @inveniosoftware 4 / 26

Page 5: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

“Code you can cite”automated GitHub↔ Zenodo bridgepush new release to GitHub→ automatic archival on Zenodosoftware preserved, minted with a DOI, made citable

https://guides.github.com/activities/citable-code

@tiborsimko · @inveniosoftware 5 / 26

Page 6: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Code↔ Data↔ Paper

link data (DATAVERSE) to code (ZENODO) to papers (INSPIRE)example: hep-ex/0011057, arXiv:1401.0080

@tiborsimko · @inveniosoftware 6 / 26

Page 7: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

2

Data Analysis

@tiborsimko · @inveniosoftware 7 / 26

Page 8: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

CERN LHC Experiments

@tiborsimko · @inveniosoftware 8 / 26

Page 9: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Large Scale Solutions

Primary site: 100k cores (10k nodes), 100k disks (50 PB), 21k NICGrid: 13 Tier-1 sites, 155 Tier-2 sites, 10 Gbps optical fibre

@tiborsimko · @inveniosoftware 9 / 26

Page 10: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Preserve an Analysis?

@tiborsimko · @inveniosoftware 10 / 26

Page 11: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Big Data?

data scale knowledgeraw ∼GB / sec calibration, conditioningreconstructed ∼PB / year filtering, selectionreduced ∼TB / analysis user code, physics objectspublication ∼GB / analysis correlation, data behind plots

filteringinput...

code

output ...

Analysis Train

@tiborsimko · @inveniosoftware 11 / 26

Page 12: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Knowledge Capture

@tiborsimko · @inveniosoftware 12 / 26

Page 13: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

System Architecture

Analysis

analysis-preservation.cern.ch

file storage abstraction layer

CASTORCephBoxAFS Drive EOS S3

GitHub

SVN

TWiki

SharePoint

INSPIRE

CDS

...

CADI

@tiborsimko · @inveniosoftware 13 / 26

Page 14: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Knowledge Representationrecord format: extended MARC21

– “technical” metadata: beyond bytes

e.g. 256 �computer file characteristics�

$a characteristics $e events $t text

$b bytes $f files ...

– “knowledge” metadata: semantics

e.g. 505 �formatted contents note� CSV column information

$t title $g miscellaneous

internal format: JSON

JSON

MARC21

EAD

schema model

@tiborsimko · @inveniosoftware 14 / 26

Page 15: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

3

Open Data

@tiborsimko · @inveniosoftware 15 / 26

Page 16: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Opening Up

Data policies:restricted→ embargo period→ open“[...] Data with high abstraction, such as AOD, will be conditionally made publiclyavailable after an embargo period of 5 years after publication for 10% of the dataand 10 years for 100% of the data [...]” —ALICE Data Policy

Challenges:audience:

– data miners– citizen scientists– high-school students– general public

computing:– exploring in the browser– specialised VMs

@tiborsimko · @inveniosoftware 16 / 26

Page 17: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

CERN Open Data Portal

@tiborsimko · @inveniosoftware 17 / 26

Page 18: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Education

@tiborsimko · @inveniosoftware 18 / 26

Page 19: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Visualise detector events

@tiborsimko · @inveniosoftware 19 / 26

Page 20: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Basic histogramming

@tiborsimko · @inveniosoftware 20 / 26

Page 21: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Research

@tiborsimko · @inveniosoftware 21 / 26

Page 22: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

CMS Primary Datasets 2010

@tiborsimko · @inveniosoftware 22 / 26

Page 23: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

CernVM Virtual Machine

@tiborsimko · @inveniosoftware 23 / 26

Page 24: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

Open Data? Who cares?

82,000 distinct users visited the site21,000 distinct users viewed data records16,000 distinct users used event display

3,000 distinct users used histogramming@tiborsimko · @inveniosoftware 24 / 26

Page 25: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

7

Conclusions

@tiborsimko · @inveniosoftware 25 / 26

Page 26: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April

CERN (Open) Data

Capturing and disseminating knowledgeof data, code, platform, processes

to enable future data reuse

(Open) Data Analysis Preservation Frameworkhttp://opendata.cern.ch/

CERN IT J. Cowton, P. Fokianos, J. Kuncar, T. Smith, T. Šimko

CERN Library S. Dallmeier-Tiessen, P. Herterich, L. Rueda

ALICE M. Gheata, C. Grigoras

ATLAS K. Cranmer, L. Heinrich, D. Rousseau, F. Socher

CMS A. Calderon, A. Huffman, K. Lassila-Perini, T. McCauley, A. Rao, A. Rodriguez Marrero

LHCb S. Amerio, B. Couturier, A. Trisovic

CERN CernVM J. Blomer CERN EOS L. Mascetti DASPOS M. Hildreth DPHEP F. Berghaus

@tiborsimko · @inveniosoftware 26 / 26