Download - Data Handling for LHC: Plans and Reality
![Page 1: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/1.jpg)
1
Data Handling for LHC:Plans and Reality
Tony CassLeader, Database Services Group
Information Technology Department
11th July 2012
![Page 2: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/2.jpg)
2
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
![Page 3: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/3.jpg)
3
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
![Page 4: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/4.jpg)
55
![Page 5: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/5.jpg)
ATLASEmily Nurse 20
We are looking for rare events!
Higgs (mH=120 GeV) : 17 pb 750 events
70 billion pb 3 trillion events! ** N.B. only a very small fraction saved!
e.g. potentially ~1 Higgs in every 300 billion interactions!
number of events = Luminosity × Cross section2010 Luminosity: 45pb-1
7
~250x more events to date
![Page 6: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/6.jpg)
22
So the four LHC Experiments…
![Page 7: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/7.jpg)
23
… generate lots of data …
The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors
![Page 8: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/8.jpg)
24
… generate lots of data …reduced by online computers to
a few hundred “good” eventsper second.
Which are recorded on disk and magnetic tapeat 100-1,000 MegaBytes/sec ~15 PetaBytes per year for all four experiments• Current forecast ~ 23-25 PB / year, 100-120M files / year
– ~ 20-25K 1 TB tapes / year
• Archive will need to store 0.1 EB in 2014, ~1Billion files in 2015
0
10
20
30
40
50
60CASTOR data written, 01/01/2010 to 29/6/2012 (in PB)
USERNTOFNA61NA48LHCBCOMPASSCMSATLASAMSALICE
Z μμ
ATLAS Z μμ event from 2012 data with 25 reconstructed vertices
![Page 9: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/9.jpg)
25
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
![Page 10: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/10.jpg)
26
What is the technique?Break up a Massive Data Set …
![Page 11: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/11.jpg)
27
What is the technique?… into lots of small pieces and distribute them around the world …
![Page 12: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/12.jpg)
28
What is the technique?… analyse in parallel …
![Page 13: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/13.jpg)
29
What is the technique?… gather the results …
![Page 14: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/14.jpg)
30
What is the technique?… and discover the Higgs boson:
Nice result, but… … is it novel?
a
![Page 15: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/15.jpg)
31
Is it Novel?Maybe not novel as such, but the implementation
is Terrascale computingthat is widely appreciated!
![Page 16: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/16.jpg)
32
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
![Page 17: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/17.jpg)
34
The Grid• Timely Technology!• The WLCG project
deployed to meet LHC computing needs.
• The EDG and EGEE projects organised development in Europe. (OSG and others in the US.)
![Page 18: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/18.jpg)
35
Grid Middleware Basics• Compute Element
– Standard interface to local workload management systems (batch scheduler)
• Storage Element– Standard interface to local mass storage
systems• Resource Broker
– Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability.
Many implementations of the basic principles:Globus, VDT, EDG/EGEE, NorduGrid, OSG
![Page 19: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/19.jpg)
36
• Issue– Grid sites generally want to maintain a high average CPU
utilisation; easiest to do this if there is a local queue of work to select from when another job ends.
– Users are generally interested in turnround times as well as job throughput. Turnround is reduced if jobs are held centrally until a processing slot is known to be free at a target site.
• Solution: Pilot job frameworks.– Per-experiment code submits a job which chooses a work
unit to run from a per-experiment queue when it is allocated an execution slot at a site.
• Pilot job frameworks separate out– site responsibility for allocating CPU resources from– Experiment responsibility for allocating priority between
different research sub-groups.
Job Scheduling in Practice
36
… But note: Pilot job frameworks talk directly to the CEs and
we have moved away from a generic solution to one that
has a specific framework per VO (although these can be
shared in principle)
![Page 20: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/20.jpg)
37
Data Issues• Reception and long-term storage• Delivery for processing and export• Distribution• Metadata distribution
1430MB/s
700MB/s 2600MB/s
700MB/s 420MB/s
(3600MB/s) (>4000MB/s)
Scheduled work only – and we need ability to support 2x for recovery!
![Page 21: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/21.jpg)
38
(Mass) Storage Systems• After evaluation of commercial alternatives
in the late 1990s, two tape-capable Mass storage systems have been developed for HEP:– CASTOR: an integrated
mass storage system
– dCache: a disk pool manager thatinterfaces to multiple tape archives(Enstore @ FNAL, IBM’s TSM)
• dCache is also used a basic disk storage manager Tier2s along with the simpler DPM
![Page 22: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/22.jpg)
39
A Word About Tape• Our data set may be massive, but…
<10K 10K-100K
100K-1M
1M-10M
10M-100M
100M-500M
500M-1G
1G-2G >2G0
5
10
15
20
25
30
35
CERN Archive file size distribution, in %
~195MB average only increasing slowly after LHC startup!
0 500 1000 1500 2000 25000
20000
40000
60000
80000
100000
120000Drive write performance, CASTOR tape format
(ANSI AUL)
IBM AULSUN AUL
file size (MB)
Writ
e sp
eed
(KB/
s)
Average write drive speed: < 40MB/s(cf native drive speeds: 120-160MB/s)Small increases with new drive generations
It is made up ofmany small files…
…which is bad fortape speeds:
![Page 23: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/23.jpg)
40
Tape Drive EfficiencySo we have to change tape writing policy…
0 100 200 300 400 500 6000
20
40
60
80
100
120
140
Drive write performance, buffered vs non-buffered tape marks
CASTOR present (3sync/file)CASTOR new (1sync/file)CASTOR future (1 sync / 4GB)
file size, MB
spee
d, M
B/s
3 sync/file 1 sync/file 1 sync / 4GB0
20
40
60
80
100
120
Average drive performance (MB/s) for CERN Archive files
![Page 24: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/24.jpg)
43
Storage vs Recall Efficiency
43
• Efficient data acceptance:– Have lots of input streams, spread across a
number of storage servers,– wait until the storage servers are ~full, and– write the data from each storage server to tape.– Result: data recorded at the same time is
scattered over many tapes.• How is the data read back?
– Generally, files grouped by time of creation.– How to optimise for this? Group files on to a
small number of tapes.• Ooops…
![Page 25: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/25.jpg)
44
Keep users away from tape
44
![Page 26: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/26.jpg)
45
CASTOR & EOS
![Page 27: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/27.jpg)
47
Data Distribution• The LHC experiments need to distribute
millions of files between the different sites.
• The File Transfer System automates this – handling failures of the underlying
distribution technology (gridftp)– ensuring effective use of the bandwidth with
multiple streams, and– managing the bandwidth use
• ensuring ATLAS, say, is guaranteed 50% of the available bandwidth between two sites if there is data to transfer
![Page 28: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/28.jpg)
48
Data Distribution• FTS uses the Storage Resource Manager as an
abstract interface to the different storage systems– A Good Idea™ but this is not (IMHO) a complete storage
abstraction layer and anyway cannot hide fundamental differences in approaches to MSS design• Lots of interest in the Amazon S3 interface these days; this
doesn’t try to do as much as SRM, but HEP should try to adopt de facto standards.
• Once you have distributed the data, a file catalogue is needed to record which files are available where.– LFC, the LCG File Catalogue was designed for this role
as a distributed catalogue to avoid a single point of failure, but other solutions are also used• And as many other services rely on CERN, the need for a
distributed catalogue is no longer (seen as…) so important.
![Page 29: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/29.jpg)
49
Looking more widely — I
49
• Only a small subset of data distributed is actually used
• Experiments don’t know a priori which dataset will be popular– CMS has 8 orders magnitude in
access between most and least popular
Dynamic data replication: create copies of popular datasets at multiple sites.
![Page 30: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/30.jpg)
50
Looking more widely — II
50Fibre cut during tests in 2009Capacity reduced, but alternative links took over
622
Mbi
ts/s
Desktops
CERNn.107 MIPSm Pbyte Robot
Universityn.106MIPSm Tbyte Robot
FNAL4.107 MIPS110 Tbyte
Robot
622 M
bits/s
N x
622
M
bits
/s
622Mbits/s
622 Mbits/s
Desktops
Desktops
MONARC2000
• Network capacity is readily available…• … and it is reliable:• So let’s simply copy data from another
site if it is not available locally– rather than recalling from tape or failing the
job.• Inter-connectedness is increasing with the
design of LHCOne to deliver (multi-) 10Gb links between Tier2s.
![Page 31: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/31.jpg)
51
Metadata Distribution• Conditions data is needed to make sense of the
raw data from the experiments– Data on items such as temperatures, detector
voltages and gas compositions is needed to turn the ~100M Pixel image of the event into a meaningful description in terms of particles, tracks and momenta.
• This data is in an RDBMS, Oracle at CERN, and presents interesting distribution challenges– One cannot tightly couple databases across the
loosely coupled WLCG sites, for example…– Oracle streams technology improved to deliver the
necessary performance, and http caching systems developed to address need for cross-DBMS distribution.
row size = 100B row size = 500B row size = 1000B0
50001000015000200002500030000350004000045000
4600 2800 1700
37000
3000025000
40000 40000
34000
Average Streams Throughput
Oracle 10g Oracle 11gR2 Oracle 11g R2 (optimized)
LCR/
s
![Page 32: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/32.jpg)
52
• Jobs submitted to sites depend on large, rapidly changing libraries of experiment specific code– Major problems ensue if updated code is not
distributed to every server across the grid (remember, there are x0,000 servers…)
– Shared filesystems can become a bottleneck if used as a distribution mechanism within a site.
• Approaches– Pilot job framework can check to see if the
execution host has the correct environment…– A global caching file system: CernVM-FS.
Job Execution Environment
52
2011
ATLAS Today: 22/1.8M filesATLAS Today: 921/115GB
![Page 33: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/33.jpg)
53
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
![Page 34: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/34.jpg)
54
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
![Page 35: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/35.jpg)
55
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
![Page 36: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/36.jpg)
56
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
![Page 37: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/37.jpg)
57
Integrating With The Cloud?
CentralTask
Queue
Site A
Site B
Site C
SharedImage
Repository(VMIC)
User
VO service
Instance requests
Commercial cloud
Payload pull
Image maintainer
Cloud bursting
Slid
e co
urte
sy o
f Ulri
ch S
chwi
cker
ath
![Page 38: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/38.jpg)
58
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
![Page 39: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/39.jpg)
59
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
![Page 40: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/40.jpg)
60
Grid Middleware Basics• Compute Element
– Standard interface to local workload management systems (batch scheduler)
• Storage Element– Standard interface to local mass storage
systems• Resource Broker
– Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability.
Many implementations of the basic principles:Globus, VDT, EDG/EGEE, NorduGrid, OSG
None of this works
without…
![Page 41: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/41.jpg)
61
Trust!
![Page 42: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/42.jpg)
62
One step beyond?
![Page 43: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/43.jpg)
63
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
![Page 44: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/44.jpg)
64
• WLCG has delivered the capability to manage and distribute the large volumes of data generated by the LHC experiments– and the excellent WLCG performance has
enabled physicists to deliver results rapidly.• HEP datasets may not be the most complex
or (any longer) massive, but in addressing the LHC computing challenges, the community has delivered– the world’s largest computing Grid,– practical solutions to requirements for large-
scale data storage, distribution and access, and– a global trust federation enabling world-wide
collaboration.
Summary
64
![Page 45: Data Handling for LHC: Plans and Reality](https://reader036.vdocuments.site/reader036/viewer/2022062816/56816770550346895ddc5c9a/html5/thumbnails/45.jpg)
65
Thank You!
And thanks to Vlado Bahyl, German Cancio, Ian Bird, Jakob Blomer, Eva Dafonte Perez, Fabiola Gianotti, Frédéric Hemmer, Jan Iven, Alberto Pace and Romain Wartel of CERN, Elisa Lanciotti of PIC and K. De, T. Maeno, and S. Panitkin of ATLAS for various unattributed graphics and slides.