data handling for lhc: plans and reality
Post on 23-Mar-2016
24 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Data Handling for LHC:Plans and Reality
Tony CassLeader, Database Services Group
Information Technology Department
11th July 2012
2
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
3
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
55
ATLASEmily Nurse 20
We are looking for rare events!
Higgs (mH=120 GeV) : 17 pb 750 events
70 billion pb 3 trillion events! ** N.B. only a very small fraction saved!
e.g. potentially ~1 Higgs in every 300 billion interactions!
number of events = Luminosity × Cross section2010 Luminosity: 45pb-1
7
~250x more events to date
22
So the four LHC Experiments…
23
… generate lots of data …
The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors
24
… generate lots of data …reduced by online computers to
a few hundred “good” eventsper second.
Which are recorded on disk and magnetic tapeat 100-1,000 MegaBytes/sec ~15 PetaBytes per year for all four experiments• Current forecast ~ 23-25 PB / year, 100-120M files / year
– ~ 20-25K 1 TB tapes / year
• Archive will need to store 0.1 EB in 2014, ~1Billion files in 2015
0
10
20
30
40
50
60CASTOR data written, 01/01/2010 to 29/6/2012 (in PB)
USERNTOFNA61NA48LHCBCOMPASSCMSATLASAMSALICE
Z μμ
ATLAS Z μμ event from 2012 data with 25 reconstructed vertices
25
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
26
What is the technique?Break up a Massive Data Set …
27
What is the technique?… into lots of small pieces and distribute them around the world …
28
What is the technique?… analyse in parallel …
29
What is the technique?… gather the results …
30
What is the technique?… and discover the Higgs boson:
Nice result, but… … is it novel?
a
31
Is it Novel?Maybe not novel as such, but the implementation
is Terrascale computingthat is widely appreciated!
32
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
34
The Grid• Timely Technology!• The WLCG project
deployed to meet LHC computing needs.
• The EDG and EGEE projects organised development in Europe. (OSG and others in the US.)
35
Grid Middleware Basics• Compute Element
– Standard interface to local workload management systems (batch scheduler)
• Storage Element– Standard interface to local mass storage
systems• Resource Broker
– Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability.
Many implementations of the basic principles:Globus, VDT, EDG/EGEE, NorduGrid, OSG
36
• Issue– Grid sites generally want to maintain a high average CPU
utilisation; easiest to do this if there is a local queue of work to select from when another job ends.
– Users are generally interested in turnround times as well as job throughput. Turnround is reduced if jobs are held centrally until a processing slot is known to be free at a target site.
• Solution: Pilot job frameworks.– Per-experiment code submits a job which chooses a work
unit to run from a per-experiment queue when it is allocated an execution slot at a site.
• Pilot job frameworks separate out– site responsibility for allocating CPU resources from– Experiment responsibility for allocating priority between
different research sub-groups.
Job Scheduling in Practice
36
… But note: Pilot job frameworks talk directly to the CEs and
we have moved away from a generic solution to one that
has a specific framework per VO (although these can be
shared in principle)
37
Data Issues• Reception and long-term storage• Delivery for processing and export• Distribution• Metadata distribution
1430MB/s
700MB/s 2600MB/s
700MB/s 420MB/s
(3600MB/s) (>4000MB/s)
Scheduled work only – and we need ability to support 2x for recovery!
38
(Mass) Storage Systems• After evaluation of commercial alternatives
in the late 1990s, two tape-capable Mass storage systems have been developed for HEP:– CASTOR: an integrated
mass storage system
– dCache: a disk pool manager thatinterfaces to multiple tape archives(Enstore @ FNAL, IBM’s TSM)
• dCache is also used a basic disk storage manager Tier2s along with the simpler DPM
39
A Word About Tape• Our data set may be massive, but…
<10K 10K-100K
100K-1M
1M-10M
10M-100M
100M-500M
500M-1G
1G-2G >2G0
5
10
15
20
25
30
35
CERN Archive file size distribution, in %
~195MB average only increasing slowly after LHC startup!
0 500 1000 1500 2000 25000
20000
40000
60000
80000
100000
120000Drive write performance, CASTOR tape format
(ANSI AUL)
IBM AULSUN AUL
file size (MB)
Writ
e sp
eed
(KB/
s)
Average write drive speed: < 40MB/s(cf native drive speeds: 120-160MB/s)Small increases with new drive generations
It is made up ofmany small files…
…which is bad fortape speeds:
40
Tape Drive EfficiencySo we have to change tape writing policy…
0 100 200 300 400 500 6000
20
40
60
80
100
120
140
Drive write performance, buffered vs non-buffered tape marks
CASTOR present (3sync/file)CASTOR new (1sync/file)CASTOR future (1 sync / 4GB)
file size, MB
spee
d, M
B/s
3 sync/file 1 sync/file 1 sync / 4GB0
20
40
60
80
100
120
Average drive performance (MB/s) for CERN Archive files
43
Storage vs Recall Efficiency
43
• Efficient data acceptance:– Have lots of input streams, spread across a
number of storage servers,– wait until the storage servers are ~full, and– write the data from each storage server to tape.– Result: data recorded at the same time is
scattered over many tapes.• How is the data read back?
– Generally, files grouped by time of creation.– How to optimise for this? Group files on to a
small number of tapes.• Ooops…
44
Keep users away from tape
44
45
CASTOR & EOS
47
Data Distribution• The LHC experiments need to distribute
millions of files between the different sites.
• The File Transfer System automates this – handling failures of the underlying
distribution technology (gridftp)– ensuring effective use of the bandwidth with
multiple streams, and– managing the bandwidth use
• ensuring ATLAS, say, is guaranteed 50% of the available bandwidth between two sites if there is data to transfer
48
Data Distribution• FTS uses the Storage Resource Manager as an
abstract interface to the different storage systems– A Good Idea™ but this is not (IMHO) a complete storage
abstraction layer and anyway cannot hide fundamental differences in approaches to MSS design• Lots of interest in the Amazon S3 interface these days; this
doesn’t try to do as much as SRM, but HEP should try to adopt de facto standards.
• Once you have distributed the data, a file catalogue is needed to record which files are available where.– LFC, the LCG File Catalogue was designed for this role
as a distributed catalogue to avoid a single point of failure, but other solutions are also used• And as many other services rely on CERN, the need for a
distributed catalogue is no longer (seen as…) so important.
49
Looking more widely — I
49
• Only a small subset of data distributed is actually used
• Experiments don’t know a priori which dataset will be popular– CMS has 8 orders magnitude in
access between most and least popular
Dynamic data replication: create copies of popular datasets at multiple sites.
50
Looking more widely — II
50Fibre cut during tests in 2009Capacity reduced, but alternative links took over
622
Mbi
ts/s
Desktops
CERNn.107 MIPSm Pbyte Robot
Universityn.106MIPSm Tbyte Robot
FNAL4.107 MIPS110 Tbyte
Robot
622 M
bits/s
N x
622
M
bits
/s
622Mbits/s
622 Mbits/s
Desktops
Desktops
MONARC2000
• Network capacity is readily available…• … and it is reliable:• So let’s simply copy data from another
site if it is not available locally– rather than recalling from tape or failing the
job.• Inter-connectedness is increasing with the
design of LHCOne to deliver (multi-) 10Gb links between Tier2s.
51
Metadata Distribution• Conditions data is needed to make sense of the
raw data from the experiments– Data on items such as temperatures, detector
voltages and gas compositions is needed to turn the ~100M Pixel image of the event into a meaningful description in terms of particles, tracks and momenta.
• This data is in an RDBMS, Oracle at CERN, and presents interesting distribution challenges– One cannot tightly couple databases across the
loosely coupled WLCG sites, for example…– Oracle streams technology improved to deliver the
necessary performance, and http caching systems developed to address need for cross-DBMS distribution.
row size = 100B row size = 500B row size = 1000B0
50001000015000200002500030000350004000045000
4600 2800 1700
37000
3000025000
40000 40000
34000
Average Streams Throughput
Oracle 10g Oracle 11gR2 Oracle 11g R2 (optimized)
LCR/
s
52
• Jobs submitted to sites depend on large, rapidly changing libraries of experiment specific code– Major problems ensue if updated code is not
distributed to every server across the grid (remember, there are x0,000 servers…)
– Shared filesystems can become a bottleneck if used as a distribution mechanism within a site.
• Approaches– Pilot job framework can check to see if the
execution host has the correct environment…– A global caching file system: CernVM-FS.
Job Execution Environment
52
2011
ATLAS Today: 22/1.8M filesATLAS Today: 921/115GB
53
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
54
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
55
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
56
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
57
Integrating With The Cloud?
CentralTask
Queue
Site A
Site B
Site C
SharedImage
Repository(VMIC)
User
VO service
Instance requests
Commercial cloud
Payload pull
Image maintainer
Cloud bursting
Slid
e co
urte
sy o
f Ulri
ch S
chwi
cker
ath
58
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
59
• Learning from our mistakes– We have just completed a review of WLCG
operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.
– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.
• Clouds
• Identity Management
Towards the Future
60
Grid Middleware Basics• Compute Element
– Standard interface to local workload management systems (batch scheduler)
• Storage Element– Standard interface to local mass storage
systems• Resource Broker
– Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability.
Many implementations of the basic principles:Globus, VDT, EDG/EGEE, NorduGrid, OSG
None of this works
without…
61
Trust!
62
One step beyond?
63
• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique
– In outline– In more detail
• Towards the Future• Summary
Outline
64
• WLCG has delivered the capability to manage and distribute the large volumes of data generated by the LHC experiments– and the excellent WLCG performance has
enabled physicists to deliver results rapidly.• HEP datasets may not be the most complex
or (any longer) massive, but in addressing the LHC computing challenges, the community has delivered– the world’s largest computing Grid,– practical solutions to requirements for large-
scale data storage, distribution and access, and– a global trust federation enabling world-wide
collaboration.
Summary
64
65
Thank You!
And thanks to Vlado Bahyl, German Cancio, Ian Bird, Jakob Blomer, Eva Dafonte Perez, Fabiola Gianotti, Frédéric Hemmer, Jan Iven, Alberto Pace and Romain Wartel of CERN, Elisa Lanciotti of PIC and K. De, T. Maeno, and S. Panitkin of ATLAS for various unattributed graphics and slides.
top related