Oracle Tech Day November 2004
Building the world’s largest Scientific Grid
Jamie ShiersDatabase Group, CERN, Geneva, Switzerland
22Jamie Shiers November 2004Scientific & Enterprise Grids
Agenda
• The Need for a World-Wide Grid
• An Overview of the World’s Largest Scientific Grid
• The role of the Database Group in the above
• The role of the CERN openlab for DataGrid Applications
• The role of Enterprise Grids
• Summary and Conclusions
Oracle Tech Day November 2004
The Requirements
The Large Hadron Collider at CERN(LHC)
44Jamie Shiers November 2004Scientific & Enterprise Grids
The European Organisation for Nuclear Research
The European Laboratory for Particle Physics• Fundamental research in particle physics• Designs, builds & operates large
accelerators• Financed by 20 European countries
– member states + others (Russia, US, Canada, India, ….)
1BSF budget - operation + new accelerators 2000 staff + 6000 users (researchers) from all over
the world
New accelerator, Large Hadron Collidor (LHC)
55Jamie Shiers November 2004Scientific & Enterprise Grids
airp
ort
Computer Centre Geneva
27km
66Jamie Shiers November 2004Scientific & Enterprise Grids
77Jamie Shiers November 2004Scientific & Enterprise Grids
The LHC machine
• Two counter-circulating proton beams
• Collision Energy 7 + 7 TeV
• 27 Km of magnets with a field of 8.4 Tesla
• Super-fluid Helium cooled to 1.9°K
• The world’s largest superconducting structure!
88Jamie Shiers November 2004Scientific & Enterprise Grids
The Atlas Detector
• The ATLAS collaboration is – ~2000 physicists from ..– ~ 150 universities and labs – from ~ 35 countries– distributed resources– remote development
• The ATLAS detector is – 26m long, – stands 20m high, – weighs 7000 tons – has 200 million read-out
channels
• One of 4 LHC experiments– ALICE, ATLAS, CMS, LHCb
99Jamie Shiers November 2004Scientific & Enterprise Grids
All charged tracks with pt > 2 GeV
Reconstructed tracks with pt > 25 GeV
(+30 minimum bias events)
Selectivity: 1 in 1013
- 1 person in a thousand world populations- A needle in 20 million haystacks
LHC: Higgs Decay into 4 muons
1313Jamie Shiers November 2004Scientific & Enterprise Grids
LHC data
• 40 million collisions per second
• After filtering, 100-200 collisions of interest per second
• 1-10 Megabytes of data digitised for each collision = recording rate of 0.1-1 Gigabytes/sec
• 1010 collisions recorded each year = ~15 Petabytes/year of data
CMS LHCb ATLAS ALICE
1 Megabyte (1MB)A digital photo
1 Gigabyte (1GB) = 1000MBA DVD movie
1 Terabyte (1TB)= 1000GBWorld annual book production
1 Petabyte (1PB)= 1000TBAnnual production of one LHC experiment
1 Exabyte (1EB)= 1000 PBWorld annual information production
1414Jamie Shiers November 2004Scientific & Enterprise Grids
Agenda
• The Need for a World-Wide Grid
• An Overview of the World’s Largest Scientific Grid
• The role of the Database Group in the above
• The role of the CERN openlab for DataGrid Applications
• The role of Enterprise Grids
• Summary and Conclusions
Oracle Tech Day November 2004
The Solution
The Large Hadron Collider Grid(LCG)
1616Jamie Shiers November 2004Scientific & Enterprise Grids
LCG Project Goals
• To prepare, deploy and operate the computing environment for the experiments to analyse the data from the LHC detectors
• Applications development environment, common tools and frameworks
• Build and operate the LHC computing service
• The Grid is just a tool towards achieving this goal
1717Jamie Shiers November 2004Scientific & Enterprise Grids
LCG-2/EGEE-0 Status24-09-2004
Total:78 Sites~9000 CPUs6.5 PByte
Total:78 Sites~9000 CPUs6.5 PByte
Cyprus
1818Jamie Shiers November 2004Scientific & Enterprise Grids
CollaboratingComputer Centres
Building a GridThe virtual LHC Computing CentreGrid
ATLAS Virtual Organisation
CMS Virtual Organisation
2121Jamie Shiers November 2004Scientific & Enterprise Grids
RAL
IN2P3
BNL
FZK
CNAF
PIC ICEPP
FNAL
Tier-1small
centres
Tier-2
desktopsportables
USCNIKHEFKrakow
CIEMATRome
Taipei
TRIUMF
CSCS
Legnaro
UB
IFCA
IC
MSU
Prague
Budapest
Cambridge
IFIC
LHC Computing Model• Tier-0 – CERN
• Filter raw data• Reconstruction summary data (ESD)• Record raw data and ESD• Distribute raw and ESD to Tier-1
• Tier-1 – • Data-heavy analysis• Re-processing raw ESD• National, regional support
• Tier-2 –• End-user analysis – batch and interactive
2222Jamie Shiers November 2004Scientific & Enterprise Grids
RAL
IN2P3
BNL
FZK
CNAF
USC
PIC ICEPP
FNAL
NIKHEFKrakow
Taipei
CIEMAT
TRIUMF
Rome
CSCS
Legnaro
UB
IFCA
IC
MSU
Prague
Budapest
Cambridge
Data distribution~70 Gbits/sec
Processing M SI2000**
Disk PetaBytes
Mass Storage
PetaBytes
CERN 20 5 20
Major data handling centres
(Tier 1)45 20 18
Other large centres (Tier 2)
40 12 5
Totals 105 37 43
** Current fast processor ~1K SI2000
Current estimates of Computing Resources needed at Major LHC Centres
First full year of data - 2008
ProcessingM SI2000**
IC
2323Jamie Shiers November 2004Scientific & Enterprise Grids
2004 Data Challenges
• Large-scale tests of the experiments’ computing models, processing chains, grid technology readiness, operating infrastructure
• The big challenge for this year – - data- file catalogue, - replica management,- database access, - integrating mass storagePlanning for a second operations
& support centre in Taipei
Grid Operations Centreat RAL
User Support Centre at FZK
Oracle Tech Day November 2004
Experiences during the data challenges
2525Jamie Shiers November 2004Scientific & Enterprise Grids
Data Challenges – ALICE
• Phase I120k Pb+Pb events produced in 56k jobs1.3 million files (26TByte) in Castor@CERNTotal CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)~25% produced on LCG-2
Phase II (underway)1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU~15% on LCG-2
2626Jamie Shiers November 2004Scientific & Enterprise Grids
Data Challenges – ATLAS ATLAS DC2 - CPU usage
41%
30%
29%
LCG
NorduGrid
Grid3
• Phase I7.7 Million events fully simulated (Geant 4) in 95.000 jobs22 TByteTotal CPU: 972 MSI-2k hours >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)
ATLAS DC2 - LCG - September 71%
2%
0%
1%
2%
14%
3%
1%
3%
9%
8%
3%2%5%1%4%
1%
1%
3%
0%
1%
1%
4%1%
0%
12%
0%
1%
1%
2%
10%
1% 4%
at.uibk
ca.triumf
ca.ualberta
ca.umontreal
ca.utoronto
ch.cern
cz.golias
cz.skurut
de.fzk
es.ifae
es.ific
es.uam
fr.in2p3
it.infn.cnaf
it.infn.lnl
it.infn.mi
it.infn.na
it.infn.na
it.infn.roma
it.infn.to
it.infn.lnf
jp.icepp
nl.nikhef
pl.zeus
ru.msu
tw.sinica
uk.bham
uk.ic
uk.lancs
uk.man
uk.rl
2727Jamie Shiers November 2004Scientific & Enterprise Grids
Data Challenges – CMS
• ~30 M events produced• 25Hz reached
•(only once for a full day)• RLS, Castor, control systems, T1 storage, …
•Not a CPU challenge, but a full chain demonstration•Pre-challenge production in 2003/04
•70 M Monte Carlo events (30M with Geant-4) produced•Classic and grid (CMS/LCG-0, LCG-1, Grid3) productions
2828Jamie Shiers November 2004Scientific & Enterprise Grids
DIRAC alone
LCG inaction
1.8 106/day
LCG paused
3-5 106/day
LCG restarted
Data Challenges – LHCb
• Phase I186 M events 61 TByteTotal CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)Up to 5600 concurrent running jobs in LCG-2
This is 5-6 times what was possible at CERN alone
2929Jamie Shiers November 2004Scientific & Enterprise Grids
Data challenges – Summary
• First time such a set of large scale grid productions has been done?
– Significant efforts invested on all sides – very fruitful collaborations
– Middleware is actually quite stable now
– But – single largest issue is lack of stable operations
• Close to 500TB (half a PB…) of Data Stored
3030Jamie Shiers November 2004Scientific & Enterprise Grids
Preparing for 7,000 boxes in 2008
3131Jamie Shiers November 2004Scientific & Enterprise Grids
LCG Summary
• The LHC Computing Grid is real and is running production
• From a ‘single world-wide Grid’ to a ‘federation of Grids’
• See also Economist, October 7th 2004
3232Jamie Shiers November 2004Scientific & Enterprise Grids
Agenda
• The Need for a World-Wide Grid
• An Overview of the World’s Largest Scientific Grid
• The role of the Database Group in the above
• The role of the CERN openlab for DataGrid Applications
• The role of Enterprise Grids
• Summary and Conclusions
Oracle Tech Day November 2004
The Role of the Database Group
3434Jamie Shiers November 2004Scientific & Enterprise Grids
CERN Database Group
• Provides support for Oracle based solutions across whole spectrum of activities of the laboratory:
– Internal e-business applications• Uses, inter alia, Oracle HR
– Technical infrastructure for the laboratory• Accelerator + Detector design, construction and operation
– Physics related services• Will be used in real-time mode for detector monitoring and
calibration• Also for storing some fraction of the scientific data
• CERN has been an Oracle customer since > 20 years!– http://cern.ch/db/
3535Jamie Shiers November 2004Scientific & Enterprise Grids
DB Group – Physics Activities
• Develop and maintain Physics-related Applications– POOL Persistency Framework for storing Physics Data– Conditions DB for conditions of the massive detectors themselves
• These activities part of the LCG Applications Area…
• General Database and Application Server services– Currently at the level of 10-20TB– Essentially all based on Intel / Linux
• Core Grid Services– Includes the LCG File Catalog– Used to schedule jobs (where the data is) and…– For running jobs to access the data
3636Jamie Shiers November 2004Scientific & Enterprise Grids
Physics Activities - Futures
• Re-engineering all DB services for Physics on
Oracle 10g RACOracle 10g RAC
• Goals are:– Isolation – 10g ‘services’ and / or physical separation– Scalability - in both database processing power and storage – Reliability – automatic failover in case of problems– Manageability – significantly easier to administer than now
• Will revisit this under ‘Enterprise Grids’ later…
3737Jamie Shiers November 2004Scientific & Enterprise Grids
CERN & Oracle
• Share a common vision regarding the future of high performance computing– Wide spread use of commodity dual processor PCs running Linux;– Focus on Grid computing
• CERN has managed to influence Oracle product
Oracle 10g features:
– Support for native IEEE floats & doubles;
– Support for “Ultra large” Databases (ULDB);
– Cross-platform transportable tablespaces.
3838Jamie Shiers November 2004Scientific & Enterprise Grids
Agenda
• The Need for a World-Wide Grid
• An Overview of the World’s Largest Scientific Grid
• The role of the Database Group in the above
• The role of the CERN openlab for DataGrid Applications
• The role of Enterprise Grids
• Summary and Conclusions
3939Jamie Shiers November 2004Scientific & Enterprise Grids
4040Jamie Shiers November 2004Scientific & Enterprise Grids
CERN openlab
• The CERN openlab for DataGrid applications is a framework for evaluating and integrating cutting-edge technologies or services in partnership with industry, focusing on potential solutions for the LCG
• The openlab invites members of the industry to join and contribute systems, resources or services, and carry out with CERN large-scale highly-performing evaluations of their solutions in an advanced integrated environment
• CERN – Oracle focus:– Areas that will lead to tangible benefits in the short-medium term– Also look at longer term, strategic issues Not limited to Physics-specific! Solutions preferably of general
interest!
4141Jamie Shiers November 2004Scientific & Enterprise Grids
openlab - achievements
• DataGuard– Typically viewed as ‘disaster protection’
• (which does happen)
– Also suitable for handling scheduled interventions• Primary cause of interventions in our Grid are O/S security patches• Cannot afford for critical Grid component to be down! (Impacts whole Grid!)
• Streams– Often viewed as ‘some sort of replication technique’ (true)– Has great potential in handling upgrades in a quasi-transparent manner– openlab fellow has demonstrated transparent upgrades:
• From one version of Oracle to another (e.g. 9i to 10g)• From one platform to another (e.g. from Solaris to Intel)
We have sufficient confidence in this technique that we will be using it in production for critical services, e.g. network DB, together with RAC
4343Jamie Shiers November 2004Scientific & Enterprise Grids
simulation
reconstruction
analysis
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Physics Analysis
event filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
les.
rob
ert
son
@ce
rn.c
h
4444Jamie Shiers November 2004Scientific & Enterprise Grids
RAW
ESD
AOD
TAG
randomseq.
1PB/yr (1PB/s prior to reduction!)
100TB/yr
10TB/yr
1TB/yr
Data
Users
Tier0
Tier1
4545Jamie Shiers November 2004Scientific & Enterprise Grids
openlab – future focus
• Ultra-large scientific databases for end-user analysis– An area that is not well understood in current LCG– Exploitation of native floats; low selectivity server-side query– Joint work with other openlab partners!
• World-wide monitoring / deployment– Extensive use of Enterprise Grid Control to handle deployment of core
DB and iAS services– Monitoring, Capacity Planning, Patch Deployment, Backup / Recovery
etc.
• Further development of quasi non-stop services using Oracle Streams, 10g DataGuard + 10g RAC etc
All with focus on early production deployment of successes!
4646Jamie Shiers November 2004Scientific & Enterprise Grids
4747Jamie Shiers November 2004Scientific & Enterprise Grids
Agenda
• The Need for a World-Wide Grid
• An Overview of the World’s Largest Scientific Grid
• The role of the Database Group in the above
• The role of the CERN openlab for DataGrid Applications
• The role of Enterprise Grids
• Summary and Conclusions
Oracle Tech Day November 2004
Enterprise Grids
The Role of Enterprise Grids in
Scientific Grids
4949Jamie Shiers November 2004Scientific & Enterprise Grids
Grid - Component Services
• A Grid such as the LCG is built upon a large number of component services and applications
• Traditional wisdom: – Hand tailor each service according to its specific needs– Hard limits in terms of scalability and capacity– Maintenance nightmare!
• Alternate approach: – Build services in a standard way out of common building blocks– Layer upon an Enterprise Grid– Scalable; Configurable; Manageable.
5050Jamie Shiers November 2004Scientific & Enterprise Grids
CERN DB Physics Services
• Currently being re-engineered on an Enterprise Grid
– 24 PCs (now) – expand to 36 / 48 end 2005• dual processor, 4GB memory, mirrored system disk, RHEL 3.0
– Redundant (dual) 64-port SAN infrastructure
– Some 50TB of mirrored SAN storage
• Based on Oracle 10g RAC and 10g Services
– Hardware on order – installation expected before Christmas 2004– Watch this space!
Oracle Tech Day November 2004
Summary and Conclusions
5252Jamie Shiers November 2004Scientific & Enterprise Grids
What are Grids all about?
• Grids are about sharing and pooling of resources
• We all know that when we can and do work together we achieve much more than if we work alone– CERN is a classic example of this at a world-wide scale!
Two other examples (from CHEP ’04 in Interlaken, CH)
1. Resilience• Security of valuable data• Continuity in case of major disruption
2. Expedience• Access to additional resources• Engagement of distributed communities
5353Jamie Shiers November 2004Scientific & Enterprise Grids
The Grid – Disruptive Technology?
• From OracleWorld San Francisco press panel Sep 2003:
[ Work on LHC Computing started around 1992 ]
• “What would happen if something came along that would change everything? Like the Web. We would simply have to take it into account”
• “We believe that thing has come along, and that thing is the grid.”
• “We are actively involved in making it happen, and it is the underlying cornerstone of our computing model”
Oracle Tech Day November 2004
The Grid
is unstoppable...