hepix 2 nd nov 2000 alan silverman proposal to form a large cluster sig alan silverman 2 nd nov 2000...
TRANSCRIPT
2nd Nov 2000 HEPiX Alan Silverman
Proposal to form a Proposal to form a Large Cluster SIGLarge Cluster SIG
Alan Silverman
2nd Nov 2000
HEPiX – Jefferson Lab
2nd Nov 2000 Alan Silverman
2HEPiX - Jlab
Overview of the talkOverview of the talk
Why - the rationale why CERN proposes this SIG
Who – who is or might be interested
What – what could such a SIG do
When – what is the timescale for setup and first actions
2nd Nov 2000 Alan Silverman
3HEPiX - Jlab
My Given MandateMy Given Mandate
“There is an emerging consensus that an important part of the analysis of LHC data will be performed in "Regional Computing Centres", closely integrated with each other and with the CERN facility to provide as far as possible a single computing environment.”
“It is proposed that we start within HEPIX a special interest group on Large Scale Cluster Management to share ideas and experience between the labs involved in regional centre computing, with a view to minimising the number of overlapping developments and maximising the degree of standardisation of the environment.”
2nd Nov 2000 Alan Silverman
4HEPiX - Jlab
Parallel DevelopmentsParallel Developments
Monitoring - PEM (CERN), NGOP (FNAL), GMS (IN2P3)
Software certification in progress in 3-4 labs now or soon on Solaris 8 and Linux 7.x
Software installation projects - CERN, DESY Remedy trouble ticket workflows - SLAC, CERN,
FNAL Kerberos 5 - CERN (CLASP), FNAL, DESY, … GRIDS - European Datagrid, PPDG and GryPhN
2nd Nov 2000 Alan Silverman
5HEPiX - Jlab
February 10, 2000: Distributed Data Access and Analysis for HENP Experiments Harvey B Newman (CIT)
MONARC General Conclusions MONARC General Conclusions on LHC Computingon LHC Computing
Following discussions of computing and network requirements, Following discussions of computing and network requirements, technology evolution and projected costs, support requirements etechnology evolution and projected costs, support requirements etc.tc.
The scale of LHC “Computing” requires The scale of LHC “Computing” requires a worldwide effort to a worldwide effort to accumulate the necessary technical and financial resourcesaccumulate the necessary technical and financial resources
A distributed hierarchy of computingA distributed hierarchy of computing centrescentres will lead to better usewill lead to better useof the financial and manpower resources of CERN, the Collaboratiof the financial and manpower resources of CERN, the Collaborations,ons,and the nations involved, than a highly centralized model focuseand the nations involved, than a highly centralized model focused at d at CERN CERN
The distributed model also provides better use of The distributed model also provides better use of physics opportunities at the LHC by physicists and students physics opportunities at the LHC by physicists and students
At the top of the hierarchy is the CERN Center, with the abilityAt the top of the hierarchy is the CERN Center, with the ability to perform to perform all analysisall analysis--related functions, but not the ability to do them completelyrelated functions, but not the ability to do them completely
At the next step in the hierarchy is a collection of large, multAt the next step in the hierarchy is a collection of large, multii--service service “Tier1 Regional “Tier1 Regional CentresCentres”, ”, each with each with
1010--20% of the CERN capacity devoted to one experiment20% of the CERN capacity devoted to one experiment
There will be Tier2 or smaller special purpose centers in many rThere will be Tier2 or smaller special purpose centers in many regionsegions
2nd Nov 2000 Alan Silverman
6HEPiX - Jlab
February 10, 2000: Distributed Data Access and Analysis for HENP Experiments Harvey B Newman (CIT)
MONARC Architectures WG:MONARC Architectures WG:Regional Centre Facilities & Services Regional Centre Facilities & Services
Regional Centres Should ProvideRegional Centres Should Provide All technical and data services required to do physics analysisAll technical and data services required to do physics analysis
All Physics Objects, Tags and Calibration dataAll Physics Objects, Tags and Calibration data
Significant fraction of raw dataSignificant fraction of raw data
Caching or mirroring calibration constantsCaching or mirroring calibration constants
Excellent network connectivity to CERN and the region’s usersExcellent network connectivity to CERN and the region’s users
Manpower to share in the development of common validation Manpower to share in the development of common validation and production softwareand production software
A fair share of postA fair share of post-- and reand re--reconstruction processingreconstruction processing
Manpower to share in ongoing work on Common R&D ProjectsManpower to share in ongoing work on Common R&D Projects
Excellent support services for training, documentation, Excellent support services for training, documentation, troubleshooting at the Centre or remote sites served by ittroubleshooting at the Centre or remote sites served by it
Service to members of other regionsService to members of other regionsLong Term Commitment for staffing, hardware evolution and supporLong Term Commitment for staffing, hardware evolution and supportt
for R&D, as part of the distributed data analysis architecturefor R&D, as part of the distributed data analysis architecture
2nd Nov 2000 Alan Silverman
7HEPiX - Jlab
1Richard P. Mount CHEP 2000Data Analysis for SLAC Physics
Complexity
• BaBar (and CDF,D0,RHIC,LHC) is driven to systems with ~1000 boxes performing tens of functions
• How to deliver reliable throughput with hundreds of users?– Instrument heavily
– Build huge test systems
– “Is this a physics experiment or a computer science experiment?”
2nd Nov 2000 Alan Silverman
8HEPiX - Jlab
2Richard P. Mount CHEP 2000Data Analysis for SLAC Physics
Personnel Issues
• Is the SLAC equipment/personnel ratio a good model?SLAC-SCS staff are:– smart
– motivated
– having fun
– (unofficially) on call 24 x 7
– in need of reinforcements
2nd Nov 2000 Alan Silverman
9HEPiX - Jlab
European DataGRID WP4 –European DataGRID WP4 –Fabric ManagementFabric Management
The objective of the fabric management work package (WP4) is to develop new automated system management techniques that will enable the deployment of very large computing fabrics constructed from mass market components with reduced systems administration and operations costs.
The fabric must support an evolutionary model that allows the addition and replacement of components, and the introduction
of new technologies, while maintaining service. The fabric management must be demonstrated in the project in production use on several thousand processors, and be able to scale to tens of thousands of processors.
2nd Nov 2000 Alan Silverman
10HEPiX - Jlab
Who might be concernedWho might be concerned
The various GRID projects – only the European DataGRID seems to mention the basic computing fabric as an issue.
CERN LHC experiment Tier 1 sites LHC Tier 2 sites? FNAL? FNAL Run II remote sites (soon in production) BNL RHIC and remote sites (in production) SLAC BaBar and remote sites (in production)
Basically – all the traditional HEPiX attendees
2nd Nov 2000 Alan Silverman
11HEPiX - Jlab
What could a SIG do?What could a SIG do?
First, promote appropriate sessions at future HEPiX meetings; perhaps even special meetings
Make sure each site knows what relevent work is in progress (produce some form of list of work in progress?)
Be aware and promote collaboration, share parts of projects perhaps
Be open to the possibility of people exchanges
2nd Nov 2000 Alan Silverman
12HEPiX - Jlab
Some possible concrete examplesSome possible concrete examples
These came from my first discussions last week at FNAL (thanks to Lisa and Dane and many others) and the site reports
Certification of future versions of Linux and SolarisSecurity (Kerberos 5), single-site sign-on, common
authorisation files, password coordination (Jlab’s password utility)
Kickstart for clusters?…..
2nd Nov 2000 Alan Silverman
13HEPiX - Jlab
More examplesMore examples
A workshop to write the definitive guide to building and running a cluster - how to choose/select/test the hardware; software installation and upgrade tools; performance mgmt, logging, accounting, alarms, security, etc, etc
Add a note on what exists and what might scale to large clusters.
Maintain this. For example ……. (from Chuck Boeheim)
2nd Nov 2000 Alan Silverman
14HEPiX - Jlab
Rack Density, PackagingRack Density, Packaging
Shopping for >= 2CPU/RU Per-unit costs for wiring, power become significant Cooling of areas becomes significant problem
(machine room was designed for water-cooled mainframes)
2nd Nov 2000 Alan Silverman
15HEPiX - Jlab
Console ManagementConsole Management
Use console servers that gather 512 lines per server
Provide SSL and SSH support for staff to connect from anywhere, anytime
Automatic monitoring of all console traffic Power management from console
2nd Nov 2000 Alan Silverman
16HEPiX - Jlab
InstallationsInstallations
Using Solaris Jumpstart, one person can install 100s of systems per day
Trying to get to the same point with LinuxPXE protocol is not up to the task, still need boot floppies
2nd Nov 2000 Alan Silverman
17HEPiX - Jlab
MonitoringMonitoring
Console monitoring Ranger Ping Switch port reports Mail summarizer
2nd Nov 2000 Alan Silverman
18HEPiX - Jlab
Cluster = AmplifierCluster = Amplifier
One mistake generated 4000 emails per hour Use mail summarizer to intercept
Need to give it its own mail server!
2nd Nov 2000 Alan Silverman
19HEPiX - Jlab
WhenWhen
Since last week actually (information gathering visit to FNAL, an CMS Tier 1 Centre)
Various discussions this week (and next week at BNL, an ATLAS Tier 1 Centre)
A half or full day session at the next and all future HEPiX meetings on cluster subjects
From now to then, information gathering. Please send me information about possibly-relevent work in progress