cn - fhe - jun 94-1
CERN
Analyse de Physique sur machines RISC : expériences
au CERN
SACLAY
20 JUIN 1994
Frédéric HemmerComputing & Networks Division
CERN, Geneva, switzerland
cn - fhe - jun 94-2
CERN
CERN - The European Laboratory for Particle Physics
• Fundamental research in particle physics
• Designs, builds & operates large accelerators
• Financed by 19 European countries
• SFR 950M budget -operation + new accelerators
• 3,000 staff
• Experiments conducted by a small number of large collaborations:
400 physicists, 50 institutes, 18 countriesusing experimental apparatus costing 100s of MSFR
cn - fhe - jun 94-3
CERN
Computing at CERN
• computers are everywhere
• embedded microprocessors
• 2,000 personal computers
• 1,400 scientific workstations
• RISC clusters, even mainframes
• estimate 40 MSFR per year (+ staff)
cn - fhe - jun 94-4
CERN
Central Computing Services
• 6,000 users
• Physics data processing traditionally:
mainframes + batch
emphasis on:
reliability, utilisation level
• Tapes:300,000 active volumes22,000 tape mounts per week
cn - fhe - jun 94-5
CERN
Application Characteristics
• inherent coarse grain parallelism (at event or job level)
• Fortran
• modest floating point content
• high data volumes
– disks
– tapes, tape robots
• moderate, but respectable, data rates -a few MB/sec per fast RISC cpu
Obvious candidate for RISC clusters
A major challenge
cn - fhe - jun 94-6
CERN
CORE - Centrally Operated Risc Environment
• Single management domain
• Services configured for specific applications, groups
but common system management
• Focus on data -external access to tape and disk
servicesfrom CERN network,or even outside CERN
Home directories& registry
CERN Network
CSF
Simulation Facility
PIAF - InteractiveAnalysis Facility
SPARCstations
Central Data Services
Shared Disk Servers
consoles&
monitors
CORE Physics Services
CERN
SHIFTData intensive services
7 IBM, SUNservers
Scalable Parallel Processors
25 H-P 9000-735 H-P 9000-750
25 H-P 9000-735 H-P 9000-750
5 H-P 9000-755100 GB RAID disk
5 H-P 9000-755100 GB RAID disk
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes
260 GBytes6 SGI, DEC, IBM servers
260 GBytes6 SGI, DEC, IBM servers
3 tape robots21 tape drives6 EXABYTEs
3 tape robots21 tape drives6 EXABYTEs
SPARCserversBaydel RAID disks
tape juke box
SPARCserversBaydel RAID disks
tape juke box
les robertson /cn
Shared Tape Servers
equipment installed or on order Jamuary 1994
CERN Network
CSF
Simulation Facility
PIAF - InteractiveAnalysis Facility
SPARCstations
Home directories& registry
Central Data Services
Shared Disk Servers
consoles&
monitors
CORE Physics Services
CERN
SHIFTData intensive services
7 IBM, SUNservers
Scalable Parallel Processors
25 H-P 9000-735 H-P 9000-750
25 H-P 9000-735 H-P 9000-750
5 H-P 9000-755100 GB RAID disk
5 H-P 9000-755100 GB RAID disk
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes260 GBytes6 SGI, DEC, IBM servers
260 GBytes6 SGI, DEC, IBM servers
3 tape robots21 tape drives6 EXABYTEs
3 tape robots21 tape drives6 EXABYTEs
SPARCserversBaydel RAID disks
tape juke box
SPARCserversBaydel RAID disks
tape juke box
les robertson /cn
Shared Tape Servers
equipment installed or on order Jamuary 1994
CERN Network
CSF
Simulation Facility
PIAF - InteractiveAnalysis Facility
SPARCstations
Home directories& registry
Central Data Services
Shared Disk Servers
consoles&
monitors
CORE Physics Services
CERN
SHIFTData intensive services
7 IBM, SUNservers
Scalable Parallel Processors
25 H-P 9000-735 H-P 9000-750
25 H-P 9000-735 H-P 9000-750
5 H-P 9000-755100 GB RAID disk
5 H-P 9000-755100 GB RAID disk
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes260 GBytes6 SGI, DEC, IBM servers
260 GBytes6 SGI, DEC, IBM servers
3 tape robots21 tape drives6 EXABYTEs
3 tape robots21 tape drives6 EXABYTEs
SPARCserversBaydel RAID disks
tape juke box
SPARCserversBaydel RAID disks
tape juke box
les robertson /cn
Shared Tape Servers
equipment installed or on order Jamuary 1994
CERN Network
CSF
Simulation Facility
PIAF - InteractiveAnalysis Facility
SPARCstations
Home directories& registry
Central Data Services
Shared Disk Servers
consoles&
monitors
CORE Physics Services
CERN
SHIFTData intensive services
7 IBM, SUNservers
Scalable Parallel Processors
25 H-P 9000-735 H-P 9000-750
25 H-P 9000-735 H-P 9000-750
5 H-P 9000-755100 GB RAID disk
5 H-P 9000-755100 GB RAID disk
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes260 GBytes6 SGI, DEC, IBM servers
260 GBytes6 SGI, DEC, IBM servers
3 tape robots21 tape drives6 EXABYTEs
3 tape robots21 tape drives6 EXABYTEs
SPARCserversBaydel RAID disks
tape juke box
SPARCserversBaydel RAID disks
tape juke box
les robertson /cn
Shared Tape Servers
equipment installed or on order Jamuary 1994
CERN Network
CSF
Simulation Facility
PIAF - InteractiveAnalysis Facility
SPARCstations
Home directories& registry
Central Data Services
Shared Disk Servers
consoles&
monitors
CORE Physics Services
CERN
SHIFTData intensive services
7 IBM, SUNservers
Scalable Parallel Processors
25 H-P 9000-735 H-P 9000-750
25 H-P 9000-735 H-P 9000-750
5 H-P 9000-755100 GB RAID disk
5 H-P 9000-755100 GB RAID disk
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
8 node SPARCcenter32 node Meiko CS-2
(Early 1994)
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes
Processors: 24 SGI; 11 DEC Alpha;9 H-P; 2 SUN; 1 IBM
Embedded disk: 1.1 TeraBytes260 GBytes6 SGI, DEC, IBM servers
260 GBytes6 SGI, DEC, IBM servers
3 tape robots21 tape drives6 EXABYTEs
3 tape robots21 tape drives6 EXABYTEs
SPARCserversBaydel RAID disks
tape juke box
SPARCserversBaydel RAID disks
tape juke box
les robertson /cn
Shared Tape Servers
equipment installed or on order Jamuary 1994
cn - fhe - jun 94-12
CERN
CSF - Central Simulation Facility
• second generation, joint project with H-P
interactive hostjob queues shared,
load balanced H-P 750
tape servers
ethernet
FDDI
• 25 H-P 735s - 48 MB memory, 400MB disk• one job per processor• generates data on local disk• staged out to tape at end of job• long jobs (4 to 48 hours)• very high cpu utilisation : >97%• very reliable : > 1 month MTBI
cn - fhe - jun 94-13
CERN
SHIFTScalable, Heterogeneous, Integrated, Facility
• Designed in 1990
• fast access to large amounts of disk data
• good tape support
• cheap & easy to expand
• vendor independent
• mainframe quality
• First implementation in production within 6 months
cn - fhe - jun 94-14
CERN
Design choices• Unix + TCP/IP
• system-wide batch job queues
“single system image”
target Cray style & service quality
• pseudo distributed file systemassumes no read/write file sharing
• distributed tape staging model (disk cache of tape files)
– the tape access primitives are
copy disk file to tape
copy tape file to disk
cn - fhe - jun 94-15
CERN
IP network
The Software Model
diskservers
cpuservers
stageservers
tapeservers
queueservers
Define functional interfaces ---- scalable heterogeneous distributed
cn - fhe - jun 94-16
CERN
• Unix Tape Subsystem• (multi-user, labels, multi-file, operation)
• Fast Remote File Access System
• Remote Tape Copy System
• Disk Pool Manager
• Tape Stager
• Clustered NQS batch system
• Integration with standard I/O packages• FATMEN, RZ, FZ, EPIO, ..
• Network Operation
• Monitoring
Basic Software
cn - fhe - jun 94-17
CERN
Unix Tape Control
• tape daemon
– operator interface / robot interface
– tape unit allocation / deallocation
– label checking, writing
cn - fhe - jun 94-18
CERN
Remote Tape Copy System
• selects a suitable tape server
• initiates the tape-disk copy
tpread -v CUT322 -g SMCF -q 4,6 pathname
tpwrite -v IX2857 -q 3-5 file 3 file4 file5
tpread -v UX3465 `sfget -p opaldst file34`
cn - fhe - jun 94-19
CERN
Remote File Access System - RFIO
high performance, reliability (improve on NFS)
• C I/O compatibility library
Fortran subroutine interface
• rfio daemon started by open on remote machine
• optimised for specific networks
• asynchronous operation (read ahead)
• optional vector pre-seek– ordered list of the records which will probably be read next
cn - fhe - jun 94-20
CERN
sgi1 dec24
sun5disk pool
a disk pool is a collection of Unix file systems, possibly on several nodes, viewed as a single chunk of allocatable space
cn - fhe - jun 94-21
CERN
Disk Pool Management
• allocation of files to pools– pools can be public or private
• and filesystems– capacity management
• name server
• garbage collection– pools can be temporary or permanent
• example:
• sfget -p opaldst file26
• may create file like:
• /shift/shd01/data6/ws/panzer/file26
cn - fhe - jun 94-22
CERN
• implements a disk cache of magnetic tape files
• integrates: Remote Tape Copy System& Disk Pool Management
• queues concurrent requests for same tape file
• provides full error recovery -restage &/or operator control on
hardware/system errorinitiate garbage collection if disk full
• supports disk pools & single (private) file systems
• available from any workstation
Tape Stager
cn - fhe - jun 94-23
CERN
Tape Stager
tape serverrtcopy tape, file
disk server
stage controlsfget file
tpread tape, file
cpu server(user job)
stagein tape, file
RFIO
independent stagecontrol for each
disk pool
cn - fhe - jun 94-24
CERN
SHIFT Statusequipment installed or on order January 1994
configuration -- capacity --
cpu(CU*) disk(GB)OPAL SGI Challenge 4-cpu + 8-cpu (R4400 - 150 MHz) 290 590 Two SGI
340S 4-cpu (R3000 - 33MHz)
ALEPH SGI Challenge 4-cpu (R4400 - 150MHz) 216 200
Eight DEC 9000-400
DELPHI Two H-P 9000/735 52 200
L3 SGI Challenge 4-cpu (R4400 - 150MHz) 80 300
ATLAS H-P 9000/755 26 23
CMS H-P 9000/735 26 23
SMC SUN SPARCserver10, 4/630 22 4
CPLEAR DEC 3000-300AXP, 500AXP 29 10
CHORUS IBM RS/6000-370 15 15
NOMAD DEC 3000-500 AXP 19 15
Totals 775 1380
* CERN-Units:one CU equals approx. 4 SPECints (CERN IBM mainframe 120 600)
CERNgroup
cn - fhe - jun 94-25
CERN
Current SHIFT Usage
• 60% cpu utilisation
• 9,000 tape mounts per week, 15% writestill some way from holding the active data on disk
• MTBI - cpu and disk servers400 hours for an individual server
• MTBF for disks: 160K hours
maturing service, but does not yet surpass the quality of the mainframe
cn - fhe - jun 94-26
CERN
UltraNet
1 Gbps backbone
6 MBytes/secsustained
SHIFT cpuservers
SHIFT diskservers
IBM mainframe
FDDI + GigaSwitch - 2-3 MBytes sustained
SHIFT tapeservers
Ethernet + Fibronics hubs - aggregate 2 MBytes/sec sustained
Simulationservice
Homedirectories
connection to CERN & external networks
CORE Networking
cn - fhe - jun 94-27
CERN
FDDI Performance(September 1993)
100 MByte disk file read/written sequentially using 32KB records
client: H-P 735 server: SGI Crimson, SEAGATE Wren 9 disk
system read write
NFS 1.6 MB/sec 300 KB/sec
RFIO 2.7 MB/sec 1.7 MB/sec
cn - fhe - jun 94-28
CERN
PIAF - Parallel Interactive Data Analysis Facility
(R.Brun, A.Nathaniel, F.Rademakers CERN)
• the data is “spread” across the interactive server cluster
• the user formulates a transaction on his personal workstation
• the transaction is executed simultaneously on all servers
• the partial results are combined and returned to the user’s workstation
cn - fhe - jun 94-29
CERN
PIAFworker
PIAF Architecture
PIAFclient
displaymanager
PIAF server
PIAFworker
PIAFworker
PIAFworker
PIAFworker
userpersonal
workstation
PIAF Service
cn - fhe - jun 94-30
CERN
Scalable Parallel Processors
• embarrassingly parallel application -therefore in competition with workstation clusters
• SMPs and SPPs should do a better job for SHIFT than loosely coupled clusters
• computing requirements will increase by three orders of magnitude over next ten years
• R&D project started, funded by ESPRIT - GPMIMD232 processor Meiko CS-225 man-years development
cn - fhe - jun 94-31
CERN
Conclusion
• Workstation clusters have replaced mainframes at CERN for physics data processing
• For the first time, we see computing budgets come within reach of the requirements
• Very large, distributed & scalable disk and tape configurations can be supported
• Mixed manufacturer environments work, and allow smooth expansion of the configuration
• Network performance is the biggest weakness in scalability
• Requires a different operational style & organisation from mainframe services
cn - fhe - jun 94-32
CERN
Operating RISC machines
• SMP’s easier to manage
• SMP’s requires less manpower
• Distributed management not yet robust
• Network is THE problem
• Much easier than mainframes, and
• ... cost effective