the dØ computing model
Post on 17-Jan-2016
35 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
1
The DØ Computing Model
Overview The picture
Planning history Status of acquisitions Performance More detail
On the current operation On the R & D General Status
Future plan
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
2
Overview
The data handling system SAM ENSTORE Robot(s)
The offline user computing systems dØmino - O (20 TB) disk linux analysis server(s) - O (2 TB) disk linux development machines - O (0.2 TB)
• build cluster• ClueDØ• remote linux machines
non-development desktops
Associated systems Fermilab production farm (raw data reconstruction) Remote production farms (simulation) Database servers
High bandwidth into robot
Robot
lxbld
Detector
Analysis Cluster 1
NT Desktops
Linux Compute Server
12.5 Mb/s
Monte Carlo Handled remotely
~ 1 TB
150 Mb/s
ClueDØ
~ 0.2 TB
Linux Farms
Database Servers
dØmino
27 TBHigh speed Network
ClueDØServer
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
4
Original plan January ‘97 DØ Internal Review February ‘97
External review: Von Rüden Committee Mar ‘97, Oct ‘97, Jun ‘98, Jan ‘99, Jun ‘99 Funding profile (DMNAG - Joint with CDF) approved ‘97
Plan updates January ‘99 for VR IV Global Computing Model reports (‘98-’99)
[Addition of Analysis Servers to plan]
Plan implementation ‘97 - ‘01 Run II Computing and Software Project: co-leaders +
Computing Planning Board
Planning history
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
5
Status of acquisitions
Analysis cpu Dømino: 192 proc O2000 complete (except add memory) Desktops: responsibility of institutions Analysis Clusters/Servers - 1 purchased of (6?)
Reconstruction cpu 200 processors acquired of 400 planned
[ 40 Hz cap @ current reco cpu perf. ; 80 Hz @ target reco perf]
Disk storage 30 TB total - complete (plan was 15 TB) See allocation slide
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
6
22.5
0.9
2.6
12
6
1
Allocated
27TOTAL
2contingency
?2Tmp ( group space)
~2.0?4Project disks
variable12DST/mDST
variable6SAM cache
11Scratch, releases & other config.
UsedAvailableDisk space on D0MINO
Total available disk space: 30 Tbyte
( all units are Tbytes)
3 Tbytes are on: D0test, d0lxac1, d0lxbld27 Tbytes are on D0MINO
Disk space in the offline systems
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
7
Status of acquisitions cont’d
Robotic tape storage 1 ADIC robot (750 TB capacity) - complete 18 Mammoth II tape drives - will be retired 6 LTO drives - now 2 STK robots (600 TB capacity) - FY02 9 STK 9940 drives - FY02 Post shutdown stopgap - use existing STKen w/ 4 drives
Database servers - complete 2 SUN systems w/ 600 GB disk
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
8
Performance
Farm production stats dØmino cpu & mem stats AC1 cpu & mem stats SAM & encp stats Disk usage stats Conclusion: Chief needs
More memory for Dømino More reliable tape drives More farm nodes More linux cpu
Open questions - DB server upgrades?
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
9
Farm Production Statistics
See web link from Main DØ Computing for weekly reports
Week of 08/31 - 09/06:
800,000 evts proc / 140,000 from data collected in that week
1.9 M events collected in that week Problems in this week:
encp problem (code change from ENSTORE)disk failure on dØbbin (the farm IO server)several other problems as well...
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
10
The Current Operation
Code release model Mapping activities to systems ClueD0 operation Remote farm operation Role of the ORB
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
11
The code release model
Weekly test releases Production releases every three months Weekly subsystem coordinators meeting:
Minutes to d0rug mailing list Rules for interface changes Schedules for big disruptive changes (e.g. switch
to KAI 4.0)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
12
Mapping activities to systems
Code development: your Linux box, if possible; d0mino is the backup solution
Large sample processing: a SAM station d0mino, lxac1, special farm allocation (gtr) , (ClueD0 - in
R&D)
Small sample processing: create derived DS on SAM station, transfer to desktop
Office/Web browsing : use your desktop! Remote users: new position to address needs
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
13
Mapping activities to systems
Disk usage Home areas - backed up; you can ask for up to 250MB
(possibility of more for good reason) BUT NFS-mounted - don’t use for data files!
TMP areas - not backed up. Code development and / or data files, allocated per institution. 37 institutions are using it so far. A good place to start off if you are not working with a well-defined project.
PRJ areas - not backed up. Code development and / or data files, allocated per project. 3 large pools: commissioning, algorithm development, simulation, plus physics and ID groups and some smaller projects.
Web pages - DØ Main Computing ( SAM Data Handling section) --> General description of where data samples are stored in our system
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
14
The current population is:111 nodes with 138 CPUs and a total memory of 37GB396 Users
Rules for joining and policies can be found at:http://www-clued0.fnal.gov/clued0/http://www-clued0.fnal.gov/clued0/policies.html
Current difficulties from the lack of Redhat 7.1 builds are being actively worked on
ClueD0 Operation
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
15
Monte Carlo Production Status
Current Software – mcp07 p07.00.05a Generator, DØgstar, Døsim P08.12.00 Døreco, recoanalyze 950 kevents generated at reco level Run IIB Simulation is a major effort Will move to p08.13.00 to remove memory leak
Future Releases – p09.10.00 Problem running DØgstar under investigation Plate level available p10 certification will be available by the end of the
month
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
16
Charge: Allocate offline resources according to the experiment’s priorities
Project & tmp disk Sample priorities for simulation on remote farms Partitions in SAM cache Batch queues
Chair: Nick Hadley Web Page
http://www-d0.fnal.gov/Run2Physics/orb/d0_private/orb_home.html
Institutions which have no tmp disk allocation and have active users
email to hadley@fnal.gov - 18 GB will be allocated
The Offline Resources Board
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
17
R & D
Analysis clusters - one in service ClueD0 servers ( a relocated analysis cluster) -
software being tested; networking strategy being developed
Compute servers for dØmino (a user-accessible farm) - 2 nodes available for tests
Remote farms for raw data reconstruction and analysis
Remote desktop analysis
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
18
Institutional contributions
Desktop seats Backup tapes Remote simulation capacity Disk for Dømino via budget code - issues
How to allocate between project & tmp? Lifetime for contribution? Unit of contribution : 1 rack of disk
Analysis cluster for Feynman via budget code Similar issues
Analysis cluster for ClueDØ - all the above issues + SAM bandwidth, networking, sysadmin, ...
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
19
General Status - Where are the limits/problems?
Online Max rate tested 40 Hz to tape Max rate sustained for a shift, to date ~25 Hz to tape Max rate expected with next iteration 60 Hz to tape Final limitation: tape budget (FY02 = ~ 400 TB )
Running p 10 on the farms Processes raw data @ 23 sec/event Thanks to Alg Group - worked out of box on raw data Limits: ~ 2-3 Hz w/ current nodes & cpu perf of reco
Output size: HUGE - writing too much tape, breaking DB model, using more than allocated network and disk resources all down the line
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
20
Expected Farm Performance
@ Current cpuperf
@ Target cpuperf
Existing farm 3 Hz 6 Hz
+ FY01purchase(32 nodes)
5 Hz 10 Hz
+ FY02purchase(200 nodes)
36 Hz 72 Hz
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
21
General Status - Where are the limits/problems?
SAM/ENSTORE status Working for many months with servers on automatic
recovery Not all features complete (pick events) 5 GB interfaces can deliver 150 MB/sec to dØmino
Robot status Design rates met, but robustness severely limited by M II
drive error rate - plan switchover by end of shutdown
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
22
Future Plan
Major purchases still in FY02 New robot and reliable drives New farm nodes More memory for dØmino *Some* linux cpu
Continue R&D for linux analysis strategies Hope to establish effectiveness and practicality of the
three proposed models: AC, CS, AC@DØ
Operational improvements SAM personnel @ DØ RECO: continue with current release schedules;
emphasize quality control and testing for releases;push on cpu, memory, output size issues
top related