amod report december 3-9, 2012

16
AMOD Report December 3-9, 2012 Torre Wenaus December 11, 2012

Upload: amity

Post on 23-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

AMOD Report December 3-9, 2012. Torre Wenaus December 11, 2012. Activities. Datataking until 6 th , the likely end of 2012 pp physics running B ulk reprocessing mostly done ~ 1 .3M production jobs ( group, MC, validation, reprocessing) ~2.5M analysis jobs ~610 analysis users. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: AMOD Report  December 3-9, 2012

AMOD Report December 3-9, 2012

Torre Wenaus

December 11, 2012

Page 2: AMOD Report  December 3-9, 2012

Torre Wenaus 2

Activities

• Datataking until 6th, the likely end of 2012 pp physics running

• Bulk reprocessing mostly done

• ~1.3M production jobs (group, MC, validation, reprocessing)

• ~2.5M analysis jobs• ~610 analysis users

Page 3: AMOD Report  December 3-9, 2012

Torre Wenaus 3

Production & Analysis

Sustained activity, production and analysis

Fluctuating analysis workload~6k min (!) – 34k max

Page 4: AMOD Report  December 3-9, 2012

Torre Wenaus 4

Data transfer - source

Page 5: AMOD Report  December 3-9, 2012

Torre Wenaus 5

Data transfer - destination

Page 6: AMOD Report  December 3-9, 2012

Torre Wenaus 6

Data transfer - activity

Page 7: AMOD Report  December 3-9, 2012

Torre Wenaus 7

T0 export tailing off at end of week with end of pp datataking

Page 8: AMOD Report  December 3-9, 2012

Torre Wenaus 8

Reprocessing (yellow) tailing off

Page 9: AMOD Report  December 3-9, 2012

Torre Wenaus 9

Tier 0, Central Services

• Tue pm: T0 LSF: slow LSF job dispatching ALARM ticket 23:46. Promptly answered, a reconfig run at 23:00 to fix an issue was slow, and reduced responsiveness to job submission. Queues refilled by 00:06. Experts looking at why the config took so long. Ticket closed. GGUS:89202

• Sat am: CERN-PROD: ALARM: ATLAS web server down Sat am, response in 10min, resolution in ~30. Due to power outage. Closed. GGUS:89334

• Weekend: CERN-PROD: EOS source errors and several periods of EOSATLAS instability in SLS (next slide). GGUS:89328

• During week, a few cases (besides alarm ticket) of T0 bsub time spiking to ~6-8 sec for <~1hr

Page 10: AMOD Report  December 3-9, 2012

Torre Wenaus 10

EOSATLAS availability lapses

Page 11: AMOD Report  December 3-9, 2012

Torre Wenaus 11

ADC

• Tue pm: Security ticket to ATLAS VOSupport: ATLAS creating world writable directories. In the PanDA pilot, one directory creation case was missed (in job recovery directory) in setting access to 770. Fixed in pre-production code. GGUS:89182

• Tue: Problem recurred in a corrupt dCache library libdcap.so being disseminated by sw installation resulting in ANALY jobs failing for all sites using dCache. Fixed promptly with a new check to prevent recurrence.

• Tue: MuonCalibration-17.2.7.4.1 not found at ANALY_MPPMU calibration site, resolved by AleDG/AleDS/Alden. Confusion over celist source (it is AGIS)

• Bulk ESD lifetime changed from 4 weeks to 3 weeks (Ueda)• Case of duplicate GUIDs, analysis ongoing• Thu: MUON_CALIBDISK close to full at INFN-NAPOLI, deletion run, freed

sufficient space• Weekend: SARA DATADISK filled up (next slide)

Page 12: AMOD Report  December 3-9, 2012

Torre Wenaus 12

T1 DATADISK space full

• At SARA on weekend, ~monotonic decline of available space for a week reached the end

• Sat pm: Taken out of T0 export at 10TB free

• DDM auto blacklisting didn’t kick in – when was it supposed to? 1TB? Very low…

• Mon am: Manually blacklisted

Page 13: AMOD Report  December 3-9, 2012

Torre Wenaus 13

Tier 1 Centers

• Mon am: IN2P3: regular SRM hangups thought to have been fixed with a dCache patch for long proxies problem (GGUS:88984) did not actually fix the issue. Recurred Tuesday, then they put in a cron to detect need for and perform SRM restart. No need to restart the server since. Investigations ongoing. GGUS:89111

• Mon am: RAL: Failures in input file staging, high FTS error rate. Restarted the stager and rebalanced the database which solved it. Closed. GGUS:89141

• Tue pm: Taiwan-LCG2: many job failures due to insufficient space on local disk. Site increased maximum job workdir size in schedconfig. Ticket closed but problem recurred Thu am, new ticket. Site reduced job slots on small-disk WNs. Ticket on hold for observation. GGUS:89200, 89253

• Wed am: FZK TAPE T0 export resumed after resolution of last week ticket. Some timeout failures since but not persistent. Closed. GGUS:88877

Page 14: AMOD Report  December 3-9, 2012

Torre Wenaus 14

Tier 1 Centers (2)

• Thu pm: SARA: T0 export failures, quick site response and resolution, "we were overloaded with requests from jobs from another cluster. This has been blocked now..." which solved the problem. Closed. GGUS:89289

• Sat am, through weekend: FZK-LCG2: Persistent <8% job failure rate due to timeouts saving files to local SE, logged on reopened 2/12 ticket. Mon am update: site canceled some long standing inactive transfers on the ATLAS write buffer pools. GGUS:89110

• Sat am: Taiwan-LCG2: Missing file needed for production. Affected by disk maintenance, recovered by site. Closed. GGUS:89332

• Sat pm: PIC: failing source transfers. Cured with SRM restart. Site is checking what caused the SRM failures. GGUS:89338

Page 15: AMOD Report  December 3-9, 2012

Torre Wenaus 15

Other

• GGUS experts unable to reproduce issue of last week, that clicking ‘back’ twice after creating a ticket creates another one (observed in Firefox)

• Coming:– PIC capacity at ~65% Dec 10-21 to save electricity– Several downtimes this week (Dec 10+)

• Sites: please make clear in GOC downtime notices the scope/impact of the downtime

• With regular space issues as well as occasional hardware etc issues, exclusion from T0 export is pretty common, would be nice to have monitoring of in/exclusion status, simplified/safer inclusion/exclusion procedure

• Noticed shifters paying attention to a site they shouldn’t need to (UTD-HEP)… how to prevent?– https://savannah.cern.ch/support/?133697

Page 16: AMOD Report  December 3-9, 2012

Torre Wenaus 16

Thanks

• Thanks to all shifters and helpful experts!