cms issues. background – ral infrastructure tm nsd xrd- mgr tm nsd xrd- mgr tm rhd stagerd tgw rhd...
TRANSCRIPT
![Page 1: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/1.jpg)
CMS Issues
![Page 2: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/2.jpg)
Background – RAL Infrastructure
TMNsdXrd-mgr
TM RhdstagerdTGW
CupvVmgrVdqmnsd
CupvVmgrnsd
Common Layer
Instance Headnodes
Diskservers (x20)
CASTOR 2.1.14-15XROOT 3.3.3-1
![Page 3: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/3.jpg)
Background – xroot infrastructure
Diskservers (x20)
Xroot manager(3.3.3-1)
Xroot redirector (4.X)
European redirector1European redirector1 European redirector1
Global redirectors Global redirectorsGlobal redirectors
Local WNs
The Grid
![Page 4: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/4.jpg)
The Problem…s• Pileup workflow
– Local jobs had 95% failure rate– Jobs that managed to run had only 30%
efficiency• AAA failure
– Despite being the second site to integrate into AAA
– 100% failure for periods of 30 minutes to several days
![Page 5: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/5.jpg)
Tackling the Problems
![Page 6: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/6.jpg)
Pileup Broken Down
• Data accessed through xroot
• >95% of data at RAL• Two problems in one
– Slow opening times (15->600 secs)
– Slow transfers rates– 100% CPU WIO
![Page 7: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/7.jpg)
Slow Opening Times
• No obvious place– Delays at all phases– Almost all DB time spent in
SubRequestToDo
![Page 8: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/8.jpg)
Solution 1(aka Go Faster Stripes Solution
![Page 9: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/9.jpg)
Database Surgery
• DBMS_ALERT suspect to add to delays under load– Modified DB code to sleep for 50 ms (limiting
rate to 20ms for subreqtodo)• Tested on preprod (functionally)
– Improved open time from 3-15 secs to 0-5 secs• Deployed on all instances• Made NO difference for CMS problem
![Page 10: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/10.jpg)
Solution 2(aka The Heart Bypass Solution)
![Page 11: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/11.jpg)
Bypassing Scheduler
• Modified xroot to disable scheduling• RISK
– nothing restricting access to disk server– ONLY applied to CMS
• RESULT– Open times reduced to 1-30 seconds– WIO still flatlining at 100%
• ‘SUCCESS’
![Page 12: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/12.jpg)
Improving IO
• Difficult to test– Could not generate artificially– Needed pileup workflow to be executing
• Testing on production ;)
• Did ‘the usual’– Reducing allowed connections– Throttling batch jobs
![Page 13: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/13.jpg)
Solution 3(aka The Don’t Do This Solution)
• Change UNIX scheduler– Now easy and can be done in-situ
• Four schedulers (plus options)– Cfq (default), anticipatory, deadline, noop– Plus associated config
• Switched to noop– WIO dropped to 60%– Network rate increased 4x
![Page 14: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/14.jpg)
XROOT Problems
![Page 15: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/15.jpg)
Observations
• Random Failures (or more correctly random successes)
• Local access was OK (if slow – see previous)
• Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug
![Page 16: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/16.jpg)
Investigating the Problem
• Set up parallel infrastructure– Replicate manager, RAL redirector and
European redirector• Immediately saw the same issue…
![Page 17: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/17.jpg)
Causes of Failure…
• Caching!– Cmsd and xrootd timed out at different
times– Xroot can return ENOENT, but later cmsd
gets response, and subseq access work– If cmsd doesn’t get a response, all future
requests get ENOENT• But why the slow response…?
![Page 18: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/18.jpg)
Log Mining…• Each log looked like
performance was good• Part of problem
– Time resoln in xroot 3.3.X– And logging generally
• Finally found delays in ‘local’ nsd– Processing time was good– But delays in servicing
requests
![Page 19: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv](https://reader036.vdocuments.site/reader036/viewer/2022062305/5697bfd51a28abf838cad8aa/html5/thumbnails/19.jpg)
Solution – RAL Infrastructure
TMNsdXrd-mgr
TM RhdstagerdTGW
EU Redirectors
The Grid
RAL
Diskservers (x20)
CASTOR 2.1.14-15XROOT 3.3.6-1
Global Redirectors
NsdXrd-mgr
Xroot redirector (4.X)Local WNs
RemoteWNs