david cameron claire adam bourdarios andrej filipcic eric lancon wenjing wu atlas computing...
TRANSCRIPT
David CameronClaire Adam BourdariosAndrej FilipcicEric LanconWenjing Wu
ATLAS Computing Jamboree, 3 December 2014
Volunteer Computing
Volunteer Computing @ CERN
• 2004: LHC@Home Sixtrack• 2011: LHC@Home Test4Theory• 2014: ATLAS@Home, CMS@Home,
Beauty@Home (LHCb)
ATLAS@Home
• Why use volunteer computing for ATLAS?– It’s free! (almost)– Public outreach
• Considerations– Low priority jobs with high CPU-I/O ratio
• Non-urgent Monte Carlo simulation
– Need virtualisation for ATLAS sw environment• CERNVM image and CVMFS
– No grid credentials or access on volunteer hosts• ARC middleware for data staging
– The resources should look like a regular Panda queue• ARC Control Tower
Initial ATLAS@Home Architecture
ARC Control Tower
Panda Server
ARC CE
Session Directory
BOINC LRMS Plugin
BOINC server
Volunteer PC
BOINC Client
VM
Shared Directory
Grid Catalogs and Storage
DB
proxy cert
BOINC PQ
CERN
Current ATLAS@Home Setup
ARC Control Tower
Panda Server
ARC CE
BOINC server (vLHC@Home)
Volunteer PC
BOINC Client
VM
Shared Directory
Grid Catalogs and Storage
DB on demand
BOINC PQ
SharedNFS
ATLAS@Home History
• Test server with ARC CE and BOINC server with ATLAS@Home app ran in Beijing from January– http://gilda117.ihep.ac.cn– Volunteers found it somehow…
• In July volunteers were moved to CERN server with ARC CE + BOINC– http://arc-boinc-01.cern.ch (alias atlasathome.cern.ch)– CERN IT provided 1TB NFS space for job input/output
• At the same time ATLAS@Home became an official BOINC project• In early October the BOINC server was changed to a vLHC@Home
server run by CERN IT– Volunteers + credit moved too
• A parallel test setup with separate ARC CE and BOINC server exists for testing
Boinc jobs• Real simulation tasks
– mc12_8TeV.117079.PowhegPythia_P2011C_ttbar_nonallhad_mtt_2000p.simul.e2940_s1773
– Full athena jobs– 50 events/job
• Runs in CERNVM with pre-cached software• But some data still needs to be downloaded at runtime
– Conditions data from squid/frontier
• Image is 1.1GB (500MB compressed) and downloaded only once• Input files (data file + small scripts) is 1-100MB• Output is ~100MB• VM memory is now 2GB (was 1GB initially, but now more complex jobs)• Jobs take from few hours up to a few days on fast (single) core• Validation
– Per work unit, that correct output is produced (just that file exists, the content is not checked)
– Physics validation comparing results to regular Grid task
How does it work for volunteers?
• Install BOINC client and VirtualBox– Linux, Mac and Windows supported– Currently 80% of hosts have Windows
• In BOINC client choose ATLAS@Home and create an account
• That’s it!
Issues with jobs
• The majority of volunteers (~80%) never complete a single job– Not powerful enough resources, entry barrier is too high
• Requires 64-bit, at least 4GB, decent bandwidth, installing VirtualBox• ATLAS@home is the hardest BOINC project to run (quote from volunteer)
– Unreliable system/failing jobs also push people away• The worst thing for volunteers is to use CPU and not give credit
– BUT the normal retention rate of a project is 10%• More problems
– Virtualisation/VMwrapper causes a lot of problems (memory, jobs not finishing, unstable)
– Firewall issues accessing conditions data through squids• We are working on ways to cache this data in the image to avoid network access
from the job
Volunteer growth
Currently >12000 volunteers, 1000 active300 new volunteers/week
Einstein@Home: 300k volunteers, 47k activeSeti@Home: 5 million volunteers, 150k active
Job statistics
• Continuous 2000-3000 running jobs• almost 300k completed jobs• 500k CPU hours• 14M events• 50% CPU efficiency
Standard Boinc webpage
• http://atlasathome.cern.ch• Technical info on how to
join• Message boards• Jobs/results• Job statistics
ATLAS@Home public outreach page
• https://atlasphysathome.cern.ch
• Designed by Claire using Drupal
• Entry point for the public to find out what they are contributing to
• Many links to existing outreach pages
Screensaver
• Many BOINC projects run as “screensavers”• Working with Riccardo-Maria Bianchi from ATLAS event display
VP1 to make ATLAS@Home screensaver– Show pre-configured event displays as events are produced to show
people what they are running
• This can help motivate people to look more into the physics details
Lessons Learned and Future
• It takes a lot of effort to run ATLAS@Home– In the interaction with volunteers
• Some volunteers are extremely competent and knowledgeable and help others
– Maintaining and improving the system workflow
• The number of running jobs has reached a plateau– We are exploring scaling options with CERN IT (Ceph, multiple apache servers etc)– Not enough people joining
• But we deliberately haven’t advertised too much to ramp up slowly
• The major problems are caused by vboxwrapper• BOINC developers very enthusiastic to help us
– They give us fixes/new features in days
• We have a few more things to fix before ATLAS@Home can move out of beta– New manpower starting now will help greatly
• We want to push ATLAS@home internally inside ATLAS– eg now available as part of NICE, to put on CERN administrative PCs
Stop press!http://cds.cern.ch/journal/CERNBulletin/2014/49/News%20Articles/1971985?ln=en
ATLAS@Home potential
• It is not possible to run any ATLAS jobs on ATLAS@home– See earlier considerations about I/O, unreliability etc
• But ~50% of jobs could feasibly run on this platform• The high entry barrier may limit general public
participation• Can it replace small Grid sites?
– For example a CPU-only T3 site or small university cluster– Instead of setting up all the Grid infrastructure just install
BOINC on the worker nodes– Standard Grid accounting in APEL is provided by ARC CE
Thanks
• Thanks to our CERN IT colleagues in LHC@Home for providing the Boinc infrastructure and storage space
• .. and please join us!http://atlasathome.cern.ch