[email protected] november 10, 2016 guillimin hpc ... · guillimin hpc users meeting -...
TRANSCRIPT
Guillimin HPC Users Meeting - November 2016
Guillimin HPC Users MeetingNovember 10, 2016
McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada
Guillimin HPC Users Meeting - November 2016
• Please be kind to your fellow user meeting attendees • Limit to two slices of pizza per person to start please• And please recycle your pop cans.• Thank you!
2
Guillimin HPC Users Meeting - November 2016
• Compute Canada News• System Status• Software Updates• Training News• Special Topic
• CernVM File System (CVMFS)
Outline
3
Guillimin HPC Users Meeting - November 2016
• (Reminder) 2017 Resource Allocation Competitions• More information here:
https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/
• Competition Information Sessions (slides):https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/#2017
Compute Canada News
4
Process Opens Due
Fast Track application (by invitation only) Early October Nov 9, 2016
RRG and RPP full application Early October Nov 24, 2016
Announcement & Implementation of awards Early March 2017 Mid April 2017
Guillimin HPC Users Meeting - November 2016
• At which point should you request Resources for Research Groups (RRG) on Guillimin:• More than 2 times the default CPU allocation:
• RAS: Default priority level: 30 core*years• Resource name: Guillimin - phase 1 or phase 2
• More than 1 GPU*year or 1 MIC*year• RAS: small fraction of GPU or MIC
• More than 5 times the default storage allocation:• RAS: Default project space allocation : 1 TB (up to 5 TB on
demand, but not guaranteed)• Resource name: DataSTAR, total 4 PB for RAC 2017
• Storage space on tape (archive or backup)• RAS: only Home directories are saved in backup system• Resource name: DataSTAR, Guillimin - phase 2
Compute Canada News
5
Guillimin HPC Users Meeting - November 2016
• November 1: downtime for scheduled site-wide power outage• Clean shutdown of VMs on GPFS, nodes and GPFS• Maintenance of the UPS system• Clean shutdown of remaining VMs and services
• November 2: brought services back online• Tested: GPFS, job scheduling, nodes• Tried to update but rolled-back: Matlab license• Reopened access to: login nodes, scheduler
• November 4: fixed LDAP issue, account creation• Increased some number of lock files
• November 8: Infiniband issue - GPFS hang• Needed to restart the IB Fabric and GPFS
System Status
6
Guillimin HPC Users Meeting - November 2016
• Space Management• /gs is full: 99% used, 51TB free (as of Nov. 7)
• For better space management we continue to migrate cold data from disk to tape• Metadata remains on disk• Users can still access their files through usual
methods, but with an increased latency• Storage space is a precious resource - manage it
wisely!• Delete temporary files, compress large files not
frequently accessed, tar many smaller files into collections, …
Storage Status
7
Guillimin HPC Users Meeting - November 2016
• Maven/3.3.9 - Apache Maven: build manager for Java projects
New Software Installations
8
Guillimin HPC Users Meeting - November 2016
• All upcoming events: calculquebec.eventbrite.ca• Nov. 23 - Programmation en R intermédiaire (U. Montreal)
• Recently completed:• Oct. 27 - Software Carpentry (U. Montreal)• Oct. 27 - Introduction to Intel Xeon Phi (McGill U.)• Nov. 8 - Easy GPU programming with OpenACC (U. Laval)• Nov. 9 - Introduction aux serveurs de calcul (U. Sherb.)
• All materials from previous workshops are available online: wiki.calculquebec.ca/w/Formations/en
• All user meeting presentations online at www.hpc.mcgill.ca
Training News
9
Guillimin HPC Users Meeting - November 2016
• Questions? Comments?• We value your feedback. Contact us at:
• Guillimin Operational News for Users– Status Pages
• http://www.hpc.mcgill.ca/index.php/guillimin-status• http://serveurscq.computecanada.ca (all CQ systems)
– Follow us on Twitter• http://twitter.com/McGillHPC
User Feedback and Discussion
10
Guillimin HPC Users Meeting - November 2016
McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada
CernVM File System (CVMFS)November 10, 2016
Guillimin HPC Users Meeting - November 2016
Outline
• What is CVMFS• How it works
• Structure, Technology and Workflow• CVMFS in Compute Canada Projects
• GenAP, MUGQIC, SoftCC• Outlook and Support• Conclusion
12
Guillimin HPC Users Meeting - November 2016
• CVMFS: CERN Virtual Machine File System (CernVM-FS)• Designed to deliver software in a fast, scalable, reliable and
distributed way.• A file system hosted on standard web servers and mounted
by clients on universal user space (/cvmfs) using as a POSIX read-only file system in user space (a FUSE module)
• Software can be installed in one location and cached on demand anywhere using CVMFS technology.
• aggressive caching and reduction of latency, CernVM-FS focuses specifically on the software use case (small files)
• Recent development is extending to Data files (large files)• Originally developed for the LHC (Large Hadron Collider)
experiments to optimally deliver software for VM images and as a replacement for different software installation areas at many distributed locations (>200 HPC sites)
What is CVMFS?
13
Guillimin HPC Users Meeting - November 2016
CVMFS Structure example, at the LHC
14
Guillimin HPC Users Meeting - November 2016
• FUSE kernel module is used• Virtual FS loading data only on access• All data/software is hosted on a CernVM-FS
repository (the stratum-0).
CVMFS Technology
15
Guillimin HPC Users Meeting - November 2016
Opening a file on CernVM-FSOpening a file on CernVM-FS
Client side
16
Opening a file on CernVM-FS● Name resolution via an SQlite catalog● File downloads are verified against the
cryptographic hash of the corresponding catalog entry
Guillimin HPC Users Meeting - November 2016
CVMFS distribution Workflow
17
->Library node----->Stratums--------->Clients● If the file is not in the local cache it is fetched from the
squid and if not present, the squid fetches from stratum-1 (Stratum-1 is always a replica of stratum-0)
Guillimin HPC Users Meeting - November 2016
• A node on which the developer, librarian interacts with the repository in order to publish files (software or data files)• Protection mechanisms exist to protect publishing integrity.
E.g one librarian at a time (a single librarian account)
The Repository (Librarian) Node
18
• Files published via a publish command to stratum0
• Publishing changes tracked by union file system combines a CernVM-FS read-only mount point and a writable scratch area.
Lib node
Guillimin HPC Users Meeting - November 2016
• A Sudo librarian account is used to manage software injection to cvmfs for Compute Canada Projects.• Commands: cvmfs_server transaction, cvmfs_server
publish, cvmfs_server abort
Example of a Librarian Interaction
19
Guillimin HPC Users Meeting - November 2016
Stratum0 • A protected read/write instance of the files
• It feeds up the public, distributed mirror webservers (the stratum1’s)
• A distributed hierarchy of proxy servers (squids) fetches content from the closest public mirror server.
• Projects can share a Stratum0 or each project can have their own. Perhaps depending on repository sizes and purposes• Example in Compute Canada: we have one for
MUGQIC and for Compute Canada Software (softcc)
Stratum0, Stratum1
20
Guillimin HPC Users Meeting - November 2016
Squids:• A caching proxy for the Web supporting HTTP,
HTTPS, FTP, and more.• It reduces bandwidth and improves response times by
caching and reusing frequently-requested web files. Squid has extensive access controls and makes a great server accelerator.• Depending on cluster size, a site can employ two or
more squids and also define more than one squids for failover.
Squid Caches
21
Guillimin HPC Users Meeting - November 2016
• CernVM-FS is controlled by autofs and automount
• Base dir is “/cvmfs”• CVMFS clients can be:
• Computer node (metal)• VMs• Containers
• Can also deliver files to clouds S3 space
• Clients can use local cache or alien cache
CVMFS Clients
22
Guillimin HPC Users Meeting - November 2016
Local cache:-Local to the node
“Client end” Cache types
● Local cache: Local to the node
● Alien cache: files on shared file system in a cache outside the control of CernVM-FS.
● NFS server mode: Exporting a single CernVM-FS client via nfs to Compute nodes
Example of a compute node with a local cache of software at Guillimin
Guillimin HPC Users Meeting - November 2016
• Software Distribution:• Delivers software to clusters.• Clusters can load only the relevant software and
versions• Software can be organised into a software tree desired
to meet different OS type, versions, architecture and modularization.
• Data federation (new!): StashCache• A cooperating set of storage resources transparently
accessible across a wide area network via a common namespace.
Applications (1)
24
Guillimin HPC Users Meeting - November 2016
• Workflows for upto 10TB of data can be achieved
• 500MB/job of data has been delivered using StashCache.
Good application for remote data access.
StashCache Example
25
cvmfs
Guillimin HPC Users Meeting - November 2016
1. Centralised Software maintenancea. Single point injections and propagation to multiple
sites/clientsb. Software versioning across sites is reinforcedc. Software mounts on VMs, Containers and Compute nodes
(metal)d. Software prerequisites can be installed to meet site
compatibilitiese. Low maintenance, high scale delivery
2. Files are fetched using standard http protocol from the server3. Files cached on demand in order to reduce local network traffic
a. Tuning possible to have longer cache life4. The CVMFS structure is highly scalable and redundant.5. Failover mechanism achieved using local and remote squids
Advantages of CVMFS for Software (1)
26
Guillimin HPC Users Meeting - November 2016
• GeNAP-Genetics and Genomics Analysis Platform https://genap.ca/public/home• Currently over 100 bioinformatics tools, packages
and pipelines transparently distributed• Compute Canada Software
• Presently each cluster has its own separate software module system
• CVMFS technology will be used in the new systems • Other projects:
• https://bitbucket.org/mugqic/mugqic_pipelines• Also Sub-atomic Physics projects (traditional users)
Applications Within Compute Canada
27
Guillimin HPC Users Meeting - November 2016
Present GenAP, MUGQIC CVMFS Structure in CC
28
GP2 (uvic)
Also Briaree
West East
Guillimin HPC Users Meeting - November 2016
GeNAP CVMFS deploying software to Clusters
29
GENAP CVMFSSource: http://cggony.wixsite.com/genap-v1
Guillimin HPC Users Meeting - November 2016
• SoftCC-CVMFS:- Software for compute Canada sites• Software distribution targeted for the new sites.• Under test and piloting • Softcc cvmfs repository will be deployed on the GP2
and GP3, and accessible to new and other CC systems for a standard set of software that will be identical at all locations
SoftCC CVMFS
30
Guillimin HPC Users Meeting - November 2016
• Main development done by CERN• In Compute Canada, adaptation, deployment and
minimal developmental tuning to suite CC projects is done by the CVMFS group.• Infrastructure to deploy global CVMFS to the new sites
is being deployed• CC support email: [email protected]
Support
31
Guillimin HPC Users Meeting - November 2016
•
End user experience (demo on a log in node)
32
● Cvmfs_config command available to check cvmfs
Guillimin HPC Users Meeting - November 2016
❖ CernVMFS technology is very useful in software distribution and recently, for Data.➢ Low maintenance but high in scalability. Can serve
hundreds of clusters with only one software injection point.
❖ Beyond sub-atomics physics projects, two Bioinfomatics CC projects have adopted it and is to be implemented for all sites in compute Canada as the standard way to distribute software.
Questions?
Conclusion
33
Guillimin HPC Users Meeting - November 2016
EXTRA SLIDES
34
Guillimin HPC Users Meeting - November 2016
• Use of the the Fuse kernel module that comes with in-kernel caching of file data and file attributes
• Cache quota management • Use of a content addressable storage format resulting in immutable files
and automatic file de-duplication • Possibility to split a directory hierarchy into sub catalogs at user-defined
levels• Automatic updates of file catalogs controlled by a time to live stored
inside file catalogs • Digitally signed repositories • Transparent file compression/decompression and transparent file
chunking • Capability to work in offline mode providing that all required files are
cached • File system versioning and hotpatching• Dynamic expansion of environment variables embedded in symbolic links
CVMFS Key Features, desirable for software delivery
35