berkeley research computing town hall meeting savio overview

20
Berkeley Research Computing Town Hall Meeting Savio Overview

Upload: nguyenhanh

Post on 02-Jan-2017

218 views

Category:

Documents


2 download

TRANSCRIPT

Berkeley Research Computing

Town Hall MeetingSavio Overview

SAVIO - The Need Has Been Stated

Inception and design was based on a specific need articulated by Eliot Quataert and nine other faculty:

Dear Graham,

We are writing to propose that UC Berkeley adopt a condominium computing model, i.e., a more centralized model for supporting research computing on campus...

SAVIO - Condo Service Offering

● Purchase into Savio by contributing standardized compute hardware

● An alternative for running a cluster in a closet with grad students and postdocs

● The condo trade-off:○ Idle resources are made available to others○ There are no (ZERO) operational costs for

administration, colocation, base storage, optimized networking and access methods, and user services

● Scheduler gives priority access to resources equivalent to the hardware contribution

SAVIO - Faculty Computing Allowance

● Provides allocations to run on Savio as well as support to researchers who have not purchased Condo nodes

● 200k Service Units (core hours) annually● More than just compute:

○ File systems○ Training/support○ User services

● PIs request their allocation via survey● Early user access (based on readiness) now● General availability planned for fall semester

SAVIO - System Overview

● Similar in design to a typical research cluster○ Master Node role has been broken out

(management, scheduling, logins, file system, etc..)● Home storage: Enterprise level, backups,

quotaed● Scratch space: Large and fast (Lustre)● Multiple login/interactive nodes● DTN: Data Transfer Node● Compute nodes are delineated based on role

SAVIO - System Architecture

SAVIO - Specification

● Hardware○ Compute Nodes: 20-core, 64GB, InfiniBand○ BigMem Nodes: 20-core, 512GB, InfiniBand

● Software Stack○ Scientific Linux 6 (equivalent to Red Hat Enterprise

Linux 6)○ Parallelization: OpenMPI, OpenMP, POSIX threads○ Intel Compiler○ SLURM job scheduler○ Software Environment Modules

SAVIO - OTP

● The biggest security threat that we encounter ...

STOLEN CREDENTIALS

● Credentials are stolen via keyboard sniffers installed on researchers laptops or workstations, incorrectly assumed to be secure

● OTP (One Time Passwords) offers mitigation● Easy to learn, simple to use, and works on both

computers and smartphones!

SAVIO - Future Services

● Serial/HTC Jobs○ Expanding the initial architecture beyond just HPC○ Specialized node hardware (12-core, 128GB, PCI

flash storage)○ Designed for jobs that use <= 1 node○ Nodes are shared between jobs

● GPU nodes○ GPUs are optimal for massively parallel algorithms○ Specialized node hardware (8-core, 64GB, 2x Nvidia

K80)

Questions

Berkeley Research ComputingTown Hall Meeting

Savio User Environment

SAVIO - Faculty Computing Allowance

● Eligibility requirements○ ladder-rank faculty or PI on UCB campus.○ In need of compute power to solve a research problem.

● Allowance Request Procedure○ First fill out the Online Requirements Survey○ Allowance can be used either by the faculty or by immediate group members.○ For additional cluster accounts fill out - Additional User Account Request Form

● Allowances○ New allowances start on June 1st of every year.○ Mid-year requests are granted a prorated allocation○ A cluster specific project (fc_projectname) with all user accounts is setup○ Scheduler account (fc_projectname) with 200K core hours is setup○ Annual allocation exipres on May 31st of the following year

SAVIO - Access● Cluster access

○ Connect using SSH (server name - hpc.brc.berkeley.edu)○ Uses OTP - One Time Passwords (Multifactor authentication) ○ Multiple login nodes (randomly distribute users)

● Coming in future○ NERSC’s NEWT REST API for web portal development○ iPython notebooks & Jupyter hub integration

SAVIO - Data Storage Options

● Storage ○ No local storage on compute nodes○ All storage accessed over network○ Either NFS or Lustre protocol

● Multiple file systems○ HOME - NFS, 10GB quota, Backed up, No purge.○ SCRATCH - Lustre, No quota, No Backups, can be purged○ Project (GROUP) space - NFS, 200GB quota, No Backups, No Purge.○ No long term archive.

SAVIO - Data Transfers

● Use only the dedicated Data Transfer Node (DTN)● Server name - dtn.brc.berkeley.edu● Highly recommend using Globus (Web interface) for management ● Many other traditional tools are also supported on the DTN

○ SCP/SFTP○ Rsync○ BBCP

SAVIO - Software Support● Software module farm

○ Many of the most commonly used packages are already available.○ In most cases packages compiled from source○ Easy command line tools to browse and access packages ($ module cmd)

● Supported package list○ Open Source

■ Tools - octave, gnuplot, imagemagick, visit, qt, ncl, paraview, lz4, git, valgrind, etc..

■ Languages - GNU C/C++/Fortran compilers, Java (JRE), Python, R, etc..

○ Commercial■ Intel C/C++/Fortran compiler suite, Matlab with 80 core license for MDCS

● User applications○ Individual user/group specific packages can be built from source by users○ Recommend using GROUP storage space for sharing with others in group.○ SAVIO consultants available to answer your questions.

SAVIO - Job Scheduler

● SLURM

● Multiple Node Options (partitions)

● Interaction with Scheduler○ Only with command line tools and utilities.○ Online web interfaces for job management can be supported in future via

NERSC’s NEWT REST API or iPython/Jupyter or both.

Quality of Service Max allowed running time/job Max number of nodes/job

savio_debug 30 minutes 4

savio_normal 72 hours (i.e 3 days) 24

Partition # of nodes # of cores/node Memory/node Local Storage

savio 160 20 64 GB No local storage

savio_bigmem 4 20 512 GB No local storage

savio_htc 12 12 128 GB Local PCI Flash

SAVIO - Job Accounting

● Jobs gain exclusive access to assigned compute nodes.● Jobs are expected to be highly parallel and capable of using all

the resources on assigned nodes.

For example:

● Running on one standard node for 5 hours uses 1 (nodes) * 20 (cores) * 5 (hours) = 100 core-hours (or Service Units).

● Online User Documentation○ User Guide - http://research-it.berkeley.edu/services/high-performance-

computing/user-guide○ New User Information - http://research-it.berkeley.edu/services/high-

performance-computing/new-user-information

● Helpdesk○ Email : [email protected]○ Monday - Friday, 9:00 am to 5:00 pm○ Best effort in non working hours

SAVIO - How to Get Help

Thank you

Questions