si liu ncar/cisl/osd/uss consulting service group

32
National Center for Atmospheric Research Computation and Information Systems Laboratory Facilities and Support Overview Feb 14, 2010 Si Liu NCAR/CISL/OSD/USS Consulting Service Group

Upload: amara

Post on 22-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

National Center for Atmospheric Research Computation and Information Systems Laboratory Facilities and Support Overview Feb 14, 2010. Si Liu NCAR/CISL/OSD/USS Consulting Service Group. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

National Center for Atmospheric ResearchComputation and Information Systems Laboratory

Facilities and Support Overview

Feb 14, 2010

Si Liu NCAR/CISL/OSD/USS

Consulting Service Group

Page 2: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

CISL’s Mission for User Support

“CISL will provide a balanced set of services to enable researchers to securely, easily, and effectively utilize community resources”

CISL Strategic Plan

CISL also supports special colloquia, workshops andcomputational campaigns; giving these groups of usersspecial privileges and access to facilities and servicesabove and beyond normal service levels.

Page 3: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

CISL Facility Navigation and usage of the facility requires a basic familiarity

with a number of the functional aspects of the facility.

• Computing systemso Bluefireo Frosto Lynx o NWSC machine

• Usageo Batcho Interactive

• Data Archivalo MSSo HPSSo GLADE

• Data Analysis & Visualizationo Mirage and Storm

•Allocations and security

•User support

Page 4: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Allocations

• Allocations are granted in General Accounting Units (GAUs)

• Monitor GAU usage through the CISL portal:http://cislportal.ucar.edu/portal(requires UCAS password)

• Charges are assessed overnight and will be available for review for runs that complete by midnight.

• GAUs charged = wallclock hours used * number of nodes used * number of processors in that node * computer factor * queue charging factor

• The computer charging factor for bluefire is 1.4.

Page 5: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Security

• CISL Firewallo Internal networks are separated from the external Interneto protect servers from malicious attacks from external networks

• Secure Shello Use SSH for local and remote access to all CISL systems.

• One-time passwords for protected systemso Cryptocard o Yubikey

Page 6: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group
Page 7: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Computing System - Bluefire• IBM clustered Symmetric MultiProcessing (SMP) system

o Operating System: AIX (IBM-proprietary UNIX)o Batch system: Load Sharing Facility (LSF)o File system: General Parallel File System (GPFS)

• 127 32-Way 4.7 GHz nodes o 4,064 POWER6 processorso SMT enabled (64 SMT threads per node)o 76.4 TFLOPS

• 117 Compute nodes (70.4 TFLOPS peak)o 3,744 POWER6 processors (32 per node)o 69 compute nodes have 64 GB memoryo 48 compute nodes have 128 GB memory

• 10 other nodeso 2 Interactive sessions/Login nodes (256 GB memory)o 2 Debugging and Share queue nodes (256 GB memory)o 4 GPFS/VSD nodeso 2 Service nodes

Page 8: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Computing System – Bluefire, continued

• Memoryo L1 cache is 128 KB per processor. o L2 cache is 4 MB per processor on-chip. o The off-chip L3 cache is 32 MB per two-processor chip, and is

shared by the two processors on the chip. o 48 nodes contain 128 GB of shared memory, and 69 nodes contain

64 GB per processor of shared memory.

• Disk storageo 150 TeraBytes of usable file system spaceo 5 GB home directory; backed up and not subject to scrubbing.o 400 GB /ptmp directory; not backed up

• High-speed interconnect : Infiniband Switcho 4X Infiniband DDR links capable of 2.5 GB/sec with 1.3

microsecond latency o 8 links per node

Page 9: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Computing System - Frost

• Four-rack, IBM Blue Gene/L systemo Operating system: SuSE Linux Enterprise Server 9.o Batch System: Cobalt from ANLo File system: The scratch space (/ptmp) is GPFS. Home directories are NFS mounted.

• 4096 compute nodes (22.936 TFLOPS peak) o 1024 compute nodes per racko Two 700MHz PowerPC-440 CPUs, 512MB of memory, and two

floating-point units (FPUs) per coreo One I/O node for every 32 compute nodes

• Four front-end cluster nodes. o IBM OpenPower720 servero Four POWER5 1.65GHz CPUs o 8GB of memory

Page 10: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Computing System – Lynx

• Single cabinet Massively Parallel Processing supercomputer.o Operating system: Cray Linux Environment

Compute Node Linux (CNL) – based on SuSE Linux SLES 10o Batch System:

MOAB workload manager Torque (aka OpenPBS) resource manager Cray’s ALPS (Application Level Placement Scheduler)

o File system: Luster file system

• 912 compute processors (8.026 TFLOPS peak) o 76 compute nodes, 12 processors per node o Two hex-core AMD 2.2 GHz Opteron chips  o Each processor has 1.3 gigabytes of memory and totaling 1.216 terabytes

of memory in the system.

• 10 I/O nodeso A single dual-core AMD 2.6 GHz Opteron chip and 8 gigabytes of memory. o 2 login nodes, 4 nodes reserved for system functions.o 4 nodes are for external Lustre filesystem and GPFS file system testing.

Page 11: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

A Job Script on Bluefire#!/bin/csh # LSF batch script to run an MPI application

#BSUB -P 12345678 # project number (required) #BSUB -W 1:00 # wall clock time (in minutes) #BSUB -n 64 # number of MPI tasks #BSUB -R "span[ptile=64]" # run 64 tasks per node #BSUB -J matadd_mpi # job name #BSUB -o matadd_mpi.%J.out # output filename #BSUB -e matadd_mpi.%J.err # error filename #BSUB -q regular # queue

# edit exec header to enable using 64K pages ldedit -bdatapsize=64K -bstackpsize=64K -btextpsize=64K matadd_mpi.exe

# set this env for launch as default processor binding setenv TARGET_CPU_LIST "-1" mpirun.lsf /usr/local/bin/launch ./matadd_mpi.exe

For more examples, see the /usr/local/examples directory.

Page 12: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Submitting, Deleting, and Monitoring Jobs on Bluefire

• Job submissiono bsub < script

• Monitor jobso bjobs

bjobs -u all bjobs -q regular -u all

o bhist -n 3 -a • Deleting a job

o bkill[jobid]

• System batch loado batchview

Page 13: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

"Big 3“ to get a best performance on bluefire

• Simultaneous Multi-Threading(SMT)o a second, on-board "virtual" processoro 64 virtual cpus in each node

• Multiple page size supporto The default page size: 4 KB. o 64-KB page size when running the 64-bit kernel.o Large pages (16 MB) and "huge" pages (16 GB)

• Processor binding

Page 14: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Submitting, Deleting, and Monitoring Jobs on Frost

• Submitting a jobo cqsub -n 28 -c 55 -m vn \ -t 00:10:00 example

• Check job statuso cqstat

• Deleting a job o cqdel [jobid]

• Altering jobs o qalter -t 60 -n 32 --mode vn 118399

Page 15: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

A job script on Lynx

#!/bin/bash#PBS -q regular#PBS -l mppwidth=60#PBS -l walltime=01:30:00#PBS –N example#PBS -e testrun.$PBS_JOBID.err#PBS -o testrun.$PBS_JOBID.out cd $PBS_O_WORKDIRaprun -n 60 ./testrun

Page 16: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Submitting, Deleting, and Monitoring Jobs on Lynx

• Submitting a jobo qsub [batch_script]

• Check job statuso qstat –a

• Deleting a job o qdel [jobid]

• Hold or release a job o qhold [jobid] qrls [jobid]

• Change attributes of submitted jobo qalter

Page 17: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

NCAR Archival System

• NCAR Mass Store Subsystem - (MSS)• Currently stores 70 million files, 11 petabytes of data

– Library of Congress (printed collection)– 10 Terabytes = 0.01 Petabytes– Mass Store holds 800 * Library of Congress

• Growing by 2-6 Terabytes of new data per day• Data holdings increasing exponentially

o 1986 - 2 Tbo 1997 - 100 Tbo 2002 - 1000 Tbo 2004 - 2000 Tbo 2008 - 5000 Tbo 2010 – 8000 Tb

Page 18: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group
Page 19: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Migrating from the NCAR MSS to HPSS

• MSS and its interfaces will be replaced by High Performance Storage System (HPSS) in March 2011.

• When the conversion is complete, the existing storage hardware currently in use will still be in place, but it will be under HPSS control.

• Users will not have to copy any of their files from the MSS to HPSS. There will be a period of downtime.

• The MSS Group will be working closely with users during this transition.

Page 20: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

The differences between MSS and HPSS

• What will be going away: o msrcp and all DCS libraries. o The MSS ftp server. o All DCS metadata commands for listing files, and manipulating files. o Retention periods, the weekly "Purge" and email purge notices, the "trash

can" for purged/deleted files, and the "msrecover" command. o Read and write passwords on MSS files.

• What we will have instead: o Hierarchical Storage Interface (HSI), which will be the primary interface that

NCAR will be supporting for data transfers to/from HPSS along with metadata access and data management.

o GridFTP (under development)o HPSS files have NO expiration date. They remain in the archive until they

are explicitly deleted. Once deleted, they cannot be recovered. o Posix-style permission bits for controlling access to HPSS files and

directories.o Persistent directories. o Longer filenames (HPSS full pathnames can be up to 1023 characters). o Higher maximum file size (HPSS files can be up to 1 Terabyte).

Page 21: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Charges for MSS and HPSS

The current MSS charging formula is:

GAUs charged = .0837*R + .0012*A +N(.1195*W + .205*S)

• R = Gigabytes read• W = Gigabytes created or written• A = Number of disk drive or tape cartridge accesses• S = Data stored, in gigabyte-years• N = Number of copies of file:

= 1 if economy reliability selected = 2 if standard reliability selected

Page 22: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

HPSS Usage

• To request an accounto TeraGrid users should send email to: [email protected] NCAR users should contact CISL Customer Support: [email protected]

• The HPSS uses HSI as its POSIX compliant interface.

• HSI uses Kerberos as an authentication mechanism.

Page 23: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Kerberoshttp://www2.cisl.ucar.edu/docs/hpss/kerberos

• Authentication serviceo Authentication – validate who you areo Service – with a server, set of functions, etc.

• Kerberos operates in a domaino Default is UCAR.EDU

• Domain served by a KDCo Key Distribution Centero Server (dog.ucar.edu)o Different pieces (not necessary to get into this)

• Users of the serviceo Individual people o Kerberized Services (like HPSS)

Page 24: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Kerberos Commands

• kinito Authenticate with KDC, get your TGT

• klisto List your ticket cache

• kdestroyo Clear your ticket cache

• kpasswdo Change your password

Page 25: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Hierarchical Storage Interface (HSI)

• POSIX like interface

• Different Ways to Invoke HSI o Command line invocation

hsi cmd hsi get myhpssfile (in your default dir on HPSS) hsi put myunixfile (in your default dir on HPSS)

o Open an HSI session hsi to get in and establish session; end, exit, quit to get out restricted shell like environment

o hsi “in cmdfile” File of commands scripted in “cmdfile”

• Navigating HPSS while in HSI sessiono pwd , cd, ls, cdls

Page 26: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Data Transfer

• Writing data – put command

o [HSI]/home/user1> put file.01o [HSI]/home/user1> put file.01 : new.hpss.file

• Reading data – get command

o [HSI]/home/user1-> get file.01o [HSI]/home/user1-> get file.01 : new.hpss.file

Detailed documentation for HSI can be found at http://www.ucar.edu/docs/hpss

Page 27: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

GLADE centralized file service

• High performance shared file system technology

• Shared work spaces across CISL's HPC resources

• Multiple different spaces based requirements.o /glade/home/usernameo /glade/scratch/usernameo /glade/proj*

Page 28: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Data Analysis & Visualization

• Data Analysis and Visualizationo High-end servers available 7 x 24 for interactive data

analysis, data-postprocessing and visualization. • Data Sharing

o Shared data access within the Lab. Access to the NCAR Archival Systems and NCAR Data Sets.

• Remote Visualizationo Access DASG's visual computing platforms from the

convenience of your office using DASG's tcp/ip based remote image delivery service.

• Visualization Consultingo Consult with DASG staff on your visualization problems.  

• Production Visualizationo DASG staff can in some instances generate images and/or

animations of your data on your behalf.

Page 29: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Software Configuration (Mirage-Storm).

• Development Environmentso Intel C, C++, F77, F90o GNU C, C++, F77, Toolso TotalView debugger

• Software Toolso VAPORo NCL, NCO, NCARGo IDLo Matlabo Paraviewo ImageMagicko VTK

Page 30: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

User Support

• CISL homepage: o http://www2.cisl.ucar.edu/

• CISL documentationo http://www2.cisl.ucar.edu/docs/user-documentation

• CISL HELPo Call (303)497-1200o Email to [email protected] Submit an extraview ticket

• CISL Consulting Serviceo NCAR Mesa Lab Area 51/55, Floor 1B

Page 31: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Information We Need From You

• Machine name

• Job ID, date

• Nodes job ran on, if known

• Description of the problem

• Commands you typed• Error code you are getting

• These are best provided in ExtraView ticket or via email to [email protected].

Page 32: Si Liu   NCAR/CISL/OSD/USS Consulting Service Group

Working on mirage

• Log on to ‘mirage’o ssh –X –l username mirage3.ucar.eduo One-time password using CryptoCard or Yubikeyo Use 'free' or 'top' to see if there is currently enough resources

• Enabling your Yubikey tokeno Your yubikey has been activated and is ready for use.o The yubikey is activated by the warmth of your finger not the pressure in

pushing the button.

• Using your Yubikey token o When you are logging in to mirage, your screen displays a response: Token_Response: o Enter you PIN number on the screen (do not hit enter) then touch the

yubikey button. This will insert a new one-time password(OTP) and a return.