pdsf and the alvarez clusters

12
PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF [email protected]

Upload: chico

Post on 13-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

PDSF and the Alvarez Clusters. Presented by Shane Canon, NERSC/PDSF [email protected]. NERSC Hardware. National Energy Research Scientific Computing Center http://www.nersc.gov - PowerPoint PPT Presentation

TRANSCRIPT

PDSF and the Alvarez Clusters

Presented by Shane Canon, NERSC/PDSF [email protected]

NERSC Hardware

National Energy Research Scientific Computing Center http://www.nersc.gov

One of the nation’s top unclassified Computing resources, funded by the DOE for over 25 years with the mission of providing computing and network services for research.NERSC is located at Lawrence Berkeley Laboratory, in Berkeley, CA http://www.lbl.gov

High Performance Computing Resources http://hpcf.nersc.gov

- IBM SP cluster, 2000+ processors, 1.2+ TB RAM, 20 TB+ cluster filesystem - Cray T3E, 692 processors, 177 GB RAM - Cray PVP, 64 processors, 3 GW RAM - PDSF, 160 Compute nodes, 281 processors, 7.5 TB disk space - HPSS, 6 StorageTek Silos, 880 TB’s of near-line and offline storage. Soon to be expanded to a full PetaByte of storage

NERSC Facilities

New Oakland Scientific Facility - 20,000 sq. foot data center - 24x7 operations team - OC48 (2.5 Gbits/sec) connection to LBL/ESNet - options on 24,000 sq. foot expansion

NERSCInternet Access

ESNet Headquarters http://www.es.net/

- Provides leading edge networking to DOE researchers - Backbone has OC12 (622 Mbit/sec) connection to CERN - Backbone connects key DOE sites - Headquartered at Lawrence Berkeley - Location assures prompt response

Cluster Design

• Embarrassingly Parallel Commodity networking

• Commodity parts Buy “at the knee”

• No modeling

Issues with Cluster Configuration

• Maintaining consistency• Scalability

System Human

• Adaptability/Flexibility• Community tools

Cluster ConfigurationPresent

• Installation Home grown (nfsroot/tar image)

• Configuration management Rsync/RPM Cfengine

Cluster Configuration Future

• Installation kickstart (or

systemimager/systeminstaller)

• Configuration management RPM Cfengine

• Database Resource management Integrate with configuration management

NERSC Staff

NERSC and LBL have dedicated, experienced staff in the fields of high performancecomputing, GRID computing and mass storage

Researchers - Will Johnston, Head of Distributed Systems Dept. GRID researcher http://www.itg.lbl.gov/

Project manager for NASA Information Power Grid http://www.nas.nasa.gov/IPG

- Arie Shoshani, Head of Scientific Data Management http://gizmo.lbl.gov/DM.html

Researches mass storage issues related to scientific computing - Doug Olson, Project Coordinator Particle Physics Data Grid http://www.ppdg.net/

Coordinator for STAR computing at PDSF - Dave Quarrie, Chief Software Architect, ATLAS http://www.nersc.gov/aboutnersc/bios/henpbios.html

- Craig Tull, Offline Software Framework/Control, Coordinator for ATLAS computing at PDSF NERSC High Performance Computing Department http://www.nersc.gov/aboutnersc/hpcd.html

- Advanced Systems Group evaluates and vetts HW/SW for production computing (4 FTE) - Computing Systems Group manages infrastructure for computing (9 FTE) - Computer Operations & Support provides 24x7x365 support (14 FTE) - Networking and Security Group provides Networking and Security (3 FTE) - Mass Storage manages the near-line and off-line storage facilities (5 FTE)

PDSF & STAR

PDSF has been working with the STAR since 1998 http://www.star.bnl.gov/l

- Data collection occurs at Brookhaven, and DST’s are sent to NERSC - PDSF is the primary offsite computing facility for STAR - Collaboration carries out DST analysis and simulations at PDSF - STAR has 37 collaborating institutions (too many for arrows!)

PDSF Philosophy

PDSF is a Linux cluster built from commodity hardware and open source software - Our mission is to provide the most effective distributed computer cluster possible that is suitable for experimental HENP applications http://pdsf.nersc.gov

- PDSF acronym came from SSC lab in 1995, along with original equipment - Architecture tuned for “embarassingly parallel” applications - Uses LSF 4.1 for batch scheduling - AFS access, and access to HPSS for mass storage - High speed (Gigabit Ethernet) access to HPSS system - One of several Linux clusters at LBL - Alvarez cluster has similar architecture, but supports Myrinet cluster interconnect - NERSC PC Cluster project by Future Technology Group is an experimental cluster http://www.nersc.gov/research/FTG/index.html

- Genome cluster at LBL for research into fruit fly genome - 152 compute nodes, 281 processors, 7.5 TB of storage - Cluster uptime for year 2000 was > 98%, and for most recently measured period (January 2001), cluster utilization for batch jobs was 78%. - Overall cluster has had zero downtime due to security issues - PDSF and NERSC have a track record of solid security balanced with unobtrusive practices

More About PDSF

PDSF uses a common resource pool for all projects - PDSF supports multiple experiments: STAR, ATLAS, BABAR, D0, Amanda, E871, E895, E896 and CDF. - Multiple projects have access to the computing resources, s/w available supports all experiments - Actual level of access is determined by the batch scheduler, using fair share rules - Each project’s investment goes into purchasing hardware and support infrastructure for the entire cluster - The use of a common configuration decreases management overhead, lowers administration complexity, and increases availability of useable computing resources - Use of commodity Intel hardware makes us vendor neutral, and lowers the cost to all of our users - Low cost and easy access to hardware makes it possible for us to update configurations relatively quickly to support new computing requirements. - Because the actual physical resources available is always greater than any individual contributor’s investment, there is usually some excess capacity available for sudden peaks in usage, and always a buffer to absorb sudden hardware failures