1 operated by the southeastern universities research association for the u.s. department of energy...

18
1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson Lab LQCD Installation of the 2004 LQCD Compute Cluster Walt Akers

Upload: javier-aswell

Post on 01-Apr-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

1Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Jefferson Lab LQCD

Installation of the 2004 LQCD Compute Cluster

Walt Akers

Page 2: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

2Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Jefferson Lab’s Scientific Purpose

Page 3: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

3Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

LQCD – Lattice Quantum Chromodynamics

Jefferson Lab and MIT lead a collaboration of 28 senior theorists from 14 institutions addressing the hadron physics portion of the U.S. National Lattice QCD Collaboration. The National LQCD Collaboration has three major sites for hardware:

BNL, FNAL and JLab.

A goal of the collaboration is to have access to tens of Teraflops (sustained) in the very near future.

Achieving this goal would make the U.S. a world leader in LQCD and put discovery potential in the hands of U.S. LQCD physicists. 2005 status:

U.S. ~8 Teraflops World ~25 Teraflops

Page 4: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

4Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Computing Resources at Jefferson Lab

Jefferson Lab currently employs more than 1000 compute nodes for parallel and batch data processing. These are managed independently by the High Performance Computing Group and the Physics Computer Center.

High Performance Computing (HPC) ResourcesResources used exclusively for parallel processing of lattice quantum chromodynamics. 384 Node Gig-E Mesh Cluster (DELL PowerEdge 2850 - 2.8 GHz) 256 Node Gig-E Mesh Cluster (Supermicro - 2.6 GHz) 128 Node Myrinet Cluster (Supermicro - 1.8 GHz ) 32 Node Test Cluster (Mixed Systems)

Physics Computer Center (CC) ResourcesResources used for sequential processing of experimental data. 200+ Node Batch Farm (Mixed Systems)

Page 5: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

5Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

The LQCD-04 Cluster - Dell PowerEdge 2850

The 384-node compute cluster provided by Dell is our most recent addition.

Each node is equipped with: Single 2.8 GHz processor 512 Mbytes RAM 38 GByte SCSI Drive 3 Dual-port Intel gigabit network interface cards Intelligent Platform Management Interface (IPMI)

Page 6: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

6Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

The LQCD-04 Cluster - Interconnects

Interconnects Each compute node uses 6 gigabit ethernet links to

perform nearest neighbor communications in three dimensions.

One onboard gigabit port is used to provide a connection to the service network.

TCP was replaced with the VIA protocol which provides less overhead, lower latency (18.75 usec), and higher throughput (500 MByte/second aggregate).

This cluster can currently be employed as either a single 384-node machine or three distinct 128-node machines.

Why use a gigabit ethernet mesh? Price/Performance!Lattice Quantum Chromodynamics calculations deal almost exclusively with nearest neighbor communication. A mesh solution is optimal.

Direct gigabit connections deliver 2/3 of the throughput at 1/3 of the current cost of an Infiniband solution.

Page 7: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

7Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

HPC Batch Management Software

TORQUE: Tera-scale Open-source Resource and QUEue manager TORQUE is an extension of Open PBS

that includes revisions allowing it to scale to thousands of nodes.

TORQUE provides a queue-based infrastructure for batch submission and resource manager daemons that run on each node.

UnderLord Scheduling System: The UnderLord scheduler was

developed at Jefferson Lab. It provides a hierarchical algorithm that

selects jobs for execution based on a collection of weighted parameters.

The UnderLord allows nodes to be associated with individual queues.

Page 8: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

8Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Considerations in Selecting a Cluster Vendor

1) Price/Performance Measured in sustained MFlops/dollar

(Dell Cluster was $1.10 / MFlop)

2) Reliability Quality of the individual components

3) Maintainability Ease of replacement of failed components Features for advanced detection of failures Features for monitoring performance of overall system

4) Service Does vendor provide a streamlined process for repair/replacement What is the time between failure and repair/replacement

Page 9: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

9Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

1990 2000 2010

Mflops / $

101

10-1

100

QCDSP

Performance per Dollar for Typical LQCD Applications

• Commodity compute nodes (leverage marketplace & Moore’s law)

• Large enough to exploit cache boost

• Low latency, high bandwidth network to exploit full I/O capability (& keep up with cache performance)

10-2

Vector Supercomputers, including the Japanese Earth Simulator

JLab SciDAC Prototype Clusters

QCDOC will deliver significant capabilities in early 2005

20022003

(2004)

Future clusters will significantly extend scientific reach

Japanese Earth Simulator

Anticipated boost due to SciDAC funded software

Page 10: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

10Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Cluster Reliability

Infant Mortality On average 7% of the machines in a large cluster acquisition will

have some component failure upon delivery or during the first month of running.

The most common failures are: Shipping damage Hard drive failures (often as a result of mishandling during shipping) Improperly installed or fitted components resulting from accelerated

production schedules. Manufacturing defects

The cluster provided by Dell had fewer than 2 % early failures and 2/3 of these failures were related to third-party ethernet cards.

We are very pleased with the early reliability of this cluster.

Page 11: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

11Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Service

Installation The installation team sent by Dell and Ideal Systems were

phenomenal. The team adapted quickly to shipping and delivery problems that

were outside of their control and delivered an operational configuration on schedule.

Return/Replacement Protocol We began exercising the return protocol during the first weeks of

commissioning to replace several defective network cards. It took very little effort to develop a return strategy that was straight

forward enough to be handled by our part time, student assistant.

Page 12: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

12Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Maintainability: Our Most Critical Requirement

The Importance of Maintainability: To remain competitive with other cluster technologies funded by the

DOE, we must provide maximum system availability with minimal staff.

The 800 nodes (in 4 computing clusters) operated by Jefferson Lab’s High Performance Computing Group are managed, operated and maintained by three regular staff members and one student assistant.

Staff are also responsible for cluster software development. Because our configuration is highly parallel, the failure of a single

node within a computing cluster renders the entire cluster unusable. Whenever possible, our compute nodes must be configured to detect

hardware/software problems before they become critical and, when possible, take measures to correct themselves without operator intervention.

Page 13: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

13Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Sensors and Intelligent Platform Management

Constant Monitoring: All systems are constantly

monitored by a local daemon that collects hardware and software operating statistics.

These results are combined with the sensor values obtained through lm_sensors or through IPMI (where available).

Sensor data is consolidated on a centralized server where it can be monitored and used for our system management utilities.

http://lqcd.jlab.org/monitor/web/facility.html

Page 14: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

14Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Sensors and Intelligent Platform Management

Sensor Summaries: Our sensor summary pages

display the values of all of the critical system parameters.

Actual values are presented in gauges that reflect their min and max, as well as low and high thresholds.

Data collected at the machine level is then used to produce a ‘rack summary’ and then further condensed into a ‘room overview’ that displays the most severe conditions throughout the room.

Administrators can ‘drill down’ from the room overview to find most problems.

Page 15: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

15Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Using SNMP to Monitor Our Infrastructure

Responding to Power Outages: The Dell computing cluster is on an individual

Uninteruptable Power Supply that is not generator backed.

When a power failure is detected by our monitoring software (and the remaining battery drops below 90%), IPMI is used to power down all compute nodes.

Once power has been restored (for at least 5 minutes) and the battery has recharged to 95%, IPMI is used to power-on all compute nodes.

Any previously running batch job is restarted, and the system continues to operate.

Page 16: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

16Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

What We Like Most About the Dell Cluster

Installation team provided by Dell and Ideal Systems was

fast, knowledgeable and efficient.

Compute nodes are easily disassembled and reassembled

for repair or maintenance.

Dell’s IPMI implementation provides a wealth of system

health information for cluster monitoring.

Systems have demonstrated a high degree of stability and

reliability so far.

Page 17: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

17Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

What We Would Do Differently Next Time

Start with more space, electricity and cooling.

A new 10,000 square foot computer facility is currently under construction and

should be online in late 2005.

Order our systems preinstalled in racks.

This will minimize the shipping debris that we struggled with during the last

installation and should greatly improve installation speed. If feasible, consider a single, high-speed interconnect rather than a

mesh topology.

While the gig-e mesh provides adequate bandwidth at a very affordable price, it

does represent a burden to install, troubleshoot and maintain.

Because the price of Infiniband is falling, we anticipate that our next cluster will

use a switched network to provide greater configuration flexibility and reduced

wire management concerns.

Page 18: 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson

18Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

How Dell Can Help Us on the Next Cluster

Improve the DOS based BIOS configuration utilities

Specifically, the bioscfg.exe utility had trouble writing changes to the boot order

in BIOS. We had to modify all of those by hand.

Make sensor data available from the /proc filesystem in Linux.

A Linux driver that provides local access to sensor data will provide us a lot of

troubleshooting flexibility. Provide a BMC console that allows administrators to remotely

monitor the system boot process using IPMI.

Since our Computer Center is currently located in a separate building from our

offices, this would save everyone on our team a long walk through the cold.