1 operated by the southeastern universities research association for the u.s. department of energy...

1Operated by the Southeastern Universities Research Association for the U.S. Department of Energy

Thomas Jefferson National Accelerator Facility

Jefferson Lab LQCD

Installation of the 2004 LQCD Compute Cluster

Walt Akers



Jefferson Lab’s Scientific Purpose



LQCD – Lattice Quantum Chromodynamics

Jefferson Lab and MIT lead a collaboration of 28 senior theorists from 14 institutions addressing the hadron physics portion of the U.S. National Lattice QCD Collaboration. The National LQCD Collaboration has three major sites for hardware:

BNL, FNAL and JLab.

A goal of the collaboration is to have access to tens of Teraflops (sustained) in the very near future.

Achieving this goal would make the U.S. a world leader in LQCD and put discovery potential in the hands of U.S. LQCD physicists. 2005 status:

U.S. ~8 Teraflops World ~25 Teraflops



Computing Resources at Jefferson Lab

Jefferson Lab currently employs more than 1000 compute nodes for parallel and batch data processing. These are managed independently by the High Performance Computing Group and the Physics Computer Center.

High Performance Computing (HPC) ResourcesResources used exclusively for parallel processing of lattice quantum chromodynamics. 384 Node Gig-E Mesh Cluster (DELL PowerEdge 2850 - 2.8 GHz) 256 Node Gig-E Mesh Cluster (Supermicro - 2.6 GHz) 128 Node Myrinet Cluster (Supermicro - 1.8 GHz ) 32 Node Test Cluster (Mixed Systems)

Physics Computer Center (CC) ResourcesResources used for sequential processing of experimental data. 200+ Node Batch Farm (Mixed Systems)



The LQCD-04 Cluster - Dell PowerEdge 2850

The 384-node compute cluster provided by Dell is our most recent addition.

Each node is equipped with: Single 2.8 GHz processor 512 Mbytes RAM 38 GByte SCSI Drive 3 Dual-port Intel gigabit network interface cards Intelligent Platform Management Interface (IPMI)



The LQCD-04 Cluster - Interconnects

Interconnects Each compute node uses 6 gigabit ethernet links to

perform nearest neighbor communications in three dimensions.

One onboard gigabit port is used to provide a connection to the service network.

TCP was replaced with the VIA protocol which provides less overhead, lower latency (18.75 usec), and higher throughput (500 MByte/second aggregate).

This cluster can currently be employed as either a single 384-node machine or three distinct 128-node machines.

Why use a gigabit ethernet mesh? Price/Performance!Lattice Quantum Chromodynamics calculations deal almost exclusively with nearest neighbor communication. A mesh solution is optimal.

Direct gigabit connections deliver 2/3 of the throughput at 1/3 of the current cost of an Infiniband solution.



HPC Batch Management Software

TORQUE: Tera-scale Open-source Resource and QUEue manager TORQUE is an extension of Open PBS

that includes revisions allowing it to scale to thousands of nodes.

TORQUE provides a queue-based infrastructure for batch submission and resource manager daemons that run on each node.

UnderLord Scheduling System: The UnderLord scheduler was

developed at Jefferson Lab. It provides a hierarchical algorithm that

selects jobs for execution based on a collection of weighted parameters.

The UnderLord allows nodes to be associated with individual queues.



Considerations in Selecting a Cluster Vendor

1) Price/Performance Measured in sustained MFlops/dollar

(Dell Cluster was $1.10 / MFlop)

2) Reliability Quality of the individual components

3) Maintainability Ease of replacement of failed components Features for advanced detection of failures Features for monitoring performance of overall system

4) Service Does vendor provide a streamlined process for repair/replacement What is the time between failure and repair/replacement



1990 2000 2010

Mflops / $

101

10-1

100

QCDSP

Performance per Dollar for Typical LQCD Applications

• Commodity compute nodes (leverage marketplace & Moore’s law)

• Large enough to exploit cache boost

• Low latency, high bandwidth network to exploit full I/O capability (& keep up with cache performance)

10-2

Vector Supercomputers, including the Japanese Earth Simulator

JLab SciDAC Prototype Clusters

QCDOC will deliver significant capabilities in early 2005

20022003

(2004)

Future clusters will significantly extend scientific reach

Japanese Earth Simulator

Anticipated boost due to SciDAC funded software



Cluster Reliability

Infant Mortality On average 7% of the machines in a large cluster acquisition will

have some component failure upon delivery or during the first month of running.

The most common failures are: Shipping damage Hard drive failures (often as a result of mishandling during shipping) Improperly installed or fitted components resulting from accelerated

production schedules. Manufacturing defects

The cluster provided by Dell had fewer than 2 % early failures and 2/3 of these failures were related to third-party ethernet cards.

We are very pleased with the early reliability of this cluster.



Service

Installation The installation team sent by Dell and Ideal Systems were

phenomenal. The team adapted quickly to shipping and delivery problems that

were outside of their control and delivered an operational configuration on schedule.

Return/Replacement Protocol We began exercising the return protocol during the first weeks of

commissioning to replace several defective network cards. It took very little effort to develop a return strategy that was straight

forward enough to be handled by our part time, student assistant.



Maintainability: Our Most Critical Requirement

The Importance of Maintainability: To remain competitive with other cluster technologies funded by the

DOE, we must provide maximum system availability with minimal staff.

The 800 nodes (in 4 computing clusters) operated by Jefferson Lab’s High Performance Computing Group are managed, operated and maintained by three regular staff members and one student assistant.

Staff are also responsible for cluster software development. Because our configuration is highly parallel, the failure of a single

node within a computing cluster renders the entire cluster unusable. Whenever possible, our compute nodes must be configured to detect

hardware/software problems before they become critical and, when possible, take measures to correct themselves without operator intervention.



Sensors and Intelligent Platform Management

Constant Monitoring: All systems are constantly

monitored by a local daemon that collects hardware and software operating statistics.

These results are combined with the sensor values obtained through lm_sensors or through IPMI (where available).

Sensor data is consolidated on a centralized server where it can be monitored and used for our system management utilities.

http://lqcd.jlab.org/monitor/web/facility.html



Sensors and Intelligent Platform Management

Sensor Summaries: Our sensor summary pages

display the values of all of the critical system parameters.

Actual values are presented in gauges that reflect their min and max, as well as low and high thresholds.

Data collected at the machine level is then used to produce a ‘rack summary’ and then further condensed into a ‘room overview’ that displays the most severe conditions throughout the room.

Administrators can ‘drill down’ from the room overview to find most problems.



Using SNMP to Monitor Our Infrastructure

Responding to Power Outages: The Dell computing cluster is on an individual

Uninteruptable Power Supply that is not generator backed.

When a power failure is detected by our monitoring software (and the remaining battery drops below 90%), IPMI is used to power down all compute nodes.

Once power has been restored (for at least 5 minutes) and the battery has recharged to 95%, IPMI is used to power-on all compute nodes.

Any previously running batch job is restarted, and the system continues to operate.



What We Like Most About the Dell Cluster

Installation team provided by Dell and Ideal Systems was

fast, knowledgeable and efficient.

Compute nodes are easily disassembled and reassembled

for repair or maintenance.

Dell’s IPMI implementation provides a wealth of system

health information for cluster monitoring.

Systems have demonstrated a high degree of stability and

reliability so far.



What We Would Do Differently Next Time

Start with more space, electricity and cooling.

A new 10,000 square foot computer facility is currently under construction and

should be online in late 2005.

Order our systems preinstalled in racks.

This will minimize the shipping debris that we struggled with during the last

installation and should greatly improve installation speed. If feasible, consider a single, high-speed interconnect rather than a

mesh topology.

While the gig-e mesh provides adequate bandwidth at a very affordable price, it

does represent a burden to install, troubleshoot and maintain.

Because the price of Infiniband is falling, we anticipate that our next cluster will

use a switched network to provide greater configuration flexibility and reduced

wire management concerns.



How Dell Can Help Us on the Next Cluster

Improve the DOS based BIOS configuration utilities

Specifically, the bioscfg.exe utility had trouble writing changes to the boot order

in BIOS. We had to modify all of those by hand.

Make sensor data available from the /proc filesystem in Linux.

A Linux driver that provides local access to sensor data will provide us a lot of

troubleshooting flexibility. Provide a BMC console that allows administrators to remotely

monitor the system boot process using IPMI.

Since our Computer Center is currently located in a separate building from our

offices, this would save everyone on our team a long walk through the cold.

1 operated by the southeastern universities research association for the u.s. department of energy...

Documents