high performance linux clusters guru session, usenix, boston june 30, 2004 greg bruno, sdsc
TRANSCRIPT
![Page 1: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/1.jpg)
High Performance Linux Clusters
Guru Session, Usenix, Boston
June 30, 2004
Greg Bruno, SDSC
![Page 2: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/2.jpg)
Overview of San Diego Supercomputer Center Founded in 1985
Non-military access to supercomputers
Over 400 employees Mission: Innovate, develop, and
deploy technology to advance science Recognized as an international leader
in: Grid and Cluster Computing Data Management High Performance Computing Networking Visualization
Primarily funded by NSF
![Page 3: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/3.jpg)
My Background
1984 - 1998: NCR - Helped to build the world’s largest database computers Saw the transistion from proprietary parallel systems to
clusters
1999 - 2000: HPVM - Helped build Windows clusters
2000 - Now: Rocks - Helping to build Linux-based clusters
![Page 4: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/4.jpg)
Why Clusters?
![Page 5: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/5.jpg)
Moore’s Law
![Page 6: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/6.jpg)
Cluster Pioneers
In the mid-1990s, Network of Workstations project (UC Berkeley) and the Beowulf Project (NASA) asked the question:
Can You Build a High Performance Machine FromCommodity Components?
![Page 7: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/7.jpg)
The Answer is: Yes
Source: Dave Pierce, SIO
![Page 8: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/8.jpg)
The Answer is: Yes
![Page 9: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/9.jpg)
Types of Clusters
High Availability Generally small (less than 8 nodes)
Visualization
High Performance Computational tools for scientific computing Large database machines
![Page 10: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/10.jpg)
High Availability Cluster
Composed of redundant components and multiple communication paths
![Page 11: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/11.jpg)
Visualization Cluster
Each node in the cluster drives a display
![Page 12: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/12.jpg)
High Performance Cluster
Constructed with many compute nodes and often a high-performance interconnect
![Page 13: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/13.jpg)
Cluster Hardware Components
![Page 14: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/14.jpg)
Cluster Processors
Pentium/Athlon Opteron Itanium
![Page 15: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/15.jpg)
Processors: x86
Most prevalent processor used in commodity clustering
Fastest integer processor on the planet: 3.4 GHz Pentium 4, SPEC2000int: 1705
![Page 16: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/16.jpg)
Processors: x86
Capable floating point performance #5 machine on Top500 list built with Pentium 4
processors
![Page 17: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/17.jpg)
Processors: Opteron
Newest 64-bit processor Excellent integer performance
SPEC2000int: 1655
Good floating point performance SPEC2000fp: 1691 #10 machine on Top500
![Page 18: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/18.jpg)
Processors: Itanium First systems released June 2001 Decent integer performance
SPEC2000int: 1404
Fastest floating-point performance on the planet SPEC2000fp: 2161
Impressive Linpack efficiency: 86%
![Page 19: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/19.jpg)
Processors Summary
Processor GHz SPECint SPECfp Price
Pentium 4 EE
3.4 1705 1561 791
Athlon
FX-51
2.2 1447 1423 728
Opteron 150 2.4 1655 1644 615
Itanium 2 1.5 1404 2161 4798
Itanium 2 1.3 1162 1891 1700
Power4+ 1.7 1158 1776 ????
![Page 20: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/20.jpg)
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
But What You Really Build?
Itanium: Dell PowerEdge 3250 Two 1.4 GHz CPUs (1.5 MB cache)
11.2 Gflops peak
2 GB memory 36 GB disk $7,700
Two 1.5 GHz (6 MB cache) makes the system cost ~$17,700
1.4 GHz vs. 1.5 GHz ~7% slower ~130% cheaper
![Page 21: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/21.jpg)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Opteron
IBM eServer 325 Two 2.0 GHz Opteron 246
8 Gflops peak
2 GB memory 36 GB disk $4,539
Two 2.4 GHz CPUs: $5,691
2.0 GHz vs. 2.4 GHz ~17% slower ~25% cheaper
![Page 22: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/22.jpg)
Pentium 4 Xeon
HP DL140 Two 3.06 GHz CPUs
12 Gflops peak
2 GB memory 80 GB disk $2,815
Two 3.2 GHz: $3,368
3.06 GHz vs. 3.2 GHz ~4% slower ~20% cheaper
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
![Page 23: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/23.jpg)
If You Had $100,000 To Spend On A Compute Farm
System# of Boxes
Peak GFlops
Aggregate
SPEC2000fp
Aggregate
SPEC2000int
Pentium 4
3 GHz
35 420 89810 104370
Opteron 246 2.0 GHz
22 176 56892 57948
Itanium
1.4 GHz
12 132 46608 24528
![Page 24: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/24.jpg)
What People Are Buying
Gartner study
Servers shipped in 1Q04 Itanium: 6,281 Opteron: 31,184
Opteron shipped 5x more servers than Itanium
![Page 25: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/25.jpg)
What Are People Buying
Gartner study
Servers shipped in 1Q04 Itanium: 6,281 Opteron: 31,184 Pentium: 1,000,000
Pentium shipped 30x more than Opteron
![Page 26: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/26.jpg)
Interconnects
![Page 27: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/27.jpg)
Interconnects
Ethernet Most prevalent on clusters
Low-latency interconnects Myrinet Infiniband Quadrics Ammasso
![Page 28: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/28.jpg)
Why Low-Latency Interconnects?
Performance Lower latency Higher bandwidth
Accomplished through OS-bypass
![Page 29: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/29.jpg)
How Low Latency Interconnects Work
Decrease latency for a packet by reducing the number memory copies per packet
![Page 30: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/30.jpg)
Bisection Bandwidth
Definition: If split system in half, what is the maximum amount of data that can pass between each half?
Assuming 1 Gb/s links: Bisection bandwidth = 1 Gb/s
![Page 31: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/31.jpg)
Bisection Bandwidth
Assuming 1 Gb/s links: Bisection bandwidth = 2 Gb/s
![Page 32: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/32.jpg)
Bisection Bandwidth
Definition: Full bisection bandwidth is a network topology that can support N/2 simultaneous communication streams.
That is, the nodes on one half of the network can communicate with the nodes on the other half at full speed.
![Page 33: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/33.jpg)
Large Networks When run out of ports on a single switch, then you must
add another network stage
In example above: Assuming 1 Gb/s links, uplinks from stage 1 switches to stage 2 switches must carry at least 6 Gb/s
![Page 34: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/34.jpg)
Large Networks
With low-port count switches, need many switches on large systems in order to maintain full bisection bandwidth
128-node system with 32-port switches requires 12 switches and 256 total cables
![Page 35: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/35.jpg)
Myrinet
Long-time interconnect vendor Delivering products since 1995
Deliver single 128-port full bisection bandwidth switch
MPI Performance: Latency: 6.7 us Bandwidth: 245 MB/s Cost/port (based on 64-port configuration): $1000
Switch + NIC + cable http://www.myri.com/myrinet/product_list.html
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 36: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/36.jpg)
Myrinet
Recently announced 256-port switch Available August 2004
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 37: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/37.jpg)
Myrinet
#5 System on Top500 list
System sustains 64% of peak performance But smaller Myrinet-connected systems hit 70-75% of
peak
![Page 38: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/38.jpg)
Quadrics
QsNetII E-series Released at the end of May 2004
Deliver 128-port standalone switches
MPI Performance: Latency: 3 us Bandwidth: 900 MB/s Cost/port (based on 64-port configuration): $1800
Switch + NIC + cable http://doc.quadrics.com/Quadrics/QuadricsHome.nsf/
DisplayPages/A3EE4AED738B6E2480256DD30057B227
![Page 39: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/39.jpg)
Quadrics
#2 on Top500 list
Sustains 86% of peak Other Quadrics-connected systems on Top500 list
sustain 70-75% of peak
![Page 40: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/40.jpg)
Infiniband
Newest cluster interconnect
Currently shipping 32-port switches and 192-port switches
MPI Performance: Latency: 6.8 us Bandwidth: 840 MB/s Estimated cost/port (based on 64-port configuration): $1700 -
3000 Switch + NIC + cable http://www.techonline.com/community/related_content/24364
![Page 41: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/41.jpg)
Ethernet
Latency: 80 us
Bandwidth: 100 MB/s
Top500 list has ethernet-based systems sustaining between 35-59% of peak
![Page 42: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/42.jpg)
Ethernet
With Myrinet, would have sustained ~1 Tflop At a cost of ~$130,000
Roughly 1/3 the cost of the system
What we did with 128 nodes and a $13,000 ethernet network $101 / port
$28/port with our latest Gigabit Ethernet switch Sustained 48% of peak
![Page 43: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/43.jpg)
Rockstar Topology
24-port switches Not a symmetric network
Best case - 4:1 bisection bandwidth Worst case - 8:1 Average - 5.3:1
![Page 44: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/44.jpg)
Low-Latency Ethernet
Bring os-bypass to ethernet Projected performance:
Latency: less than 20 us Bandwidth: 100 MB/s
Potentially could merge management and high-performance networks
Vendor “Ammasso”
![Page 45: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/45.jpg)
Application Benefits
![Page 46: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/46.jpg)
Storage
![Page 47: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/47.jpg)
Local Storage
Exported to compute nodes via NFS
![Page 48: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/48.jpg)
Network Attached Storage
A NAS box is an embedded NFS appliance
![Page 49: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/49.jpg)
Storage Area Network
Provides a disk block interface over a network (Fibre Channel or Ethernet) Moves the shared disks out of the servers and onto the network Still requires a central service to coordinate file system operations
![Page 50: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/50.jpg)
Parallel Virtual File System
PVFS version 1 has no fault tolerance PVFS version 2 (in beta) has fault tolerance mechanisms
![Page 51: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/51.jpg)
Lustre
Open Source “Object-based” storage
Files become objects, not blocks
![Page 52: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/52.jpg)
Cluster Software
![Page 53: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/53.jpg)
Cluster Software Stack
Linux Kernel/Environment RedHat, SuSE, Debian, etc.
![Page 54: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/54.jpg)
Cluster Software Stack
HPC Device Drivers Interconnect driver (e.g., Myrinet, Infiniband, Quadrics) Storage drivers (e.g., PVFS)
![Page 55: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/55.jpg)
Cluster Software Stack
Job Scheduling and Launching Sun Grid Engine (SGE) Portable Batch System (PBS) Load Sharing Facility (LSF)
![Page 56: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/56.jpg)
Cluster Software Stack
Cluster Software Management E.g., Rocks, OSCAR, Scyld
![Page 57: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/57.jpg)
Cluster Software Stack
Cluster State Management and Monitoring Monitoring: Ganglia, Clumon, Nagios, Tripwire, Big Brother Management: Node naming and configuration (e.g., DHCP)
![Page 58: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/58.jpg)
Cluster Software Stack
Message Passing and Communication Layer E.g., Sockets, MPICH, PVM
![Page 59: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/59.jpg)
Cluster Software Stack
Parallel Code / Web Farm / Grid / Computer Lab Locally developed code
![Page 60: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/60.jpg)
Cluster Software Stack
Questions: How to deploy this stack across every machine in the cluster? How to keep this stack consistent across every machine?
![Page 61: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/61.jpg)
Software Deployment
Known methods: Manual Approach “Add-on” method
Bring up a frontend, then add cluster packages OpenMosix, OSCAR, Warewulf
Integrated Cluster packages are added at frontend installation time
Rocks, Scyld
![Page 62: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/62.jpg)
Rocks
![Page 63: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/63.jpg)
Primary Goal
Make clusters easy
Target audience: Scientists who want a capable computational resource in their own lab
![Page 64: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/64.jpg)
Philosophy Not fun to “care and feed” for a system All compute nodes are 100% automatically
installed Critical for scaling
Essential to track software updates RHEL 3.0 has issued 232 source RPM updates since Oct
21 Roughly 1 updated SRPM per day
Run on heterogeneous standard high volume components Use the components that offer the best
price/performance!
![Page 65: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/65.jpg)
More Philosophy
Use installation as common mechanism to manage a cluster Everyone installs a system:
On initial bring up When replacing a dead node Adding new nodes
Rocks also uses installation to keep software consistent If you catch yourself wondering if a node’s software is up-to-
date, reinstall! In 10 minutes, all doubt is erased
Rocks doesn’t attempt to incrementally update software
![Page 66: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/66.jpg)
Rocks Cluster Distribution
Fully-automated cluster-aware distribution Cluster on a CD set
Software Packages Full Red Hat Linux distribution
Red Hat Linux Enterprise 3.0 rebuilt from source De-facto standard cluster packages Rocks packages Rocks community packages
System Configuration Configure the services in packages
![Page 67: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/67.jpg)
Rocks Hardware Architecture
![Page 68: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/68.jpg)
Minimum Components
X86, Opteron, IA64 server
Local HardDrive
Power
Ethernet
OS on all nodes (not SSI)
![Page 69: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/69.jpg)
Optional Components
Myrinet high-performance network Infiniband support in Nov 2004
Network-addressable power distribution unit
keyboard/video/mouse network not required Non-commodity How do you manage your management
network? Crash carts have a lower TCO
![Page 70: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/70.jpg)
Storage
NFS The frontend exports all home directories
Parallel Virtual File System version 1 System nodes can be targeted as Compute + PVFS or
strictly PVFS nodes
![Page 71: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/71.jpg)
Minimum Hardware Requirements
Frontend: 2 ethernet connections 18 GB disk drive 512 MB memory
Compute: 1 ethernet connection 18 GB disk drive 512 MB memory
Power Ethernet switches
![Page 72: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/72.jpg)
Cluster Software Stack
![Page 73: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/73.jpg)
Rocks ‘Rolls’
Rolls are containers for software packages and the configuration scripts for the packages
Rolls dissect a monolithic distribution
![Page 74: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/74.jpg)
Rolls: User-Customizable Frontends
Rolls are added by the Red Hat installer Software is added and configured at initial installation
time
Benefit: apply security patches during initial installation This method is more secure than the add-on method
![Page 75: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/75.jpg)
Red Hat Installer Modified to Accept Rolls
![Page 76: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/76.jpg)
Approach Install a frontend
1. Insert Rocks Base CD2. Insert Roll CDs (optional components)3. Answer 7 screens of configuration data4. Drink coffee (takes about 30 minutes to install)
Install compute nodes:1. Login to frontend2. Execute insert-ethers3. Boot compute node with Rocks Base CD (or PXE)4. Insert-ethers discovers nodes5. Goto step 3
Add user accounts Start computing Optional Rolls
Condor Grid (based on NMI R4) Intel (compilers) Java SCE (developed in Thailand) Sun Grid Engine PBS (developed in Norway) Area51 (security monitoring tools)
![Page 77: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/77.jpg)
Login to Frontend
Create ssh public/private key Ask for ‘passphrase’ These keys are used to securely login into compute
nodes without having to enter a password each time you login to a compute node
Execute ‘insert-ethers’ This utility listens for new compute nodes
![Page 78: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/78.jpg)
Insert-ethers
Used to integrate “appliances” into the cluster
![Page 79: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/79.jpg)
Boot a Compute Node in Installation Mode
Instruct the node to network boot Network boot forces the compute node to run the PXE protocol (Pre-
eXecution Environment)
Also can use the Rocks Base CD If no CD and no PXE-enabled NIC, can use a boot floppy built from
‘Etherboot’ (http://www.rom-o-matic.net)
![Page 80: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/80.jpg)
Insert-ethers Discovers the Node
![Page 81: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/81.jpg)
Insert-ethers Status
![Page 82: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/82.jpg)
eKVEthernet Keyboard and Video
Monitor your compute node installation over the ethernet network No KVM required!
Execute: ‘ssh compute-0-0’
![Page 83: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/83.jpg)
Node Info Stored In A MySQL Database
If you know SQL, you can execute some powerful commands
![Page 84: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/84.jpg)
Cluster Database
![Page 85: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/85.jpg)
Kickstart
Red Hat’s Kickstart Monolithic flat ASCII file No macro language Requires forking based on site
information and node type.
Rocks XML Kickstart Decompose a kickstart file into
nodes and a graph Graph specifies OO framework Each node specifies a service and its
configuration Macros and SQL for site
configuration Driven from web cgi script
![Page 86: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/86.jpg)
Sample Node File<?xml version="1.0" standalone="no"?><!DOCTYPE kickstart SYSTEM "@KICKSTART_DTD@" [<!ENTITY ssh "openssh">]><kickstart>
<description>Enable SSH</description>
<package>&ssh;</package><package>&ssh;-clients</package><package>&ssh;-server</package><package>&ssh;-askpass</package>
<post>
<file name="/etc/ssh/ssh_config">Host * CheckHostIP no ForwardX11 yes ForwardAgent yes StrictHostKeyChecking no UsePrivilegedPort no FallBackToRsh no Protocol 1,2</file>
chmod o+rx /rootmkdir /root/.sshchmod o+rx /root/.ssh
</post></kickstart>>
![Page 87: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/87.jpg)
Sample Graph File<?xml version="1.0" standalone="no"?><!DOCTYPE kickstart SYSTEM "@GRAPH_DTD@">
<graph><description>Default Graph for NPACI Rocks.</description>
<edge from="base" to="scripting"/><edge from="base" to="ssh"/><edge from="base" to="ssl"/><edge from="base" to="lilo" arch="i386"/><edge from="base" to="elilo" arch="ia64"/>
…<edge from="node" to="base" weight="80"/><edge from="node" to="accounting"/><edge from="slave-node" to="node"/><edge from="slave-node" to="nis-client"/>
<edge from="slave-node" to="autofs-client"/> <edge from="slave-node" to="dhcp-client"/> <edge from="slave-node" to="snmp-server"/> <edge from="slave-node" to="node-certs"/> <edge from="compute" to="slave-node"/> <edge from="compute" to="usher-server"/> <edge from="master-node" to="node"/> <edge from="master-node" to="x11"/> <edge from="master-node" to="usher-client"/></graph>
![Page 88: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/88.jpg)
Kickstart framework
![Page 89: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/89.jpg)
Appliances
Laptop / Desktop Appliances Final classes Node types
Desktop IsA standalone
Laptop IsA standalone pcmcia
Code re-use is good
![Page 90: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/90.jpg)
Architecture Differences
Conditional inheritance Annotate edges with target
architectures if i386
Base IsA grub
if ia64 Base IsA elilo
One Graph, Many CPUs Heterogeneity is easy Not for SSI or Imaging
![Page 91: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/91.jpg)
Installation Timeline
![Page 92: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/92.jpg)
Status
![Page 93: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/93.jpg)
But Are Rocks Clusters High Performance Systems? Rocks Clusters on June 2004 Top500 list:
![Page 94: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/94.jpg)
![Page 95: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/95.jpg)
What We Proposed To Sun
Let’s build a Top500 machine … … from the ground up … … in 2 hours … … in the Sun booth at Supercomputing ‘03
![Page 96: High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC](https://reader036.vdocuments.site/reader036/viewer/2022062314/56649cb15503460f94976188/html5/thumbnails/96.jpg)
Rockstar Cluster (SC’03) Demonstrate
We are now in the age of “personal supercomputing”
Highlight abilities of: Rocks SGE
Top500 list #201 - November 2003 #413 - June 2004
Hardware 129 Intel Xeon servers
1 Frontend Node 128 Compute Nodes
Gigabit Ethernet $13,000 (US) 9 24-port switches 8 4-gigabit trunk uplinks