technological overview of high-performance...

Technological Overview of High-Performance Computing

Gerolf Ziegenhain - TU Kaiserslautern, Germany

Outline of This Talk● Give a glance at the important technologies● The most important stuff is mentioned at least...● Provide keywords / directions for further reading

Where is the Bottleneck?● Beowolf cluster

Communication between nodes critical⇒● Algorithms / Problems to solve

– Look for good decomposition

– Parallel many algorithms ↔ many systems∋

● This talk: – What technologies for building beowolf cluster do exist?

How to Share Resources● What can be shared?

– Computation time

– Storage

– Administrative overhead● Each participating group has to have at least one

administrator to explain usage and to know the needs make design decisions!⇒

● Critical technologies– Hardware: interconnect of nodes

– Software: automatic fair-share

Part - Networking

Networking Technologies

Various technologies existed in the last decade

Networking● Backbone of a cluster

– Communication between nodes

– Experience: 90% of failures because of network trouble

● This talk: basic knowledge– Different topologies

– Latency & throughput

Network Topologies● How are the nodes connected?

● All-to-all communication fully connected⇒● Long ranged interactions with field approximation

tree⇒● Storage with NAS station star⇒● GBit with switch provides: bus

Wikipedia

Topologies of Switched Networks● Simple switched

– Cheap

– All-to-all communication within onering

– Different rings possible

● Stacked switched– All-to-all communication

– Limited bandwidth

● Fat tree– Unlimited all-to-all

communication ^

Bauke & Mertens, Springer 2005

Latency

Technology Latency (µs) Bandwidth (MB/s)MD-step(N) 2.8 0.00007SD-RAM <0.007 >1000MBit Ethernet 70 11GBit Ethernet 30 110Infiniband 7.5 800Myrinet 6.3 248SCI 2.7 326

Latency

● Critical parameter: latency● RAM: fast, expensive● Networking: slow, cheap → expensive

⇒ Look for good decomposition

Bandwidth● Evaluation of trajectories (data in 50-GB-packets

common for MD-simulations)

⇒ Bandwidth also matters

⇒ Know your needs!

Collisions of Packets?● Different types ↔ collisions● Packet transfer tolerance

– Persistent / blocking / non-blocking

– Point-point / broadcast / multicast

● Can your switch handle the load?– Switch is essential to your performance!

● MPI is fragile– At least with GBit: separate the networks

Channel Bonding● Combine multiple (Gbit)-Channels● Increased bandwidth (load balancing)

– Bandwidth ~#channels

– Fault tolerance

● Loss in latency <1%● Algorithms for channel bonding

– XOR of MAC addresses

– ARP packets

– Dynamic link aggregation (802.3ad compliant switch)

IPMI / SOL● IPMI = Intelligent Platform Management Interface

– Monitor hardware status

– BIOS access

– Reboot or Power on / off

– Serial Interface

– Vendors: Dell, HP, Intel, NEC

● SOL = Serial Over Lan– Access serial port over LAN

– Only one cable infrastructure

– ≠ KVM-switch

Computer Architecture

Development of Architecture

Architecture● Feynman 1960: Plenty of room at the bottom● But: heat (power) diffusion problematic● Paradigm change in CPU design

– Not possible: one big cpu

– Possible: many cpus

– Many processes in parallel

– Virtualization

– Parallel programming

Architecture ↔ Networking● Latency RAM Latency networking≪

⇒ Use as many CPU per node as possible

– But: overhead because only one memory!

● Network– Topology, technology

– Price increases nonlinearly with #nodes

● Optimum changes with years...– Price for multi-CPU mainboards

– Price for networking overhead

⇒ Know your needs!

Software and System Management

Scripting● Occurence

– Init-Scripts

– Cron Jobs

– Customization of software (most stuff written in BASH)

● Applications– Monitoring (temperature, storage, CPU usage, ...)

– Maintenance (logfiles, storage, users, ...)

– Initialization (services, mounting storage, ...)

– Job control (initialization, cleanup, ...)

Which Scripting Languages● Absolutely essential:

– Shell (/bin/sh & /bin/bash)

– awk&sed

– Python

● Optional– Perl

– Shell (/bin/zsh)

● Good to know– C

Queue System● Objective

– Start / stop / monitor jobs

– Fair share (users / groups / university)

– Hard / soft limits of resources

● Job requirements– Memory

– # cores

– Runtime

– Priority (can it be resumed?)● Quickshots, long-term simulations

● Optimal usage of the hardware● Transparent & fair end-user interface

How to Configure the Queue● Properties of nodes

– Memory

– #cores

– Hdd space

– Architecture

● Priorities / rights (user, group)● Different queues (priority, resources)● Fair-share policies

How to Use the Queue● Specify the requirements for a job

– Runtime

– #cores

– Memory

– Architecture

– Priority

● Create dependent jobs– Evaluation

– Parameter scans

● Interactive jobs

Monitoring Performance● What to monitor?

– Temperature

– Disk space / memory / cpu load

– Job queue

– Network traffic

● How to monitor?– Automatic monitoring system

● Ganglia● Multicast, webinterface, charts

– Manual monitoring● vmstat, tcpdump● logfiles

OpenSSH● Secure SHell● Idea: Point-to-Point● Toolset

– Ssh, scp, sftp

– Key & hostbased authentification

● Features– Tunneling (→ VPN)

– Shell

– Copying files

– Mounting server on remote desktop (sshfs)

Parallel ssh● Ssh:

– Secure, fast

– Point-to-point 1000*n seconds⇒for i in node1...node1000; do ssh $i reboot;done

● Rgang– Python script

– Parallel ssh sessions

– Tree structure of spawned ssh instances n seconds⇒rgang nodes reboot

– Scales up to thousands of nodes

Nice to Have● Syslog server ● Monitoring● Snapshot of the virtualized system states● Automatic installation server

– 2 min complete reinstall is doable

Storage

Storage Hardware● Hard disk drive

– Invented 1954

– State-of-the-art technologies

– Moore's law holds

– Parameters to consider● Average lifetime● Power consumption● Heat diffusion

– Vendor specific● Quality● Compatibility with Linux / BSD

RAID● Striping (Raid0)

– Size, speed: ~N, 1 disk fail all broken⇒● Mirroring (Raid1)

– Data security, reading speed increased

● Striping with distributed parity (Raid5)– Size, speed: ~(N-1), 1 HDD may fail

RAID● Software / Hardware?

– Little speed differences

– Software RAID works with all disk-controllers

– Hardware RAID: vendor specific format

● Consider– Failure tolerance: hdd may break

– Total capacity of storage

– Create multiple storages?

– How to share the resources?

Storage Sharing● NAS: export local (raid) filesystem over NFS size⇒● SAN: one hdd per node in rack size, speed⇒

Distributed Storage● NAS: slow in parallel access● Solution: Store local data portions

– Custom solution with your code

● Parallel filesystems– 1 hdd per node,

combine all nodes to one big shared storage

– Glusterfs, coda, lustre, pvfs, xtreemfs ...

– Licensing

– Installation effort

– Fault tolerance

– Performance?

HDF5● Hierarchical data format● Use (python-)libraries for evaluation● Advantages

– As fast as binary files, but much smaller (statistics in less than one second through 100GB data file)

– Include images, descriptions, axis labels, units etc in the same file

– Portable

– I/O libraries exist

– Seamless integration with MPI (support for output in slices)

Administration● User management● Core programs

– ssh

– NIS server

– Queue (Automatic job kill?)

● Storage: quota– Allowed filesize

– Automatic deletion

Housing of the Cluster● Nominal power consumption

– Rise with time

– Consider also power supply of cooling

● Power supplies– Influence of voltage peaks equalizer⇒

● Power may fail – UPS for critical infrastructure

● Cooling– Each W burned in a CPU heat⇒

Thank you!

● Acknowledgement:– Jan Janßen

– Prof. Dr. rer. nat. Herbert M. Urbassek

technological overview of high-performance...

Documents