technological overview of high-performance...
TRANSCRIPT
Technological Overview of High-Performance Computing
Gerolf Ziegenhain - TU Kaiserslautern, Germany
Outline of This Talk● Give a glance at the important technologies● The most important stuff is mentioned at least...● Provide keywords / directions for further reading
Where is the Bottleneck?● Beowolf cluster
Communication between nodes critical⇒● Algorithms / Problems to solve
– Look for good decomposition
– Parallel many algorithms ↔ many systems∋
● This talk: – What technologies for building beowolf cluster do exist?
How to Share Resources● What can be shared?
– Computation time
– Storage
– Administrative overhead● Each participating group has to have at least one
administrator to explain usage and to know the needs make design decisions!⇒
● Critical technologies– Hardware: interconnect of nodes
– Software: automatic fair-share
Part - Networking
Networking Technologies
Various technologies existed in the last decade
Networking● Backbone of a cluster
– Communication between nodes
– Experience: 90% of failures because of network trouble
● This talk: basic knowledge– Different topologies
– Latency & throughput
Network Topologies● How are the nodes connected?
● All-to-all communication fully connected⇒● Long ranged interactions with field approximation
tree⇒● Storage with NAS station star⇒● GBit with switch provides: bus
Wikipedia
Topologies of Switched Networks● Simple switched
– Cheap
– All-to-all communication within onering
– Different rings possible
● Stacked switched– All-to-all communication
– Limited bandwidth
● Fat tree– Unlimited all-to-all
communication ^
Bauke & Mertens, Springer 2005
Latency
Technology Latency (µs) Bandwidth (MB/s)MD-step(N) 2.8 0.00007SD-RAM <0.007 >1000MBit Ethernet 70 11GBit Ethernet 30 110Infiniband 7.5 800Myrinet 6.3 248SCI 2.7 326
Latency
● Critical parameter: latency● RAM: fast, expensive● Networking: slow, cheap → expensive
⇒ Look for good decomposition
Bandwidth● Evaluation of trajectories (data in 50-GB-packets
common for MD-simulations)
⇒ Bandwidth also matters
⇒ Know your needs!
Collisions of Packets?● Different types ↔ collisions● Packet transfer tolerance
– Persistent / blocking / non-blocking
– Point-point / broadcast / multicast
● Can your switch handle the load?– Switch is essential to your performance!
● MPI is fragile– At least with GBit: separate the networks
Channel Bonding● Combine multiple (Gbit)-Channels● Increased bandwidth (load balancing)
– Bandwidth ~#channels
– Fault tolerance
● Loss in latency <1%● Algorithms for channel bonding
– XOR of MAC addresses
– ARP packets
– Dynamic link aggregation (802.3ad compliant switch)
IPMI / SOL● IPMI = Intelligent Platform Management Interface
– Monitor hardware status
– BIOS access
– Reboot or Power on / off
– Serial Interface
– Vendors: Dell, HP, Intel, NEC
● SOL = Serial Over Lan– Access serial port over LAN
– Only one cable infrastructure
– ≠ KVM-switch
Computer Architecture
Development of Architecture
Architecture● Feynman 1960: Plenty of room at the bottom● But: heat (power) diffusion problematic● Paradigm change in CPU design
– Not possible: one big cpu
– Possible: many cpus
– Many processes in parallel
– Virtualization
– Parallel programming
Architecture ↔ Networking● Latency RAM Latency networking≪
⇒ Use as many CPU per node as possible
– But: overhead because only one memory!
● Network– Topology, technology
– Price increases nonlinearly with #nodes
● Optimum changes with years...– Price for multi-CPU mainboards
– Price for networking overhead
⇒ Know your needs!
#CPUs
Software and System Management
Scripting● Occurence
– Init-Scripts
– Cron Jobs
– Customization of software (most stuff written in BASH)
● Applications– Monitoring (temperature, storage, CPU usage, ...)
– Maintenance (logfiles, storage, users, ...)
– Initialization (services, mounting storage, ...)
– Job control (initialization, cleanup, ...)
Which Scripting Languages● Absolutely essential:
– Shell (/bin/sh & /bin/bash)
– awk&sed
– Python
● Optional– Perl
– Shell (/bin/zsh)
● Good to know– C
Queue System● Objective
– Start / stop / monitor jobs
– Fair share (users / groups / university)
– Hard / soft limits of resources
● Job requirements– Memory
– # cores
– Runtime
– Priority (can it be resumed?)● Quickshots, long-term simulations
● Optimal usage of the hardware● Transparent & fair end-user interface
How to Configure the Queue● Properties of nodes
– Memory
– #cores
– Hdd space
– Architecture
● Priorities / rights (user, group)● Different queues (priority, resources)● Fair-share policies
How to Use the Queue● Specify the requirements for a job
– Runtime
– #cores
– Memory
– Architecture
– Priority
● Create dependent jobs– Evaluation
– Parameter scans
● Interactive jobs
Monitoring Performance● What to monitor?
– Temperature
– Disk space / memory / cpu load
– Job queue
– Network traffic
● How to monitor?– Automatic monitoring system
● Ganglia● Multicast, webinterface, charts
– Manual monitoring● vmstat, tcpdump● logfiles
OpenSSH● Secure SHell● Idea: Point-to-Point● Toolset
– Ssh, scp, sftp
– Key & hostbased authentification
● Features– Tunneling (→ VPN)
– Shell
– Copying files
– Mounting server on remote desktop (sshfs)
Parallel ssh● Ssh:
– Secure, fast
– Point-to-point 1000*n seconds⇒for i in node1...node1000; do ssh $i reboot;done
● Rgang– Python script
– Parallel ssh sessions
– Tree structure of spawned ssh instances n seconds⇒rgang nodes reboot
– Scales up to thousands of nodes
Nice to Have● Syslog server ● Monitoring● Snapshot of the virtualized system states● Automatic installation server
– 2 min complete reinstall is doable
Storage
Storage Hardware● Hard disk drive
– Invented 1954
– State-of-the-art technologies
– Moore's law holds
– Parameters to consider● Average lifetime● Power consumption● Heat diffusion
– Vendor specific● Quality● Compatibility with Linux / BSD
RAID● Striping (Raid0)
– Size, speed: ~N, 1 disk fail all broken⇒● Mirroring (Raid1)
– Data security, reading speed increased
● Striping with distributed parity (Raid5)– Size, speed: ~(N-1), 1 HDD may fail
RAID● Software / Hardware?
– Little speed differences
– Software RAID works with all disk-controllers
– Hardware RAID: vendor specific format
● Consider– Failure tolerance: hdd may break
– Total capacity of storage
– Create multiple storages?
– How to share the resources?
Storage Sharing● NAS: export local (raid) filesystem over NFS size⇒● SAN: one hdd per node in rack size, speed⇒
Distributed Storage● NAS: slow in parallel access● Solution: Store local data portions
– Custom solution with your code
● Parallel filesystems– 1 hdd per node,
combine all nodes to one big shared storage
– Glusterfs, coda, lustre, pvfs, xtreemfs ...
– Licensing
– Installation effort
– Fault tolerance
– Performance?
HDF5● Hierarchical data format● Use (python-)libraries for evaluation● Advantages
– As fast as binary files, but much smaller (statistics in less than one second through 100GB data file)
– Include images, descriptions, axis labels, units etc in the same file
– Portable
– I/O libraries exist
– Seamless integration with MPI (support for output in slices)
Administration● User management● Core programs
– ssh
– NIS server
– Queue (Automatic job kill?)
● Storage: quota– Allowed filesize
– Automatic deletion
Housing of the Cluster● Nominal power consumption
– Rise with time
– Consider also power supply of cooling
● Power supplies– Influence of voltage peaks equalizer⇒
● Power may fail – UPS for critical infrastructure
● Cooling– Each W burned in a CPU heat⇒
Thank you!
● Acknowledgement:– Jan Janßen
– Prof. Dr. rer. nat. Herbert M. Urbassek