early experience with i/o on mira · compute and i/o components of leadership-class machines to...
TRANSCRIPT
Venkat Vishwanath Steve Crusan and Kevin Harms Argonne Na5onal Laboratory
Early Experience with I/O on Mira
Disclaimer: Mira I/O subsystem and filesystem is still work in progress. The configuration and performance will change, and we will keep users informed of these.
ALCF-1 I/O Infrastructure
2
40K Nodes 160K Cores 0.55 PFlops
640 I/O
Nodes
10 Gbps Myrinet Switch Complex 900+ ports
100 Nodes 3.2 TB Memory
200 GPUs
128 File Servers
640 GB/s
640x 0.85 GB/s 100 GB/s
128 GB/s 65 GB/s
Intrepid BG/P Compute Resource Eureka Analysis
Cluster
Storage System
ALCF uses the GPFS filesystem for production I/O
544 GB/s
3
48K Nodes 768K Cores 10 PFlops
384 I/O
Nodes
QDR IB Switch Complex 900+ ports
6 TB Memory 96 Nodes 192 GPUs
128 DDN VM File Server 26 PB
768X 2 GB/s
384 GB/s
512 -1024 GB/s
~250GB/s
Mira BG/Q Compute Resource Tukey Analysis
Cluster
Storage System
ALCF-2 I/O Infrastructure
1536 GB/s
1536 GB/s
Storage System consists of 16 DDN SFA12KE “couplets” with each couplet consisting of 560 x 3TB HDD, 32 x 200GB SSD, 8 Virtual Machines (VM) with two QDR IB ports per VM
I/O Subsystem of BG/P
Tree Network Switch
BGP Compute Nodes in a Pset
BGP I/O Node
Storage / Analysis Node
I/O forwarding is a paradigm that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute nodes to dedicated I/O nodes.
5
128 BG/Q Compute Nodes
2x2x4x4x2
ION
2 GB/s
2 GB/s
4 GB/s
• For every 128 compute nodes, we have two links to the I/O Node on the BG/Q Torus
• On the ION, we have a QDR Infiniband connected on a PCI-e Gen2 Slot
I/O Subsystem of BG/Q- Simplified View
Control and I/O Daemon (CIOD) – I/O Forwarding mechanism for BG/P and BG/Q
• CIOD is the I/O forwarding infrastructure provided by IBM for the Blue Gene / P
• For each compute node, we have a dedicated I/O proxy process to handle the associated I/O operations
• On BG/Q, we have a thread based implementation instead of process-based
Intrepid vs Mira : I/O Comparison
Intrepid BG/P Mira BG/Q Increase Flops 0.557 PF 10 PF 20X Cores 160K 768K ~5X ION / CN ratio 1 / 64 1 / 128 0.5X IONs per Rack 16 8 0.5X Total Storage 6 PB 35 PB 6X I/O Throughput 62GB/s 240 GB/s Peak 4X User Quota? None Quotas
enforced!! -
I/O per Flops (%) 0.01 0.0024 0.24X
7
I/O interfaces and Libraries on Mira
• Interfaces – POSIX – MPI-‐IO
• Higher-‐level Libraries – HDF5 – Netcdf4 – PnetCDF
• Early I/O Performance on Mira – Shared File Write ~70 GB/s – Shared File Read ~180 GB/s
8
Early Experience Scaling I/O for HACC • Hybrid/Hardware Accelerated
Computa5onal Cosmology code is lead by Salman Habib at Argonne
• Members include Katrin Heitmann, Hal Finkel, Adrian Pope, Vitali Morozov
• Par5cle-‐in-‐Cell code • Achieved ~14 PF on Sequoia and
7 PF on Mira • On Mira, HACC uses 1-‐10 Trillion
par5cles
9
30 30
30
� First Large-Scale Simulation on Mira
– 16 racks, 262,144 cores (one-third of Mira)
– 14 hours continuous, no checkpoint restarts
� Simulation Parameters – 9.14 Gpc simulation box
– 7 kpc force resolution
– 1.07 trillion particles
� Science Motivations – Resolve halos that host
luminous red galaxies for analysis of SDSS observations
– Cluster cosmology
– Baryon acoustic oscillations in galaxy clustering
HACC on Mira: Early Science Test Run
Zoom-in visualization of the density field illustrating the global spatial dynamic range of the simulation -- approximately a million-to-one
Early Experience Scaling I/O for HACC • Typical Checkpoint/Restart dataset is ~100-‐400 TB • Analysis output, wriden more frequently, is ~1TB-‐10TB • Step 1:
– Default I/O mechanism using single shared file with MPI-‐IO – Achieved ~15 GB/s
• Step 2: – File per rank – Too many files at 32K Nodes (262K files with 8RPN)
• Step 3: – An I/O mechanism wri5ng 1 File per ION – Topology-‐aware I/O to reduce network conten5on – Achieves up to 160 GB/s (~10 X improvement)
• Wriden and read ~10 PB of data on Mira (and coun5ng) 10
Few Tips and Tricks for I/O on BG/Q
• Pre-‐create the files – Increases shared file write performance from 70GB/s to 130GB/s
• Single shared file vs file per process (MPI rank) • Pre-‐allocate files • HDF5 Datatype
– Instead of Na5ve, use Big Endian(eg. H5T_IEEE_F64BE) • MPI Datatypes • Collec5ve I/O • MPI-‐IO BgLockless Environment variable
– Prefix file with bglockless:/ – BGLOCKLESSMPIO_F_TYPE=0x47504653
11
Darshan – An I/O Characterization Tool
• An open-‐source tool developed for sta5s5cal profiling of I/O • Designed to be lightweight and low overhead
– Finite memory alloca5on for sta5s5cs (about 2MB) done during MPI_Init
– Overhead of 1-‐2% total to record I/O calls – Varia5on of I/O is typically around 4-‐5% – Darshan does not create detailed func5on call traces
• No source modifica5ons – Uses PMPI interfaces to intercept MPI calls – Use ld wrapping to intercept POSIX calls – Can use dynamic linking with LD_PRELOAD instead
• Stores results in single compressed log file • hdp://www.mcs.anl.gov/darshan
12
Darshan on BG/Q • logs: /gpfs/vesta-fs0/logs/darshan/vesta /gpfs/mira-fs0/logs/darshan/mira
• bins: /soft/perftools/darshan/darshan/bin
• wrappers: /soft/perftools/darshan/darshan/wrappers/<wrapper>
• export PATH=/soft/perftools/darshan/darshan/wrappers/xl:${PATH}
13
fms c48L24 am3p5.x (3/27/2010) 1 of 2
uid: 6277 nprocs: 1944 runtime: 5382 seconds
0
20
40
60
80
100
POSIXMPI-IO
Perc
enta
ge o
f run
tim
e
Average I/O cost per process
ReadWrite
MetadataOther (including application compute)
0
5e+06
1e+07
1.5e+07
2e+07
2.5e+07
3e+07
3.5e+07
4e+07
4.5e+07
Read Write Open Stat Seek Mmap Fsync
Ops
(Tot
al, A
ll Pro
cess
es)
I/O Operation Counts
POSIXMPI-IO Indep.
MPI-IO Coll.
0
5e+06
1e+07
1.5e+07
2e+07
2.5e+07
3e+07
0-100101-1K
1K-10K
10K-100K
100K-1M
1M-4M4M-10M
10M-100M
100M-1G
1G+
Coun
t (To
tal,
All P
rocs
)
I/O Sizes
Read Write
0
5e+06
1e+07
1.5e+07
2e+07
2.5e+07
3e+07
3.5e+07
4e+07
Read Write
Ops
(Tot
al, A
ll Pro
cs)
I/O Pattern
TotalSequential
Consecutive
Most Common Access Sizesaccess size count
65536 23066526800 2378602304 2172322400 2029354
./fms c48L24 am3p5.x <unknown args>
Questions
14