kyle spafford jeremy s. meredith jeffrey s. vetter

Quantifying NUMA and Contention

Effects in Multi-GPU Systems

Kyle Spafford

Jeremy S. Meredith

Jeffrey S. Vetter

http://ft.ornl.gov

2 Managed by UT-Battellefor the U.S. Department of Energy

S3D DCA++

Early Work


“An experimental high performance

computing system of innovative design.”

“Outside the mainstream of what is routinely

available from computer vendors.”

- National Science Foundation, Track2D Call Fall 2008


Keeneland ID @ GT\ORNL


Inside a Node

4 Hot plug SFF (2.5”) HDDs

1 GPU module in the rear, lower 1U

2 GPU modules in upper 1U

Dual 1GbE

Dedicated management iLO3 LAN & 2 USB ports

VGA

UID LED & Button

Health LED

Serial (RJ45)

Power ButtonQSFP

(QDR IB)

2 Non-hot plug SFF (2.5”) HDD


Node Block Diagram

DDR3

DDR3

PCIe x16

PCIe x16

CPU

CPUGPU (6GB)

GPU (6GB)

RAMRAMRAM

RAMRAMRAM

QPI

QPI

Infiniband

QPI I/O Hub

I/O Hub GPU (6GB)

integrated

PCIe x16

QPI


Why a dual I/O hub?

8

8

8 GPU #0

GPU #1

PCIe Switch

Tesla 1U

IOH


Why a dual I/O hub?

8

8

8.0 GPU #0

GPU #1

PCIe Switch

Tesla 1U

IOH

Bottleneck!


Why a dual I/O hub?

8

8

8.0 GPU #0

GPU #1

PCIe Switch

Tesla 1U

IOH

8.0

8.0

CPU #0

CPU #1

GPU #1

GPU #2

12.8

12.8

12.8IOH

GPU #08.0

12.8

IOH

Bottleneck!


Introduction of NUMA

8.0

CPU #0

GPU #1IOH

12.8

IOH

CPU #0

GPU #0IOH

12.8

12.8 8.0Short Path

Long Path


Bandwidth PenaltyCPU #0 H->D Copy


Bandwidth PenaltyCPU #0 D->H Copy

~2 GB/s


Other Benchmark Results• MPI Latency

– 26% penalty for large messages, 12% small messages

• SHOC Benchmarks– Mismap penalty shown below– gives this effect context

SGEM

M

DGEMM

MD_DP

MD

Sort FF

T

FFT_DP

Reduction

Scan

Stencil

0%5%

10%15%20%25%30%35%40%

3% 4%7% 9% 9% 12% 12% 12%

31%36%


Given a Multi-GPU app, how should processes be pinned?


Given a Multi-GPU app, how should processes be pinned?

0 1 2


CPU #1

CPU #0

GPU #1IOH

IOHInfiniband

GPU #0

GPU #2

Maximize GPU Bandwidth


CPU #1

CPU #0

GPU #1IOH

IOHInfiniband

GPU #0

GPU #2

0

12

Maximize GPU Bandwidth


CPU #1

CPU #0

GPU #1IOH

IOHInfiniband

GPU #0

GPU #2

0

12

Maximize MPI Bandwidth


CPU #1

CPU #0

GPU #1IOH

IOHInfiniband

GPU #0

GPU #2

0

12

Maximize MPI Bandwidth

Pretty easy, right?


Pinning with numactl

numactl --cpunodebind=0 --membind=0 ./program


if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "2" ]]

then

numactl --cpunodebind=1 --membind=1 ./prog

else

if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "1" ]]

then


else # rank = 0


fi

fi

0-1-1 Pinning with numactl


HPL Scaling

• Sustained MPI and GPU ops

• Uses other CPU cores via Intel MKL


What Happened with 0-1-1?

CPU #1

CPU #0

0

12

MPI Tasks



0

1

2

CPU #0

CPU #1

MPI Tasks



0

1

2

CPU #0

CPU #1

MPI Tasks

MKL Threads



0

1

2

CPU #0

CPU #1

MPI Tasks

MKL Threads

Threads inherit pinning!



0

1

2

CPU #0

CPU #1

MKL Threads



0

1

2

CPU #0

CPU #1

MKL Threads

Two idle cores, 1 oversubscribed socket!


NUMA Impact on Apps


Well…

time


Can we improve utilization by sharing a Fermi among multiple

tasks?


Bandwidth of Most Bottlenecked Task

1 2 3 40.00

1.00

2.00

3.00

4.00

5.00

6.00

5.13

2.57

1.71

0.07

5.43

2.72

1.81

0.96

min mean

Tasks sharing a GPU

GB/s

ec


Is the second IO hub worth it?


Is the second IO hub worth it?

• Aggregate bandwidth to GPUs is 16.9 GB/s

• What about real app behavior?– Scenario A: “HPL” -- 1 MPI & 1 GPU task per GPU– Scenario B: A + 1 MPI for each other core


Contention Penalty


Puzzler – Pinning Redux

Do ranks 1 and 2 always have a long path?




CPU #0

GPU #1IOH

IOH




CPU #0

GPU #1IOH

IOH

CPU #1

Infiniband

IOH

IOH


Split MPI and GPU – MPI Latency


Split MPI and GPU – PCIe bandwidth


Takeaways

• Dual IO hubs deliver– But add complexity

• Ignoring the complexity will sink some apps– Wrong pinning sunk HPL– Bandwidth bound kernels & “function offload” apps

• Threads and libnuma can help– but can be tedious to use


[email protected]://kylespafford.com/

kyle spafford jeremy s. meredith jeffrey s. vetter

Documents