abela amos, torsten oefler capability models for manycore...

spcl.inf.ethz.ch

@spcl_eth

SABELA RAMOS, TORSTEN HOEFLER

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

spcl.inf.ethz.ch

@spcl_eth

Microarchitectures are becoming more and more complex

… CPU

CPUL1L2

GDDR5 GDDR5

spcl.inf.ethz.ch

@spcl_eth

▪ Performance engineering: “encompasses the set of roles, skills, activities, practices, tools, and deliverables applied at every phase of the systems development life cycle which ensures that a solution will be designed, implemented, and operationally supported to meet the non-functional requirements for performance (such as throughput, latency, or memory usage).”

▪ Manually profile codes and tune them to the given architecture

▪ Requires highly-skilled performance engineers

▪ Need familiarity with

NUMA (topology, bandwidths etc.)

Caches (associativity, sizes etc.)

Microarchitecture (number of outstanding loads etc.)

How to optimize codes for these complex architectures?

Trust me, I’m an engineer!

spcl.inf.ethz.ch

@spcl_eth

An engineering example – Tacoma Narrows Bridge

spcl.inf.ethz.ch

@spcl_eth

Scientific Performance Engineering

1) Observe2) Model

3) Understand4) Build

spcl.inf.ethz.ch

@spcl_eth

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

IMC Tile Tile IMCTile Tile

MCDRAM MCDRAM MCDRAM MCDRAM

DDR DDR

Modeling by example: KNL Architecture (mesh)

L2Core

2 VPU2 VPU

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (memory: Flat & Cache)

EDC EDC EDC EDC

Tile Tile Tile Tile

DDR DDR

L2Core

2 VPU2 VPU

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (all to all mode)

EDC EDC EDC EDC

Tile Tile Tile Tile

DDR DDR

L2Core

2 VPU2 VPU

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (Quadrant or Hemisphere)

EDC EDC EDC EDC

Tile Tile Tile Tile

DDR DDR

L2Core

2 VPU2 VPU

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (SNC-4 or SNC-2)

EDC EDC EDC EDC

Tile Tile Tile Tile

DDR DDR

L2Core

2 VPU2 VPU

spcl.inf.ethz.ch

@spcl_eth

EDC EDC EDC EDC

Tile Tile Tile Tile

DDR DDR

KNL Architecture

How much does this all matter?

What is the real cost of accessing cache?

What is the cost of accessing memory?

spcl.inf.ethz.ch

@spcl_eth

Step 1: Understand core-to-core transfers – MESIF cache coherence

Write back overhead

Location: only 5-15% difference

That is curious!

Contention effects?

All values are medians within 10% of the 95% nonparametric CI, cf. TH, RB: “Scientific Benchmarking of Parallel Computing Systems”, SC16

spcl.inf.ethz.ch

@spcl_eth

Step 2: Understand core-to-memory transfers – DRAM and MCDRAM

13All values are medians within 10% of the 95% nonparametric CI, cf. TH, RB: “Scientific Benchmarking of Parallel Computing Systems”, SC16

MCDRAM 20% slower!

MCDRAM 4-6x faster!

Need to read and write for full bandwidth

Cache mode >20% slower

Bandwidth suffers a bit

spcl.inf.ethz.ch

@spcl_eth

Performance engineers optimize your code!

Are you kidding me?

spcl.inf.ethz.ch

@spcl_eth

A principled approach to designing cache-to-cache broadcast algorithms

Tree cost

Tree depth

Reached threads

S. Ramos, TH: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, ACM HPDC’13S. Ramos, TH: “Cache line aware optimizations for ccNUMA systems (IEEE TPDS’17).

4 5 6 7

Multi-ary tree example

depth d = 2

k1 = 2

k2 = 3

Level size

spcl.inf.ethz.ch

@spcl_eth

Model-driven performance engineering for broadcast

Binary or binomial trees?

Oh! But sure I can do that.

spcl.inf.ethz.ch

@spcl_eth

Model-driven performance engineering for broadcast

Binary or binomial trees?

Oh! But sure I can do that.

13x faster, lower

variance

spcl.inf.ethz.ch

@spcl_eth

Easy to generalize to similar algorithms

Barrier (7x faster than OpenMP) Reduce (5x faster then OpenMP)

spcl.inf.ethz.ch

@spcl_eth

Sorting in complex memories

Check! Let’s do something “Big Data”!

TriadCache Mode

TriadFlat Mode

We’ll need a parallel sort on all cores, right?

spcl.inf.ethz.ch

@spcl_eth

Memory model: Bitonic Mergesort

• Slices of 16 elements go through a bitonic network.• Communication: CPU0 accesses data from local and

remote caches.• Synchronization: CPU0 waits for CPU1.• Memory accesses: latency vs. bandwidth.

spcl.inf.ethz.ch

@spcl_eth

Modeling Bitonic Sorting memory access cost

L1 access cost

elements in L1

L2 access cost

elements in L2

spcl.inf.ethz.ch

@spcl_eth

Bitonic Sort of 1 kiB measurement

model including synchronization cost

model (just) memory costs

Synchronization outweights memory costs for small data!

So don’t parallelize too much!

spcl.inf.ethz.ch

@spcl_eth

Bitonic Sort of 4 MiB

model including synchronization cost

model (just) memory costs“parallelization

boundary”

spcl.inf.ethz.ch

@spcl_eth

Bitonic Sort of 1 GiB

synchronization negligible

Always use all cores!

spcl.inf.ethz.ch

@spcl_eth

The most surprising result last …

Hey, but which memory? DRAM

or MCDRAM?

The model (and practical measurements) indicate that it does not matter.

Thesis: the higher bandwidth of MCDRAM did not help due to the higher latency (log2 n depth).

Disclaimer: this is NOT the best sorting algorithm for Xeon Phi KNL. It is the best we found with limited effort. We suspect that

a combination of algorithms will perform best.

spcl.inf.ethz.ch

@spcl_eth

Scientific performance engineering for complex memory systems

Questions/Discussions?

abela amos, torsten oefler capability models for manycore...

Documents

drian errig & torsten hoefler networks and operating systems...

torsten hoefler automatic performance models for the...

non-blocking collectives for mpi -...

comparative firewall study - torsten hoefler, christian

hoefler text font book

mpi3 coll workgroup - eth z · 2018-03-05 · mpi3 coll...

orsten oefler progress in automatic gpu compilation and...

mpi for dummies - t. hoefler

drian errig & torsten hoefler networks and operating

amd powerpoint - white template · [10] tobias gysi,...

torsten hoefler performance reproducibility in hpc and...

torsten hoefler an overview of static & dynamic techniques...

toward performance models of mpi...

towards efficient mapreduce using mpi - eth z · pdf...

slim graph: practical lossy graph compression for...

spcl.inf.ethz.ch @spcl_eth torsten hoefler eth zürich sc’...

erc consolidator grants 2020 list of principal investigators...

torsten hoefler (most work by tobias grosser, tobias gysi,...

adrian perrig & torsten hoefler networks and ... - eth z

fast and strongly-consistent per-item resilience in …fast...