a locality-aware memory hierarchy for energy-efficient gpu ... · • each thread can follow its...

29
A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures Minsoo Rhu Michael Sullivan Jingwen Leng Mattan Erez The University of Texas at Austin

Upload: others

Post on 14-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures

Minsoo Rhu Michael Sullivan Jingwen Leng Mattan Erez

The University of Texas at Austin

Page 2: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Introduction – (1)

•  “General purpose computing” with GPUs –  Enabled by non-graphics APIs (e.g., CUDA, OpenCL)

•  ‘GP’GPU == massively multithreaded throughput processor

•  Excellent for executing regular programs –  Simple control flow –  Regular memory access patterns (with high locality)

• e.g., matrix-multiplication

2

Page 3: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Introduction – (2)

•  Importance in executing irregular apps increasing –  Still embarrassingly parallel … but complex control flow and irregular memory access behavior

• Molecular dynamics • Computer vision • Analytics • And much more …

•  GPU’s massive multi-threading + irregular memory –  Small per-thread cache space available –  Efficient caching becomes extremely challenging

•  Low hit rates •  Low block reuse •  Significant waste in off-chip bandwidth utilization

3

Page 4: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Problem / Our solution

•  Problem : Waste in off-chip bandwidth utilization

•  Solution : Locality-aware memory hierarchy (LAMAR)

4

(a)   [Byte-traffic/num_of_instrs] (left-axis) and the normalization to baseline memory system (right-axis).

Lower is better

BASE : Baseline memory system LAMAR : Proposed solution

0

0.2

0.4

0.6

0.8

1

0

2

4

6

8

10

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

BASE

LAMA

R

IIX SSSP BFS1 SP SSC BFS2 MUM NW PVC WP

Norm

aliz

ed t

o BA

SE

Off-

chip

By

te-t

raff

ic

Byte-Traffic Normalized

Page 5: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Background

5

Page 6: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Graphic Processing Units (GPUs)

•  General-purpose many-core accelerators –  Use simple in-order shader cores in 10s / 100s numbers

•  Scalar frontend (fetch & decode) + parallel backend –  Amortizes the cost of frontend and control

6

Page 7: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

CUDA exposes hierarchy of data-parallel threads

•  SPMD model: single kernel executed by all scalar threads •  Kernel / Thread-block / Warp / Thread

–  Multiple warps compose a thread-block

–  Multiple threads (32) compose a warp

7

A warp is scheduled as a batch of scalar threads

Page 8: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

SIMT: Single-Instruction Multiple-Thread

•  Programmer writes code to be executed by scalar threads in a vector SIMD hardware.

–  HW/SW supports conditional branches • Each thread can follow its own control flow

–  Fine-grained scatter-gather LD/STs • Each thread can LD/ST from arbitrary memory address

8

Page 9: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Memory divergence (a.k.a address divergence) •  Well-structured memory

accesses are coalesced into a single memory transaction –  e.g., all the threads in a warp

access the same cache block –  Minimized cache port

contention, save port-BW

•  A single warp (32 threads) can generate up to 32 memory transactions –  Memory divergence –  e.g., each thread within a warp

accesses a distinct cache block region

9

(a)  Regular memory accesses

(b)Irregular memory accesses

Page 10: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

The problem:

GPU caching inefficiencies

10

Page 11: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

GPUs leverage massive multithreading

•  It is what makes them throughput processors! –  Core of its latency-tolerance

• Thread-level parallelism >> Instruction-level parallelism

•  Massive multithreading + irregular memory accesses –  Bursts of concurrent cache accesses within a given

time frame –  Orders of magnitude higher cache contention than

CMP counterparts • Efficient caching becomes extremely challenging!

11

Page 12: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

On-chip cache hierarchy of GPUs

•  Per-thread cache capacity of modern CPUs/GPU

•  NVIDIA GPU cores issue LD/STs in [32B – 128B] granularity –  Each scalar thread can request data of [1B – 16B]* –  HW address coalescer generates one or more [32B – 128B]*

memory transactions per warp, depending on: • Control divergence (subset of active threads within a warp)* • Memory divergence

12

Intel Core i7-4960x

IBM Power7

Oracle UltraSparc T3

NVIDIA Kepler GK 110

32KB L1 2 threads/core

16KB/thread

32KB L1 4 threads/core

8KB/thread

8KB L1 8 threads/core

1KB/thread

48KB L1 2K threads/core

24B/thread

* 1B (char) – 16B (vector of floats) * NVIDIA, “CUDA C Programming Guide”, 2013 * Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow”, MICRO-2007

Page 13: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Baseline GPU memory hierarchy (Coarse-grained [CG] fetch-only)

•  128B L1/L2 cache blocks –  Cache misses invoke data requests in cache-block granularity (CG-fetches)

•  Width of each memory-channel: 64b (two 32b GDDR5 chips) –  Fermi (GF110) / Kepler (GK110) / Southern-Island

•  64B (64b x 8-bursts) minimum access granularity

13

(a)  Baseline cache and memory system (V: valid)

Page 14: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Why is GPU caching so challenging?

•  Small per-thread cache capacity –  24B per thread (worst case)

•  Larger cache block size –  128B per cache block

•  Caching efficiency is significantly compromised! –  High miss rate –  Low block reuse –  Sub-optimal utilization of off-chip bandwidth

14

Page 15: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Cache block reuse (temporal) •  Number of repeated accesses to L1/L2 cache blocks,

after fill (before eviction)

15

(a)   L1 cache

(b) L2 cache

0%

20%

40%

60%

80%

100% 0 1 2 ~ 4 5 ~ 9 10 ~ 19 20 ~

0%

20%

40%

60%

80%

100% 0 1 2 ~ 4 5 ~ 9 10 ~ 19 20 ~

Page 16: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Cache block reuse (spatial) •  Number of 32B chunks (or sectors) in a cache block

actually referenced in L1/L2 (128B cache block)

16

0% 20% 40% 60% 80%

100% 1 2 3 4

0%

20%

40%

60%

80%

100% 1 2 3 4

(a)   L1 cache

(b) L2 cache

Page 17: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Our solution:

Locality-aware memory hierarchy

17

Page 18: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

LAMAR: Locality-aware memory hierarchy

18

GDU: Granularity Decision Unit

Determines whether to fetch missed block in CG or in FG

Static-GDU: Always fetch CG or FG Dynamic-GDU : choose at runtime

•  Provide FG data fetching capability to save BW –  Motivation: massive multithreading limits cache block lifetime, so

the likelihood of block reuse is very low in GPUs • Effectiveness of prefetching (e.g., filling all 128B) decreases •  Sectored caches* + sub-ranked* memory system

* Liptay, “Structural aspects of the system/360 model 85, Part II: The cache”, IBM Journal, 1968 * Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency”, MICRO-2008

Page 19: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

LAMAR memory systems (Fine-grained [FG]-enabled)

•  Sectored cache: amortize tag-overhead –  32B sector, 4-sectors per cache block (hence a single tag)

•  Each channel divided into two sub-ranks (retrieve data from 1 chip) –  32B (32b x 8-bursts) minimum access granularity –  Allows finer control of data fetches from DRAM (64B vs 32B)

19

(a) Memory hierarchy with a sectored cache and a sub-ranked memory system (128B cache block, V: valid)

Page 20: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Predicting Access Granularity

20

Page 21: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Bi-modal granularity predictor (BGP)

21

•  Predictor provides bi-modal prediction –  Coarse granularity (CG) –  Fine granularity (FG)

•  Fetch all block-wide data (CG) or just enough to service the data requested from the GPU core (FG) –  Reduce the number of RD/WR commands sent to DRAM –  Reduce byte-traffic sent to off-chip memory

Page 22: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Light-weight BGP microarchitecture

22

•  Bloom-filter based design with dual-bitarrays –  Temporally delayed/overlapped for history insertions

• The one with more insertions are used for TESTing membership

• The older bitarray is CLEARed at regular intervals (N-ins.) • When old bitarray is cleared, other one used for TESTing

–  Balance size, false positive rate, and history depth

Page 23: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Granularity prediction algorithm

•  Predictor has a default prediction (like agree predictor*) –  Initialized as CG (or FG) at kernel initialization

•  Each cache miss queries the bloom-filter –  INTUITION. check whether missing block’s optimal

fetching-prediction agrees* with the default prediction

•  What/when are elements inserted into the bloom-filter? –  What: Evicted block-address (+ spatial reuse information) –  When: Evicted block’s reuse information disagrees* with

the default prediction

23

* Sprangle et al., “The agree predictor: A mechanism for reducing negative branch history interference”, ISCA-1997

Page 24: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Key experiments and analysis

24

Page 25: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Simulation environment

•  GPGPU-Sim –  Processor architecture

• Similar to GTX 480 –  Memory hierarchy

• Sectored cache • Sub-ranked memory system (using DrSim*) • Memory bandwidth

–  179.2 GB/s overall (through 8 channels)

•  Applications –  CUDA-SDK, Rodinia, LonestarGPU, MapReduce –  Characterization

•  FG-leaning (avg. number of sectors referenced ≤ 2.0) • CG-leaning (avg. number of sectors referenced > 2.0)

25

* http://lph.ece.utexas.edu/public/Main/DrSim

Page 26: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Off-chip byte traffic (normalized to # of instructions)

26

0

0.2

0.4

0.6

0.8

1

0

2

4

6

8

10

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

CG

FG

BGP

IIX SSSP BFS1 SP SSC BFS2 MUM NW PVC WP

Norm

aliz

ed t

o CG

Traf

fic/

Inst

r.

Traffic/Instr Normalized

(a) FG-leaning applications

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2

CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP

MST RAY SCLST BACKP NN SRAD LAVAMD SPROD MCARLO FWT

Norm

aliz

ed t

o CG

Traf

fic/

Inst

r.

Traffic/Instr Normalized

(b) CG-leaning applications

Page 27: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

Performance improvements

27

Higher is better

(a) FG-leaning applications

(b) CG-leaning applications

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6

IIX SSSP BFS1 SP SSC BFS2 MUM NW PVC WP HarMean

Spee

dup

CG FG BGP

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6

MST RAY SCLST BACKP NN SRAD LAVAMD SPROD MCARLO FWT HarMean

Spee

dup

CG FG BGP

Page 28: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

DRAM power consumption

28

(a) FG-leaning applications

(b) CG-leaning applications

0

0.5

1

1.5

2

2.5

0

20

40

60

80

100

CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP

IIX SSSP BFS1 SP SSC BFS2 MUM NW PVC WP

Perf

/Wat

t (N

orma

lize

d)

Powe

r (W

atts

)

Background ACT/PRE READ WRITE REFRESH Perf/Watt

0 0.2 0.4 0.6 0.8 1 1.2 1.4

0 10 20 30 40 50 60 70

CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP CG

FG

BGP

MST RAY SCLST BACKP NN SRAD LAVAMD SPROD MCARLO FWT

Perf

/Wat

t (N

orma

lize

d)

Powe

r (W

atts

)

Page 29: A Locality-Aware Memory Hierarchy for Energy-Efficient GPU ... · • Each thread can follow its own control flow – Fine-grained scatter-gather LD/STs • Each thread can LD/ST

LAMAR conclusions

•  Memory hierarchy of GPUs necessitates fine-grained access granularity –  Inherent data locality is rarely captured in GPUs –  Minimizing useless overfetching (within cache block)

improves BW utilization • Reduces DRAM power consumption •  Improves performance • Perf/Watt enhanced

29