supporting x86-64 address translation for 100s of gpu lanes

55
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood Based on HPCA 20 paper UW-Madison Computer Sciences 2/19/2014

Upload: john-herrera

Post on 31-Dec-2015

78 views

Category:

Documents


4 download

DESCRIPTION

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power , Mark D. Hill, David A. Wood Based on HPCA 20 paper. Summary. CPUs & GPUs: physically integrated, logically separate Near Future: Cache coherence, Shared virtual address space P roof-of-concept GPU MMU design - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences

Supporting x86-64 Address Translation for 100s of GPU Lanes

Jason Power, Mark D. Hill, David A. WoodBased on HPCA 20 paper

2/19/2014

Page 2: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 2

Summary

• CPUs & GPUs: physically integrated, logically separateNear Future: Cache coherence, Shared virtual address space

• Proof-of-concept GPU MMU design– Per-CU TLBs, highly-threaded PTW, page walk cache– Full x86-64 support– Modest performance decrease (2% vs. ideal MMU)

• Design alternatives not chosen

2/19/2014

Page 3: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 3

Motivation

• Closer physical integration of GPGPUs

• Programming model still decoupled– Separate address spaces– Unified virtual address (NVIDIA) simplifies code– Want: Shared virtual address space (HSA hUMA)

2/19/2014

Page 4: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 42/19/2014

Tree Index

Hash-table Index

CPU address space

Page 5: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 5

Separate Address Space

2/19/2014

CPU address space

GPU address space

Simply copy data Transform to new pointers

Transform to new pointers

Page 6: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 6

Unified Virtual Addressing

2/19/2014

CPU address space

GPU address space

1-to-1 addresses

Page 7: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 7

void main() {int *h_in, *h_out;

h_in = cudaHostMalloc(sizeof(int)*1024); // allocate input array on host

h_in = ... // Initial host array

h_out = cudaHostMalloc(sizeof(int)*1024); // allocate output array on host

Kernel<<<1,1024>>>(h_in, h_out);

... h_out // continue host computation with result

cudaHostFree(h_in); // Free memory on hostcudaHostFree(h_out);

}

2/19/2014

Unified Virtual Addressing

Page 8: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 8

Shared Virtual Address Space

2/19/2014

CPU address space

GPU address space

1-to-1 addresses

Page 9: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 9

Shared Virtual Address Space

• No caveats

• Programs “just work”

• Same as multicore model

2/19/2014

Page 10: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 10

Shared Virtual Address Space

• Simplifies code• Enables rich pointer-based datastructures– Trees, linked lists, etc.

• Enables composablity

Need: MMU (memory management unit) for the GPU • Low overhead• Support for CPU page tables (x86-64)• 4 KB pages• Page faults, TLB flushes, TLB shootdown, etc

2/19/2014

Page 11: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 11

Outline

• Motivation

• Data-driven GPU MMU designD1: Post-coalescer MMUD2: +Highly-threaded page table walkerD3: +Shared page walk cache

• Alternative designs

• Conclusions2/19/2014

Page 12: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 12

System Overview

2/19/2014

CPUGPU

L2 C

ache

DRAM

CPUCore

L1 Cache

L2 Cache

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Page 13: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 13

GPU Overview

2/19/2014

Compute Unit (32 lanes ) GPU

Instruction Fetch / Decode

Coalescer

L1 Cache

Scratchpad Memory

L2 C

ache

Register File

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Page 14: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 14

Methodology

• Target System– Heterogeneous CPU-GPU system

• 16 CUs, 32 lanes each

– Linux OS 2.6.22.9

• Simulation– gem5-gpu – full system mode (gem5-gpu.cs.wisc.edu)

• Ideal MMU– Infinite caches– Minimal latency

2/19/2014

Page 15: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 15

Workloads

• Rodinia benchmarks

• Database sort– 10 byte keys, 90 byte payload

2/19/2014

– backprop– bfs: Breadth-first search– gaussian– hotspot– lud: LU-decomposition

– nn: nearest-neighbor– nw: Needleman-Wunsch– pathfinder– srad: anisotropic diffusion

Page 16: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 16

GPU MMU Design 0

2/19/2014

CU CU CU CU

L1Cache

L1Cache

L1Cache

L1Cache

L2 Cache

Coalescer

I-FetchRegister File

Coalescer

I-FetchRegister File

Coalescer

I-FetchRegister File

Coalescer

I-FetchRegister File

TLBTLBTLBTLBTLBTLBTLBTLB

TLBTLBTLBTLBTLBTLBTLBTLB

TLBTLBTLBTLBTLBTLBTLBTLB

TLBTLBTLBTLBTLBTLBTLBTLB

Page 17: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 17

GPU MMU Design 1

Per-CU MMUs: Reducing the translation request rate

2/19/2014

Page 18: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 18

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

2/19/2014

Page 19: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 19

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Scratchpad Memory

1x

0.45x

2/19/2014

Page 20: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 20

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Coalescer

1x

0.45x

0.06x

2/19/2014

Scratchpad Memory

Page 21: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 21

Reducing translation request rate

• Shared memory and coalescer effectively filter global memory accesses• Average 39 TLB accesses per 1000 cycles for 32 lane CU

Breakdown of memory operations

2/19/2014

Page 22: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 22

GPU MMU Design 1

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Page

faul

t re

gist

er

L1Cache

Page 23: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 23

Performance

2/19/2014

Page 24: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 24

GPU MMU Design 2

Highly-threaded page table walker:Increasing TLB miss bandwidth

2/19/2014

Page 25: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 25

Outstanding TLB Misses

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Coalescer

Scratchpad Memory

2/19/2014

Page 26: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 26

Multiple Outstanding Page Walks

• Many workloads are bursty• Miss latency skyrockets due to queuing delays if blocking page walker

Concurrent page walks per CU when non-zero (log scale)

2/19/2014

2

Page 27: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 27

GPU MMU Design 2

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Highly-threadedPage table

walker

Page

wal

k bu

ffer

s

Page

faul

t re

gist

er

L1Cache

Page 28: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 28

Highly-threaded PTW

2/19/2014

Page walk state machine

Outstanding addr StateOutstanding addr StateOutstanding addr StateOutstanding addr State

Outstanding addr State

From memory To memory

TLB TLB TLBPer-CU TLBs:

Page walk buffers

32

Page 29: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 29

Performance

2/19/2014

Page 30: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 30

GPU MMU Design 3

Page walk cache:Overcoming high TLB miss rate

2/19/2014

Page 31: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 31

High L1 TLB Miss Rate

• Average miss rate: 29%!

2/19/2014

Per-access Rate with 128 entry L1 TLB

Page 32: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 32

High TLB Miss Rate

• Requests from coalescer spread out– Multiple warps issuing unique streams– Mostly 64 and 128 byte requests

• Often streaming access patterns– Each entry not reused many times– Many demand misses

2/19/2014

Page 33: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 33

GPU MMU Design 3

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Highly-threadedPage table

walker

Page walk cache

Page

wal

k bu

ffer

s

Page

faul

t re

gist

er

L1Cache

Shared by16 CUs

Page 34: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 34

Performance

2/19/2014

Worst case: 12% slowdown

Average: Less than 2% slowdown

Page 35: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 35

Correctness Issues

• Page faults– Stall GPU as if TLB miss– Raise interrupt on CPU core– Allow OS to handle in normal way– One change to CPU microcode, no OS changes

• TLB flush and shootdown– Leverage CPU core, again– Flush GPU TLBs whenever CPU core TLB is flushed

2/19/2014

Page 36: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 36

Page-fault Overview

1. Write faulting address to CR2

2. Interrupt CPU

3. OS handles page fault

4. iret instruction executed

5. GPU notified of completed page fault

2/19/2014

CPU core

GPU MMU

CR3

CR2

CR3

PF register

Page table

Page 37: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 37

Page faultsNumber of page faults

Average page fault latency

Lud 1 1698 cyclesPathfinder 15 5449 cyclesCell 31 5228 cyclesHotspot 256 5277 cyclesBackprop 256 5354 cycles

• Minor Page faults handled correctly by Linux 2.6.22.9• 5000 cycles = 3.5 µsec: too fast for GPU context switch

2/19/2014

Page 38: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 38

Summary

• D3: proof of concept • Non-exotic design

2/19/2014

Per-CU L1 TLB entries

Highly-threadedpage table walker

Page walk cache size

Shared L2 TLB entries

Ideal MMU Infinite Infinite Infinite NoneDesign 0 N/A: Per-lane MMUs None NoneDesign 1 128 Per-CU walkers None NoneDesign 2 128 Yes (32-way) None NoneDesign 3 64 Yes (32-way) 8 KB None

Low overhead Fully compatible Correct execution

Page 39: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 39

Outline

• Motivation

• Data-driven GPU MMU design

• Alternative designs – Shared L2 TLB– TLB prefetcher– Large pages

• Conclusions2/19/2014

Page 40: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 40

Alternative Designs

• Shared L2 TLB– Shared between all CUs– Captures address translation sharing– Poor performance: increased TLB-miss penalty

• Shared L2 TLB and PWC– No performance difference– Trade-offs in the design space

2/19/2014

Page 41: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 41

Alternative Design Summary

2/19/2014

• Each design has 16 KB extra SRAM for all 16 CUs• PWC needed for some workloads• Trade-offs in energy and area

Per-CU L1 TLB entries

Page walk cache size

Shared L2 TLB entries

Performance

Design 3 64 8 KB None 0%

Shared L2 64 None 1024 50%

Shared L2 & PWC 32 8 KB 512 0%

Page 42: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 42

GPU MMU Design Tradeoffs

• Shared L2 & PWC provices lower area, but higher energy• Proof-of-concept Design 3 provides low area and energy

2/19/2014

*data produced using McPAT and CACTI 6.5

Page 43: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 43

Alternative Designs

• TLB prefetching:– Simple 1-ahead prefetcher does not affect

performance

• Large pages:– Effective at reducing TLB misses– Requiring places burden on programmer– Must have compatibility

2/19/2014

Page 44: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 44

Related Work

• “Architectural Support for Address Translation on GPUs.” Bharath Pichai, Lisa Hsu, Abhishek Bhattacharjee (ASPLOS 2014).

– Concurrent with our work– High level: modest changes for high performance– Pichai et al. investigate warp scheduling– We use a highly-threaded PTW

2/19/2014

Page 45: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 45

Conclusions

• Shared virtual memory is important

• Non-exotic MMU design– Post-coalescer L1 TLBs– Highly-threaded page table walker– Page walk cache

• Full compatibility with minimal overhead

2/19/2014

Page 46: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 46

Questions?

2/19/2014

Page 47: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 47

Simulation Details

backprop 74 MB nn 500 KB

bfs 37 MB nw 128 MB

gaussian 340 KB pathfinder 38 MB

hotspot 12 MB srad 96 MB

lud 4 MB sort 208 MB

CPU 1 core, 2 GHz, 64 KB L1, 2 MB L2

GPU 16 CUs, 1.4 GHz, 32 lanes

L1 Cache (per-CU) 64 KB, 15 ns latency

Scratchpad memory 16 KB, 15 ns latency

GPU L2 cache 1 MB, 16-way set associative, 130 ns latency

DRAM 2 GB, DDR3 timing, 8 channels, 667 MHz

2/19/2014

Page 48: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 48

Shared L2 TLB

2/19/2014

Sharing pattern

• Many benchmarks share translations between CUs

Page 49: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 49

Perf. of Alternative Designs

• D3 uses AMD-like physically-addressed page walk cache• Ideal PWC performs slightly better (1% avg, up to 10%)

2/19/2014

Page 50: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 50

Large pages are effective

• 2 MB pages significantly reduce miss rate

• But can they be used?– slow adoption on

CPUs– Special allocation API

Relative miss rate with 2MB pages

2/19/2014

Page 51: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 51

Shared L2 TLB

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Highly-threadedPage table

walker

Page

wal

k bu

ffer

s

Page

faul

t re

gist

er

L1Cache

Shared L2 TLB

Page 52: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 52

Shared L2 TLB and PWC

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Highly-threadedPage table

walker

Page

wal

k bu

ffer

s

Page

faul

t re

gist

er

L1Cache

Shared L2 TLB

Page walk cache

Page 53: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 53

Shared L2 TLB

2/19/2014

Performance relative to D3

• A shared L2 TLB can work as well as a page walk cache• Some workloads need low latency TLB misses

Page 54: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 54

Shared L2 TLB and PWC

• Benefits of both page walk cache and shared L2 TLB• Can reduce the size of L1 TLB without performance penalty

2/19/2014

Page 55: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 55

Unified MMU Design

L1Cache

L1Cache

L1Cache

L1Cache

L2 Cache

L2 TLB(optional)

Multi-threadedpagetable

walker

Pagewalk Cache(optional)

Page

wal

k bu

ffer

sPa

ge fa

ult

regi

sterTLB TLB TLB TLB

Shared page walk unit

(a)

(b)

② ③

Coalescer Coalescer Coalescer Coalescer

Memory request(virtual addr)

Memory request(physical addr)

Page walk request & response

2/19/2014