the architecture and evolution of cpu-gpu systems for general purpose computing manish arora...

The Architecture and Evolution of CPU-GPU Systems for General

Purpose Computing

Manish AroraComputer Science and EngineeringUniversity of California, San Diego

From GPU to GPGPU

2

GPU

. . .

Inp

ut

Assem

bly

Vert

ex

Pro

cessin

g

Fra

me

Bu

ffer

Op

era

tion

s

L2

Memory Controller

Off-Chip Memory

Geom

etr

y

Pro

cessin

g

L2

SM SM

Shared Mem

Shared Mem

. . .

GPGPU

Memory Controller

Off-Chip Memory

Widespread adoption (300M devices) First with NVIDIA Tesla in 2006-2007.

3

1 2006 – 2010

L2

SM SM

Shared Mem

Shared Mem

. . .

GPGPU

Memory Controller

Off-Chip Memory

PCIBridge

Previous Generation Consumer Hardware1

Off-Chip Memory

Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

CPU

. . .

Memory Controller

Current Consumer Hardware2

4

L2

Off-Chip Memory

Shared On-Chip Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

SM

Shared Mem

SM SM

Shared Mem

Shared Mem

CPU

. . . . . .

GPGPUMemory Controller

2 Intel Sandy Bridge AMD Fusion APUs

2011 - 2012

Our Goals Today Examine the current state of the art Trace the next steps of this evolution (major part) Lay out research opportunities

5

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPUPart 1

6

Next Generation CPU – GPU Architectures

GPGPUEvolution

Part 2

OpportunisticOptimizations

Part 5

Shared Components

Part 4

Emerging Technologies

Power Temperature

Reliability

Part 6

Tools

(Future Work)

Outline

Holistic Optimizations

CPU Core OptimizationRedundancy

Elimination

Part 3


CPU onlyWorkloads




GPGPU

7

Part 1

Progression of GPGPU Architectures

GPGPUs - 1 The fixed function graphics era (pre 2006)

Programmable vertex processors Programmable pixel processors Lots of fixed hardware blocks (assembly, geometry, z-culling…)

Non-graphics processing was possible Represent user work as graphics tasks Trick the graphics pipeline Programming via graphics APIs No hardware for bit-wise operations, no explicit branching…

Imbalance in modern workloads motivated unification General purpose opportunity sensed by vendors

8

GPGPUs - 2 The unified graphics and computing era (2006 - 2010)

Single programmable processor design Explicit support for both graphics and computing Computing specific modifications (IEEE FP Compliance and ECC)

Non-graphics processing easy High level programming (C, C++, Python etc.) Separate GPU and CPU memory space Explicit GPU memory management required

High overhead to process on the GPU Memory transfers over PCI

Significant customer market penetration

9

GPGPUs - 3 Chip Integrated CPU-GPU era (2011 onwards)

Multicore CPU + GPGPU on the same die Shared last level caches and memory controller Shared main memory system

Chip Integration advantages Lower total system costs Shared hardware blocks improve utilization Lower latency Higher Bandwidth

Continued improvements in programmability Standardization efforts (OpenCL and DirectCompute)

10

11

Contemporary GPU Architecture(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)

PCIBridgeOff-Chip Memory

Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

CPU

. . .

Memory Controller

L2

SM SM

Shared Mem

Shared Mem

. . .

GPGPU

Memory Controller

Off-Chip Memory

Memory Controller

Memory Controller

Memory Controller

Memory Controller

Memory Controller

Memory Controller

L2 Cache L2 Cache L2 Cache

L2 Cache L2 Cache L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Interconnect

. . .

. . .

. . .

. . .

DRAM DRAM DRAM

DRAM DRAM DRAM

SM Architecture(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)

12

Banked Register File

Warp Scheduler

Operand Buffering

SIMT Lanes

Shared Memory / L1 Cache

ALUs SFUs MEM TEX

Multi-threading and Warp Scheduling Warp processing

32 threads grouped and processed as a Warp Single instruction fetched and issued per warp Lots of active threads per SM (Fermi: 1536 threads in 48 Warps)

Hardware Multithreading for latency hiding Threads has dedicated registers (Fermi: 21 registers per thread) Register state need not be copied or restored Enables fast switching (potentially new warp each cycle)

Threads processed in-order Warps scheduled out-of-order

SM Multithreaded Instruction Scheduler

Warp 1 Instruction 1



Time


Warp 3 Instruction 2...

Warp 1 Instruction 2...

Example of Warp Scheduling(Lindholm et al. IEEE Micro 2007)

Design for Efficiency and ScalabilityNickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011

15

Amortized costs of instruction supply Single instruction multiple thread model

Efficient Data supply Large register files Managed locality (via shared memories)

Lack of global structures No out-of-order processing

High utilization with hardware multithreading Biggest tradeoff : Programmability

Exposed microarchitecture, frequent changes Programmer has to manage data

Scalability(Lee et al. ISCA 2010 / Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE

Micro 2011 and other public sources)

16

Double precision performance 10x in 3 generations Memory structures growing slower than ALUs (22.5x)

Memory bandwidth even slower (2.2x in 4 generations) Clearly favors workloads with high Arithmetic Intensity CPU performance gap increasing rapidly

Double precision performance gap 2x 9x


CPU onlyWorkloads




GPGPU

17

Part 2

Towards Better GPGPU


GPGPUEvolution

Time

Control-flow Divergence Losses(Fung et al. Micro 2007)

Mask = 1111

Code A Code B

Mask = 1111

DivergentBranch

MergePoint

Diverge Point

Path A: Ins 1Path A: Ins 2

…

Path B: Ins 1Path B: Ins 2

…

Converge Point

Low Utilization

Key Insight: Several warps at the same diverge point Combine threads from same execution path

dynamically Generate warps on the fly 20.7% improvements @ 4.7% area overhead

Dynamically formed 2 new warps from 4 original

warps

With DWF

Warp 0 : Path A

Time

Original Scheme

Dynamic Warp Formation (Fung et al. Micro 2007)

Mask = 1111

Code A Code B

Mask = 1111

DivergentBranch

MergePoint

Warp 1 : Path A

Warp 0 : Path B

Warp 1 : Path B

Warp 0+1 : Path AWarp 0+1 : Path B

Register file accesses during lane-aware dynamic warp

formation

Ban

k 1

ALU

1

Ban

k 2

ALU

2

Ban

k N

ALU

N

Ban

k 1

ALU

1

Ban

k 2

ALU

2

Ban

k N

ALU

NDenotes register accessed

Register File

Register File

Register file accesses for static warps

Dynamic Warp Formation Intricacies(Fung et al. Micro 2007)

Register file accesses without lane awareness

Needs several warps at the same execution point “Majority” warp scheduling policy

Need for Lane-awareness Banked register files Spread out threads of the dynamic warp Simplifies design

Large Warp Microarchitecture(Narasiman et al. Micro 2011)

1 0 0

0 1 0

0 0

1 1 1 1

T = 1

1

Activity Mask

- - 0 0

0 0 -

0 0 -

1 1

T = 2

1

Activity Mask

- - 0 0

0 0 -

0 0 -

T = 3

Activity Mask

Time

1 1 0 0

0 1 0 1

0 0 1 1

1 1 1 1

T = 0

OriginalLarge Warp

Similar idea to generate dynamic warps Differs in the creation method

Machine organized as large warps bigger than the SIMT width

Dynamically create warps from within the large warp

Two level Scheduling(Narasiman et al. Micro 2011)

22

Typical Warp scheduling scheme: Round Robin Beneficial because it exploits data locality across warps

All warps tend to reach long latency operations at the same time

Cannot hide latency because everyone is waiting Solution: Group warps into several sets

Schedule warps within a single set round robin Still exploit data locality Switch to another set when all warps of a set hit long latency

operations

Dynamic Warps vs Large Warp + 2-Level Scheduling(Fung et al Micro 2007 vs Narasiman et al. Micro 2011)

23

Dynamic Warp formation gives better performance vs Large Warp alone

More opportunities to form warps All warps vs large warp size

Large Warp + 2-level scheduling better than dynamic warp formation

2-level scheduling can be applied together with dynamic warp formation


CPU onlyWorkloads




GPGPU

24

Part 3

Holistically Optimized CPU Designs


GPGPUEvolution



Elimination

Motivation to Rethink CPU Design(Arora et al. In Submission to IEEE Micro 2012)

25

Heterogeneity works best when each composing core runs subsets of codes well (Kumar et al. PACT 2006)

GPGPU already an example of this The CPU need not be fully general-purpose Sufficient to optimize it for non-GPU code CPU undergoes a “Holistic Optimization” Code expected to run on the CPU is very different We start by investigating properties of this code

Benchmarks

26

Took important computing applications and partitioned them over the CPU and GPU

Partitioning knowledge mostly based on expert information

Either used publically available source code Or details from publications Performed own CUDA implementations for 3 benchmarks

Also used serial and parallel programs with no known GPU implementations as CPU only workloads

Total of 11 CPU-heavy, 11 mixed and 11 GPU-heavy benchmarks

Methodology

27

Used a combination of two techniques Inserted start-end functions based on partitioning information Real machine measurements PIN based simulators

Branches categorized into 4 categories Biased (same direction), patterned (95% accuracy on local

predictor), correlated (95% accuracy on gshare), hard (remaining) Loads and stores characterized into 4 categories

Static (same address), Strided (95% accuracy on stride prefetcher), Patterned (95% accuracy on Markov predictor), Hard (remaining)

Thread level parallelism is speedup on 32 core machine

Results – CPU Time

28

Conservative speedups are capped at 10x More time being spent on the CPU than GPU

Results – Instruction Level Parallelism

29

Drops in 17/22 apps (11% drop for larger window size) Short independent loops on GPU / Dependence heavy code on CPU

Results – Branch Characterization

30

Frequency of hard branches 11.3% 18.6% Occasional effects of data dependent branches

Results – Loads

31

Reduction in strided loads Increase in hard loads Occasional GPU mapping of irregular access kernels

Results – Vector Instructions

32

SSE usage drops to almost half GPUs and SSE extensions targeting same regions of code

Results – Thread Level Parallelism

33

GPU heavy worst hit (14x 2.1x), Overall 40-60% drops Majority of benchmarks have almost no post-GPU TLP Going from 8 cores to 32 cores has a 10% benefit

Impact : CPU Core Directions

34

Larger instruction windows will have muted gains Considerably increased pressure on branch predictor

Need to adopt better performing techniques (L-Tage Seznec et al. ) Memory access will continue to be major bottlenecks

Stride or next-line prefetching almost irrelevant Need to apply techniques that capture complex patterns Lots of literature but never adapted on real machines (e.g. Markov

prediction, Helper thread prefetching)

Impact : Redundancy Elimination

35

SSE rendered significantly less important Every core need not have it Cores could share SSE hardware

Extra CPU cores not of much use because of lack of TLP Few bigger cores with a focus on addressing highly

irregular code will improve performance


CPU onlyWorkloads




GPGPU

36

Part 4

Shared Component Designs


GPGPUEvolution



EliminationShared

Components

Optimization of Shared Structures

37

L2

Off-Chip Memory


Core

CacheHierarch

y

Core

CacheHierarch

y

SM

Shared Mem

SM SM

Shared Mem

Shared Mem

CPU

. . . . . .

GPGPUMemory Controller

Latency Sensitive Potentially Latency In-Sensitive But

Bandwidth Hungry

TAP: TLP Aware Shared LLC Management(Lee et al. HPCA 2012)

38

Insight 1: GPU cache misses / hits may or may not Impact performance

Misses only matter if there is not enough latency hiding Allocated capacity useless if there is abundant parallelism

Measure cache sensitivity to performance Core sampling controller

Insight 2: GPU causes a lot more cache traffic than CPU Allocation schemes typically allocate based on number of accesses Normalization needed for larger number of GPU accesses

Cache block lifetime normalization

TAP Design - 1

39

Core sampling controller Usually GPUs run the same workload on all cores Use different cache policies on 2 of cores and measure

performance difference E.g. LRU for one core / MRU on the other

Cache block lifetime normalization Count number of cache accesses for all CPU and GPU workloads Calculate ratios of access counts across workloads

TAP Design - 2

40

Utility based Cache Partitioning (UCP) Dynamic cache way allocation scheme Allocate ways based on an applications expected gain from

additional space (utility) Uses cache hit rate to calculate utility Uses cache access rates to calculate cache block lifetime

TLP Aware Utility based Cache Partitioning (TAP-UCP) Uses core sampling controller information Allocate ways based on performance sensitivity and not hit rate TAP-UCP normalizes access rates to reduce GPU workload weight

5% better performance than UCP, 11% over LRU

41

Typical Memory Controller Policy: Always Prioritize CPU CPU latency sensitive, GPU not However, this can slow down GPU traffic Problem for real-time applications (graphics)

QoS Aware Mem Bandwidth PartitioningJeong et al. DAC 2012

42

Static management policies problematic Authors propose a dynamic management scheme

Default scheme is to prioritize CPU over GPU Periodically measure current rate of progress on the frame Work decomposed into smaller tiles, so measurement simple Compare with target frame rate If current frame rate slower than measured rate, set CPU and GPU priorities equal If close to deadline and still behind, boost GPU request priority even further

QoS Aware Mem Bandwidth Partitioning(Jeong et al. DAC 2012)


CPU onlyWorkloads




GPGPU

43

Part 5

Opportunistic Optimizations


GPGPUEvolution



EliminationShared

Components


Opportunistic Optimizations

44

Chip integration advantages Lower latency New communication paths e.g. shared L2

Opportunity for non-envisioned usage Using idle resources to help active execution

Idle GPU helps CPU Idle CPU helps GPU

Idle GPU Shader based Prefetching(Woo et al. ASPLOS 2010)

45

Realization: Advanced Prefetching not adopted because of high storage costs

GPU system can have exploitable idle resources Use idle GPU shader resources

Register files as prefetcher storage Execution threads as logic structures Parallel prefetcher execution threads to improve latency

Propose an OS based enabling and control interface Miss Address Provider

Library of prefetchers and application specific selection Prefetching performance benefit of 68%

Miss Address Provider

46


Core Core . . . SM SM . . .

MAP

Miss PC

Miss Address

Shader Pointer

Command Buffer

MAP

OS Allocates Idle GPU Core

Miss info forwardedTo GPU Core

GPU Core storesand processes miss

streamData prefetched into

Shared LLC

CPU assisted GPGPU processing(Yang et al. HPCA 2012)

47

Use idle CPU resources to prefetch for GPGPU applications

Target bandwidth sensitive GPGPU applications Compiler based framework to convert GPU kernels to

CPU prefetching program CPU runs ahead appropriately of the GPU

If too far behind then the CPU cache hit rate will be very high If too far ahead then GPU cache hit rate will be very low

Very few CPU cycle required since LLC line is large Prefetching performance benefit of 21%

Example GPU Kernel and CPU program

48

__global__ void VecAdd (float *A, *B, *C, int N) { int I = blockDim.x * blockIdx.x + threadIdx.x; C[i] = A[i] + B[i] }

float mem_fetch (float *A, *B, *C, int N) { return A[N] + B[N] + C[N] }

void cpu_prefetching (…) { unroll_factor = 8 //traverse through all thread blocks (TB) for (j = 0; j < N_TB; j += Concurrent_TB) //loop to traverse concurrent threads TB_Size for (i = 0; i < Concurrent_TB*TB_Size; i += skip_factor*batch_size*unroll_factor) { for (k=0; j<batch_size; k++) { id = i + skip_factor*k*unroll_factor + j*TB_Size //unrolled loop float a0 = mem_fetch (id + skip_factor*0) float a1 = mem_fetch (id + skip_factor*1) . . . sum += a0 + a1 + . . . } update skip_factor}}}

GPU Kernel

Requests for Single thread

For all concurrentThread blocksSkip_factor

controls CPU timing

Batch_size controls how often skip_fctor

is updated

Unroll_factor artificially boost

CPU requests

Drawbacks: CPU assisted GPGPU processing

49

Does not consider effects of Thread block scheduling CPU program stripped of actual computations

Memory requests from data or computation dependent paths not considered


CPU onlyWorkloads




GPGPU

50

Part 6

FutureWork


GPGPUEvolution



EliminationShared

Components



Power Temperature

Reliability

Tools

Continued System Optimizations

51

Continued holistic optimizations Understand impact of GPU workloads on CPU requests to the

memory controller? Continued opportunistic optimizations

Latest GPUs allow different kernels to be run on the same GPU Can GPU threads prefetch for other GPU kernels?

Research Tools

52

Severe lack of GPU research tools No GPU power model No GPU temperature model Immediate and impactful opportunities

Power, Temperature and Reliability

53

Bounded by lack of power tools No work yet on effective power management No work yet on effective temperature management


54

Impact of non-volatile memories on GPUs 3D die stacked GPUs Stacked CPU-GPU-Main memory systems

Conclusions

55

In this work we looked at the CPU-GPU research landscape

GPGPUs systems are quickly scaling in performance CPU needs to be refocused to handle extremely

irregular code Design of shared components needs to be rethought Abundant optimization and research opportunities!

Questions?

Backup Slides

Results – Stores

57

Similar trends as loads but slightly less pronounced

Results – Branch Prediction Rates

58

Hard branches translate to higher misprediction rates Strong influence of CPU only benchmarks

the architecture and evolution of cpu-gpu systems for general purpose computing manish arora...

Documents