the architecture and evolution of cpu-gpu systems for general purpose computing manish arora...

58
The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California, San Diego

Upload: emmeline-freeman

Post on 01-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

The Architecture and Evolution of CPU-GPU Systems for General

Purpose Computing

Manish AroraComputer Science and EngineeringUniversity of California, San Diego

Page 2: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

From GPU to GPGPU

2

GPU

. . .

Inp

ut

Assem

bly

Vert

ex

Pro

cessin

g

Fra

me

Bu

ffer

Op

era

tion

s

L2

Memory Controller

Off-Chip Memory

Geom

etr

y

Pro

cessin

g

L2

SM SM

Shared Mem

Shared Mem

. . .

GPGPU

Memory Controller

Off-Chip Memory

Widespread adoption (300M devices) First with NVIDIA Tesla in 2006-2007.

Page 3: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

3

1 2006 – 2010

L2

SM SM

Shared Mem

Shared Mem

. . .

GPGPU

Memory Controller

Off-Chip Memory

PCIBridge

Previous Generation Consumer Hardware1

Off-Chip Memory

Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

CPU

. . .

Memory Controller

Page 4: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Current Consumer Hardware2

4

L2

Off-Chip Memory

Shared On-Chip Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

SM

Shared Mem

SM SM

Shared Mem

Shared Mem

CPU

. . . . . .

GPGPUMemory Controller

2 Intel Sandy Bridge AMD Fusion APUs

2011 - 2012

Page 5: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Our Goals Today Examine the current state of the art Trace the next steps of this evolution (major part) Lay out research opportunities

5

Page 6: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPUPart 1

6

Next Generation CPU – GPU Architectures

GPGPUEvolution

Part 2

OpportunisticOptimizations

Part 5

Shared Components

Part 4

Emerging Technologies

Power Temperature

Reliability

Part 6

Tools

(Future Work)

Outline

Holistic Optimizations

CPU Core OptimizationRedundancy

Elimination

Part 3

Page 7: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPU

7

Part 1

Progression of GPGPU Architectures

Page 8: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

GPGPUs - 1 The fixed function graphics era (pre 2006)

Programmable vertex processors Programmable pixel processors Lots of fixed hardware blocks (assembly, geometry, z-culling…)

Non-graphics processing was possible Represent user work as graphics tasks Trick the graphics pipeline Programming via graphics APIs No hardware for bit-wise operations, no explicit branching…

Imbalance in modern workloads motivated unification General purpose opportunity sensed by vendors

8

Page 9: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

GPGPUs - 2 The unified graphics and computing era (2006 - 2010)

Single programmable processor design Explicit support for both graphics and computing Computing specific modifications (IEEE FP Compliance and ECC)

Non-graphics processing easy High level programming (C, C++, Python etc.) Separate GPU and CPU memory space Explicit GPU memory management required

High overhead to process on the GPU Memory transfers over PCI

Significant customer market penetration

9

Page 10: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

GPGPUs - 3 Chip Integrated CPU-GPU era (2011 onwards)

Multicore CPU + GPGPU on the same die Shared last level caches and memory controller Shared main memory system

Chip Integration advantages Lower total system costs Shared hardware blocks improve utilization Lower latency Higher Bandwidth

Continued improvements in programmability Standardization efforts (OpenCL and DirectCompute)

10

Page 11: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

11

Contemporary GPU Architecture(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)

PCIBridgeOff-Chip Memory

Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

CPU

. . .

Memory Controller

L2

SM SM

Shared Mem

Shared Mem

. . .

GPGPU

Memory Controller

Off-Chip Memory

Memory Controller

Memory Controller

Memory Controller

Memory Controller

Memory Controller

Memory Controller

L2 Cache L2 Cache L2 Cache

L2 Cache L2 Cache L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Interconnect

. . .

. . .

. . .

. . .

DRAM DRAM DRAM

DRAM DRAM DRAM

Page 12: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

SM Architecture(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)

12

Banked Register File

Warp Scheduler

Operand Buffering

SIMT Lanes

Shared Memory / L1 Cache

ALUs SFUs MEM TEX

Page 13: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Multi-threading and Warp Scheduling Warp processing

32 threads grouped and processed as a Warp Single instruction fetched and issued per warp Lots of active threads per SM (Fermi: 1536 threads in 48 Warps)

Hardware Multithreading for latency hiding Threads has dedicated registers (Fermi: 21 registers per thread) Register state need not be copied or restored Enables fast switching (potentially new warp each cycle)

Threads processed in-order Warps scheduled out-of-order

Page 14: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

SM Multithreaded Instruction Scheduler

Warp 1 Instruction 1

Warp 2 Instruction 1

Warp 3 Instruction 1

Time

Warp 2 Instruction 2

Warp 3 Instruction 2...

Warp 1 Instruction 2...

Example of Warp Scheduling(Lindholm et al. IEEE Micro 2007)

Page 15: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Design for Efficiency and ScalabilityNickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011

15

Amortized costs of instruction supply Single instruction multiple thread model

Efficient Data supply Large register files Managed locality (via shared memories)

Lack of global structures No out-of-order processing

High utilization with hardware multithreading Biggest tradeoff : Programmability

Exposed microarchitecture, frequent changes Programmer has to manage data

Page 16: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Scalability(Lee et al. ISCA 2010 / Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE

Micro 2011 and other public sources)

16

Double precision performance 10x in 3 generations Memory structures growing slower than ALUs (22.5x)

Memory bandwidth even slower (2.2x in 4 generations) Clearly favors workloads with high Arithmetic Intensity CPU performance gap increasing rapidly

Double precision performance gap 2x 9x

Page 17: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPU

17

Part 2

Towards Better GPGPU

Next Generation CPU – GPU Architectures

GPGPUEvolution

Page 18: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Time

Control-flow Divergence Losses(Fung et al. Micro 2007)

Mask = 1111

Code A Code B

Mask = 1111

DivergentBranch

MergePoint

Diverge Point

Path A: Ins 1Path A: Ins 2

Path B: Ins 1Path B: Ins 2

Converge Point

Low Utilization

Page 19: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Key Insight: Several warps at the same diverge point Combine threads from same execution path

dynamically Generate warps on the fly 20.7% improvements @ 4.7% area overhead

Dynamically formed 2 new warps from 4 original

warps

With DWF

Warp 0 : Path A

Time

Original Scheme

Dynamic Warp Formation (Fung et al. Micro 2007)

Mask = 1111

Code A Code B

Mask = 1111

DivergentBranch

MergePoint

Warp 1 : Path A

Warp 0 : Path B

Warp 1 : Path B

Warp 0+1 : Path AWarp 0+1 : Path B

Page 20: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Register file accesses during lane-aware dynamic warp

formation

Ban

k 1

ALU

1

Ban

k 2

ALU

2

Ban

k N

ALU

N

Ban

k 1

ALU

1

Ban

k 2

ALU

2

Ban

k N

ALU

NDenotes register accessed

Register File

Register File

Register file accesses for static warps

Dynamic Warp Formation Intricacies(Fung et al. Micro 2007)

Register file accesses without lane awareness

Needs several warps at the same execution point “Majority” warp scheduling policy

Need for Lane-awareness Banked register files Spread out threads of the dynamic warp Simplifies design

Page 21: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Large Warp Microarchitecture(Narasiman et al. Micro 2011)

1 0 0

0 1 0

0 0

1 1 1 1

T = 1

1

Activity Mask

- - 0 0

0 0 -

0 0 -

1 1

T = 2

1

Activity Mask

- - 0 0

0 0 -

0 0 -

T = 3

Activity Mask

Time

1 1 0 0

0 1 0 1

0 0 1 1

1 1 1 1

T = 0

OriginalLarge Warp

Similar idea to generate dynamic warps Differs in the creation method

Machine organized as large warps bigger than the SIMT width

Dynamically create warps from within the large warp

Page 22: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Two level Scheduling(Narasiman et al. Micro 2011)

22

Typical Warp scheduling scheme: Round Robin Beneficial because it exploits data locality across warps

All warps tend to reach long latency operations at the same time

Cannot hide latency because everyone is waiting Solution: Group warps into several sets

Schedule warps within a single set round robin Still exploit data locality Switch to another set when all warps of a set hit long latency

operations

Page 23: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Dynamic Warps vs Large Warp + 2-Level Scheduling(Fung et al Micro 2007 vs Narasiman et al. Micro 2011)

23

Dynamic Warp formation gives better performance vs Large Warp alone

More opportunities to form warps All warps vs large warp size

Large Warp + 2-level scheduling better than dynamic warp formation

2-level scheduling can be applied together with dynamic warp formation

Page 24: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPU

24

Part 3

Holistically Optimized CPU Designs

Next Generation CPU – GPU Architectures

GPGPUEvolution

Holistic Optimizations

CPU Core OptimizationRedundancy

Elimination

Page 25: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Motivation to Rethink CPU Design(Arora et al. In Submission to IEEE Micro 2012)

25

Heterogeneity works best when each composing core runs subsets of codes well (Kumar et al. PACT 2006)

GPGPU already an example of this The CPU need not be fully general-purpose Sufficient to optimize it for non-GPU code CPU undergoes a “Holistic Optimization” Code expected to run on the CPU is very different We start by investigating properties of this code

Page 26: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Benchmarks

26

Took important computing applications and partitioned them over the CPU and GPU

Partitioning knowledge mostly based on expert information

Either used publically available source code Or details from publications Performed own CUDA implementations for 3 benchmarks

Also used serial and parallel programs with no known GPU implementations as CPU only workloads

Total of 11 CPU-heavy, 11 mixed and 11 GPU-heavy benchmarks

Page 27: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Methodology

27

Used a combination of two techniques Inserted start-end functions based on partitioning information Real machine measurements PIN based simulators

Branches categorized into 4 categories Biased (same direction), patterned (95% accuracy on local

predictor), correlated (95% accuracy on gshare), hard (remaining) Loads and stores characterized into 4 categories

Static (same address), Strided (95% accuracy on stride prefetcher), Patterned (95% accuracy on Markov predictor), Hard (remaining)

Thread level parallelism is speedup on 32 core machine

Page 28: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – CPU Time

28

Conservative speedups are capped at 10x More time being spent on the CPU than GPU

Page 29: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Instruction Level Parallelism

29

Drops in 17/22 apps (11% drop for larger window size) Short independent loops on GPU / Dependence heavy code on CPU

Page 30: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Branch Characterization

30

Frequency of hard branches 11.3% 18.6% Occasional effects of data dependent branches

Page 31: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Loads

31

Reduction in strided loads Increase in hard loads Occasional GPU mapping of irregular access kernels

Page 32: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Vector Instructions

32

SSE usage drops to almost half GPUs and SSE extensions targeting same regions of code

Page 33: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Thread Level Parallelism

33

GPU heavy worst hit (14x 2.1x), Overall 40-60% drops Majority of benchmarks have almost no post-GPU TLP Going from 8 cores to 32 cores has a 10% benefit

Page 34: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Impact : CPU Core Directions

34

Larger instruction windows will have muted gains Considerably increased pressure on branch predictor

Need to adopt better performing techniques (L-Tage Seznec et al. ) Memory access will continue to be major bottlenecks

Stride or next-line prefetching almost irrelevant Need to apply techniques that capture complex patterns Lots of literature but never adapted on real machines (e.g. Markov

prediction, Helper thread prefetching)

Page 35: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Impact : Redundancy Elimination

35

SSE rendered significantly less important Every core need not have it Cores could share SSE hardware

Extra CPU cores not of much use because of lack of TLP Few bigger cores with a focus on addressing highly

irregular code will improve performance

Page 36: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPU

36

Part 4

Shared Component Designs

Next Generation CPU – GPU Architectures

GPGPUEvolution

Holistic Optimizations

CPU Core OptimizationRedundancy

EliminationShared

Components

Page 37: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Optimization of Shared Structures

37

L2

Off-Chip Memory

Shared On-Chip Last Level Cache

Core

CacheHierarch

y

Core

CacheHierarch

y

SM

Shared Mem

SM SM

Shared Mem

Shared Mem

CPU

. . . . . .

GPGPUMemory Controller

Latency Sensitive Potentially Latency In-Sensitive But

Bandwidth Hungry

Page 38: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

TAP: TLP Aware Shared LLC Management(Lee et al. HPCA 2012)

38

Insight 1: GPU cache misses / hits may or may not Impact performance

Misses only matter if there is not enough latency hiding Allocated capacity useless if there is abundant parallelism

Measure cache sensitivity to performance Core sampling controller

Insight 2: GPU causes a lot more cache traffic than CPU Allocation schemes typically allocate based on number of accesses Normalization needed for larger number of GPU accesses

Cache block lifetime normalization

Page 39: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

TAP Design - 1

39

Core sampling controller Usually GPUs run the same workload on all cores Use different cache policies on 2 of cores and measure

performance difference E.g. LRU for one core / MRU on the other

Cache block lifetime normalization Count number of cache accesses for all CPU and GPU workloads Calculate ratios of access counts across workloads

Page 40: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

TAP Design - 2

40

Utility based Cache Partitioning (UCP) Dynamic cache way allocation scheme Allocate ways based on an applications expected gain from

additional space (utility) Uses cache hit rate to calculate utility Uses cache access rates to calculate cache block lifetime

TLP Aware Utility based Cache Partitioning (TAP-UCP) Uses core sampling controller information Allocate ways based on performance sensitivity and not hit rate TAP-UCP normalizes access rates to reduce GPU workload weight

5% better performance than UCP, 11% over LRU

Page 41: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

41

Typical Memory Controller Policy: Always Prioritize CPU CPU latency sensitive, GPU not However, this can slow down GPU traffic Problem for real-time applications (graphics)

QoS Aware Mem Bandwidth PartitioningJeong et al. DAC 2012

Page 42: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

42

Static management policies problematic Authors propose a dynamic management scheme

Default scheme is to prioritize CPU over GPU Periodically measure current rate of progress on the frame Work decomposed into smaller tiles, so measurement simple Compare with target frame rate If current frame rate slower than measured rate, set CPU and GPU priorities equal If close to deadline and still behind, boost GPU request priority even further

QoS Aware Mem Bandwidth Partitioning(Jeong et al. DAC 2012)

Page 43: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPU

43

Part 5

Opportunistic Optimizations

Next Generation CPU – GPU Architectures

GPGPUEvolution

Holistic Optimizations

CPU Core OptimizationRedundancy

EliminationShared

Components

OpportunisticOptimizations

Page 44: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Opportunistic Optimizations

44

Chip integration advantages Lower latency New communication paths e.g. shared L2

Opportunity for non-envisioned usage Using idle resources to help active execution

Idle GPU helps CPU Idle CPU helps GPU

Page 45: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Idle GPU Shader based Prefetching(Woo et al. ASPLOS 2010)

45

Realization: Advanced Prefetching not adopted because of high storage costs

GPU system can have exploitable idle resources Use idle GPU shader resources

Register files as prefetcher storage Execution threads as logic structures Parallel prefetcher execution threads to improve latency

Propose an OS based enabling and control interface Miss Address Provider

Library of prefetchers and application specific selection Prefetching performance benefit of 68%

Page 46: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Miss Address Provider

46

Shared On-Chip Last Level Cache

Core Core . . . SM SM . . .

MAP

Miss PC

Miss Address

Shader Pointer

Command Buffer

MAP

OS Allocates Idle GPU Core

Miss info forwardedTo GPU Core

GPU Core storesand processes miss

streamData prefetched into

Shared LLC

Page 47: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

CPU assisted GPGPU processing(Yang et al. HPCA 2012)

47

Use idle CPU resources to prefetch for GPGPU applications

Target bandwidth sensitive GPGPU applications Compiler based framework to convert GPU kernels to

CPU prefetching program CPU runs ahead appropriately of the GPU

If too far behind then the CPU cache hit rate will be very high If too far ahead then GPU cache hit rate will be very low

Very few CPU cycle required since LLC line is large Prefetching performance benefit of 21%

Page 48: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Example GPU Kernel and CPU program

48

__global__ void VecAdd (float *A, *B, *C, int N) { int I = blockDim.x * blockIdx.x + threadIdx.x; C[i] = A[i] + B[i] }

float mem_fetch (float *A, *B, *C, int N) { return A[N] + B[N] + C[N] }

void cpu_prefetching (…) { unroll_factor = 8 //traverse through all thread blocks (TB) for (j = 0; j < N_TB; j += Concurrent_TB) //loop to traverse concurrent threads TB_Size for (i = 0; i < Concurrent_TB*TB_Size; i += skip_factor*batch_size*unroll_factor) { for (k=0; j<batch_size; k++) { id = i + skip_factor*k*unroll_factor + j*TB_Size //unrolled loop float a0 = mem_fetch (id + skip_factor*0) float a1 = mem_fetch (id + skip_factor*1) . . . sum += a0 + a1 + . . . } update skip_factor}}}

GPU Kernel

Requests for Single thread

For all concurrentThread blocksSkip_factor

controls CPU timing

Batch_size controls how often skip_fctor

is updated

Unroll_factor artificially boost

CPU requests

Page 49: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Drawbacks: CPU assisted GPGPU processing

49

Does not consider effects of Thread block scheduling CPU program stripped of actual computations

Memory requests from data or computation dependent paths not considered

Page 50: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Lower Costs Overheads

CPU onlyWorkloads

Chip IntegratedCPU-GPU Systems

ThroughputApplications

Energy EfficientGPUs

GPGPU

50

Part 6

FutureWork

Next Generation CPU – GPU Architectures

GPGPUEvolution

Holistic Optimizations

CPU Core OptimizationRedundancy

EliminationShared

Components

OpportunisticOptimizations

Emerging Technologies

Power Temperature

Reliability

Tools

Page 51: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Continued System Optimizations

51

Continued holistic optimizations Understand impact of GPU workloads on CPU requests to the

memory controller? Continued opportunistic optimizations

Latest GPUs allow different kernels to be run on the same GPU Can GPU threads prefetch for other GPU kernels?

Page 52: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Research Tools

52

Severe lack of GPU research tools No GPU power model No GPU temperature model Immediate and impactful opportunities

Page 53: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Power, Temperature and Reliability

53

Bounded by lack of power tools No work yet on effective power management No work yet on effective temperature management

Page 54: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Emerging Technologies

54

Impact of non-volatile memories on GPUs 3D die stacked GPUs Stacked CPU-GPU-Main memory systems

Page 55: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Conclusions

55

In this work we looked at the CPU-GPU research landscape

GPGPUs systems are quickly scaling in performance CPU needs to be refocused to handle extremely

irregular code Design of shared components needs to be rethought Abundant optimization and research opportunities!

Questions?

Page 56: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Backup Slides

Page 57: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Stores

57

Similar trends as loads but slightly less pronounced

Page 58: The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California,

Results – Branch Prediction Rates

58

Hard branches translate to higher misprediction rates Strong influence of CPU only benchmarks