the architecture and evolution of cpu-gpu systems for general purpose computing manish arora...
TRANSCRIPT
The Architecture and Evolution of CPU-GPU Systems for General
Purpose Computing
Manish AroraComputer Science and EngineeringUniversity of California, San Diego
From GPU to GPGPU
2
GPU
. . .
Inp
ut
Assem
bly
Vert
ex
Pro
cessin
g
Fra
me
Bu
ffer
Op
era
tion
s
L2
Memory Controller
Off-Chip Memory
Geom
etr
y
Pro
cessin
g
L2
SM SM
Shared Mem
Shared Mem
. . .
GPGPU
Memory Controller
Off-Chip Memory
Widespread adoption (300M devices) First with NVIDIA Tesla in 2006-2007.
3
1 2006 – 2010
L2
SM SM
Shared Mem
Shared Mem
. . .
GPGPU
Memory Controller
Off-Chip Memory
PCIBridge
Previous Generation Consumer Hardware1
Off-Chip Memory
Last Level Cache
Core
CacheHierarch
y
Core
CacheHierarch
y
CPU
. . .
Memory Controller
Current Consumer Hardware2
4
L2
Off-Chip Memory
Shared On-Chip Last Level Cache
Core
CacheHierarch
y
Core
CacheHierarch
y
SM
Shared Mem
SM SM
Shared Mem
Shared Mem
CPU
. . . . . .
GPGPUMemory Controller
2 Intel Sandy Bridge AMD Fusion APUs
2011 - 2012
Our Goals Today Examine the current state of the art Trace the next steps of this evolution (major part) Lay out research opportunities
5
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPUPart 1
6
Next Generation CPU – GPU Architectures
GPGPUEvolution
Part 2
OpportunisticOptimizations
Part 5
Shared Components
Part 4
Emerging Technologies
Power Temperature
Reliability
Part 6
Tools
(Future Work)
Outline
Holistic Optimizations
CPU Core OptimizationRedundancy
Elimination
Part 3
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPU
7
Part 1
Progression of GPGPU Architectures
GPGPUs - 1 The fixed function graphics era (pre 2006)
Programmable vertex processors Programmable pixel processors Lots of fixed hardware blocks (assembly, geometry, z-culling…)
Non-graphics processing was possible Represent user work as graphics tasks Trick the graphics pipeline Programming via graphics APIs No hardware for bit-wise operations, no explicit branching…
Imbalance in modern workloads motivated unification General purpose opportunity sensed by vendors
8
GPGPUs - 2 The unified graphics and computing era (2006 - 2010)
Single programmable processor design Explicit support for both graphics and computing Computing specific modifications (IEEE FP Compliance and ECC)
Non-graphics processing easy High level programming (C, C++, Python etc.) Separate GPU and CPU memory space Explicit GPU memory management required
High overhead to process on the GPU Memory transfers over PCI
Significant customer market penetration
9
GPGPUs - 3 Chip Integrated CPU-GPU era (2011 onwards)
Multicore CPU + GPGPU on the same die Shared last level caches and memory controller Shared main memory system
Chip Integration advantages Lower total system costs Shared hardware blocks improve utilization Lower latency Higher Bandwidth
Continued improvements in programmability Standardization efforts (OpenCL and DirectCompute)
10
11
Contemporary GPU Architecture(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)
PCIBridgeOff-Chip Memory
Last Level Cache
Core
CacheHierarch
y
Core
CacheHierarch
y
CPU
. . .
Memory Controller
L2
SM SM
Shared Mem
Shared Mem
. . .
GPGPU
Memory Controller
Off-Chip Memory
Memory Controller
Memory Controller
Memory Controller
Memory Controller
Memory Controller
Memory Controller
L2 Cache L2 Cache L2 Cache
L2 Cache L2 Cache L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Interconnect
. . .
. . .
. . .
. . .
DRAM DRAM DRAM
DRAM DRAM DRAM
SM Architecture(Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011)
12
Banked Register File
Warp Scheduler
Operand Buffering
SIMT Lanes
Shared Memory / L1 Cache
ALUs SFUs MEM TEX
Multi-threading and Warp Scheduling Warp processing
32 threads grouped and processed as a Warp Single instruction fetched and issued per warp Lots of active threads per SM (Fermi: 1536 threads in 48 Warps)
Hardware Multithreading for latency hiding Threads has dedicated registers (Fermi: 21 registers per thread) Register state need not be copied or restored Enables fast switching (potentially new warp each cycle)
Threads processed in-order Warps scheduled out-of-order
SM Multithreaded Instruction Scheduler
Warp 1 Instruction 1
Warp 2 Instruction 1
Warp 3 Instruction 1
Time
Warp 2 Instruction 2
Warp 3 Instruction 2...
Warp 1 Instruction 2...
Example of Warp Scheduling(Lindholm et al. IEEE Micro 2007)
Design for Efficiency and ScalabilityNickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011
15
Amortized costs of instruction supply Single instruction multiple thread model
Efficient Data supply Large register files Managed locality (via shared memories)
Lack of global structures No out-of-order processing
High utilization with hardware multithreading Biggest tradeoff : Programmability
Exposed microarchitecture, frequent changes Programmer has to manage data
Scalability(Lee et al. ISCA 2010 / Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE
Micro 2011 and other public sources)
16
Double precision performance 10x in 3 generations Memory structures growing slower than ALUs (22.5x)
Memory bandwidth even slower (2.2x in 4 generations) Clearly favors workloads with high Arithmetic Intensity CPU performance gap increasing rapidly
Double precision performance gap 2x 9x
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPU
17
Part 2
Towards Better GPGPU
Next Generation CPU – GPU Architectures
GPGPUEvolution
Time
Control-flow Divergence Losses(Fung et al. Micro 2007)
Mask = 1111
Code A Code B
Mask = 1111
DivergentBranch
MergePoint
Diverge Point
Path A: Ins 1Path A: Ins 2
…
Path B: Ins 1Path B: Ins 2
…
Converge Point
Low Utilization
Key Insight: Several warps at the same diverge point Combine threads from same execution path
dynamically Generate warps on the fly 20.7% improvements @ 4.7% area overhead
Dynamically formed 2 new warps from 4 original
warps
With DWF
Warp 0 : Path A
Time
Original Scheme
Dynamic Warp Formation (Fung et al. Micro 2007)
Mask = 1111
Code A Code B
Mask = 1111
DivergentBranch
MergePoint
Warp 1 : Path A
Warp 0 : Path B
Warp 1 : Path B
Warp 0+1 : Path AWarp 0+1 : Path B
Register file accesses during lane-aware dynamic warp
formation
Ban
k 1
ALU
1
Ban
k 2
ALU
2
Ban
k N
ALU
N
Ban
k 1
ALU
1
Ban
k 2
ALU
2
Ban
k N
ALU
NDenotes register accessed
Register File
Register File
Register file accesses for static warps
Dynamic Warp Formation Intricacies(Fung et al. Micro 2007)
Register file accesses without lane awareness
Needs several warps at the same execution point “Majority” warp scheduling policy
Need for Lane-awareness Banked register files Spread out threads of the dynamic warp Simplifies design
Large Warp Microarchitecture(Narasiman et al. Micro 2011)
1 0 0
0 1 0
0 0
1 1 1 1
T = 1
1
Activity Mask
- - 0 0
0 0 -
0 0 -
1 1
T = 2
1
Activity Mask
- - 0 0
0 0 -
0 0 -
T = 3
Activity Mask
Time
1 1 0 0
0 1 0 1
0 0 1 1
1 1 1 1
T = 0
OriginalLarge Warp
Similar idea to generate dynamic warps Differs in the creation method
Machine organized as large warps bigger than the SIMT width
Dynamically create warps from within the large warp
Two level Scheduling(Narasiman et al. Micro 2011)
22
Typical Warp scheduling scheme: Round Robin Beneficial because it exploits data locality across warps
All warps tend to reach long latency operations at the same time
Cannot hide latency because everyone is waiting Solution: Group warps into several sets
Schedule warps within a single set round robin Still exploit data locality Switch to another set when all warps of a set hit long latency
operations
Dynamic Warps vs Large Warp + 2-Level Scheduling(Fung et al Micro 2007 vs Narasiman et al. Micro 2011)
23
Dynamic Warp formation gives better performance vs Large Warp alone
More opportunities to form warps All warps vs large warp size
Large Warp + 2-level scheduling better than dynamic warp formation
2-level scheduling can be applied together with dynamic warp formation
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPU
24
Part 3
Holistically Optimized CPU Designs
Next Generation CPU – GPU Architectures
GPGPUEvolution
Holistic Optimizations
CPU Core OptimizationRedundancy
Elimination
Motivation to Rethink CPU Design(Arora et al. In Submission to IEEE Micro 2012)
25
Heterogeneity works best when each composing core runs subsets of codes well (Kumar et al. PACT 2006)
GPGPU already an example of this The CPU need not be fully general-purpose Sufficient to optimize it for non-GPU code CPU undergoes a “Holistic Optimization” Code expected to run on the CPU is very different We start by investigating properties of this code
Benchmarks
26
Took important computing applications and partitioned them over the CPU and GPU
Partitioning knowledge mostly based on expert information
Either used publically available source code Or details from publications Performed own CUDA implementations for 3 benchmarks
Also used serial and parallel programs with no known GPU implementations as CPU only workloads
Total of 11 CPU-heavy, 11 mixed and 11 GPU-heavy benchmarks
Methodology
27
Used a combination of two techniques Inserted start-end functions based on partitioning information Real machine measurements PIN based simulators
Branches categorized into 4 categories Biased (same direction), patterned (95% accuracy on local
predictor), correlated (95% accuracy on gshare), hard (remaining) Loads and stores characterized into 4 categories
Static (same address), Strided (95% accuracy on stride prefetcher), Patterned (95% accuracy on Markov predictor), Hard (remaining)
Thread level parallelism is speedup on 32 core machine
Results – CPU Time
28
Conservative speedups are capped at 10x More time being spent on the CPU than GPU
Results – Instruction Level Parallelism
29
Drops in 17/22 apps (11% drop for larger window size) Short independent loops on GPU / Dependence heavy code on CPU
Results – Branch Characterization
30
Frequency of hard branches 11.3% 18.6% Occasional effects of data dependent branches
Results – Loads
31
Reduction in strided loads Increase in hard loads Occasional GPU mapping of irregular access kernels
Results – Vector Instructions
32
SSE usage drops to almost half GPUs and SSE extensions targeting same regions of code
Results – Thread Level Parallelism
33
GPU heavy worst hit (14x 2.1x), Overall 40-60% drops Majority of benchmarks have almost no post-GPU TLP Going from 8 cores to 32 cores has a 10% benefit
Impact : CPU Core Directions
34
Larger instruction windows will have muted gains Considerably increased pressure on branch predictor
Need to adopt better performing techniques (L-Tage Seznec et al. ) Memory access will continue to be major bottlenecks
Stride or next-line prefetching almost irrelevant Need to apply techniques that capture complex patterns Lots of literature but never adapted on real machines (e.g. Markov
prediction, Helper thread prefetching)
Impact : Redundancy Elimination
35
SSE rendered significantly less important Every core need not have it Cores could share SSE hardware
Extra CPU cores not of much use because of lack of TLP Few bigger cores with a focus on addressing highly
irregular code will improve performance
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPU
36
Part 4
Shared Component Designs
Next Generation CPU – GPU Architectures
GPGPUEvolution
Holistic Optimizations
CPU Core OptimizationRedundancy
EliminationShared
Components
Optimization of Shared Structures
37
L2
Off-Chip Memory
Shared On-Chip Last Level Cache
Core
CacheHierarch
y
Core
CacheHierarch
y
SM
Shared Mem
SM SM
Shared Mem
Shared Mem
CPU
. . . . . .
GPGPUMemory Controller
Latency Sensitive Potentially Latency In-Sensitive But
Bandwidth Hungry
TAP: TLP Aware Shared LLC Management(Lee et al. HPCA 2012)
38
Insight 1: GPU cache misses / hits may or may not Impact performance
Misses only matter if there is not enough latency hiding Allocated capacity useless if there is abundant parallelism
Measure cache sensitivity to performance Core sampling controller
Insight 2: GPU causes a lot more cache traffic than CPU Allocation schemes typically allocate based on number of accesses Normalization needed for larger number of GPU accesses
Cache block lifetime normalization
TAP Design - 1
39
Core sampling controller Usually GPUs run the same workload on all cores Use different cache policies on 2 of cores and measure
performance difference E.g. LRU for one core / MRU on the other
Cache block lifetime normalization Count number of cache accesses for all CPU and GPU workloads Calculate ratios of access counts across workloads
TAP Design - 2
40
Utility based Cache Partitioning (UCP) Dynamic cache way allocation scheme Allocate ways based on an applications expected gain from
additional space (utility) Uses cache hit rate to calculate utility Uses cache access rates to calculate cache block lifetime
TLP Aware Utility based Cache Partitioning (TAP-UCP) Uses core sampling controller information Allocate ways based on performance sensitivity and not hit rate TAP-UCP normalizes access rates to reduce GPU workload weight
5% better performance than UCP, 11% over LRU
41
Typical Memory Controller Policy: Always Prioritize CPU CPU latency sensitive, GPU not However, this can slow down GPU traffic Problem for real-time applications (graphics)
QoS Aware Mem Bandwidth PartitioningJeong et al. DAC 2012
42
Static management policies problematic Authors propose a dynamic management scheme
Default scheme is to prioritize CPU over GPU Periodically measure current rate of progress on the frame Work decomposed into smaller tiles, so measurement simple Compare with target frame rate If current frame rate slower than measured rate, set CPU and GPU priorities equal If close to deadline and still behind, boost GPU request priority even further
QoS Aware Mem Bandwidth Partitioning(Jeong et al. DAC 2012)
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPU
43
Part 5
Opportunistic Optimizations
Next Generation CPU – GPU Architectures
GPGPUEvolution
Holistic Optimizations
CPU Core OptimizationRedundancy
EliminationShared
Components
OpportunisticOptimizations
Opportunistic Optimizations
44
Chip integration advantages Lower latency New communication paths e.g. shared L2
Opportunity for non-envisioned usage Using idle resources to help active execution
Idle GPU helps CPU Idle CPU helps GPU
Idle GPU Shader based Prefetching(Woo et al. ASPLOS 2010)
45
Realization: Advanced Prefetching not adopted because of high storage costs
GPU system can have exploitable idle resources Use idle GPU shader resources
Register files as prefetcher storage Execution threads as logic structures Parallel prefetcher execution threads to improve latency
Propose an OS based enabling and control interface Miss Address Provider
Library of prefetchers and application specific selection Prefetching performance benefit of 68%
Miss Address Provider
46
Shared On-Chip Last Level Cache
Core Core . . . SM SM . . .
MAP
Miss PC
Miss Address
Shader Pointer
Command Buffer
MAP
OS Allocates Idle GPU Core
Miss info forwardedTo GPU Core
GPU Core storesand processes miss
streamData prefetched into
Shared LLC
CPU assisted GPGPU processing(Yang et al. HPCA 2012)
47
Use idle CPU resources to prefetch for GPGPU applications
Target bandwidth sensitive GPGPU applications Compiler based framework to convert GPU kernels to
CPU prefetching program CPU runs ahead appropriately of the GPU
If too far behind then the CPU cache hit rate will be very high If too far ahead then GPU cache hit rate will be very low
Very few CPU cycle required since LLC line is large Prefetching performance benefit of 21%
Example GPU Kernel and CPU program
48
__global__ void VecAdd (float *A, *B, *C, int N) { int I = blockDim.x * blockIdx.x + threadIdx.x; C[i] = A[i] + B[i] }
float mem_fetch (float *A, *B, *C, int N) { return A[N] + B[N] + C[N] }
void cpu_prefetching (…) { unroll_factor = 8 //traverse through all thread blocks (TB) for (j = 0; j < N_TB; j += Concurrent_TB) //loop to traverse concurrent threads TB_Size for (i = 0; i < Concurrent_TB*TB_Size; i += skip_factor*batch_size*unroll_factor) { for (k=0; j<batch_size; k++) { id = i + skip_factor*k*unroll_factor + j*TB_Size //unrolled loop float a0 = mem_fetch (id + skip_factor*0) float a1 = mem_fetch (id + skip_factor*1) . . . sum += a0 + a1 + . . . } update skip_factor}}}
GPU Kernel
Requests for Single thread
For all concurrentThread blocksSkip_factor
controls CPU timing
Batch_size controls how often skip_fctor
is updated
Unroll_factor artificially boost
CPU requests
Drawbacks: CPU assisted GPGPU processing
49
Does not consider effects of Thread block scheduling CPU program stripped of actual computations
Memory requests from data or computation dependent paths not considered
Lower Costs Overheads
CPU onlyWorkloads
Chip IntegratedCPU-GPU Systems
ThroughputApplications
Energy EfficientGPUs
GPGPU
50
Part 6
FutureWork
Next Generation CPU – GPU Architectures
GPGPUEvolution
Holistic Optimizations
CPU Core OptimizationRedundancy
EliminationShared
Components
OpportunisticOptimizations
Emerging Technologies
Power Temperature
Reliability
Tools
Continued System Optimizations
51
Continued holistic optimizations Understand impact of GPU workloads on CPU requests to the
memory controller? Continued opportunistic optimizations
Latest GPUs allow different kernels to be run on the same GPU Can GPU threads prefetch for other GPU kernels?
Research Tools
52
Severe lack of GPU research tools No GPU power model No GPU temperature model Immediate and impactful opportunities
Power, Temperature and Reliability
53
Bounded by lack of power tools No work yet on effective power management No work yet on effective temperature management
Emerging Technologies
54
Impact of non-volatile memories on GPUs 3D die stacked GPUs Stacked CPU-GPU-Main memory systems
Conclusions
55
In this work we looked at the CPU-GPU research landscape
GPGPUs systems are quickly scaling in performance CPU needs to be refocused to handle extremely
irregular code Design of shared components needs to be rethought Abundant optimization and research opportunities!
Questions?
Backup Slides
Results – Stores
57
Similar trends as loads but slightly less pronounced
Results – Branch Prediction Rates
58
Hard branches translate to higher misprediction rates Strong influence of CPU only benchmarks