gpu power model nandhini sudarsanan [email protected]@umn.edu nathan vanderby...
TRANSCRIPT
GPU Power Model
Nandhini Sudarsanan [email protected] Vanderby [email protected]
Neeraj Mishra [email protected] Vinodh [email protected]
Chi Xu [email protected]
CSCI 8205: GPU Power Model
2
Outline
Introduction and Motivation
Analytical Model Description
Experiment Setup
Results
Conclusion and Further Work
5/4/11
CSCI 8205: GPU Power Model
3
Introduction
Develop a methodology for building an accurate power model for a GPU.
Validate with a NVIDA’s GTX 480 GPU.
Measure power efficiency of various NVIDIA SDK benchmarks.
Accurate power model can helpExplore various architectural and algorithmic trade offs.Figure out balance of workload between GPU and CPU.
5/4/11
CSCI 8205: GPU Power Model
4
Motivation
Power Consumption: Key criterion for future Hardware Devices and Embedded Software.
Effect of increased power density has been not been felt till now Supply voltage was scaled back too. Current and Power density remained constant.
Further reduction in supply voltage difficult in future Supply voltage approaching close to threshold voltage. Gate oxide thickness almost equal to 1nm.
5/4/11
CSCI 8205: GPU Power Model
5
Motivation
5/4/11
CSCI 8205: GPU Power Model
6
GPU Processing Power
5/4/11
CSCI 8205: GPU Power Model
7
Price of Power
Maximum Load = Lot of Power Nvidia 8800 GTX: 137W Intel Xeon LS5400: 50W
5/4/11
CSCI 8205: GPU Power Model
8
Power Wall
Power Density in GPUs larger that even high end CPUs
Power gating, Clock gating have been successfully employed in CPUs [Brooks, Hpca 2001]
Power gating, Clock gating and other H/W based schemes are not used in most GPUs [Kim Isca 2010]
Accurate power model can help Explore various architectural and algorithmic trade offs. Figure out balance of workload between GPU and CPU.
5/4/11
CSCI 8205: GPU Power Model
9
Background
Power consumption can be divided into:
Power = Dynamic_power + Static_power + Short_Ckt_Power
Dynamic power is determined by run-time events Fixed-function units: texture filtering and rasterization Programmable units: memory and floating point
Static power determined by circuit technology chip layout operating temperature.
P = VCC * N* Kdesign* Ileak
5/4/11
CSCI 8205: GPU Power Model
10
Previous Power Models
Statistical power modeling approach for GPU [Matsuoka 2010] Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile Finds correlation b/w profiles and power by statistical model learning. Lot of information not captured by counters lost
Cycle-level simulations based Power Model ,[Skadron HWWS'04] Assume hypothetical architecture to explore new GPU microarchitectures and model
power and leakage properties Cycle-level processor simulations are time consuming [Martonosi&Isci 2003] Do not allow a complete view of operating system effects, I/O [Isci 2003]
5/4/11
CSCI 8205: GPU Power Model
11
Outline
Introduction and Motivation
Analytical Model Description Parser Power Model
Experiment Setup
Results
Conclusion and Further Work
5/4/11
CSCI 8205: GPU Power Model
12
Need for a Parser
GPGPUsim is time consuming
GPGPUsim output is not tailored to our needs
Parser is very fast
GPGPUsim works only with CUDA 2.3 or prior
5/4/11
CSCI 8205: GPU Power Model
13
Limitations of the Parser
Dynamic loops are not automatically determined.
Branch prediction is assumed to be taken
Highly tailored to our specific needs.
A change in the PTX layout might require change to parser.
5/4/11
CSCI 8205: GPU Power Model
14
Outline
Introduction and Motivation
Analytical Model Description Parser Power Model
Experiment Setup
Results
Conclusion and Further Work
5/4/11
CSCI 8205: GPU Power Model
15
Fermi Architecture: sm_20
5/4/11
Memory Hierarchy PCIE & RAM L2 Cache L1 Cache Shared Memory Registers
Streaming Processor 32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle
CSCI 8205: GPU Power Model
16
Fermi Architecture: sm_20
5/4/11
Memory Hierarchy PCIE & RAM L2 Cache L1 Cache Shared Memory Registers
Streaming Processor 32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle
CSCI 8205: GPU Power Model
17
Factors in the Power Model
Temperature # of SMs
5/4/11
CSCI 8205: GPU Power Model
18
Power Model
Assembly Level
5/4/11
CSCI 8205: GPU Power Model
19
Outline
Introduction and Motivation
Analytical Model Description Parser Power Model
Experiment Setup
Results
Conclusion and Further Work
5/4/11
CSCI 8205: GPU Power Model
20
Experiment Setup - Hardware
Measure Power Consumption and Temperature Sample Temperature @ 10Hz, GPU sensor Current Clamp for PCIE & GPU Power Cable
Data Acquisition Card @ 100Hz GPU Performance Counter
Profile 57 Counters per Kernel 9 Executions
5/4/11
CSCI 8205: GPU Power Model
21
Experiment Setup - Software Driver API
PTX level Micro-benchmark Minimize control loops Stress one type of PTX instruction per kernel, over 95% 76 kernels Wisely choose block and grid size and
CUDA 4.0 Built in Binary -> Assembly Converter (cuobjdump)
Timer interrupt to collect Temperature
Remote login
5/4/11
CSCI 8205: GPU Power Model
22
Limitations of PTX
Higher level than assembly 30 out of 76 PTX take multiple assembly Divide, Sqrt, etc.: 1 PTX line, library in assembly
Compiler optimizations from PTX -> assembly
Doesn’t reflect RAW dependencies
Performance counters results based on assembly
5/4/11
CSCI 8205: GPU Power Model
23
CUDA – Fermi Architecture
Third Generation Streaming Multiprocessor(SM) 32 CUDA cores per SM, 4x over GT200 1024 thread block size, 2x over GT200 Unified address space enables full C++ support Improved Memory Subsystem
5/4/11
CSCI 8205: GPU Power Model
24
CUDA – Fermi Architecture
5/4/11
Fermi Memory Hierarchy
RegistersSM - 0
L1 Cache Shared Mem.
Registers
SM - N
L1 Cache Shared Mem.
L2 Cache
Global Memory
CSCI 8205: GPU Power Model
25
Validation Benchmarks
Small number of overhead operations (loop counters, initialization, etc.).
Computational intensive work to allow for an experiment of significant length for accurate current measurement.
Exhibit high utilization of the CUDA cores, few data hazards as possible.
Grid and block sizes appropriately so that all SM are used, since idle SM leak.
Accordingly 7 benchmarks were selected from CUDA SDK.
5/4/11
CSCI 8205: GPU Power Model
26
Validation Benchmarks
Our benchmarks 2D convolution Matrix Multiplication Vector Addition Vector Reduction Scalar Product DCT 8x8 3DFD
5/4/11
CSCI 8205: GPU Power Model
27
Outline
Introduction and Motivation
Analytical Model Description Parser Power Model
Experiment Setup
Results
Conclusion and Further Work
5/4/11
CSCI 8205: GPU Power Model
28
Results
5/4/11
CSCI 8205: GPU Power Model
29
Outline
Introduction and Motivation
Analytical Model Description Parser Power Model
Experiment Setup
Results
Conclusion and Further Work
5/4/11
CSCI 8205: GPU Power Model
30
Conclusion and Further Work
Conclusion
Further Work Take into account context switches Consider Multiple kernels running simultaneously
5/4/11
CSCI 8205: GPU Power Model
31
The End
Thanks
Q&A
5/4/11