gpu power model nandhini sudarsanan [email protected]@umn.edu nathan vanderby...

GPU Power Model

Nandhini Sudarsanan [email protected] Vanderby [email protected]

Neeraj Mishra [email protected] Vinodh [email protected]

Chi Xu [email protected]

mailto:[email protected]





CSCI 8205: GPU Power Model

2

Outline

Introduction and Motivation

Analytical Model Description

Experiment Setup

Results

Conclusion and Further Work

5/4/11


3

Introduction

Develop a methodology for building an accurate power model for a GPU.

Validate with a NVIDA’s GTX 480 GPU.

Measure power efficiency of various NVIDIA SDK benchmarks.

Accurate power model can helpExplore various architectural and algorithmic trade offs.Figure out balance of workload between GPU and CPU.

5/4/11


4

Motivation

Power Consumption: Key criterion for future Hardware Devices and Embedded Software.

Effect of increased power density has been not been felt till now Supply voltage was scaled back too. Current and Power density remained constant.

Further reduction in supply voltage difficult in future Supply voltage approaching close to threshold voltage. Gate oxide thickness almost equal to 1nm.

5/4/11


5

Motivation

5/4/11


6

GPU Processing Power

5/4/11


7

Price of Power

Maximum Load = Lot of Power Nvidia 8800 GTX: 137W Intel Xeon LS5400: 50W

5/4/11


8

Power Wall

Power Density in GPUs larger that even high end CPUs

Power gating, Clock gating have been successfully employed in CPUs [Brooks, Hpca 2001]

Power gating, Clock gating and other H/W based schemes are not used in most GPUs [Kim Isca 2010]

Accurate power model can help Explore various architectural and algorithmic trade offs. Figure out balance of workload between GPU and CPU.

5/4/11


9

Background

Power consumption can be divided into:

Power = Dynamic_power + Static_power + Short_Ckt_Power

Dynamic power is determined by run-time events Fixed-function units: texture filtering and rasterization Programmable units: memory and floating point

Static power determined by circuit technology chip layout operating temperature.

P = VCC * N* Kdesign* Ileak

5/4/11


10

Previous Power Models

Statistical power modeling approach for GPU [Matsuoka 2010] Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile Finds correlation b/w profiles and power by statistical model learning. Lot of information not captured by counters lost

Cycle-level simulations based Power Model ,[Skadron HWWS'04] Assume hypothetical architecture to explore new GPU microarchitectures and model

power and leakage properties Cycle-level processor simulations are time consuming [Martonosi&Isci 2003] Do not allow a complete view of operating system effects, I/O [Isci 2003]

5/4/11


11

Outline


Analytical Model Description Parser Power Model

Experiment Setup

Results


5/4/11


12

Need for a Parser

GPGPUsim is time consuming

GPGPUsim output is not tailored to our needs

Parser is very fast

GPGPUsim works only with CUDA 2.3 or prior

5/4/11


13

Limitations of the Parser

Dynamic loops are not automatically determined.

Branch prediction is assumed to be taken

Highly tailored to our specific needs.

A change in the PTX layout might require change to parser.

5/4/11


14

Outline



Experiment Setup

Results


5/4/11


15

Fermi Architecture: sm_20

5/4/11

Memory Hierarchy PCIE & RAM L2 Cache L1 Cache Shared Memory Registers

Streaming Processor 32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle


16

Fermi Architecture: sm_20

5/4/11

Memory Hierarchy PCIE & RAM L2 Cache L1 Cache Shared Memory Registers

Streaming Processor 32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle


17

Factors in the Power Model

Temperature # of SMs

5/4/11


18

Power Model

Assembly Level

5/4/11


19

Outline



Experiment Setup

Results


5/4/11


20

Experiment Setup - Hardware

Measure Power Consumption and Temperature Sample Temperature @ 10Hz, GPU sensor Current Clamp for PCIE & GPU Power Cable

Data Acquisition Card @ 100Hz GPU Performance Counter

Profile 57 Counters per Kernel 9 Executions

5/4/11


21

Experiment Setup - Software Driver API

PTX level Micro-benchmark Minimize control loops Stress one type of PTX instruction per kernel, over 95% 76 kernels Wisely choose block and grid size and

CUDA 4.0 Built in Binary -> Assembly Converter (cuobjdump)

Timer interrupt to collect Temperature

Remote login

5/4/11


22

Limitations of PTX

Higher level than assembly 30 out of 76 PTX take multiple assembly Divide, Sqrt, etc.: 1 PTX line, library in assembly

Compiler optimizations from PTX -> assembly

Doesn’t reflect RAW dependencies

Performance counters results based on assembly

5/4/11


23

CUDA – Fermi Architecture

Third Generation Streaming Multiprocessor(SM) 32 CUDA cores per SM, 4x over GT200 1024 thread block size, 2x over GT200 Unified address space enables full C++ support Improved Memory Subsystem

5/4/11


24

CUDA – Fermi Architecture

5/4/11

Fermi Memory Hierarchy

RegistersSM - 0

L1 Cache Shared Mem.

Registers

SM - N

L1 Cache Shared Mem.

L2 Cache

Global Memory


25

Validation Benchmarks

Small number of overhead operations (loop counters, initialization, etc.).

Computational intensive work to allow for an experiment of significant length for accurate current measurement.

Exhibit high utilization of the CUDA cores, few data hazards as possible.

Grid and block sizes appropriately so that all SM are used, since idle SM leak.

Accordingly 7 benchmarks were selected from CUDA SDK.

5/4/11


26

Validation Benchmarks

Our benchmarks 2D convolution Matrix Multiplication Vector Addition Vector Reduction Scalar Product DCT 8x8 3DFD

5/4/11


27

Outline



Experiment Setup

Results


5/4/11


28

Results

5/4/11


29

Outline



Experiment Setup

Results


5/4/11


30


Conclusion

Further Work Take into account context switches Consider Multiple kernels running simultaneously

5/4/11


31

The End

Thanks

Q&A

5/4/11

gpu power model nandhini sudarsanan [email protected]@umn.edu nathan vanderby...

Documents

model power

gpu power model11541111need

power static

power short

powerdynamic power

accurate power model

gpu matsuoka

floating point static