performance in gpu architectures: potentials and distances

Performance in GPU Architectures: Potentials and

Distances

Ahmad LashgarECE

University of Tehran

Amirali BaniasadiECE

University of Victoria

WDDD-9June 5, 2011

This Work

Goal: Investigating GPU performance for general-purpose workloads

How: Studying the isolated impact ofI. Memory divergence II. Branch divergence III. Context-keeping resources

Key finding: Memory has the biggest impact.Branch divergence solution needs memory consideration.

2A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Outline

Background

Performance Impacting Parameters

Machine Models

Performance Potentials

Performance Distances

Sensitivity Analysis

Conclusion

Distances.

GPU Architecture

Interconnection Netw

MCtrl6

DRAM1DRAM1DRAM1

...... ... ...

SM1 SM2 SM3

MCtrl1

DRAM1DRAM1DRAM1

MCtrl2

DRAM1DRAM1DRAM1

MCtrl5

DRAM1DRAM1DRAM1

DRAM5TPC10

SM1 SM2 SM3

Thread Pool

L1Data L1Cost L1Text

PE32PE1 PE2 PE31

Register File

CTAID Program Counter

TID CTAID Program Counter.

•Number of concurrent CTAs per SM is limited by the size of 3 shared resources:

1. Thread Pool2. Register File3. Shared Memory

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Branch Divergence

SM is SIMD processor Group of threads (warp) execute the same

instruction on the lanes. Branch instruction potentially diverge warp to two

groups:1. Threads with taken outcome2. Threads with not-taken outcome

Distances.

A 1 1 1 1 1 1 1 1

B 1 1 0 1 0 0 1 0

C 0 0 1 0 1 1 0 1

D 1 1 1 1 1 1 1 1

A: // Pre-Divergence if(CONDITION) {B: //NT path } else {C: //T path }D: // reconvergence point

Control-flow mechanism

Control-flow solutions address this. Previous solutions:

Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally

reconverging all paths Dynamic Warp Formulation (DWF)

Regrouping the threads in diverging paths into new warps

Distances.

Utilizationover time

B W0011

C W0100

D W0111

Dynamic regrouping ofdiverged threads at same path

increases utilization

Distances.

Utilizationover time

B W0011

C W2100

D W0011

Warp Pool

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 A 1 1 1 1

W2 C 1 0 0 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 B 0 0 0 1

W2 C 1 0 0 1

W3 C 1 1 1 0

Wi PC Mask Vector

W0 B 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 D 1 0 0 0

Wi PC Mask Vector

W0 D 1 1 1 1

W1 D 1 1 1 1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 D 1 1 1 1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

Possibilit

Performance impacting parameters Memory Divergence

Increase of memory pressure with un-coalesced memory accesses Branch Divergence

Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism

CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources:

1. Shared Memory2. Register File3. Thread Pool

Distances.

Machine Models

Limited Resources :LRUnlimited

Resources :UR

DC: DWF Control-flowPC: PDOM Control-flowIC: Ideal Control-flow (MIMD)

IM: Ideal Memory M: Real Memory

Y ZX Y Z-

Isolates the impact of each parameter:

Machine Models continued…

Distances.

LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM

Real-Memory

Ideal-Memory

Real-Memory

Ideal-Memory

Limitedper SM resources

Unlimitedper SM resources

Methodology

GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and

CUDA SDK 2.3

Distances.

Parameter ValueNoC

Total Number of SMs 30Number of Memory Ctrls 6Number of SM Sharing an

Interconnect3

Warp Size 32 ThreadsNumber of Thread per SM 1024

Number of Register per SM 16384 32-bit

Number of PEs per SM 32Shared Memory Size 16KB

L1 Data Cache 32KB

Parameter ValueClocking

Core Clock 325 MHzInterconnect Clock 650 MHz

DRAM memory Clock 800MHzControl-Flow Mechanisms

Base DWF issue heuristic MajorityPDOM warp scheduling round-robin

Performance Potentials

The speedup can be reached if the impacting parameter is idealized

3 Potentials (per control-flow mechanism): Memory Potential

Speedup due to ideal memory Control Potential

Speedup due to free-of-divergence architecture Resource Potential

Speedup due to infinite CTA-limiting resources per SM

Distances.

Performance Potentials continued…

Distances.

Memory Potentials

Distances.

DWF61%PDOM59%

Resource Potentials

Distances.

DWF8.6%PDOM9.4%

Control Potentials

Distances.

PDOM-7%

Performance Distances

How much an otherwise ideal GPU is distanced from ideal due to the parameter.

3 Distances: Memory Distance

Distance form ideal GPU due to real memory Resource Distance

Distance from ideal GPU due to limited resources Control Distance

Distance from ideal GPU due to branch divergence

Distances.

Performance Distances continued…

Distances.

Memory Distance

Distances.

Resource Distance

Distances.

Control Distances

Distances.

DWF15%

PDOM8%

Sensitivity Analysis

Validating the findings under aggressive configurations: Aggressive-Memory

2x L1 caches 2x Number of memory controllers

Aggressive-Resource 2x CTA-limiting resources

Limited to performance potentials

Distances.

Aggressive-memory

Memory Potentials

Distances.

PDOM memory potential

DWF memory potential

Aggressive-memory continued…

Control Potentials

Distances.

PDOM control potential

DWF control potential

Aggressive-memory continued…

Resource Potentials

Distances.

PDOM resource potential

DWF resource potential

Aggressive-resource

Memory Potentials

Distances.

PDOM memory potential

DWF memory potential

Aggressive-resource continued…

Control Potentials

Distances.

PDOM control potential

DWF control potential

Aggressive-resource continued…

Resource Potentials

Distances.

PDOM resource potential

DWF resource potential

Conclusion

30A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion

Performance in GPUs Potentials: Improvement by idealizing

Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF

Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2%

Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources

31A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Thank you.

Questions?

Why 32 PEs per SM

GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs:

Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp

We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs:

0-7 8-15 16-23 24-31

performance in gpu architectures: potentials and distances

gpu performance

gpu architectures

branch instruction

branch divergencesm

ofdiverged threads

predivergence ifcondition

diverging paths

biggest impact

Documents

synaptic potentials

junction potentials, electrode standard potentials, and...

cygnus: gpu meets fpga for hpc - riken r-ccs · 2020. 2....

- potentials - liénard-wiechart potentials - larmor’s...

electrical potentials

junction potentials, electrode standard potentials, and...

physiology chap5 (membrane potentials & action potentials)

pycuda: even simpler gpu programming with python · python...

membrane potentials and action potentials - stony...

best practices gpu-based video processing | gtc...

distances to astronomical objects. recap distances in...

cmpt454 gpu managed database · gpgpu: general purpose gpu,...

gpu, gp-gpu, gpu computing

paw potentials

taming gpu threads with f# and alea gpu · taming gpu...

gpu physics -...

gpu-to-gpu and host-to-host multipattern string matching on...

action potentials

evoked potentials (ep) (a.k.a. event related potentials,...

membrane potentials