performance in gpu architectures: potentials and distances

Post on 30-Jan-2016

62 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Performance in GPU Architectures: Potentials and Distances. Amirali Baniasadi ECE University of Victoria. Ahmad Lashgar ECE University of Tehran. WDDD-9 June 5, 2011. This Work. Goal : Investigating GPU performance for general-purpose workloads How : Studying the isolated impact of - PowerPoint PPT Presentation

TRANSCRIPT

Performance in GPU Architectures: Potentials and

Distances

Ahmad LashgarECE

University of Tehran

Amirali BaniasadiECE

University of Victoria

WDDD-9June 5, 2011

This Work

Goal: Investigating GPU performance for general-purpose workloads

How: Studying the isolated impact ofI. Memory divergence II. Branch divergence III. Context-keeping resources

Key finding: Memory has the biggest impact.Branch divergence solution needs memory consideration.

2A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Outline

Background

Performance Impacting Parameters

Machine Models

Performance Potentials

Performance Distances

Sensitivity Analysis

Conclusion

3A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

GPU Architecture

Interconnection Netw

ork

MCtrl6

DRAM1DRAM1DRAM1

DRAM6

...... ... ...

TPC1

SM1 SM2 SM3

MCtrl1

DRAM1DRAM1DRAM1

DRAM1

MCtrl2

DRAM1DRAM1DRAM1

DRAM2

MCtrl5

DRAM1DRAM1DRAM1

DRAM5TPC10

SM1 SM2 SM3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4

Thread Pool

L1Data L1Cost L1Text

PE32PE1 PE2 PE31

Register File

CTAID Program Counter

TID CTAID Program Counter.

.

.

.

.

.

.

.

.

.

.

.

TID

•Number of concurrent CTAs per SM is limited by the size of 3 shared resources:

1. Thread Pool2. Register File3. Shared Memory

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Branch Divergence

SM is SIMD processor Group of threads (warp) execute the same

instruction on the lanes. Branch instruction potentially diverge warp to two

groups:1. Threads with taken outcome2. Threads with not-taken outcome

5A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

A 1 1 1 1 1 1 1 1

B 1 1 0 1 0 0 1 0

C 0 0 1 0 1 1 0 1

D 1 1 1 1 1 1 1 1

A: // Pre-Divergence if(CONDITION) {B: //NT path } else {C: //T path }D: // reconvergence point

Control-flow mechanism

Control-flow solutions address this. Previous solutions:

Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally

reconverging all paths Dynamic Warp Formulation (DWF)

Regrouping the threads in diverging paths into new warps

6A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM

7A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

ASIMD

Utilizationover time

W0111

1W1

1111

B W0011

0W1

0001

C W0100

1W1

1110

D W0111

1W1

1111

W0011

0W1

0001

TOS

TOS

Dynamic regrouping ofdiverged threads at same path

increases utilization

DWF

8A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

ASIMD

Utilizationover time

W0111

1W1

1111

B W0011

0W1

0001

C W2100

1W3

1110

D W0011

1W1

1111

Warp Pool

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 A 1 1 1 1

W2 C 1 0 0 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 B 0 0 0 1

W2 C 1 0 0 1

W3 C 1 1 1 0

Wi PC Mask Vector

W0 B 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 D 1 0 0 0

Wi PC Mask Vector

W0 D 1 1 1 1

W1 D 1 1 1 1

W1111

1W2

1000

W0011

1

W2100

0

W0111

1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 D 1 1 1 1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

W1111

1

W0111

1

Merge

Possibilit

y

Performance impacting parameters Memory Divergence

Increase of memory pressure with un-coalesced memory accesses Branch Divergence

Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism

CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources:

1. Shared Memory2. Register File3. Thread Pool

9A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

-

Machine Models

10

Limited Resources :LRUnlimited

Resources :UR

X

DC: DWF Control-flowPC: PDOM Control-flowIC: Ideal Control-flow (MIMD)

IM: Ideal Memory M: Real Memory

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Y ZX Y Z-

Isolates the impact of each parameter:

Machine Models continued…

11A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM

Real-Memory

Ideal-Memory

Real-Memory

Ideal-Memory

Limitedper SM resources

Unlimitedper SM resources

Methodology

GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and

CUDA SDK 2.3

12A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Parameter ValueNoC

Total Number of SMs 30Number of Memory Ctrls 6Number of SM Sharing an

Interconnect3

SM

Warp Size 32 ThreadsNumber of Thread per SM 1024

Number of Register per SM 16384 32-bit

Number of PEs per SM 32Shared Memory Size 16KB

L1 Data Cache 32KB

Parameter ValueClocking

Core Clock 325 MHzInterconnect Clock 650 MHz

DRAM memory Clock 800MHzControl-Flow Mechanisms

Base DWF issue heuristic MajorityPDOM warp scheduling round-robin

Amirali
processor config?

Performance Potentials

The speedup can be reached if the impacting parameter is idealized

3 Potentials (per control-flow mechanism): Memory Potential

Speedup due to ideal memory Control Potential

Speedup due to free-of-divergence architecture Resource Potential

Speedup due to infinite CTA-limiting resources per SM

13A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Performance Potentials continued…

14A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Memory Potentials

15A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF61%PDOM59%

Resource Potentials

16A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF8.6%PDOM9.4%

Control Potentials

17A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF2%

PDOM-7%

Performance Distances

How much an otherwise ideal GPU is distanced from ideal due to the parameter.

3 Distances: Memory Distance

Distance form ideal GPU due to real memory Resource Distance

Distance from ideal GPU due to limited resources Control Distance

Distance from ideal GPU due to branch divergence

18A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Performance Distances continued…

19A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Memory Distance

20A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

40%

Resource Distance

21A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

2%

Control Distances

22A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF15%

PDOM8%

Sensitivity Analysis

Validating the findings under aggressive configurations: Aggressive-Memory

2x L1 caches 2x Number of memory controllers

Aggressive-Resource 2x CTA-limiting resources

Limited to performance potentials

23A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Aggressive-memory

Memory Potentials

24A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM memory potential

28%

DWF memory potential

28%

Aggressive-memory continued…

Control Potentials

25A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM control potential

-8%

DWF control potential

-0.4%

Aggressive-memory continued…

Resource Potentials

26A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM resource potential

8%

DWF resource potential

~0%

Aggressive-resource

Memory Potentials

27A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM memory potential

51%

DWF memory potential

52%

Aggressive-resource continued…

Control Potentials

28A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM control potential

-8%

DWF control potential

2%

Aggressive-resource continued…

Resource Potentials

29A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM resource potential

4%

DWF resource potential

3%

Conclusion

30A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion

Performance in GPUs Potentials: Improvement by idealizing

Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF

Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2%

Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources

31A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

32

Thank you.

Questions?

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Why 32 PEs per SM

GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs:

Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp

We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs:

33

0-7 8-15 16-23 24-31

0-31

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

top related