understanding performance of concurrent data structures on graphics processors daniel cederman, bapi...

Understanding Performance of Concurrent Data

Structures on Graphics Processors

Daniel Cederman, Bapi Chatterjee, Philippas TsigasDistributed Computing and Systems

D&IT, Chalmers University, Sweden(Supported by PEPPHER, SCHEME, VR)

Euro-PAR2012

Parallelization on GPU (GPGPU)

• CUDA, OpenCl• Independent of CPU• Ubiquitous

Main processor• Uniprocessors – No more• Multi-core, Many-core

Co-processor• Graphics Processors• SIMD, N x speedup

2/31

Data structures + Multi-core/Many-core = Concurrent data structure

• Rich literature and growing• Applications

CDS on GPU• Synchronization aware

applications on GPU• Challenging but required

Concurrent Programming

Parallel Slowdown3/31

Concurrent Data Structures on GPU

• Implementation Issues

• Performance Portability

4/31

GPU (Nvidia)• Architecture Evolution• Support to Synchronization

Concurrent Data Structure

• Concurrent FIFO Queues

CDS on GPU• Implementation &

Optimization• Performance Portability

Analysis

Outline of the talk

5/31






Analysis

Outline of the talk

6/31

Processor Atomics Cache

Tesla(CC 1.0) No Atomics No Cache

Tesla(CC = 1.x,

x>0) Atomics available No Cache

Fermi(CC=2.x)

Atomics on L2

Unified L2 and

Configurable L1

Kepler(CC=3.x)

Faster than earlier

L2 73% faster than

Fermi

GPU Architecture Evolution

7/31

CAS behavior on GPUs

0 10 20 30 40 50 600

1000

2000

3000

4000

5000

6000

GeForce GTX 280 Tesla C2050 GeForce GTX 680

Thread blocks

CAS

oper

ation

s per

ms p

er th

read

blo

ck

Compare and Swap (CAS)on GPU

8/31

CDS on GPU – Motivation & Challenges

• Transition from a pure co-processor to a more

independent compute unit.

• CUDA and OpenCL.

• Synchronization primitives getting cheaper with

availability of multilevel cache.

• Synchronization aware programs vs. inherent

SIMD. 9/31






Analysis

Outline of the talk

10/31

11/31


1. Synchronization Progress

guarantee.

2. Blocking.

3. Non-blocking.

1. Lock – free

2. Wait - free

Concurrent FIFO Queues

12/31

Concurrent FIFO Queues

Single Producer Single Consumer(SPSC)

Multi Producer Multi Consumer(MPMC)

• Lamport 1983 : Lamport Queue • Michael & Scott 1997 : MS-Queue (Blocking and non-blocking)

• Giacomoni et al. 2008 : FastForward Queue • Tsigas & Zhang 2001: TZ-Queue

• Lee et al. 2009 : MCRingBuffer

• Preud'homme et al. 2010 : BatchQueue

13/31

1. Lock-free, Array-based.

2. Synchronization through atomic read and write on

shared head and tail Causes cache thrashing.

Lamport [1]SPSC FIFO Queues

enqueue (data) {if (NEXT(head) == tail) return FALSE; buffer[head] = data;head = NEXT(head);return TRUE;}

dequeue (data) {if (head == tail) {

return FALSE; data = buffer[tail];tail = NEXT(tail);return TRUE;}

14/31

• head and tail private to producer and

consumer lowering cache thrashing.

FastForward [2]

• The queue is divided into two batches –

producer writes to one of them while the

consumer reads from the other one.

BatchQueue [3]

SPSC FIFO Queues

15/31

1. Similar to BatchQueue but handles

many batches.

2. Many batches may cause less latency if

producer is not fast enough.

MCRingBuffer [4]

SPSC FIFO Queues

16/31

1. Linked-List based.

2. Mutual exclusion locks to synchronize.

3. CAS-based spin lock and Bakery Lock –

fine Grained and Coarse grained.

MPMC FIFO Queues

MS-queue (Blocking) [5]

17/31

1. Lock-free.

2. Uses CAS to add nodes at tail and

remove nodes from head.

3. Helping mechanism between threads

leads to true lock-freedom.

MPMC FIFO Queues

MS-queue (Non-blocking) [5]

18/31

1. Lock-free, Array-based.

2. Uses CAS to insert elements and move

head and tail.

3. head and tail pointers are moved after

every x:th operation.

MPMC FIFO Queues

TZ-Queue (Non-Blocking) [6]

19/31






Analysis

Outline of the talk

20/31

Implementation Platform

Processor Clock Speed

Memory Clock Cores LL Cache Architecture

8800GT 1.6GHz 1.0GHz 14 0 Tesla(CC 1.1)

GTX280 1.3GHz 1.1GHz 30 0 Tesla(CC 1.3)

Tesla C2050

1.2GHz 1.5GHz 14 786kB Fermi(CC 2.0)

GTX680 1.1 GHz 3.0GHz 8 512kB Kepler(CC 3.0)

Intel E5645(2x)

2.4GHz 0.7GHz 24 12MB Intel HT

21/31

GPU implementation

1. A thread-block works either as a producer

or as a consumer.

2. Varying number of thread blocks for MPMC

queues.

3. Shared Memory is used for private

variables of producer and consumer. 22/31

GPU optimization

1. BatchQueue and MCRingBuffer - advantage

of shared memory to make them Buffered.

2. Coalescing in memory transfer in buffered

queues.

3. Empirical optimization in TZ-Queue – move

the pointers after every second operation.23/31

Experimental Setup1. Throughput = # {successful enque or

deque} / ms.

2. MPMC experiments : 25% enque and 75%

deque.

3. Contention – high and low.

4. In CPU, producers and consumers were put

on different sockets. 24/31

SPSC on CPU

Reducing Cache Thrashing Increasing Throughput

Intel 24-core0

50000

100000

150000

200000

250000

300000

350000

400000

450000

LamportFastForwardMCRingBufferBatchQueue

Ope

ratio

ns p

er m

s

Cache Profile Throughput

Intel 24-core0

10

20

30

40

50

60

70

80

90

Ratio

of L

LC m

isses

rela

tive

Batc

hQue

ueIntel 24-core

0

0.5

1

1.5

2

2.5

Stal

led

cycle

s per

inst

ructi

on25/31

SPSC on GPU

• GPU without cache – no cache thrashing.

• GPU Shared memory advantage – Buffering.

• High memory clock + faster cache advantage – unbuffered.

Throughput GeForce 8800 GT GeForce GTX 280 Tesla C2050 GeForce GTX 680

0

500

1000

1500

2000

2500

3000

3500

LamportFastForwardMCRingBufferBatchQueueBuffered MCRingBufferBuffered BatchQueueO

pera

tions

per

ms

26/31

MPMC on CPU

• SpinLock (CAS based) beats the bakery lock (read/write).

Best Lock based vs Lock-free(High Contention)

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

3500

4000

4500

Dual SpinLock MS-Queue TZ-Queue

Threads

Ope

ratio

ns p

er m

s

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

3500

4000

4500

Dual SpinLock MS-Queue TZ-Queue

Threads

Ope

ratio

ns p

er m

sBest Lock based vs Lock-free

(Low Contention)

27/31

• Lock free better than Blocking.

0 10 20 30 40 50 600

50

100

150

200

250

300

350

400

Thread blocks

Ope

ratio

ns p

er m

s

MPMC on GPU(High Contention)

Newer

Architecture

Scalability

0 10 20 30 40 50 600

100

200

300

400

500

600

700

Thread blocks

Ope

ratio

ns p

er m

s

0 10 20 30 40 50 600

200

400

600

800

1000

1200

Dual SpinLock

MS-Queue

TZ-Queue

Thread blocks

Ope

ratio

ns p

er m

s

C2050(CC 2.0)GTX280(CC 1.3)

GTX680(CC 3.0) 28/31

CAS behavior on GPUs

0 10 20 30 40 50 600

1000

2000

3000

4000

5000

6000

GeForce GTX 280 Tesla C2050 GeForce GTX 680

Thread blocks

CAS

oper

ation

s per

ms p

er th

read

blo

ck

Compare and Swap (CAS)on GPU

29/31

0 10 20 30 40 50 600

100

200

300

400

500

600

700

Thread blocks

Ope

ratio

ns p

er m

s

0 10 20 30 40 50 600

50

100

150

200

250

300

350

400

Thread blocks

Ope

ratio

ns p

er m

s

0 10 20 30 40 50 600

200

400

600

800

1000

1200

Dual SpinLock

MS-Queue

TZ-Queue

Thread blocks

Ope

ratio

ns p

er m

s

MPMC on GPU(Low Contention)

GTX280(CC 1.3) C2050(CC 2.0)

GTX680(CC 3.0)

Lower

Contention

Scalability30/31

Summary1. Concurrent queues are in general performance

portable from CPU to GPU.

2. The configurable cache are still NOT enough to

remove the benefit of redesigning algorithms

from GPU shared memory viewpoint.

3. Significantly improved atomics in Fermi and

further in Kepler is a big motivation for

algorithmic designs of CDS for GPU.31/31

References1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on

Programming Languages and Systems 5, (1983), 190 -2222. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline

parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52

3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222

4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79

5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275

6. Tsigas, P., Zhang, Y.: A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In: Proceedings of the 13th annual ACM symposium on Parallel algorithms and architectures, ACM (2001) 134-143

understanding performance of concurrent data structures on graphics processors daniel cederman, bapi...

Documents