understanding performance of concurrent data structures on graphics processors daniel cederman, bapi...

32
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee , Philippas Tsigas buted Computing and Systems Chalmers University, Sweden rted by PEPPHER, SCHEME, VR) Euro-PAR 2012

Upload: darrell-adams

Post on 30-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Understanding Performance of Concurrent Data

Structures on Graphics Processors

Daniel Cederman, Bapi Chatterjee, Philippas TsigasDistributed Computing and Systems

D&IT, Chalmers University, Sweden(Supported by PEPPHER, SCHEME, VR)

Euro-PAR2012

Page 2: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Parallelization on GPU (GPGPU)

• CUDA, OpenCl• Independent of CPU• Ubiquitous

Main processor• Uniprocessors – No more• Multi-core, Many-core

Co-processor• Graphics Processors• SIMD, N x speedup

2/31

Page 3: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Data structures + Multi-core/Many-core = Concurrent data structure

• Rich literature and growing• Applications

CDS on GPU• Synchronization aware

applications on GPU• Challenging but required

Concurrent Programming

Parallel Slowdown3/31

Page 4: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Concurrent Data Structures on GPU

• Implementation Issues

• Performance Portability

4/31

Page 5: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

GPU (Nvidia)• Architecture Evolution• Support to Synchronization

Concurrent Data Structure

• Concurrent FIFO Queues

CDS on GPU• Implementation &

Optimization• Performance Portability

Analysis

Outline of the talk

5/31

Page 6: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

GPU (Nvidia)• Architecture Evolution• Support to Synchronization

Concurrent Data Structure

• Concurrent FIFO Queues

CDS on GPU• Implementation &

Optimization• Performance Portability

Analysis

Outline of the talk

6/31

Page 7: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Processor Atomics Cache

Tesla(CC 1.0) No Atomics No Cache

Tesla(CC = 1.x,

x>0) Atomics available No Cache

Fermi(CC=2.x)

Atomics on L2

Unified L2 and

Configurable L1

Kepler(CC=3.x)

Faster than earlier

L2 73% faster than

Fermi

GPU Architecture Evolution

7/31

Page 8: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

CAS behavior on GPUs

0 10 20 30 40 50 600

1000

2000

3000

4000

5000

6000

GeForce GTX 280 Tesla C2050 GeForce GTX 680

Thread blocks

CAS

oper

ation

s per

ms p

er th

read

blo

ck

Compare and Swap (CAS)on GPU

8/31

Page 9: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

CDS on GPU – Motivation & Challenges

• Transition from a pure co-processor to a more

independent compute unit.

• CUDA and OpenCL.

• Synchronization primitives getting cheaper with

availability of multilevel cache.

• Synchronization aware programs vs. inherent

SIMD. 9/31

Page 10: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

GPU (Nvidia)• Architecture Evolution• Support to Synchronization

Concurrent Data Structure

• Concurrent FIFO Queues

CDS on GPU• Implementation &

Optimization• Performance Portability

Analysis

Outline of the talk

10/31

Page 11: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

11/31

Concurrent Data Structure

1. Synchronization Progress

guarantee.

2. Blocking.

3. Non-blocking.

1. Lock – free

2. Wait - free

Page 12: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Concurrent FIFO Queues

12/31

Page 13: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Concurrent FIFO Queues

Single Producer Single Consumer(SPSC)

Multi Producer Multi Consumer(MPMC)

• Lamport 1983 : Lamport Queue • Michael & Scott 1997 : MS-Queue (Blocking and non-blocking)

• Giacomoni et al. 2008 : FastForward Queue • Tsigas & Zhang 2001: TZ-Queue

• Lee et al. 2009 : MCRingBuffer

• Preud'homme et al. 2010 : BatchQueue

13/31

Page 14: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

1. Lock-free, Array-based.

2. Synchronization through atomic read and write on

shared head and tail Causes cache thrashing.

Lamport [1]SPSC FIFO Queues

enqueue (data) {if (NEXT(head) == tail) return FALSE; buffer[head] = data;head = NEXT(head);return TRUE;}

dequeue (data) {if (head == tail) {

return FALSE; data = buffer[tail];tail = NEXT(tail);return TRUE;}

14/31

Page 15: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

• head and tail private to producer and

consumer lowering cache thrashing.

FastForward [2]

• The queue is divided into two batches –

producer writes to one of them while the

consumer reads from the other one.

BatchQueue [3]

SPSC FIFO Queues

15/31

Page 16: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

1. Similar to BatchQueue but handles

many batches.

2. Many batches may cause less latency if

producer is not fast enough.

MCRingBuffer [4]

SPSC FIFO Queues

16/31

Page 17: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

1. Linked-List based.

2. Mutual exclusion locks to synchronize.

3. CAS-based spin lock and Bakery Lock –

fine Grained and Coarse grained.

MPMC FIFO Queues

MS-queue (Blocking) [5]

17/31

Page 18: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

1. Lock-free.

2. Uses CAS to add nodes at tail and

remove nodes from head.

3. Helping mechanism between threads

leads to true lock-freedom.

MPMC FIFO Queues

MS-queue (Non-blocking) [5]

18/31

Page 19: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

1. Lock-free, Array-based.

2. Uses CAS to insert elements and move

head and tail.

3. head and tail pointers are moved after

every x:th operation.

MPMC FIFO Queues

TZ-Queue (Non-Blocking) [6]

19/31

Page 20: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

GPU (Nvidia)• Architecture Evolution• Support to Synchronization

Concurrent Data Structure

• Concurrent FIFO Queues

CDS on GPU• Implementation &

Optimization• Performance Portability

Analysis

Outline of the talk

20/31

Page 21: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Implementation Platform

Processor Clock Speed

Memory Clock Cores LL Cache Architecture

8800GT 1.6GHz 1.0GHz 14 0 Tesla(CC 1.1)

GTX280 1.3GHz 1.1GHz 30 0 Tesla(CC 1.3)

Tesla C2050

1.2GHz 1.5GHz 14 786kB Fermi(CC 2.0)

GTX680 1.1 GHz 3.0GHz 8 512kB Kepler(CC 3.0)

Intel E5645(2x)

2.4GHz 0.7GHz 24 12MB Intel HT

21/31

Page 22: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

GPU implementation

1. A thread-block works either as a producer

or as a consumer.

2. Varying number of thread blocks for MPMC

queues.

3. Shared Memory is used for private

variables of producer and consumer. 22/31

Page 23: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

GPU optimization

1. BatchQueue and MCRingBuffer - advantage

of shared memory to make them Buffered.

2. Coalescing in memory transfer in buffered

queues.

3. Empirical optimization in TZ-Queue – move

the pointers after every second operation.23/31

Page 24: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Experimental Setup1. Throughput = # {successful enque or

deque} / ms.

2. MPMC experiments : 25% enque and 75%

deque.

3. Contention – high and low.

4. In CPU, producers and consumers were put

on different sockets. 24/31

Page 25: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

SPSC on CPU

Reducing Cache Thrashing Increasing Throughput

Intel 24-core0

50000

100000

150000

200000

250000

300000

350000

400000

450000

LamportFastForwardMCRingBufferBatchQueue

Ope

ratio

ns p

er m

s

Cache Profile Throughput

Intel 24-core0

10

20

30

40

50

60

70

80

90

Ratio

of L

LC m

isses

rela

tive

Batc

hQue

ueIntel 24-core

0

0.5

1

1.5

2

2.5

Stal

led

cycle

s per

inst

ructi

on25/31

Page 26: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

SPSC on GPU

• GPU without cache – no cache thrashing.

• GPU Shared memory advantage – Buffering.

• High memory clock + faster cache advantage – unbuffered.

Throughput GeForce 8800 GT GeForce GTX 280 Tesla C2050 GeForce GTX 680

0

500

1000

1500

2000

2500

3000

3500

LamportFastForwardMCRingBufferBatchQueueBuffered MCRingBufferBuffered BatchQueueO

pera

tions

per

ms

26/31

Page 27: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

MPMC on CPU

• SpinLock (CAS based) beats the bakery lock (read/write).

Best Lock based vs Lock-free(High Contention)

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

3500

4000

4500

Dual SpinLock MS-Queue TZ-Queue

Threads

Ope

ratio

ns p

er m

s

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

3500

4000

4500

Dual SpinLock MS-Queue TZ-Queue

Threads

Ope

ratio

ns p

er m

sBest Lock based vs Lock-free

(Low Contention)

27/31

• Lock free better than Blocking.

Page 28: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

0 10 20 30 40 50 600

50

100

150

200

250

300

350

400

Thread blocks

Ope

ratio

ns p

er m

s

MPMC on GPU(High Contention)

Newer

Architecture

Scalability

0 10 20 30 40 50 600

100

200

300

400

500

600

700

Thread blocks

Ope

ratio

ns p

er m

s

0 10 20 30 40 50 600

200

400

600

800

1000

1200

Dual SpinLock

MS-Queue

TZ-Queue

Thread blocks

Ope

ratio

ns p

er m

s

C2050(CC 2.0)GTX280(CC 1.3)

GTX680(CC 3.0) 28/31

Page 29: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

CAS behavior on GPUs

0 10 20 30 40 50 600

1000

2000

3000

4000

5000

6000

GeForce GTX 280 Tesla C2050 GeForce GTX 680

Thread blocks

CAS

oper

ation

s per

ms p

er th

read

blo

ck

Compare and Swap (CAS)on GPU

29/31

Page 30: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

0 10 20 30 40 50 600

100

200

300

400

500

600

700

Thread blocks

Ope

ratio

ns p

er m

s

0 10 20 30 40 50 600

50

100

150

200

250

300

350

400

Thread blocks

Ope

ratio

ns p

er m

s

0 10 20 30 40 50 600

200

400

600

800

1000

1200

Dual SpinLock

MS-Queue

TZ-Queue

Thread blocks

Ope

ratio

ns p

er m

s

MPMC on GPU(Low Contention)

GTX280(CC 1.3) C2050(CC 2.0)

GTX680(CC 3.0)

Lower

Contention

Scalability30/31

Page 31: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

Summary1. Concurrent queues are in general performance

portable from CPU to GPU.

2. The configurable cache are still NOT enough to

remove the benefit of redesigning algorithms

from GPU shared memory viewpoint.

3. Significantly improved atomics in Fermi and

further in Kepler is a big motivation for

algorithmic designs of CDS for GPU.31/31

Page 32: Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing

References1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on

Programming Languages and Systems 5, (1983), 190 -2222. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline

parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52

3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222

4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79

5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275

6. Tsigas, P., Zhang, Y.: A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In: Proceedings of the 13th annual ACM symposium on Parallel algorithms and architectures, ACM (2001) 134-143