compiler and runtime support for efficient software transactional memory

41
Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

Upload: afya

Post on 17-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiler and Runtime Support for Efficient Software Transactional Memory

Compiler and Runtime Supportfor Efficient

Software Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

Page 2: Compiler and Runtime Support for Efficient Software Transactional Memory

2

Motivation

Multi-core architectures are mainstream– Software concurrency needed for scalability– Concurrent programming is hard– Difficult to reason about shared data

Traditional mechanism: Lock-based Synchronization– Hard to use– Must be fine-grain for scalability – Deadlocks– Not easily composable

New Solution: Transactional Memory (TM)– Simpler programming model: Atomicity, Consistency, Isolation– No deadlocks– Composability– Optimistic concurrency– Analogy

• GC : Memory allocation ≈ TM : Mutual exclusion

Page 3: Compiler and Runtime Support for Efficient Software Transactional Memory

3

Composability

class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}

Thread-safe – but no scaling• ConcurrentHashMap (Java 5/JSR 166) does not help• Performance requires redesign from scratch & fine-grain locking

Page 4: Compiler and Runtime Support for Efficient Software Transactional Memory

4

Transactional solution

class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}

Underlying system provide:• isolation (thread safety)• optimistic concurrency

Page 5: Compiler and Runtime Support for Efficient Software Transactional Memory

5

Transactions are Composable

Scalability - 10,000,000 operations

0

1

2

3

4

0 4 8 12 16

# of Processors

Sca

lab

ilit

y

Synchronized Transactional

Scalability on 16-way 2.2 GHz Xeon System

Page 6: Compiler and Runtime Support for Efficient Software Transactional Memory

6

Our System

A Java Software Transactional Memory (STM) System– Pure software implementation – Language extensions in Java– Integrated with JVM & JIT

Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Risc-like STM API– Compiler optimizations– Per-type word and object level conflict detection– Complete GC support

Page 7: Compiler and Runtime Support for Efficient Software Transactional Memory

7

System Overview

Polyglot

ORP VM

McRT STM

StarJIT

Transactional Java

Java + STM API

Transactional STIR

Optimized T-STIR

Native Code

Page 8: Compiler and Runtime Support for Efficient Software Transactional Memory

8

Transactional Java

Java + new language constructs:• Atomic: execute block atomically

• atomic {S}• Retry: block until alternate path possible

• atomic {… retry;…}• Orelse: compose alternate atomic blocks

• atomic {S1} orelse{S2} … orelse{Sn}• Tryatomic: atomic with escape hatch

• tryatomic {S} catch(TxnFailed e) {…}• When: conditionally atomic region

• when (condition) {S}

Builds on prior researchConcurrent Haskell, CAML, CILK, JavaHPCS languages: Fortress, Chapel, X10

Page 9: Compiler and Runtime Support for Efficient Software Transactional Memory

9

Transactional Java → Java

Transactional Java

atomic {

S;

}

STM API• txnStart[Nested]• txnCommit[Nested]• txnAbortNested• txnUserRetry• ...

Standard Java + STM API

while(true) {

TxnHandle th = txnStart();

try {

S’;

break;

} finally {

if(!txnCommit(th))

continue;

}

}

Page 10: Compiler and Runtime Support for Efficient Software Transactional Memory

10

JVM STM support

On-demand cloning of methods called inside transactions

Garbage collection support• Enumeration of refs in read set, write set & undo log

Extra transaction record field in each object• Supports both word & object granularity

Native method invocation throws exception inside transaction• Some intrinsic functions allowed

Runtime STM API• Wrapper around McRT-STM API

• Polyglot / StarJIT automatically generates calls to API

Page 11: Compiler and Runtime Support for Efficient Software Transactional Memory

11

Background: McRT-STM

STM for• C / C++ (PPoPP 2006)• Java (PLDI 2006)

• Writes: – strict two-phase locking– update in place– undo on abort

• Reads: – versioning– validation before commit

• Granularity per type– Object-level : small objects– Word-level : large arrays

• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries

Page 12: Compiler and Runtime Support for Efficient Software Transactional Memory

12

Ensuring Atomicity: Novel Combination

Memory Ops

Mode ↓ Reads Writes

Pessimistic Concurrency

Optimistic Concurrency

+ Caching effects+ Avoids lock operations

Quantitative results in PPoPP’06

+ In place updates+ Fast commits+ Fast reads

Page 13: Compiler and Runtime Support for Efficient Software Transactional Memory

13

McRT-STM: Example

……atomic { B = A + 5;}…

……stmStart(); temp = stmRd(A); stmWr(B, temp + 5);stmCommit();…

STM read & write barriers before accessing memory inside transactions

STM tracks accesses & detects data conflicts

Page 14: Compiler and Runtime Support for Efficient Software Transactional Memory

14

Transaction Record

Pointer-sized record per object / word

Two states:• Shared (low bit is 1)

– Read-only / multiple readers– Value is version number (odd)

• Exclusive– Write-only / single owner– Value is thread transaction descriptor (4-byte aligned)

Mapping• Object : slot in object• Field : hashed index into global record table

Page 15: Compiler and Runtime Support for Efficient Software Transactional Memory

15

Transaction Record: Example

Every data item has an associated transaction record

TxR1

TxR2

TxR3

…TxRn

Object words hashinto table of TxRs

Hash is f(obj.hash, offset)

class Foo { int x; int y;}

vtblxy

TxRxy

vtbl Extra transactionrecord fieldObject

granularity

Wordgranularity

class Foo { int x; int y;}

hashxy

vtbl

Page 16: Compiler and Runtime Support for Efficient Software Transactional Memory

16

Transaction Descriptor

Descriptor per thread– Info for version validation, lock release, undo on abort, …

Read and Write set : {<Ti, Ni>}– Ti: transaction record– Ni: version number

Undo log : {<Ai, Oi, Vi, Ki>}– Ai: field / element address– Oi: containing object (or null for static)– Vi: original value– Ki: type tag (for garbage collection)

In atomic region– Read operation appends read set– Write operation appends write set and undo log– GC enumerates read/write/undo logs

Page 17: Compiler and Runtime Support for Efficient Software Transactional Memory

17

McRT-STM: Example

atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }

T1atomic { t1 = bar.x; t2 = bar.y; }

T2

• T1 copies foo into bar• T2 reads bar, but should not see intermediate values

Class Foo { int x; int y;};Foo bar, foo;

Page 18: Compiler and Runtime Support for Efficient Software Transactional Memory

18

McRT-STM: Example

stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit();

T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

T2

• T1 copies foo into bar• T2 reads bar, but should not see intermediate values

Page 19: Compiler and Runtime Support for Efficient Software Transactional Memory

19

McRT-STM: Example

stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit;

T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

T2

hdrx = 0y = 0

5hdr

x = 9y = 7

3foo bar

Reads <foo, 3> Reads <bar, 5>

T1

x = 9

<foo, 3>Writes <bar, 5>Undo <bar.x, 0>

T2 waits

y = 7

<bar.y, 0>

7

<bar, 7>

Abort

•T2 should read [0, 0] or should read [9,7]

Commit

Page 20: Compiler and Runtime Support for Efficient Software Transactional Memory

20

Early Results: Overhead breakdown

STM time breakdown

0%

20%

40%

60%

80%

100%

Binary tree Hashtable Linked list Btree

Application

TLS access

STM write

STM commit

STM validate

STM read

Time breakdown on single processor

STM read & validation overheads dominate

Good optimization targets

Page 21: Compiler and Runtime Support for Efficient Software Transactional Memory

21

System Overview

Polyglot

ORP VM

McRT STM

StarJIT

Transactional Java

Java + STM API

Transactional STIR

Optimized T-STIR

Native Code

Page 22: Compiler and Runtime Support for Efficient Software Transactional Memory

22

Leveraging the JIT

StarJIT: High-performance dynamic compiler

• Identifies transactional regions in Java+STM code

• Differentiates top-level and nested transactions

• Inserts read/write barriers in transactional code

• Maps STM API to first class opcodes in STIR

Good compiler representation →

greater optimization opportunities

Page 23: Compiler and Runtime Support for Efficient Software Transactional Memory

23

Representing Read/Write Barriers

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

stmWr(&a.x, t1)

stmWr(&a.y, t2)

if(stmRd(&a.z) != 0) {

stmWr(&a.x, 0);

stmWr(&a.z, t3)

}

Traditional barriers hide redundant locking/logging

Page 24: Compiler and Runtime Support for Efficient Software Transactional Memory

24

An STM IR for Optimization

Redundancies exposed:

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnOpenForWrite(a)

txnLogObjectInt(&a.y, a)

a.y = t2

txnOpenForRead(a)

if(a.z != 0) {

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = 0

txnOpenForWrite(a)

txnLogObjectInt(&a.z, a)

a.z = t3

}

Page 25: Compiler and Runtime Support for Efficient Software Transactional Memory

25

Optimized Code

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnLogObjectInt(&a.y, a)

a.y = t2

if(a.z != 0) {

a.x = 0

txnLogObjectInt(&a.z, a)

a.y = t3

}

Fewer & cheaper STM operations

Page 26: Compiler and Runtime Support for Efficient Software Transactional Memory

26

Compiler Optimizations for Transactions

Standard optimizations• CSE, Dead-code-elimination, …

• Careful IR representation exposes opportunities and enables optimizations with almost no modifications

• Subtle in presence of nesting

STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)

• Transaction-local object detection & barrier removal

• Partial inlining of STM fast paths to eliminate call overhead

Page 27: Compiler and Runtime Support for Efficient Software Transactional Memory

27

Experiments

16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)

Workloads• Hashtable, Binary tree, OO7 (OODBMS)

– Mix of gets, in-place updates, insertions, and removals

• Object-level conflict detection by default– Word / mixed where beneficial

Page 28: Compiler and Runtime Support for Efficient Software Transactional Memory

28

Effective of Compiler Optimizations

1P overheads over thread-unsafe baseline

Prior STMs typically incur ~2x on 1PWith compiler optimizations:

- < 40% over no concurrency control- < 30% over synchronization

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

HashMap TreeMap

% O

verh

ead

on

1P

Synchronized

No STM Opt

+Base STM Opt

+Immutability

+Txn Local

+Fast Path Inlining

Page 29: Compiler and Runtime Support for Efficient Software Transactional Memory

29

Scalability: Java HashMap Shootout

Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control

Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper

Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)

Atomic• Transactional version via “AtomicMap” wrapper

Atomic Prime• Transactional version with minor hand optimization

• Tracks size per segment ala ConcurrentHashMap

Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level

Page 30: Compiler and Runtime Support for Efficient Software Transactional Memory

30

Scalability: 100% Gets

Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale

02468

10121416

0 4 8 12 16

# of Processors

Sp

eed

up

over

1P

Un

safe

Unsafe Synchronized Concurrent

Atomic (No Opt) Atomic

Page 31: Compiler and Runtime Support for Efficient Software Transactional Memory

31

Scalability: 20% Gets / 80% Updates

ConcurrentHashMap thrashes on 16 segmentsAtomic still scales

0

24

6

8

1012

14

16

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent Atomic (No Opt) Atomic

Page 32: Compiler and Runtime Support for Efficient Software Transactional Memory

32

20% Inserts and Removes

Atomic conflicts on entire bucket array- The array is an object

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent Atomic

Page 33: Compiler and Runtime Support for Efficient Software Transactional Memory

33

20% Inserts and Removes: Word-Level

We still conflict on the single size field in java.util.HashMap

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent

Object Atomic Word Atomic

Page 34: Compiler and Runtime Support for Efficient Software Transactional Memory

34

20% Inserts and Removes: Atomic Prime

Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime

Page 35: Compiler and Runtime Support for Efficient Software Transactional Memory

35

20% Inserts and Removes: Mixed-Level

Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime

Page 36: Compiler and Runtime Support for Efficient Software Transactional Memory

36

Scalability: java.util.TreeMap

02

46

810

1214

16

0 4 8 12 16

# of Processors

Scal

abili

ty

Unsafe Synchronized Atomic

100% Gets 80% Gets

Results similar to HashMap

0

0.2

0.4

0.6

0.8

1

1.2

0 4 8 12 16

# of Processors

Scal

abili

tySynchronized Atomic Atomic Prime

Page 37: Compiler and Runtime Support for Efficient Software Transactional Memory

37

Scalability: OO7 – 80% Reads

“Coarse” atomic is competitive with medium-grain synchronization

Operations & traversal over synthetic database

0

1

2

3

4

5

6

0 4 8 12 16

# of Processors

Sca

lab

ilit

y

Atomic Synch (Coarse) Synch (Med.) Synch (Fine)

Page 38: Compiler and Runtime Support for Efficient Software Transactional Memory

38

Key Takeaways

Optimistic reads + pessimistic writes is nice sweet spot

Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe

- 10-30% over synchronized

Simple atomic wrappers sometimes good enough

Minor modifications give competitive performance to complex fine-grain synchronization

Word-level contention is crucial for large arrays

Mixed contention provides best of both

Page 39: Compiler and Runtime Support for Efficient Software Transactional Memory

39

Research challenges

Performance– Compiler optimizations– Hardware support– Dealing with contention

Semantics– I/O & communication– Strong atomicity– Nested parallelism– Open transactions

Debugging & performance analysis tools

System integration

Page 40: Compiler and Runtime Support for Efficient Software Transactional Memory

40

Conclusions

Rich transactional language constructs in Java

Efficient, first class nested transactions

Risc-like STM API

Compiler optimizations

Per-type word and object level conflict detection

Complete GC support

Page 41: Compiler and Runtime Support for Efficient Software Transactional Memory

41