concurrency on the jvm

Concurrency on the JVM.. or some of the nuts and bolts of Akka

Bernhard Huemerbhuemer.at@bhuemer23 July 2013

IRIAN Solutionsirian.at

Agenda• In general, a (random) selection of (more or less loosely coupled)

points I would like to address

• Low-level concurrency - only once you understand the complexity will you appreciate the solution :)

• Thread pools, contention issues around them and the enlightened path to Akka

• What’s missing: A lot. Software-transactional memory (Clojure), data-flow concurrency, futures and more theory I wanted to cover

We’ll focus on utilisation

• “The number of idle cores on my machine doubles every two years” - Sander Mak (DZone interview)

• Distinction between low latency (produce one answer fast) and high throughput (produce lots of answers fast) somewhat fuzzy anyway

• Locks are not expensive, lock contention is - don’t shoot the messenger! Does contention on junctions arise because of traffic lights or because of bad traffic planning?

• Most locking in Java programs is not only uncontended, but also unshared

• Rule of thumb: Think about contention first, and only then worry about your locking.

See also: Brian Goetz, “Threading lightly, Part 1: Synchronization is not the enemy” http://www.ibm.com/developerworks/library/j-threads1/index.html

Note: When benchmarking your application, don’t just deliberately provoke contention when it wouldn’t arise otherwise!

Synchronisation is not the enemy

http://www.ibm.com/developerworks/library/j-threads1/index.html

http://www.ibm.com/developerworks/library/j-threads1/index.html

Synchronised on the JVM

• Optimised for the uncontended case (i.e. the usual one) - can be handled entirely within the JVM (i.e. no OS calls)

• Lightweight locking based on CAS instructions

Example: Implementation of thin locks on IBM’s

version of the JDK 1.1.2 for AIX (yes, yes, .. totally outdated, but you get the idea ..)

See also: http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon98Thin.pdf

http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon98Thin.pdf

http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon98Thin.pdf

Locking benchmarks (1)

Shamelessly taken from: http://www.ibm.com/developerworks/library/j-jtp11234/

Not the code used in the benchmarks(!) - this is just to illustrate it (and to show off my uber non-blocking locking skills*)

* You are allowed to keep any bugs you find at your discretion.

http://www.ibm.com/developerworks/library/j-jtp11234/


Locking benchmarks (2)

See also: Brian Goetz, “Java Concurrency in Practice”, Chapter 15

Uncontended version

Contended version

“With low to moderate contention, atomics offer better scalability; with high contention, locks offer better contention avoidance.” - Brian Goetz

• Think roundabouts vs. traffic lights

• Benchmark is deceptive as it produces an unusually high amount of contention. Atomics scale quite nicely in reality.

• Actual lesson learned: Always measure yourself before you assume anything! There are no general performance advices.

0

0.3

0.6

0.9

1.2

2 4 8 16 32 64

ReentrantLockAtomicInteger

0

0.75

1.5

2.25

3

2 4 8 16 32 64

ReentrantLockAtomicInteger

Note: Graph is not based on values I measured, it’s from JCIP .. and I didn’t use a rulerto measure points in the pictures. It’s not correct and doesn’t aim to be!

Lock splitting• Incohesive classes tend to increase lock granularity ..

• .. at least make your locks cohesive by splitting them (even better: write cohesive classes to begin with!)

• Only a short-term solution to contention - in this case: as soon as you double the load, it’ll be the same

See also: Brian Goetz, “Java Concurrency in Practice”, Chapter 11

Lock striping• Extends the lock splitting idea, but works on

partitions of variable-sized data

• Classic example: ConcurrentHashMap - 16 buckets with respective locks rather than one “global” lock

• Depends on the number available processors and the likelihood they’ll end up locking the same partition (e.g. non uniformly distributed data)

• To some extent also a trade-off between memory and performance (e.g. do you really need 16 buckets with ConcurrentHashMaps? they’re not that cheap!)

See also: http://ria101.wordpress.com/2011/12/12/concurrenthashmap-avoid-a-common-misuse/ and of course “Java Concurrency in Practice”, Chapter 11

http://ria101.wordpress.com/2011/12/12/concurrenthashmap-avoid-a-common-misuse/

http://ria101.wordpress.com/2011/12/12/concurrenthashmap-avoid-a-common-misuse/

Layers of synchronisation

• High-level concurrency abstractions (java.util.concurrent, scala.concurrent)

• Low-level locking (synchronized() blocks and util.concurrent.locks)

• Low-level primitives (volatile variables, util.concurrent.atomic classes)

• Data races: deliberate undersynchronisation (Avoid!)

Shamelessly taken from: Jeremy Manson, “Advanced Topics in Programming Languages: The Java Memory Model” http://www.youtube.com/watch?v=1FX4zco0ziY

Let’s take a step back for a moment ..

http://www.youtube.com/watch?v=1FX4zco0ziY

http://www.youtube.com/watch?v=1FX4zco0ziY

Synchronisation addresses two distinct issues

• Thread-interference or atomicity

• Visibility, ordering and memory consistency (i.e. what volatile is about)

Quantum concurrency and Schrödinger’s memory tricks:

The thread we’ll use to observe the value of the counter has an effect on the observation!

Why is this code broken?• Double-checked locking and concurrent collections,

so what’s the problem then? (don’t argue about whether or not caches should preload everything up-front - fair point, but that’s not the issue here)

Can you see it now?• Semantically speaking, this is exactly the same code

• The compiler, the JVM, the operating system & even the CPU conspire behind your back against you in the Extraordinary League of Ordinary Things That Will Mess You Up! Most likely, they’re sinister enough to wait until you deploy to production before they show their true faces!

See also: Most/many double-checked locking implementations around Singletons

Happens-before relationships• Monitor lock rule. An unlock on a monitor lock happens before every subsequent lock on

that same monitor lock.

• Volatile variable rule. A write to a volatile field happens before every subsequent read of that same field.

• ...

• Transitivity. If A happens before B, and B happens before C, then A happens before C.

Shamelessly taken from: “Java theory and practice: Fixing the Java Memory Model, Part 2“,http://www.ibm.com/developerworks/library/j-jtp03304/



Volatile piggybacking (1)

Shamelessly taken from Vitkor Klang’s Github: https://gist.github.com/viktorklang/2362563

• With high-level concurrency frameworks, you may not have to worry about these issues (note: plain, vanilla thread pools are not high level enough - very fragile technique anyway)

https://gist.github.com/viktorklang/2362563

https://gist.github.com/viktorklang/2362563

Volatile piggybacking (2)• Repetitive exercise, I know, but why can’t we rely on

thread pools for memory consistence? They do have locks internally! (I’ll promise, you’ll understand concurrent code a lot better if you think about this!)

Hint: Think about happens-before relationships with regards to locks and multiple workers (i.e. what are the release/

acquire pairs for your memory barriers?)

A word on immutability (1)• Java Memory Model treats final fields / val fields

specially (value must be assigned before the constructor returns and cannot be re-assigned)

Actors and the Java Memory Model. In most cases messages are immutable, but if that message is not a properly constructed immutable object, without a "happens before" rule, it would be possible for the receiver to see partially initialized data structures and possibly even values out of thin air (longs/doubles).

See: http://typesafe.com/blog/akka-and-the-java-memory-model

http://typesafe.com/blog/akka-and-the-java-memory-model

http://typesafe.com/blog/akka-and-the-java-memory-model

A word on immutability (2)

• Its state cannot be modified after construction (i.e. no getters that return mutable objects, nothing passed to the constructor references mutable objects held by this one, etc.)

• All fields are declared as final / val *

• It is properly constructed (i.e. the this reference doesn’t escape during construction)

Thus, precisely defined notion of immutability

* Yes, java.lang.String is not immutable according to that definition. hashCodes are cached and there actually is a data race in that method, but

it’s a benign one. So for all intents and purposes, java.lang.String can still be considered an immutable class.

Tasks and thread pools

• Heterogenous tasks are annoying when you aim for utilisation (bit theoretical though as it presumably averages out .. but .. )

• Dependent tasks cause even more issues (possibly even dead locks, if it’s a bounded thread pool)

Task A Task B (10x Task A)Sequential:

Parallel: Task A

Task B (10x Task A)

Result: A whopping 9% speedup! (well, we still need to deduct something for concurrency overhead ..)

Configuring thread pools (1)

Worker Queue(no such queue really exists, but we’ll just think that way)

Task Queue

newFixedThreadPool bounded - n unboundedLinkedBlockingQueue

newSingleThreadExecutor bounded - 1 unboundedLinkedBlockingQueue

newCachedThreadExecutor unbounded SynchronousQueue

alternative invocation of new ThreadPoolExecutor(..)

bounded - n bounded - m, m > nLinkedBlockingQueue(m)

alternative invocation of new ThreadPoolExecutor(..)

bounded - n SynchronousQueue

The same implementation can exhibit radically different behaviour depending on how you instantiate it.

Note: SynchronousQueues are not just LinkedBlockingQueues with capacity 1.They’re more like rendezvous-channels in CSP.

Configuring thread pools (2)• Client-run saturation policy means that overloading

causes tasks being pushed outward from the thread pool (no more accepts, TCP might dismiss connections, etc.. which ultimately enables clients as well to handle degradation - e.g. load balancing)

• For example, asynchronous loggers that don’t break down when sh** hits the fan!

See also: Brian Goetz, “Java Concurrency in Practice”, Chapter 8.3

Visualising task queues• Predefined tasks (nudge nudge, actors) will be used to process

different data (I don’t even need to “nudge nudge” here ..)

Payload 1Task A

Payload 2Task B

Payload 3Task A

Payload 4Task C

Payload 5Task A

Payload 6Task B

Payload 7Task A

Payload 8Task C

...

Thread 1 Thread 2 Thread 3 Thread 4

Spot the issue in this model! Hint: Think “contention”, think

“BlockingQueue.take()”

What could the solution look like? Maintain the invariant that we’re only

allowed to process a message once and only once!

Hint: It’s not non-blocking locking!

Organising task queues• Make a distinction between tasks and data and do

some sensible partitioning

Task A

Payload 1

Payload 3

Payload 5

Payload 7

Task B

Payload 2

Payload 6...

Thread 1 Thread 2 Thread 3 Thread 4

• Tasks now have message .. I mean .. payload queues

• n tasks with a queue each means 1/n load per queue (if you add new kinds of tasks, this scales, if you just add more messages not so, but hold on to your thought!)

• Tasks can still be executed in parallel (i.e. you don’t get away yet without synchronisation)

Does it really make a difference? (1)

• Comparison is silly and totally crazy, but it’s a bit like the difference between these two pieces of code (obviously neither is recommended ..)

• Apart from reduced contention, all kinds of localities that you’re exploiting (cache friendliness, GC friendliness - new objects don’t

span over multiple threads, and so on and so forth)

In case you haven’t had enough background literature yet: http://gee.cs.oswego.edu/dl/papers/fj.pdf

http://gee.cs.oswego.edu/dl/papers/fj.pdf

http://gee.cs.oswego.edu/dl/papers/fj.pdf

Does it really make a difference? (2)

• In case you’re still not believing me, here’s a proof by “Pics or it didn’t happen!”

• ForkJoin Pools organise tasks similarly, hence the comparison

Shamelessly taken from: “Scalability of ForkJoin Pool”, http://letitcrash.com/post/17607272336/scalability-of-fork-join-pool

http://letitcrash.com/post/17607272336/scalability-of-fork-join-pool

http://letitcrash.com/post/17607272336/scalability-of-fork-join-pool

One more thing!• The missing commandment. Thou shall not schedule

two tasks at the same time, if they both need the same locks!

• How would the scheduler know? Well, here’s an educated guess: If two tasks are the same task, they will most likely also need the same locks!

• Executing an actor only once at a time therefore has performance reasons (yes, it does make it easier as well to reason about it .. but we wouldn’t want to appear lame ..)

• Conversely, if you write different actors, make sure that they don’t use the same locks (not sure if this is a best-practice in Akka, but it’s certainly true in Erlang)

Finally!

• The devil’s in the detail and unfortunately some knowledge of these details is required to design scalable architectures.

• In particular, understanding the underlying issues will hopefully help you with designing scalable Akka applications (e.g. applying what you’ve heard, what can you do about too many messages being queued up?)

• Concurrency is hard, yes, but isn’t that the beauty about it? Not at all, but never mind!

To sum up, just read this book!

Thanks!

Bernhard Huemerbhuemer.at@bhuemer23 July 2013

IRIAN Solutionsirian.at

concurrency on the jvm

Technology

lock contention

java concurrency

high contention

contention issues

lowlevel concurrency

moderate contention

lightweight locking

doublechecked locking