jvm memory model - yoav abrahami, wix

JVM

Memory Model

Anomalies

• How long does it take to count to 100?

• How long does it take to append to a list?

To sort a list?

• How long does it take to append to a

vector? To sort a vector?

Dynamic vs Static Compilation

• Static Compilation

– “ahead-of-time” (AOT) compilation

– Source code -> Native executable

– Compiles before executing

• Dynamic compiler (JIT)

– “just-in-time” (JIT) compilation

– Source -> bytecode -> interpreter -> JITed

– Most of compilation happens during executing

JIT Compilation

• Aggressive optimistic optimizations

– Through extensive usage of profiling info

– Limited budget (CPU, Memory)

– Startup speed may suffer

• The JIT

– Compiles bytecode when needed

– Maybe immediately before execution?

– Maybe never?

JVM JIT Compilation

• Eventually JITs bytecode

– Based on profiling

– After 10,000 cycles, again after 20,000 cycles

• Profiling allows focused code-gen

• Profiling allows better code-gen

– Inline what’s hot

– Loop unrolling, range-check elimination, etc.

– Branch prediction, spill-code-gen, scheduling

JVM JIT Compilation

• JVM applications operate in mixed mode

• Interpreted

– Bytecode-walking

– Artificial stack machine

• Compiled

– Direct native operations

– Native register machine

JVM application utilization

Optimizations in HotSpots JVM

Inlining

int addAll(int max) {

int accum = 0;

for (int i=0; i < max; i++) {

accum = add(accum, i);

}

return accum;

}

int add(int a, int b) {

return a+b;

}

int addAll(int max) {

int accum = 0;

for (int i=0; i < max; i++) {

accum = accum + i;

}

return accum;

}

Loop unrolling

public void foo(int[] arr, int a) {

for (int i=0; i<arr.length; i++) {

arr[i] += a;

}

}

public void foo(int[] arr, int a) {

int limit = arr.length / 4;

for (int i=0; i<limit ; i++){

arr[4*i] += a; arr[4*i+1] += a;

arr[4*i+2] += a; arr[4*i+3] += a;

}

for (int i=limit*4; i<arr.length; i++) {

arr[i] += a;

}

}

Escape Analysis

public int m1() {

Pair p = new Pair(1,2);

return m2(p);

}

public int m2(Pair p) {

return p.first + m3(p);

}

public int m3(Pair p) {

return p.second;

}

// after deep inlining

public int m1() {

Pair p = new Pair(1,2);

return p.first + p.second;

}

// optimized version

public int m1() {

return 3;

}

Monitoring Jit

• Info about compiled methods– -XX:+PrintCompilation

• Info about inlining– -xx:+PrintInlining

– Requires also -XX:+UnlockDiagnosticVMOptions

• Print the assembly code– -XX:+PrintAssembly

– Also requires also -XX:+UnlockDiagnosticVMOptions

– On Mac OS requires adding hsdis-amd64.dylibto the LD_LIBRARY_PATH environment variable.

Challenge

• Rerun the benchmarks, this time using 1. -XX:+PrintCompilation

2. -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining

JVM Memory

The Java Memory Model

Java Memory Model

• The Java Memory Model (JMM) describes

how threads in the Java (Scala)

Programming language interact through

memory.

• Provides sequential consistency for data

race free programs.

Instruction Reordering

• Program Orderint a=1;

int b=2;

int c=3;

int d=4;

int e = a + b;

int f = c - d;

• Execution Orderint d=4;

int c=3;

int f = c - d;

int b=2;

int a=1;

int e = a + b;

Anomaly

• Two threads running

• What will be the result?i=1, j=1

i=0, j=1

i=1, j=0

i=0, j=0

x=y=0

j=y

x=1

i=x

y=1

Thread 1 Thread 2

Let’s Check

• Let’s build the scenarioval t1 = new Thread(new Runnable {

def run() {

// sleep a little to add some uncertainty

Thread.sleep(1)

x=1

j=y

}

})

• Then run it a few times

• Do we see the anomaly?

Happens Before Ordering

• Defines constraints on instruction reordering

• A monitor release

• A matching monitor acquire

• Volatile field reads are after writes– For non volatile field, this is not necessarily the

case!

• Assignment dependency within a single thread

• Happens Before ordering is transitive

Anomaly

• Let’s see how far we can count in 100 milli-secondsvar running = true

• Let thread 1 countvar count = 0

while (running)

count = count + 1

println(count)

• Let thread 2 signal thread 1 to stopThread.sleep(100)

running = false

println("thread 2 set running to false”)

Volatile

• Compilers can reorder instructions

• Compilers can keep values in registers

• Processors can reorder instructions

• Values may be in different caching levels

and not synced to main memory

• JMM is designed for aggressive

optimizations

Volatile

• Modern processor caches

Core 1 Core 2 Core 3 Core 4

L1 L1 L1 L1

L2 L2 L2 L2

L3 L3

Main Memory ~65 ns (DRAM)

~15 ns (40-45 cycles)

~3 ns (10-12 cycles)

~1 ns (3-4 cycles)

< 1 ns

Volatile

• Volatile instructs the compiler and processor to sync the value to main memory on every access– Does not utilize the L1, L2 or L3 cache

• Volatile reads / writes cannot be reordered

• Volatile long and doubles are atomic– Long and double types are over 32bit – the

processor operates on 32bit atomicity by default.

Resolve the Anomaly

• Let’s see how far we can count in 100 milli-seconds@volatile var running = true

• Let thread 1 countvar count = 0

while (running)

count = count + 1

println(count)

• Let thread 2 signal thread 1 to stopThread.sleep(100)

running = false

println("thread 2 set running to false”)

Anomaly

• Let’s count to 10,000

• But lets use 10 threads, each adding 1,000 to our count

var count = 0

• Each of the 10 threads doesfor (i <- 1 to 1000)

count = count + 1

• What did we get?

Synchronization

• Let’s have another look at the assignmentcount = count + 1

count = count + 1

• Is this a single instruction?

• javap

– javap <class> - Print the class signature

– javap -c <class> - Print the class bytecode

Synchronization

• The bytecode for count = count + 1

14: getfield #38 // Field scala/runtime/IntRef.elem:I

17: iconst_1

18: iadd

19: putfield #38 // Field scala/runtime/IntRef.elem:I

Synchronization

• The bytecode for count = count + 1

// Read the current counter value from field 38

// and add it to the stack

14: getfield #38 // Field scala/runtime/IntRef.elem:I

// Add 1 to the stack

17: iconst_1

// Add the first two stack elements as integers,

// and put the result in the stack

18: iadd

// set field 38 to the current top element of the stack

// assuming it is an integer

19: putfield #38 // Field scala/runtime/IntRef.elem:I

Synchronization Tools

Actions by

thread 1

Thread 1

“release”

monitor

Thread 2

“acquire”

monitor

Actions by

thread 2

Happens-before


• Synchronization tools allow grouping instructions as if “one atomic instruction”

– Only one thread can perform the code at a time

• Some tools

– Synchronized

– ReentrantLock

– CountDownLatch

– Semaphore

– ReentrantReadWriteLock


• Simplest tools – synchronized// for each thread

for (i <- 1 to 1000)

synchronized {

count = count + 1

}

• Works relative to ‘this’


• Using ReentrantLock// before the threads

val lock = new ReentrantLock()

// for each thread

for (i <- 1 to 1000) {

lock.lock()

try {

count = count + 1

}

finally {

lock.unlock()

}

}

Atomic Operations

• Containers for simple values or references

with atomic operations

• getAndIncrement

• getAndDecrement

• getAndAdd

Atomic Operations

• All are based on compareAndSwap

– From the unsafe class

– Used to implement spin-locks

Atomic Operations

• Spin Lockpublic final int getAndIncrement() {

for (;;) {

int current = get();

int next = current + 1;

if (compareAndSet(current, next))

return current;

}

}

}

public final boolean compareAndSet(int expect, int update) {

return unsafe.compareAndSwapInt(this,

valueOffset, expect, update);

}

References

• The examples on Github

https://github.com/yoavaa/jvm-memory-model

Questions?

jvm memory model - yoav abrahami, wix

Technology

accum i

int accum

int addint

jit compiles bytecode

focused codegenprofiling

dynamic compiler jitjust

mixed mode

memorystartup speed