jit-compiler-assisted distributed java virtual machine

JIT-Compiler-Assisted Distributed Java Virtual Machine

Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau

The Systems Research Group

Department of Computer Science and Information Systems

The University of Hong KongPresented by Cho-Li Wang

TCHPC 2004, Taiwan, Mar, 2004 2

OutlineDistributed Java Virtual Machine (DJVM)Design tradeoffsRelated workJESSICA2 DJVM JIT-compiler-assisted dynamic thread migration Global Object Space (GOS) for location-transparent object

access

Experimental results + A demoConclusion & future work


Distributed Java Virtual Machine (DJVM)

A distributed Java Virtual Machine (DJVM) consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application.

A DJVM provides all the JVM services, that are compliant with the Java language specification.

DJVM provides an illusion that the program is running on a single machine (yet more powerful) -- Single System Image (SSI)

Heap

Bytecode Execution Engine

ClassThread

DJVM

(Single System Image)

import java.util.*;class worker extends Thread{ private long n; public worker(long N){ n=N; } public void run(){ long sum=0; for(long i=0; i<n; i++) sum+=i; System.out.println(“N=“+n+” Sum="+sum);}}public class test { static final int N=100; public static void main(String args[]){ worker [] w= new worker[N]; Random r = new Random(); for (int i=0; i<N; i++) w[i] = new worker(r.nextLong()); for (int i=0; i<N; i++) w[i].start(); try{ for (int i=0; i<N; i++) w[i].join();} catch (Exception e){}}}

Java thread

JVM JVM JVM JVM


Design Tradeoffs of a DJVMHow to manage the threads? Distributed thread scheduling Initial thread placement vs migration

How to store the data ? Object store : A global heap shared by threads ? Memory consistency : Java memory model ? Can an off-the-shelf DSM be used ? Or others ?

How to process the bytecode ? Execution Engine : Interpretation, Just-in-Time (JIT)

compilation, static compilation High performance ?

ThreadSched

ExecEngine Heap


Related workcJVM (IBM Haifa Research) Interpreter mode execution Embedded OO-based DSM (Proxy)

JAVA/DSM (Rice University) Interpreter mode execution Heap built on top of a page-based DSM

JESSICA (HKU) Thread migration Interpreter mode execution Heap built on top of a page-based DSM

Jackal, Hyperion Static compilation Link to an object-based DSM

RemoteCreation

IntrEmbedded OO-based

DSM (Proxy)

ManualDistribution

Intr Page-basedDSM

TransparentMigration

Intr Page-basedDSM

RemoteCreation

Static compilation

OO-basedDSM


JESSICA2 (Java-Enabled Single-System-Image Computing Architecture)

Thread Migration

Global Object Space

JESSICA2JVM

A Multithreaded Java Program

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

Master Worker Worker Worker Worker Worker

JIT Compiler ModePortable Java Frame

A shared global heap spanning all cluster nodes


JESSICA2 Main FeaturesCluster-aware bytecode execution engine (JITEE) JVM operated in Just-In-Time (JIT) compilation mode Cluster-aware : global naming scheme for threads, objects,..

JIT-compiler-assisted dynamic thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation

(preprocessing); no new API introduced Enable dynamic load balancing

Global Object Space (GOS) Provide location-transparent object access for threads Tightly integrated with JVM, Memory consistency : compliant with Java Memory Model (JMM) Various optimizing schemes : adaptive migrating home, synchronized

method shipping, object pushing I/O redirection


JESSICA2 thread migration (In a JIT-enabled JVM)

Thread

Frame

(1) Alert

Frames

Method AreaJVM

Frame parsingRestore execution

Frame

Stack analysisStack capturing

Thread Scheduler

Source node

Destination node

Migration Manager

LoadMonitor

Method Area

RTC

RTC

FramesBTC

(2)

(3)

PC

PC

RTC: Raw Thread ContextBTC : Bytecode-oriented Thread Context = thread id + Java frames (class name, method signature, PC, Operand stack ptr, local vars …)

Transformation of the RTC into the BTC directly inside the JIT compiler


Thread Stack TransformationRaw Thread Context (RTC)

%esp: 0x00000000%esp+4: 0x082ca809%esp+8: 0x08225400%esp+12: 0x08266bc0

%esp: 0x00000000%esp+4: 0x086243c%esp+8: 0x08623200%esp+12: 0x08293010...%eax = 0x08623200%ebx = 0x08293010

Frames{method CPI::run()V@111local=13;stack=0;var:arg0:CPI, 33, 0x8225400local1: [D; 33, 0x8266bc0@2local2: int, 2;...

Bytecode-oriented Thread Context (BTC)

Stack Capturing

Stack Restoration

method id bytecode Program Counter

%esp : stack pointer

node id[ : array; D: double


Thread State Capturing : DetailsBytecode verifier

Bytecode translation

migration points :(1) head of basic block (loop) (2) before a method invocationConstruct

control flow graph

invoke

code generation

Native Code

Linking & Constant Resolution

Intermediate Code

Java frame

C frame

Java frame detection

thread stack

raw stack

Global Object Space

1. Add migration checking code (cmp mflag,0)2. Add object checking (local or remote obj) 3. Add type and register spilling


Restoring: Dynamic Register Patching (on i386 Architecture)

Stack growth

%ebp

bootstrap frame

trampoline frame

Ret addr

frame 0

reg1 <- value1reg2 <- value2

jmp restore_point0

Ret addr

%ebp

%ebp

frame 1

reg1 <- value1jmp restore_point1

Compiled methods:

Method1(){...retore_point1:}

Method0(){...retore_point0:}

trampoline

bootstrap(){ trampoline();closing handler();}

Rebuilt register context

Native code

%ebp : i386 frame pointer“Ret Addr”: return address of the current function call

Small code stubs


Global Object Space (GOS)

Provide global heap abstraction for DJVMHome-based object coherence protocol, compliant with JVM Memory Model OO-based to reduce false sharing

Non-blocking communication Use threaded I/O interface inside JVM for

communication to hide the latency

Adaptive object home migration mechanism Take advantage of JVM runtime information for

optimization Optimizations: Home migration, Synchronized Method

Shipping, Object pushing


Experimental environment

HKU Gideon 300 Linux cluster : 300 P4 PCs (2GHz, 512 MB RAM, 40 GB disk)

Network: 312-port Foundry FastIron 1500 Non-blocking switch (100 Mbits/s)

Kaffe JVM version 1.0.6; Linux kernel 2.4.18-3 (RedHat 7.3)


Migration overhead during normal execution

(SPECJVM98 benchmark)

Benchmarks Time (seconds) Space (native code/bytecode)

No migration Migration No migration Migration

compress 11.31 11.39(+0.71%) 6.89 7.58(+10.01%)

jess 30.48 30.96(+1.57%) 6.82 8.34(+22.29%)

raytrace 24.47 24.68(+0.86%) 7.47 8.49(+13.65%)

db 35.49 36.69(+3.38%) 7.01 7.63(+8.84%)

javac 38.66 40.96(+5.95%) 6.74 8.72(+29.38%)

mpegaudio 28.07 29.28(+4.31%) 7.97 8.53(+7.03%)

mtrt 24.91 25.05(+0.56%) 7.47 8.49(+13.65%)

jack 37.78 37.90(+0.32%) 6.95 8.38(+20.58%)

Average (+2.21%) (+15.68%)


Migration overhead analysisProgram (frame #) LT(1) CPI(1) ASP(1) N-Body(8) SOR(2)

Latency (ms) 4.997 2.680 4.678 10.803 8.467

Frame # 1 2 4 6 8 10

Var # 4 15 37 59 81 103

Size (B) 201 417 849 1281 1713 2145

Capture (us) 202 266 410 495 605 730

Parse (us) 235 253 447 526 611 724

Create (us) 360 360 360 360 360 360

Compile (us) 478 575 847 1,169 1,451 1,720

Build (us) 7 11 14 16 21 28

Total (us) 1,282 1,465 2,078 2,566 3,048 3,562

Overall migration latency (2-10 ms)

Migration time breakdown (LT program)


GOS Optimizations (using 4 PCs)

0%

20%

40%

60%

80%

100%

NO H

HS

HS

P

NO H

HS

HS

P

NO H

HS

HS

P

NO H

HS

HS

P

ASP SOR Nbody TSP

Obj

Syn

Comp

NO = No optimizations HS = Home migration + Synchronized Method ShippingH = Home migration HSP = HS + Object pushing


Application benchmark

Speedup

0

2

4

6

8

10

2 4 8

Node number

Spe

edup

Linear speedup

CPI

TSP

Raytracer

nBody

Number of Nodes


JESSICA2 vs JESSICA (CPI)

CPI(50,000,000iterations)

050000

100000150000200000250000

2 4 8

Number of nodes

Tim

e(m

s) JESSICA

JESSICA2


Parallel Ray Tracing (using 64 nodes of Gideon 300 cluster)

Linux 2.4.18-3 kernel (Redhat 7.3)

64 nodes: 108 seconds

1 node: 4402 seconds ( 1.2 hour)

Speedup = 4402/108=40.75


Demo

Execution Steps1. Create the display panel

2. Start the ray tracing program on node 26 with 8 threads

3. Add two more nodes: 27 and 28

4. Add 5 more nodes: 29, 30, 31, 32, 33


Conclusions

Dynamic Java thread migration makes it possible for true parallel execution of Java threads and enables dynamic load balancing.

Runtime (“Just-In-Time”) code Instrument for thread state capturing and restoring is feasible.

An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead


Advantages of native code instrumentation

LightweightRe-use JIT compiler internal data structures

and control flow analysis functions Instrumented native codes are more efficient

than instrumented bytecode.

TransparentNo source code modification.No new API introduced.No preprocessing


Future work

Advanced thread migration mechanism without overhead during normal execution

Incremental Distributed GC

Enhanced Single I/O Space to benefit more real-life applications

Parallel I/O Support


Thanks

JESSICA2 Webpagehttp://www.csis.hku.hk/~clwang/

projects/JESSICA2.html

jit-compiler-assisted distributed java virtual machine

Documents