jit-compiler-assisted distributed java virtual machine
DESCRIPTION
JIT-Compiler-Assisted Distributed Java Virtual Machine. Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau The Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong Presented by Cho-Li Wang. Outline. - PowerPoint PPT PresentationTRANSCRIPT
JIT-Compiler-Assisted Distributed Java Virtual Machine
Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau
The Systems Research Group
Department of Computer Science and Information Systems
The University of Hong KongPresented by Cho-Li Wang
TCHPC 2004, Taiwan, Mar, 2004 2
OutlineDistributed Java Virtual Machine (DJVM)Design tradeoffsRelated workJESSICA2 DJVM JIT-compiler-assisted dynamic thread migration Global Object Space (GOS) for location-transparent object
access
Experimental results + A demoConclusion & future work
TCHPC 2004, Taiwan, Mar, 2004 3
Distributed Java Virtual Machine (DJVM)
A distributed Java Virtual Machine (DJVM) consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application.
A DJVM provides all the JVM services, that are compliant with the Java language specification.
DJVM provides an illusion that the program is running on a single machine (yet more powerful) -- Single System Image (SSI)
Heap
Bytecode Execution Engine
ClassThread
DJVM
(Single System Image)
import java.util.*;class worker extends Thread{ private long n; public worker(long N){ n=N; } public void run(){ long sum=0; for(long i=0; i<n; i++) sum+=i; System.out.println(“N=“+n+” Sum="+sum);}}public class test { static final int N=100; public static void main(String args[]){ worker [] w= new worker[N]; Random r = new Random(); for (int i=0; i<N; i++) w[i] = new worker(r.nextLong()); for (int i=0; i<N; i++) w[i].start(); try{ for (int i=0; i<N; i++) w[i].join();} catch (Exception e){}}}
Java thread
JVM JVM JVM JVM
TCHPC 2004, Taiwan, Mar, 2004 4
Design Tradeoffs of a DJVMHow to manage the threads? Distributed thread scheduling Initial thread placement vs migration
How to store the data ? Object store : A global heap shared by threads ? Memory consistency : Java memory model ? Can an off-the-shelf DSM be used ? Or others ?
How to process the bytecode ? Execution Engine : Interpretation, Just-in-Time (JIT)
compilation, static compilation High performance ?
ThreadSched
ExecEngine Heap
TCHPC 2004, Taiwan, Mar, 2004 5
Related workcJVM (IBM Haifa Research) Interpreter mode execution Embedded OO-based DSM (Proxy)
JAVA/DSM (Rice University) Interpreter mode execution Heap built on top of a page-based DSM
JESSICA (HKU) Thread migration Interpreter mode execution Heap built on top of a page-based DSM
Jackal, Hyperion Static compilation Link to an object-based DSM
RemoteCreation
IntrEmbedded OO-based
DSM (Proxy)
ManualDistribution
Intr Page-basedDSM
TransparentMigration
Intr Page-basedDSM
RemoteCreation
Static compilation
OO-basedDSM
TCHPC 2004, Taiwan, Mar, 2004 6
JESSICA2 (Java-Enabled Single-System-Image Computing Architecture)
Thread Migration
Global Object Space
JESSICA2JVM
A Multithreaded Java Program
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
Master Worker Worker Worker Worker Worker
JIT Compiler ModePortable Java Frame
A shared global heap spanning all cluster nodes
TCHPC 2004, Taiwan, Mar, 2004 7
JESSICA2 Main FeaturesCluster-aware bytecode execution engine (JITEE) JVM operated in Just-In-Time (JIT) compilation mode Cluster-aware : global naming scheme for threads, objects,..
JIT-compiler-assisted dynamic thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation
(preprocessing); no new API introduced Enable dynamic load balancing
Global Object Space (GOS) Provide location-transparent object access for threads Tightly integrated with JVM, Memory consistency : compliant with Java Memory Model (JMM) Various optimizing schemes : adaptive migrating home, synchronized
method shipping, object pushing I/O redirection
TCHPC 2004, Taiwan, Mar, 2004 8
JESSICA2 thread migration (In a JIT-enabled JVM)
Thread
Frame
(1) Alert
Frames
Method AreaJVM
Frame parsingRestore execution
Frame
Stack analysisStack capturing
Thread Scheduler
Source node
Destination node
Migration Manager
LoadMonitor
Method Area
RTC
RTC
FramesBTC
(2)
(3)
PC
PC
RTC: Raw Thread ContextBTC : Bytecode-oriented Thread Context = thread id + Java frames (class name, method signature, PC, Operand stack ptr, local vars …)
Transformation of the RTC into the BTC directly inside the JIT compiler
TCHPC 2004, Taiwan, Mar, 2004 9
Thread Stack TransformationRaw Thread Context (RTC)
%esp: 0x00000000%esp+4: 0x082ca809%esp+8: 0x08225400%esp+12: 0x08266bc0
%esp: 0x00000000%esp+4: 0x086243c%esp+8: 0x08623200%esp+12: 0x08293010...%eax = 0x08623200%ebx = 0x08293010
Frames{method CPI::run()V@111local=13;stack=0;var:arg0:CPI, 33, 0x8225400local1: [D; 33, 0x8266bc0@2local2: int, 2;...
Bytecode-oriented Thread Context (BTC)
Stack Capturing
Stack Restoration
method id bytecode Program Counter
%esp : stack pointer
node id[ : array; D: double
TCHPC 2004, Taiwan, Mar, 2004 10
Thread State Capturing : DetailsBytecode verifier
Bytecode translation
migration points :(1) head of basic block (loop) (2) before a method invocationConstruct
control flow graph
invoke
code generation
Native Code
Linking & Constant Resolution
Intermediate Code
Java frame
C frame
Java frame detection
thread stack
raw stack
Global Object Space
1. Add migration checking code (cmp mflag,0)2. Add object checking (local or remote obj) 3. Add type and register spilling
TCHPC 2004, Taiwan, Mar, 2004 11
Restoring: Dynamic Register Patching (on i386 Architecture)
Stack growth
%ebp
bootstrap frame
trampoline frame
Ret addr
frame 0
reg1 <- value1reg2 <- value2
jmp restore_point0
Ret addr
%ebp
%ebp
frame 1
reg1 <- value1jmp restore_point1
Compiled methods:
Method1(){...retore_point1:}
Method0(){...retore_point0:}
trampoline
bootstrap(){ trampoline();closing handler();}
Rebuilt register context
Native code
%ebp : i386 frame pointer“Ret Addr”: return address of the current function call
Small code stubs
TCHPC 2004, Taiwan, Mar, 2004 12
Global Object Space (GOS)
Provide global heap abstraction for DJVMHome-based object coherence protocol, compliant with JVM Memory Model OO-based to reduce false sharing
Non-blocking communication Use threaded I/O interface inside JVM for
communication to hide the latency
Adaptive object home migration mechanism Take advantage of JVM runtime information for
optimization Optimizations: Home migration, Synchronized Method
Shipping, Object pushing
TCHPC 2004, Taiwan, Mar, 2004 13
Experimental environment
HKU Gideon 300 Linux cluster : 300 P4 PCs (2GHz, 512 MB RAM, 40 GB disk)
Network: 312-port Foundry FastIron 1500 Non-blocking switch (100 Mbits/s)
Kaffe JVM version 1.0.6; Linux kernel 2.4.18-3 (RedHat 7.3)
TCHPC 2004, Taiwan, Mar, 2004 14
Migration overhead during normal execution
(SPECJVM98 benchmark)
Benchmarks Time (seconds) Space (native code/bytecode)
No migration Migration No migration Migration
compress 11.31 11.39(+0.71%) 6.89 7.58(+10.01%)
jess 30.48 30.96(+1.57%) 6.82 8.34(+22.29%)
raytrace 24.47 24.68(+0.86%) 7.47 8.49(+13.65%)
db 35.49 36.69(+3.38%) 7.01 7.63(+8.84%)
javac 38.66 40.96(+5.95%) 6.74 8.72(+29.38%)
mpegaudio 28.07 29.28(+4.31%) 7.97 8.53(+7.03%)
mtrt 24.91 25.05(+0.56%) 7.47 8.49(+13.65%)
jack 37.78 37.90(+0.32%) 6.95 8.38(+20.58%)
Average (+2.21%) (+15.68%)
TCHPC 2004, Taiwan, Mar, 2004 15
Migration overhead analysisProgram (frame #) LT(1) CPI(1) ASP(1) N-Body(8) SOR(2)
Latency (ms) 4.997 2.680 4.678 10.803 8.467
Frame # 1 2 4 6 8 10
Var # 4 15 37 59 81 103
Size (B) 201 417 849 1281 1713 2145
Capture (us) 202 266 410 495 605 730
Parse (us) 235 253 447 526 611 724
Create (us) 360 360 360 360 360 360
Compile (us) 478 575 847 1,169 1,451 1,720
Build (us) 7 11 14 16 21 28
Total (us) 1,282 1,465 2,078 2,566 3,048 3,562
Overall migration latency (2-10 ms)
Migration time breakdown (LT program)
TCHPC 2004, Taiwan, Mar, 2004 16
GOS Optimizations (using 4 PCs)
0%
20%
40%
60%
80%
100%
NO H
HS
HS
P
NO H
HS
HS
P
NO H
HS
HS
P
NO H
HS
HS
P
ASP SOR Nbody TSP
Obj
Syn
Comp
NO = No optimizations HS = Home migration + Synchronized Method ShippingH = Home migration HSP = HS + Object pushing
TCHPC 2004, Taiwan, Mar, 2004 17
Application benchmark
Speedup
0
2
4
6
8
10
2 4 8
Node number
Spe
edup
Linear speedup
CPI
TSP
Raytracer
nBody
Number of Nodes
TCHPC 2004, Taiwan, Mar, 2004 18
JESSICA2 vs JESSICA (CPI)
CPI(50,000,000iterations)
050000
100000150000200000250000
2 4 8
Number of nodes
Tim
e(m
s) JESSICA
JESSICA2
TCHPC 2004, Taiwan, Mar, 2004 19
Parallel Ray Tracing (using 64 nodes of Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 4402 seconds ( 1.2 hour)
Speedup = 4402/108=40.75
TCHPC 2004, Taiwan, Mar, 2004 20
Demo
Execution Steps1. Create the display panel
2. Start the ray tracing program on node 26 with 8 threads
3. Add two more nodes: 27 and 28
4. Add 5 more nodes: 29, 30, 31, 32, 33
TCHPC 2004, Taiwan, Mar, 2004 21
Conclusions
Dynamic Java thread migration makes it possible for true parallel execution of Java threads and enables dynamic load balancing.
Runtime (“Just-In-Time”) code Instrument for thread state capturing and restoring is feasible.
An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead
TCHPC 2004, Taiwan, Mar, 2004 22
Advantages of native code instrumentation
LightweightRe-use JIT compiler internal data structures
and control flow analysis functions Instrumented native codes are more efficient
than instrumented bytecode.
TransparentNo source code modification.No new API introduced.No preprocessing
TCHPC 2004, Taiwan, Mar, 2004 23
Future work
Advanced thread migration mechanism without overhead during normal execution
Incremental Distributed GC
Enhanced Single I/O Space to benefit more real-life applications
Parallel I/O Support
TCHPC 2004, Taiwan, Mar, 2004 24
Thanks
JESSICA2 Webpagehttp://www.csis.hku.hk/~clwang/
projects/JESSICA2.html