1 grid computing in hong kong dr. cho-li wang systems research group department of computer science...
TRANSCRIPT
1
Grid Computing in Hong Kong
Dr. Cho-Li WangSystems Research Group
Department of Computer Science and Information SystemsThe University of Hong Kong
2
Agenda
Grid computing – a simple picture The Hong Kong Grid SRG Projects
SLIM, ODGPC G-JavaMPI JESSICA2 LOTS DSM for Grid
Summary and Conclusion
3
Grid Computing : A Simple Picture
Grid Computing
Access toremote resources
via standard protocols
forcross-domain collaboration
CPU power,Memory,Network,Storage…
Data..Services..
Resource providers
End users
Much like “utilities” in our daily lives – electricity, water, etc.
Advantages: Cost-effectiveness Platform extensibility Convenience (P&P)
4
Grid Computing in Hong Kong -- The Hong Kong Grid
The experimental grid in HK
Supported under HKU Foundation Seed Grant http://www.hkgrid.org/
5
The Hong Kong Grid (HKGrid)
Goals: to construct and make available a grid test bed
to facilitate the development of grid middleware and applications by local industry and institutions in Hong Kong and their partners in the region
to demonstrate the benefits of adopting grid technologies and to showcase any outstanding results of development or application
HKGrid provides a platform for its members to experiment with various research prototypes and pilot applications
6
HKGrid - Current constituents
Institutions Computing facilities
City University of HK Service gateway (2-way Xeon SMP)
HK Baptist University 2-way Xeon SMP x 64(#300 in TOP500, 6/2003)
HK University of Science and Technology
4-way SMP cluster
The HK Polytechnic University Service gateway (2-way Xeon SMP)
The HK Institute of HPC Service gateway (2-way Xeon SMP)
HKU – Computer Centre 2-way Xeon SMP x 128(#240 in TOP500, 11/2003)
HKU – Department of CSIS Pentium 4 x 300(#340 in TOP500, 6/2003)
A 4 Tflop/s theoretical maximum computing power
7
Grid Point Monitoring
with Ganglia
URL: http://gideon.csis.hku.hk/status/
8
HKU Grid Point: Grid and Cluster Software
Gatekeepergideon.csis.hku.hk
Remote job submission
Gideon Ostrich Srgdell Real
- Globus Toolkit (GT) 2.0, 2.4, 3.0.1
Local Job Scheduler
IPC / Network communication
- OpenPBS 2.3.16 - Maui 3.2.5
-HPF, Fortran 90-C, C++, Java with MPI-JESSICA2 (HKU)
- MPICH-G2 1.2.3
Grid middleware
Job scheduling
Programming
Communication Lib
9
Main Computing Facilities: HKU-CSIS Gideon 300 Cluster
10
Research Projects in HKGrid
HKBU: Knowledge Grid (Autonomous grid service composition). HKPU: Peer-to-peer (P2P) grid, meta scheduler, fault tolerance HKUST: Development of sensor Grid infrastructure HKU
ETI: Modelling of Air Quality in Hong Kong (E-Business Technology Institute with the Environmental Protection Department, HKSAR)
Computer Centre : HKU campus grid ; scientific applications running across the ApGrid
CSIS : Robust Speech Recognition (J. Wu and Dr. Q. Huo) CSIS : Simulation for the DNA Shuffling Experiment (W.H. Hon and Dr. T.W. Lam) CSIS: Approximate String Matching on DNA Sequences (L.L. Cheng) CSIS: Whole Genome Alignment via Mutation-Sensitive Sequence Similarity (H.L.
Chan, N. Lu, and Dr. T.W. Lam) ME: Parallel Simulation of Turbulent Flow Model (Dr. C.H. Liu, Dept. of
Mechanical Engineering) CSIS : HKU Grid Point (863 Project: China National Grid) CSIS: Asia-Pacific Grid …..
11
HKGrid – Connections
Links to China National Grid (CNGrid) and Asia-Pacific Grid (ApGrid) via CERNET and APAN
Internet2 connection to the Abilene backbone at Chicago, USA
Plays the role of a gateway for the other bigger grids
12
China National Grid (CNGrid) : 863 Project
上海超级计算中心
中科院计算所
香港大学 (CSIS)
西安交通大学
中国科技大学
国防科技大学
中科院应用物理所
清华大学
China National Grid Participants
Supporting software :
VEGA (织女星 ) grid management system : dynamic service deployment, single-sign-on, data replication, and performance monitoring. Developed by Institute of Computing Technology, Chinese Academy of SciencesV.1.0 released 8
中科院计算所开发的网格系统软件已将计算所、华中科技大学 与香港大学网格节点连接在一起,通过 VEGA_GOS …
13
ApGrid / PRAGMA TestbedApGrid / PRAGMA Testbed 10 countries 21 organizations 22 clusters 853 CPUs
14
ApGrid Demon on The HKU School Open Day (Oct. 2003)
15
Grid Research at HKU-CSIS
SRG Projects SLIM + ODGPC G-JavaMPI JESSICA2 LOTS DSM
16
Our Goal
Utility computing: to aggregate and make use of distributed computing resources transparently
Traditional means: to utilize the dedicated HPC facilities distributed across institutions Performance and reliability are key
Pervasive means: any user can be resource provider (e.g., idle PCs, etc.) or consumer, or both Convenience and security are key
To construct an advanced grid computing platform to accommodate utility-like computing via traditional and
“pervasive” means
17
Research at HKU – An Advanced Grid Computing Platform
G-JavaMPIJESSICA
LOTSODGPCSLIM
Load balancing
AGP
On-demand Grid point construction (ODGPC)
Research Issues
Single-system image
Performance and ReliabilityObjectives
User’s convenience
Grid point construction
Convenient system administration
(Programming Environment)
18
SLIM
Single Linux Image Management
19
SLIM
Utility computing decouples computing platforms (resources) and computing logic (applications)
I.e., a single platform can run completely different applications
Problem: different applications demand different execution environments (OS, shared libraries, supporting apps, etc.)
Hassles associated with managing execution environments (EE’s) in the resource provider side offset the benefits of resource sharing
SLIM is a network service for managing and constructing EE’s, and disseminating them to remote computing platforms
20
SLIM – System design
How it works? A node sends a EE specification across the network to find the Boot
server Boot server delivers the requested Linux kernel Image server constructs an EE by collecting shared libraries, user
data, etc. Linux kernel boots, and contacts the Image Server to “mount” the
EE via a file synchronization protocol such as NFS Aggressive caching techniques are deployed to optimize
performance
21
SLIM – Ongoing and future work
SLIM has been managing: the HKU-CSIS grid point (350 nodes) for
various grid research projects an addition 300+ lab machines for
teaching purpose (different courses have different requirements)
Future work To overcome the challenges in
deploying SLIM over broadband links Realizing the “pervasive utility
computing”
22
/ usr/ local/ gt3.2OS image
SLIMserver
client clientclient
DHCP
client clientclient
SLIMserver
1
TFTP
2
SLIMserver
client clientclient
3
4
certificate
SLIMserver
1
CA server
client1
client1
client1
42
3
On-Demand Grid Point Construction (ODGPC)
1. Software installation at SLIM server 2. Client boots and obtains kernel
3. OS image/App disseminated 4. Process to generate certificates
23
SLIM and ODGPC Performance Evaluation
Boot up 100 machines (Linux + GT3) : 6 minutes. Generate certificates for 100 machines (Step 4) : 30 minutes. Total time : 6 + 30 = 36 minutes
256 PCs < 5 minutes(OS only)
24
SLIM – Key references
http://www.csis.hku.hk/~cmlee/slim/
C.M. Lee, R.S.C. Ho, D.H.F. Hung, C.L. Wang, and F.C.M. Lau, “Managing Execution Environments for Utility Computing,” Network Research Workshop 2004 (with APAN 2004), March, 2004.
(LinuxPilot 2004/04)
25
G-JavaMPI
A grid-enabled Java-MPI system with dynamic load-balancing via process migration
26
G-JavaMPI
A grid-enabled implementation of Java binding of MPI, supporting efficient MPI communication among distributed Java processes
Supports transparent Java process migration (through JVMDI) within and across grid points for balancing CPU and network loads
Communication-aware process migration policies based on: application’s communication pattern available network bandwidth on grid overlays
27
G-JavaMPI – System design
(*)
Gatekeeper
(1)(1*)
LS
Gatekeeper(3*)
LS
Gatekeeper
(3)
LS
(2)
WAN
Migrating(restarting a new process through Globus remote job request with delegated user credentials and Java-MPI job credentials)
Java-MPI communication
Some legacymessages are redirectedduring migration
(2*)
JVM
M
Migration module resides in each JVM
28
G-JavaMPI – Ongoing and future work
The migration mechanism has been implemented Future work targets at process migration policies
Goal: to offset performance pitfalls caused by heterogeneity through dynamic process migration
Sources of heterogeneity in grids CPU, network, runtime environments, etc.
CPU and network heterogeneities cause long “blocking” periods in cooperative processes, thus limiting the system throughput
G-JavaMPI aims to detect and eliminate “blocking” through process migration (e.g. to migrate a “bottleneck” process to a faster node, etc.)
29
G-JavaMPI – Key references
L. Chen, C.L. Wang, and F.C.M. Lau, “A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports,” Journal of Computer Science and Technology (China), Vol. 18, No. 4, July 2003, pp. 505-514.
L. Chen, C.L. Wang, F.C.M. Lau, and R.K.K. Ma, “A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports,” International Workshop on Grid and Cooperative Computing (GCC-2002), December 26-28, 2002, Hainan, China, pp. 640-652.
30
JESSICA2 : A Java-Enabled Single-System Image Computing Architecture
JESSICA2 is a distributed Java Virtual Machine (DJVM) which consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application.
Java threads can freely move across node boundaries and execute in parallel to achieve more scalable high-performance computing using clusters
The JESSICA2 DJVM provides standard JVM services, that are compliant with the Java language specification, as if running on a single machine – Single System Image (SSI).
31
JESSICA2 Architecture
Thread Migration
Global Object Space
JESSICA2JVM
A Multithreaded Java Program
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
Master Worker Worker Worker Worker Worker
JIT Compiler ModePortable Java Frame
32
JESSICA2 Main Features
Transparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation
(preprocessing); no new API introduced Enable dynamic load balancing on clusters
Full Speed Computation JITEE: cluster-aware bytecode execution engine Operated in Just-In-Time (JIT) compilation mode Zero cost if no migration
Transparent Remote Object Access Global Object Space : A shared global heap spanning all
cluster nodes Adaptive migrating home protocol for memory consistency +
various optimizing schemes. I/O redirection
33
Ray Tracing on JESSICA2 (64 PCs)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 4402 seconds ( 1.2 hour)
Speedup = 4402/108=40.75
34
JESSICA – Key references
W.Z. Zhu , C.L. Wang, and F.C.M. Lau “A Lightweight Solution for Transparent Java Thread Migration in Just-in-Time Compilers,” The 2003 International Conference on Parallel Processing (ICPP-2003), pp. 465-472, Taiwan, Oct. 6-10, 2003
W.Z. Zhu, C.L. Wang and F.C.M. Lau, “JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support,” IEEE Fourth International Conference on Cluster Computing (CLUSTER 2002), Chicago, USA, September 23-26, 2002, pp. 381-388.
M.J.M. Ma, C.L. Wang, F.C.M. Lau. “JESSICA: Java-Enabled Single-System-Image Computing Architecture,” Journal of Parallel and Distributed Computing, Vol. 60, No. 10, October 2000, pp. 1194-1222.
35
LOTS
OS
H/W
LOTS
OS
H/W
LOTS
OS
H/W
LOTS
OS
H/W
LOTS
OS
H/W
Large Large Global Global Object Object SpaceSpace
LOTS: Large Object Space on Grid
A large software distributed memory system for Grid. Provides a global object space larger than the process space (4GB in 32-bit CPU) Uses local hard disk to store recently unused objects Scope Consistency + Home Migration to reduce redundant data traffic
Grid
36
Summary
Performance G-JavaMPI, JESSICA, establish extensible grid platforms
(good for computation-intensive applications) Process/thread migration enables performance
optimization and load balancing LOTS supports shared memory programming
environment on large object space (easier to develop data grid applications)
Reliability G-JavaMPI migrates processes from failed machines SLIM helps construct platforms for failover
Convenience G-JavaMPI, JESSICA, and LOTS enable users to harness
distributed resources via traditional means SLIM and ODGPC simplify Grid point managements
37
Conclusion
Grid/utility computing are relatively new paradigms that deserve further investigation
We address the performance, reliability, and user convenience issues in grid/utility computing
Our advanced grid computing platform (consisting of G-JavaMPI, JESSICA2, LOTS, and SLIM/ODGPC) is geared to deploy in the HKGrid for easy adoption of Grid technologies.
38
Q&AThank you!
The SRGers (Photo: 12/2003)
39
Reference
• Hong Kong Grid • http://www.hkgrid.org/
• Grid Computing Research Portal• http://grid.csis.hku.hk/
• The HKU Systems Research Group• http://www.srg.csis.hku.hk
VEGA Project http://vega.ict.ac.cn/
The HK Supercomputing Directory http://www.hkhpc.org/~SuperDir/