Download - 8-Hegde
-
8/10/2019 8-Hegde
1/74
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
AND THESOFTWARE ECOSYSTEM
MANJU HEGDE, CORPORATE VP, PRODUCTS GROUP, AMD
-
8/10/2019 8-Hegde
2/74
OUTLINE
Motivation
HSA architecture v1
Software stack
Workload analysis
Software Ecosystem
-
8/10/2019 8-Hegde
3/74
PARADIGM SHIFTS.
?
S i n g
l e - t
h r e a d
P e r f o r m a n c e
Time
we arehere
Enabled by:MooresLawVoltageScaling
Constrained by:Power Complexity
Single
-
Core Era
M o
d e r n
A p p
l i c a
t i o n
P e r f o r m a n c e
Time (D
wh
Heter
SysEnabled by:
Abundant dataparallelismPower efficientGPUs
T h r o u g
h p u
t
P e r f o r m a n c e
Time (# of processors)
we arehere
Enabled by:Moores LawSMParchitecture
Constrained by:Power Parallel SWScalability
Multi
-
Core Era
Assembly C/C++ Java
pthreads OpenMP / TBB Shader CUD
-
8/10/2019 8-Hegde
4/74
WITNESS DISCRETE CPU AND DISCRETE GPU COMPUTE
CPU Memory (Coherent )
CPU1
CPUN
GPU Memory
GPUCPU2
PCIe
Compute acceleration works well for large offloadSlow data transfer between CPU and GPUExpert programming necessary to take advantage of theGPU compute
-
8/10/2019 8-Hegde
5/74
FIRST AND SECOND GENERATION APU S
CPU Partition (Coherent)
CPU1
CPUN
GPU Partition
GPUCPU
2
Highspeed
InternalBus
First integration of CPU and GPU on-chipCommon physical memory but not to programmer Faster transfer of data between CPU and GPU to enablemore code to run on the GPU
-
8/10/2019 8-Hegde
6/74
GPUCPU
CPU Memory GPU Memory
| || ||
| || ||
| || ||
| || ||
CPU explicitly copies data to GPU memoryGPU completes computationCPU explicitly copies result back to CPU memory
COMMON PHYSICAL MEMORY BUT NOT TO PROGRAMMER
-
8/10/2019 8-Hegde
7/74
WHAT ARE THE PROBLEMS WE ARE TRYING TO SOLVE
SOCs are quickly following into the samemany CPU core bottlenecks of the PC
To move beyond this we need to look atright processor(s) and/or execution devicefor given workload at reasonable power
While addressing the core issues of Easier to program
Easier to optimizeEasier to load balance
High performance
Lower power
-
8/10/2019 8-Hegde
8/74
COMBINE INTO UNIFIED PROGRAMMING MODEL
CPU
GPU
Shared Memory, Coherency, User Mode Queues
AudioProcessor
VideoHardware
DSPImagSign
Process
FixedFunction
Accelerator
Encode
DecodeEngines
-
8/10/2019 8-Hegde
9/74
WHO IS DOING THIS?HSA FOUNDATION MEMBERSHIP JUNE 2013
9
Founders
Promoters
Supporters
Contributors
Academic
Associates
http://www.multicorewareinc.com/index.phphttp://www.apical.co.uk/ -
8/10/2019 8-Hegde
10/74
HSA FOUNDATIONS FOCUS
Identify design features to make accelerators first class processors
Attract mainstream programmers
Create a platform architecture for ALL accelerators
-
8/10/2019 8-Hegde
11/74
CPU
GPU
AudioProcessor
VideoHardware
DSPPr
FixedFunction
Acctr
EnDe
S h a r e
d M e m o r y
C o h e r e n c y ,
U s e r
M o
d e
Q u e u e s
GPU compute C++ support
User Mode Scheduling
Fully coherent memorybetween CPU & GPU
GPU uses pageable system
memory via CPU pointers
GPU graphics pre-emption
GPU compute context switch
HSA ARCHITECTURE V1
-
8/10/2019 8-Hegde
12/74
Physical Memory
GPU
HWCoherency
Virtual Memory
CPU
Entire memory space:Both CPU and GPU can access and
allocate any location in the systemsvirtual memory space
CacheCache
Coherent Memory:
Ensures CPU andGPU
caches both seean up-to-date viewof data Pageable memory
The GPU seamles
access virtual memoryaddresses that are not
present in physicamem
HSA KEY FEATURES
-
8/10/2019 8-Hegde
13/74
CPU / GPU Uniform Memory
| || ||
| || ||
WITH HSACPU simply passes a pointer to GPUGPU completes computationCPU can read the result directly no copying needed!
GPUCPU
HSA S ft St k
-
8/10/2019 8-Hegde
14/74
CPU
GPU
Audio
Processor
Video
Hardware
DSPPr
FixedFunction
Acctr
En
De
S h a r e
d M e m o r y
C o
h e r e n c y ,
U s e r
M o
d e
Q u e u e s
GPU compute C++ support
User Mode Scheduling
Fully coherent memorybetween CPU & GPU
GPU uses pageable systemmemory via CPU pointers
GPU graphics pre-emption
GPU compute context switch
HSA ARCHITECTURE V1HSA Software Stack
Task QueuingLibrarie s
HSA Domain Libraries,
OpenCL 2.x Runtime
HSA KernelMode Driver
HSA Runtime
HSA JIT
AppsApps
AppsApps
AppsApps
-
8/10/2019 8-Hegde
15/74
HETEROGENEOUS COMPUTE DISPATCH
How compute dispatch operatestoday in the dr iver model
How compute dispatchimproves un der HSA
-
8/10/2019 8-Hegde
16/74
TODAYS COMMAND AND DISPATCH FLOWCommand Flow Data Flow
SoftQueue
KernelMode
Driver
ApplicationA
Command Buffer
UserMode
Driver
Direct3D
DMA Buffer
HardwareQueue
A
-
8/10/2019 8-Hegde
17/74
Command Flow Data Flow
SoftQueue
KernelMode
Driver
ApplicationA
Command Buffer
UserMode
Driver
Direct3D
DMA Buffer
TODAYS COMMAND AND DISPATCH FLOW
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationC
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
HardwareQueue
A
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationB
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
-
8/10/2019 8-Hegde
18/74
TODAYS COMMAND AND DISPATCH FLOW
HardwareQueue
A
CB
A B
Command Flow Data Flow
SoftQueue
KernelMode
Driver
ApplicationA
Command Buffer
UserMode
Driver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationC
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationB
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
-
8/10/2019 8-Hegde
19/74
Command Flow Data Flow
SoftQueue
KernelMode
Driver
ApplicationA
Command Buffer
UserMode
Driver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationC
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationB
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
TODAYS COMMAND AND DISPATCH FLOW
HardwareQueue
A
CB
A B
HSA COMMAND AND DISPATCH FLOW
-
8/10/2019 8-Hegde
20/74
HSA COMMAND AND DISPATCH FLOW
ApplicationA
ApplicationB
ApplicationC
Optional DispatchBuffer
GPUHARDWARE
Hardware Queue
A
A A
Hardware Queue
B
B B
Hardware Queue
C
C C
C
C
No APIs
No Soft Qu
No User M
No Kernel
No Overhe
Applicationhardware
User mode
Hardware s
Low dispat
COMMAND AND DISPATCH CPU GPU
-
8/10/2019 8-Hegde
21/74
Application / Runtime
COMMAND AND DISPATCH CPU GPU
CPU2CPU1 GPU
MAKING GPUS AND APUS EASIER TO
-
8/10/2019 8-Hegde
22/74
MAKING GPUS AND APUS EASIER TOPROGRAM: TASK QUEUING RUNTIMES
Popular pattern for task and data parallelprogramming on SMP systems today
Characterized by:
A work queue per core
Runtime library that divides largeloops into tasks and distributes toqueues
A work stealing runtime that keepsthe system balanced
HSA is designed to extend this pattern torun on heterogeneous systems
-
8/10/2019 8-Hegde
23/74
TASK QUEUING RUNTIME ON CPU S
CPU Threads GPU Threads Memory
Work Stealing Runtime
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
X86 CPU X86 CPU X86 CPU X86 CPU
-
8/10/2019 8-Hegde
24/74
TASK QUEUING RUNTIME ON THE HSA PLATFORM
SI
MD
SI
MD
SI
MD
Work Stealing Runtime
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
GMan
Fetch an
X86 CPU X86 CPU X86 CPU X86 CPU
CPU Threads GPU Threads Memory
Driver Stack
HSA Software Stack
-
8/10/2019 8-Hegde
25/74
Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL 1.x, DX Runtimes ,User Mode Drivers
Graphics Kernel Mode Driver
AppsApps
AppsApps
AppsApps
HSA Software Stack
Task QueuingLibraries
HSA Domain Libraries,OpenCL 2.x Runtime
H
HSA JIT
AppsApps
AppsApps
AppsApps
User mode component Kernel mode component Components contributed by third partie
-
8/10/2019 8-Hegde
26/74
HSA INTERMEDIATE LANGUAGE - HSAIL
HSAIL is the intermediate language for parallel compute in HSAGenerated by a high level compiler (LLVM, gcc, Java VM, etc)Compiled down to GPU ISA or other parallel processor ISA by an IHVFinalizer Finalizer may execute at run time, install time or build time, dependingon platform type
HSAIL is a low level instruction set designed for parallel compute in ashared virtual memory environment. HSAIL is SIMT in form and doesnot dictate hardware microarchitecture
HSAIL is designed for fast compile time, moving most optimizations toHL compiler
HSAIL is at the same level as PTX: an intermediate assembly orVirtual Machine Target
Represented as bit-code in in a Brig file format with support latebinding of libraries
HSA BRINGS A MODERN OPEN COMPILATION
-
8/10/2019 8-Hegde
27/74
HSA BRINGS A MODERN OPEN COMPILATIONFOUNDATION
This bring about fully competitive rich complete compilation stack architectuthe creation of a broader set of GPU Computing tools, languages and librarie
HSAIL supports LLVM and other compilers GCC, Java VM
EDG or CLANG EDG or CLANG
NVVM IR SPIR
LLVM LLVM
PTX HSAIL
Hardware HARDWARE
Cuda OpenCL
OPENCL AND HSA
-
8/10/2019 8-Hegde
28/74
OPENCL AND HSA
HSA is an optimized platform architecture for OpenCL Not an alternative to OpenCL Focused on the hardware platform more than APIReady to support many more languages than C/C++
OpenCL on HSA will benefit from Avoidance of wasteful copiesLow latency dispatchImproved memory modelPointers shared between CPU and GPU
HSA also exposes a lower level programming interface
Optimized libraries may choose the lower level interface
-
8/10/2019 8-Hegde
29/74
-
8/10/2019 8-Hegde
30/74
AMDS OPEN SOURCE COMMITMENT TO HSAWe will open source our Linux execution and compilation stack
Jump start the ecosystem
Allow a single shared implementation where appropriateEnable university research in all areas
Component Name AMDSpecific
Rationale
HSA Bolt Library No Enable understanding and debug
HSAIL Code Generator No Enable researchLLVM Contributions No Industry and academic collaboration
HSA Assembler No Enable understanding and debug
HSA Runtime No Standardize on a single runtime
HSA Finalizer Yes Enable research and debug
HSA Kernel Driver Yes For inclusion in linux distros
-
8/10/2019 8-Hegde
31/74
WORKLOAD ANALYSIS
-
8/10/2019 8-Hegde
32/74
HAAR Face DetectionCORNERSTONE TECHNOLOGYFOR COMPUTER VISION
-
8/10/2019 8-Hegde
33/74
LOOKING FOR FACES IN ALL THE RIGHT PLACES
Quick HD CalculationsSearch square = 21 x 21Pixels = 1920 x 1080 = 2,073,600Search squares = 1900 x 1060 = ~2 Million
LOOKING FOR DIFFERENT SIZE FACES BY
-
8/10/2019 8-Hegde
34/74
LOOKING FOR DIFFERENT SIZE FACES BYSCALING THE VIDEO FRAME
More HD Calculations70% scaling in H and VTotal Pixels = 4.07 MillionSearch squares = 3.8 Million
-
8/10/2019 8-Hegde
35/74
Feature l
Feature m
Feature p
Feature r
Feature q
HAAR CASCADE STAGES
Feature k
Stage N
Stage N+1
Fp Yes
22 CASCADE STAGES EARLY OUT BETWEEN EACH
-
8/10/2019 8-Hegde
36/74
22 CASCADE STAGES, EARLY OUT BETWEEN EACH
STAGE 22STAGE 21STAGE 2STAGE 1
NO FACE
Final HD CalculationsSearch squares = 3.8 million
Average features per square = 124Calculations per feature = 100Calculations per frame = 47 GCalcs
Calculation Rate30 frames/sec = 1.4TCalcs/second60 frames/sec = 2.8TCalcs/second
and this only gets front -facing
-
8/10/2019 8-Hegde
37/74
CASCADE DEPTH ANALYSIS
05
10
15
20
25
Cascade Depth
20-25
15-20
10-15
5-10
0-5
PROCESSING TIME/STAGE
-
8/10/2019 8-Hegde
38/74
0
10
2030
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9-22
T i m e
( m s
)
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
GPU
CPU
PROCESSING TIME/STAGE
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,6 compute units, 685MHz; 4GB RAM; Windows 7 (64- bit); OpenCL 1.1 (873.1)
Cascade Stage
PERFORMANCE CPU VS GPU
-
8/10/2019 8-Hegde
39/74
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7 8 22
I m a g e s
/ S e c
Number of Cascade Stages on GPU
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
CPU
HSA
GPU
PERFORMANCE CPU-VS-GPU
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,6 compute units, 685MHz; 4GB RAM; Windows 7 (64- bit); OpenCL 1.1 (873.1)
HAAR SOLUTION RUN DIFFERENT CASCADES
-
8/10/2019 8-Hegde
40/74
ON GPU AND CPUBy seamlessly sharing data between CPU and GPU,
HSA allows the right processor to handle its appropriate
workload+2.5x
-2.5x
INCREASEDPERFORMANCE
DECREASED ENERGYPER FRAME
-
8/10/2019 8-Hegde
41/74
GAMEPLAY RIGID BODYPHYSICS
RIGID BODY PHYSICS SIMULATION
-
8/10/2019 8-Hegde
42/74
RIGID BODY PHYSICS SIMULATION
Rigid-Body Physics Simulation is:a way to animate and interact with objects, widely used in games and movieproductionused to drive game play and for visual effects (eye candy)
Physics Simulation is used in many of todays software:Middleware Physics engines such as Bullet, Havok, PhysXGames ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 33D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D,LightwaveIndustrial applications such as Siemens NX8 Mechatronics Concept DesignMedical applications such as surgery trainersRobotics simulation
But GPU-accelera ted r ig id-body p hys ic s i s no t used in gam e play - only in e ffec ts
RIGID BODY PHYSICS - ALGORITHM
-
8/10/2019 8-Hegde
43/74
Find potential interacting object pairs using bounding shape approximations.
Perform full overlap ting between potentially interacting pairs
Compute exact contact information for a various shape types
Compute constraint forces for natural motion and stable stacking
Broad-PhaseCollisionDetection
Setupconstraints
Solvconstra
Computecontactpoints
A B0 B1 C0 C1 D1 D1
1 1 2 2 3 3 4
B D
A1
2 3
4
Mid-PhaseCollisionDetection
Narrow-PhaseCollisionDetection
-
8/10/2019 8-Hegde
44/74
RIGID BODY PHYSICS - CHALLENGES & SOLUTIONS
-
8/10/2019 8-Hegde
45/74
Simulation is a pipeline of manydifferent algorithms, some of whichare more suitable for CPU whileothers are more suitable for GPU
Many CPU optimizations (eg . earlyouts) arent efficient on GPUs,requiring the use of more brute-forcebut GPU-friendly algorithmsDiversity of intersection algorithmscause load balancing challenges
Varying object sizes require morecomplex and difficult to parallelizebroad-phase algorithms
sweep -and- prune uses incrementalsorting and traversal of lists
Narrow-phase algorithms (such asSAT or GJK) cause threaddivergence
Implementation Challenges
Avoidance of the data copy to/fromGPU and of the overhead ofmaintaining two copies of simulationstate
SMA, COH
Usage of early out optimizations andmore efficient load balancing
ENQ
More efficient serial aspects of broad-phase can run on the CPU
SMA, COH
Improved handling of threaddivergence
ENQ
Benefits of HSA
EMS : Entire Memory Space; PM : Pageable Memory; COH : Bidirectional Platform CoherencySMA : Shared Virtual Memory; DYN: Dynamic Memory Allocation; ENQ : GPU ENQueue;USD : USer Mode Dispatch
-
8/10/2019 8-Hegde
46/74
GESTURE RECOGNITION
GESTURE RECOGNITION
-
8/10/2019 8-Hegde
47/74
\:
An emerging natural way of interacting with a computer
Compute intensive where the computational complexity depends on the number andcomplexity of recognized gestures.
Strongly benefits from availability of depth informationBrowsing (previous/next, scroll), media players (next/previous song/video/image,pause/start), collaboration tools, such as slideshows, gaming (finger/hand as thecontroller), immersive environments, virtual reality
Todays systems are tuned to todays HW, lacking in robustness and usability, which can only beachieved by use of special-purpose HW. They do not do well for
A wide variety of useful gestures (one or two hand, multiple finger, arm or full body)
Motion dependent gestures (e.g. finger pinch), which requires correlatinginformation from multiple frames
Adaptability to variable lighting conditions
Larger region/distance of input, enabled by processing higher resolution video
ALGORITHM PIPELINE
-
8/10/2019 8-Hegde
48/74
Image processing:
adaptive light normalization
Edge and corner detection
Erode/dilate/threshold filter, to produce a
feature image.Depth analysis (for fg/bg segmentation, if usingstereo cameras)
Sparse approach, correlate salient points inthe feature image, and validate via localhistogram matching in the original image.
Connected components analysis, for handidentification (based on level sets)
GPU can recognize local connectivity with aparallel scan. CPU can apply transitivity oflabels (the neighbor of your neighbor is yourneighbor).
Feature vector (local histogram) extraction
Global: HOG on tiles; or
Contextual: SURF/SIFT keypoints
Find best match of histogram, with the t raining set(support vector machine), optionally update thetraining set.
Update temporal model state machine
GESTURE RECOGNITION CHALLENGES AND SOLUTIONS
-
8/10/2019 8-Hegde
49/74
GESTURE RECOGNITION CHALLENGES AND SOLUTIONS
Transfer of raw image data from CPUto GPU adds latency
Feature matching and depthreconstruction is a divergent workload,as images are sparsely populated bykeypoints, which require extensiveprocessing.Connected component analysis onGPU uses parallel scan, of which thelast stages of reduction are moreefficiently performed on the CPU.
High overhead of the per-frameupdates to the GPU copy of the featuredatabase, for unsupervised learningalgorithms (e.g. Ojas rule).
Implementation Challenges
Avoidance the latency of duplicating data inGPU memory SMA
Higher GPU utilization is achieved viawavefront reshaping - ENQ
Reduction is most optimally implemented byusing both CPU and GPU - COH, SMA
CPU can update the database, while theGPU is accessing it SMA, COH
Benefits of HSA
EMS : Entire Memory Space; PM : Pageable Memory; COH : Bidirectional Platform CohSMA : Shared Virtual Memory; DYN: Dynamic Memory Allocation; ENQ : GPU ENQueUSD : USer Mode Dispatch
-
8/10/2019 8-Hegde
50/74
RAY TRACING
RAY TRACING
-
8/10/2019 8-Hegde
51/74
Photo-realistic visualization method that is widely used in movieproduction and high-fidelity visual effects
Used in many of todays photorealistic rendering packages
Maxwell Render (photorealistic high-end renderer)
Nvidias Optix (Nvidia GPU ray tracing renderer)
POV-Ray (popular CPU-only ray tracer)
Luxmark (popular ray tracing benchmark)
Rendering method that is friendly to parallelism, however not trivially
ported to parallel architectures, due to the complexity of an efficientimplementation.
However it is not used in interactive applications due to performancelimitations
-
8/10/2019 8-Hegde
52/74
RAY TRACING - CHALLENGES & SOLUTIONS
-
8/10/2019 8-Hegde
53/74
Scene database and acceleration datastructure can be huge
Eg . A power plant scene (shown
left) contains 12.7M polygons, has asize of 500MBytes, and anacceleration data structure of250MB-1.5GB (depending onrenderer)
Todays GPUs have problems fittingthem into video memory
Acceleration data structure has to be built
and updated using the CPU andtransferred to video memory
8ms time to transfer above datastructure (250MB) to the GPU
Implementation Challenges
GPU Compute Units can access scene andacceleration data structure from main memory
SMA, PM
Avoidance of acceleration data structure copy toGPU memory
SMA
Benefits of HSA
EMS : Entire Memory Space; PM : Pageable Memory; COSMA : Shared Virtual Memory; DYN: Dynamic Memory AllocaUSD : USer Mode Dispatch
RAY TRACING - CHALLENGES & SOLUTIONS
-
8/10/2019 8-Hegde
54/74
Dynamic Scenes are impractical withcurrent GPU compute implementations
Data structure build time too long forinteractive frame rates
Simple data structures can be built fast, butare difficult to traverse
Faster traversal requires complexstructures that require a long time tocompute and are difficult to transfer to theGPU
Ray divergence caused by child rayshitting different object types with differentshading models (both GPUs & APUs likeregular operations) results in lower
utilization of CUsThe amount of rays can be immense (in thebillions), and the ray intersection process iscompute intensive
power plant scene at 1080p conservativeest. 2 billion rays.
Implementation Challenges
CPU updates to scene are transparently andimmediately available (without any transferpenalty) to the GPU SMA, PM
Casting of child rays with no CPU-GPU roundtrip ENQ
Wavefront reshaping can improve CUutilization ENQ
Benefits of HSA
EMS : Entire Memory Space; PGM : Pageable Memory; COH : Bidirectional CoherencySMA : System Memory Access; DYN: Dynamic Memory Allocation;ENQ : GPU ENQueue; USD : USer Mode Dispatch
-
8/10/2019 8-Hegde
55/74
ACCELERATING MEMCACHEDCLOUD S ERVER WORKLOAD
MEMCACHED
-
8/10/2019 8-Hegde
56/74
MEMCACHED
A Distributed Memory Object Caching System Used in Cloud Servers
Generally used for short-term storage and caching, handling requests that wouldotherwise require database or file system accesses
Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others
Effectively a large distributed hash table
Responds to store and get requests received over the network
Conceptually:
store(key, object)
object = get(key)
OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU
-
8/10/2019 8-Hegde
57/74
0
1
2
3
4Key Look Up Performance Execution Br
Data Transfer Execution
T. H. Hetherington, T. G. Rogers, L. Hsu, M. OConnor, and T. M. Aamodt , Characterizing and Evaluating a Key -Value Store Application on Heterogeneous CPU- GPU Systems,Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012) , April 2012.http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209
Multithreaded CPU Radeon HD 5870 Trinity A10 -5800K Zacate E
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209 -
8/10/2019 8-Hegde
58/74
ACCELERATING JAVAGOING BEYOND NATIVE LANGUAGES
GPU PROGRAMMING OPTIONS FOR JAVAPROGRAMMERS
-
8/10/2019 8-Hegde
59/74
PROGRAMMERSExisting Java GPU ( OpenCL /CUDA) bindings require coding a Kernel
in a domain-specific language.// JOCL/OpenCL kernel code
__kernel void squares(__global const float *in, __global float *out){int gid = get_global_id(0);out[gid] = in[gid] * in[gid];
}
Along with the Java host code to:Initialize the data
Select/Initialize execution device
Allocate or define memory buffers for args/parameters
Compile 'Kernel' for a selected device
Enqueue/Send arg buffers to device
Execute the kernel
Read results buffers back from the device
Cleanup (remove buffers/queues/device handles)
Use the results
import static org.jocl.CL.import org.jocl.*;
public class Sample { public static void main(Str
// Create input- andint size =float inArr[] = nfloat outArray[]for (int i=0; i