a framework for dynamically instrumenting gpu compute...

A Framework for Dynamically Instrumenting GPUCompute Applications within GPU Ocelot

Naila Farooqui1, Andrew Kerr2, Gregory Diamos3, S. Yalamanchili4, and K. Schwan5

College of Computing1,5, School of Electrical and Computer Engineering2,3,4

Georgia Institute of TechnologyAtlanta, GA

{naila1, schwan5}@cc.gatech.edu, {arkerr2, gregory.diamos3, sudha4}@gatech.edu

ABSTRACTIn this paper we present the design and implementation of adynamic instrumentation infrastructure for PTX programsthat procedurally transforms kernels and manages relateddata structures. We show how performing instrumentationwithin the GPU Ocelot dynamic compiler infrastructureprovides unique capabilities not available to other profilingand instrumentation toolchains for GPU computing. Wedemonstrate the utility of this instrumentation capabilitywith three example scenarios - (1) performing workloadcharacterization accelerated by a GPU, (2) providing loadimbalance information for use by a resource allocator, and(3) providing compute utilization feedback to be used on-line by a simulated process scheduler that might be found ina hypervisor. Additionally, we measure both (1) the com-pilation overheads of performing dynamic compilation and(2) the increases in runtimes when executing instrumentedkernels. On average, compilation overheads due to instru-mentation consisted of 69% of the time needed to parse akernel module, in the case of the Parboil benchmark suite.Slowdowns for instrumenting each basic block ranged from1.5x to 5.5x, with the largest slowdowns attributed to ker-nels with large numbers of short, compute-bound blocks.

Categories and Subject DescriptorsC.1.3 [Processor Architectures]: Heterogeneous (hybrid)systems; D.3.4 [Programming Languages]: Retargetablecompilers; D.3.4 [Programming Languages]: Run-timeenvironments; D.4.8 [Operating Systems]: Measurements

General TermsGPU Computing, Instrumentation, Dynamic Binary Com-pilation

KeywordsCUDA, OpenCL, Ocelot, GPGPU, PTX, Parboil, Rodinia

1. INTRODUCTIONDynamic binary instrumentation is a technique in which

application binaries are modified by an instrumentation tool

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GPGPU-4 Mar 05-05 2011, Newport Beach, CA, USACopyright 2011 ACM 978-1-4503-0569-3/11/03 ...$10.00.

to insert additional procedures into the existing executionpath. Such instrumentation provides access to the run-timestate of the application and enables sophisticated actions totake place. For example, code can be inserted for correct-ness checks such as memory bounds checking, or to improveperformance such as insertion of pre-fetch instructions. In-strumentation takes place at the instruction level and of-fers inspection opportunities not exposed at a higher levelsuch as source-level assertions. Dynamic compilation frame-works offer the additional capability of adding and remov-ing instrumentation at runtime to avoid performance costsunless instrumentation is needed.

Toolchains such as OpenCL [1] and CUDA [2] have tremen-dously enhanced the state of the art for developing high-performance applications targeting GPU architectures. Bothplatforms provide a C-like language for expressing data-parallel kernels and an API for launching them on GPU ac-celerators as well as managing associated resources such astextures and device memory. However, GPUs are typicallydesigned with vendor-specific instruction set architectures,and the toolchains do not facilitate direct inspection andmodification of native application binaries. Consequently,developers of GPU compute applications are deprived muchof the flexibility and power that dynamic binary instrumen-tation has brought to traditional architectures.

In this paper, we discuss techniques for dynamically in-strumenting CUDA kernels at the PTX level [3] and presentenhancements to a toolchain - GPU Ocelot - for transpar-ently instrumenting unmodified CUDA applications. Specif-ically, we discuss the use of GPU Ocelot’s [4] pass managerfor applying PTX-to-PTX transformations to loaded mod-ules before they are dispatched to the GPU driver or otherbackends. We define an instrumentation model and ex-pose APIs to enable construction of user-defined, custom in-strumentation tools. Additionally, we present a frameworkwithin Ocelot’s implementation of the CUDA Runtime APIfor managing data structures associated with instrumenta-tion. Our tool inserts instrumentation via a transformationpipeline which exists upstream of each of Ocelot’s supportedprocessor backends. Consequently, procedural instrumenta-tion may be inserted and utilized transparently for each ofthe Ocelot backends: NVIDIA GPUs, x86 multicore CPUs,PTX emulation, and (under development) AMD RadeonGPUs. The details of translation to multicore x86 are de-scribed in [5], and translation to AMD GPUs are describedby Dominguez, et al. in [6]. We also anticipate the pos-sibility of an OpenCL API front-end to Ocelot that wouldextend the reach of the toolchain described in this paper toOpenCL applications.

We demonstrate Ocelot’s instrumentation capability withseveral example use cases revealing application behaviorsthat would take up to 1000x longer if run on Ocelot’s PTXemulator. To the best of our knowledge, this is the firstimplementation of a dynamic transformation and instru-mentation tool for PTX.

This paper provides a background of GPU computing andof GPU Ocelot in Section 2. In Section 3, we discuss the de-sign of GPU Ocelot’s PTX-to-PTX pass manager, facilitiesfor modifying kernels at the PTX level, and the hooks intoOcelot’s CUDA Runtime API implementation. The PTX-to-PTX pass manager is not new to GPU Ocelot, but inte-grating it as part of a transformation pipeline upstream ofeach device backend is a new capability. Enhancements toOcelot’s API for managing instrumentation procedures andextracting results is also a novel capability not discussedin previous works. In Section 4, we describe several met-rics and gather results from applications from the CUDASDK and Parboil Benchmark suite. We show how a processscheduler might use this information to enforce a fairnesstarget, and we characterize overheads of dynamic compila-tion and instrumentation.

2. GPU COMPUTINGNVIDIA’s CUDA [2] toolchain is a programming lan-

guage and API that enables data-parallel kernels to be writ-ten in a language with C++-like semantics. Computationsare performed by a tiered hierarchy of threads. At thelowest level, collections of threads are mapped to a sin-gle stream multiprocessor or SM and executed concurrently.Each SM includes an L1 data cache, a shared scratch-padmemory for exchanging data between threads, and a SIMDarray of functional units. This collection of threads is knownas a cooperative thread array (CTA), and kernels are typ-ically launched with tens or hundreds of CTAs which areoversubscribed to the set of available SMs. A work sched-uler on the GPU maps CTAs onto individual SMs for exe-cution, and the programming model forbids global synchro-nization between SMs except on kernel boundaries.

2.1 GPU OcelotGPU Ocelot [4] is a dynamic compilation and binary

translation infrastructure for CUDA that implements theCUDA Runtime API and executes PTX kernels on sev-eral types of backend execution targets. Ocelot includes afunctional simulator for offline workload characterization,profiling, and correctness checking. A translator from PTXto LLVM provides efficient execution of PTX kernels onmulticore CPU devices with the addition of a runtime ex-ecution manager. To support Ocelot’s NVIDIA GPU de-vice, PTX kernels are emitted and invoked via the CUDADriver API. Ocelot has the unique capability of inspectingthe state of the application as it is running, transformingPTX kernels before they are executed natively on availableGPU devices, and managing additional resources and datastructures needed to support instrumentation.

Figure 1 illustrates Ocelot’s relationship with CUDA ap-plications. Ocelot replaces the CUDA Runtime API librarythat CUDA applications link with. API calls to CUDApass through Ocelot providing a layer of compilation sup-port, resource management, and execution. Ocelot maymodify CUDA kernels as they are registered and launchedas well as insert additional state and functionality into thehost application. Consequently, it is uniquely positioned totransparently instrument applications and respond to data-

dependent application behaviors which would not be pos-sible with static transformation techniques. For example,Ocelot may implement random sampling by inserting in-strumentation into a kernel as an application is running,profiling for a brief period, then re-issuing the original ker-nel without instrumentation. A more sophisticated ap-proach might use the results of profiling to perform opti-mizations of the kernel, although the application of profile-directed optimizations is beyond the scope of this work.

2.2 PTX Instruction Set and Internal Repre-sentation

Parallel Thread eXecution, or PTX, is the RISC-like vir-tual instruction set targeted by NVIDIA’s CUDA and OpenCLcompilers and used as an intermediate representation forGPU kernels. PTX consists of standard arithmetic instruc-tions for integer and floating-point arithmetic, load andstore instructions to explicitly denoted address spaces, tex-ture sampling and graphics related instructions, and branchinstructions. Additionally, special instructions for interact-ing with other threads within the CTA are provided suchas CTA-wide barrier instructions, warp-wide vote instruc-tions, and reduction operations, to name several. Imple-menting the so-called Single-Instruction, Multiple-Thread(SIMT) execution model, PTX specifies the execution ofa scalar thread and the hardware executes many threadsconcurrently. PTX is decoupled from actual hardware in-stantiations and includes an abstract Application BinaryInterface (ABI) for calling functions and managing a localparameter memory space while leaving the actual callingconvention semantics to the native ISA implementation.

CUDA sources compiled by nvcc, NVIDIA’s CUDA com-piler, become C++ programs with calls to register PTX ker-nels to the active CUDA context via the CUDA RuntimeAPI. The kernels themselves are stored as string literalswithin the program, and kernel names are bound to objectsthe host program may then refer to when configuring andlaunching kernel grids. Ocelot implements the CUDA Run-time API’s __cudaRegisterFatBinary() and parses PTXkernels into an internal representation. Control-flow anal-ysis partitions instructions into basic blocks connected byedges indicating control dependencies. An optional data-flow analysis pass transforms the kernel into static single-assignment form [7] which enables register re-allocation.Ocelot’s PTX internal representation (PTX IR) covers theentire PTX 1.4 ISA 1 specification and provides a con-venient programming interface for analyzing kernels andadding new instructions.

3. DESIGN AND IMPLEMENTATIONIn this section, we discuss the specific enhancements made

to Ocelot to add externally-defined instrumentation proce-dures, apply them to PTX modules, and extract profilinginformation during application execution.

Ocelot defines an interface for implementing PTX in-strumentation tools and provides an externally visible APIfor attaching instrumentation passes to Ocelot before andduring the execution of GPU compute applications. Asdescribed in previous work [8], Ocelot replaces NVIDIA’sCUDA Runtime API library (libcudart on Mac and Linux,cudart.dll on Windows) during the link step when CUDAapplications are compiled. To insert third party instrumen-

1Ocelot 2.0.969 supports PTX 2.1 (Fermi) and has becomethe main trunk as of January 31, 2011.

add.s64 %rd2, %rd1, 1mul.s64 %rd3, %rd2, 4mov.s64 %rd4, 256setp.lt.s64 %p1, %rd3, %rd4

L_BB_1:

@%p1 bra L_BB_3

mov.s64 %rd5, 64

setp.lt.s64 %p2, %rd3, %rd5

L_BB_2:

@%p2 bra L_BB_4

abs.f64 %fd1, %fd1

sin.f64 %fd2, %fd1L_BB_3:

st.f64 %fd2, [%rd0 + 4]

reconverge L_BB_2

reconverge L_BB_1

L_BB_4:

exitL_BB_5:

PTX KernelNVIDIA GPU Execution

x86 Multicore

PTX EmulationCUDA Application

cudaMalloc();cudaMemcpy();

kernel<<< grid, block >>>( .. args .. );

cudaMemcpy();cudaFree(); Translation,

JIT compilation PTX to LLVM to x86

CUDA Runtime API

PTX-to-PTX Transformation Pass Manager

L_BB_1:

L_BB_2:

L_BB_4:

L_BB_3:

Optimization:Loop unroll

L_BB_1:

L_BB_2:

L_BB_3:

BasicBlock-InstrumentationPass

beginKernel()

L_BB_4:

enterBlock()

enterBlock()

exitBlock()


st.f64 %fd2, [%rd0 + 4]


st.f64 %fd2, [%rd0 + 4]


st.f64 %fd2, [%rd0 + 4]


st.f64 %fd2, [%rd0 + 4]

Figure 1: Overview of the GPU Ocelot dynamic compilation infrastructure.

tation procedures, applications can be modified to explicitlyadd and remove instrumentors between kernel launches ofthe program via Ocelot’s add and remove APIs. Alterna-tively, instrumentation tools built as an additional libraryand linked with the application may add themselves whenthe library is initialized. This approach means applicationsources do not need to be modified or recompiled.

The instrumentation tools themselves are C++ classesthat consist of two logical components: (1) an instrumentorclass derived from the abstract base class PTXInstrumentor,and (2) an instrumentation pass class derived from Ocelot’sPass abstract class. A class diagram in Figure 2 illustratesan example instrumentation tool, BasicBlockInstrumen-

tor, and presents the class structure relationship graphi-cally. The instrumentor is responsible for performing anystatic analysis necessary for the instrumentation, construct-ing instrumentation-related data structures, instantiating aPTX transformation pass, extracting instrumentation re-sults, and cleaning up resources. The PTX pass appliestransformations to PTX modules which are presented to itvia Ocelot’s PTX Internal Representation (IR).

3.1 Instrumentation and PTX PassesThe host application calling the CUDA Runtime API is

responsible for registering PTX modules which are thenloaded on the selected CUDA device. Under the RuntimeAPI layer, within Ocelot, modules are loaded by parsingand analyzing PTX, applying procedural transformations,then translating for execution on the target device. Reg-istered PTXInstrumentor instances are applied during theprocedural transformation step by invoking their instru-

ment() method on each module.PTX modules contain a set of global variables and func-

tions and must be modified to include a global variablepointing to additional data structures to receive and out-put instrumentation results. This module-level change isperformed by a ModulePass specific to each PTXInstrumen-

tor instance. Subsequently, kernels within the module areinstrumented by either a kernel pass or basic block pass.

Between the parse phase and the translation phase, apass manager is positioned to apply a sequence of proce-

ModulePass pass0;

KernelPass pass1;

KernelPass pass2;

BasicBlockPass pass3;

BasicBlockPass pass4;

ModulePass pass5;

application.ptx

kernel1() kernel2()

application.ptx

B0

B0

B1

B1 B2

B2 B3

B3 B4

B4 B5

B5 B6

B6

Registered PTX Passes Applied to this PTX element

kernel1() kernel2()

Figure 3: PTX Pass manager showing applicationof transformations to a PTX module. The circleindicates BasicBlockPasses pass3 and pass4 are eachapplied to basic block B5 before moving on to otherblocks.

dural transformations to the PTX kernels before they areloaded onto the selected device. This approach is largelyinspired by LLVM’s [9] pass manager framework in whichtransformations may be made at the module level, the func-tion level, and the basic block level depending on the scopeof the transformation. The manager orchestrates pass ap-plication to ensure both the sequential order of passes ispreserved while taking advantage of locality of reference.Figure 3 illustrates this method of restricting pass scopeand coalescing passes by type.

Passes themselves are C++ classes that implement inter-faces derived from Ocelot’s analysis::Pass abstract class.These include ImmutablePass for performing static analysisconsumed later during the transformation pipeline, Mod-

ulePass for performing module-level modifications, Ker-

nelPass for changing the control structure of the kernel,and BasicBlockPass for applying transformations within

Figure 2: Class diagram for instrumentation passes.

the scope of one basic block at a time. PTX passes mayaccess the kernel’s control-flow graph, dominator tree, anddata-flow graph. The data-flow graph is updated when newinstructions are added that create new values, and the dom-inator trees are recomputed when kernel and module passeschange the control-flow of kernels.

3.2 Instrumentor APIs and CUDA RuntimeCertain instrumentations may require inspection of the

kernel’s CFG to obtain necessary information required bythe CUDA Runtime API to properly allocate resources onthe device. In general, any actions that must be performedprior to allocating resources on the device, are encapsulatedin the analyze() method. For our basic block executioncount instrumentation, we obtain the CFG of each kernelto determine the total number of basic blocks.

Before launching a kernel, memory on the device mustbe allocted and initialized to store the instrumentation re-sults. Ocelot calls each registered instrumentation pass’sinitialize() method which may allocate memory and trans-fer data to and from the selected device. After the kernelhas been launched, each instrumentor’s finalize() methodis invoked to free up allocated resources and extract in-strumentation results into an instance of KernelProfile.The KernelProfile class outputs results either to a file ordatabase, or it may channel instrumentation results to othercomponents or applications that link with Ocelot. Externalapplications can access the KernelProfile instance via thekernelProfile() API within Ocelot.

3.3 Example Instrumentation ToolsThe ClockCycleCountInstrumentationPass inserts in-

strumentation to read the clock cycle counter exposed byPTX’s special register %clock which corresponds to a built-in hardware clock cycle counter. Instrumentation is insertedto the beginning of the kernel to record the starting clockcycle number and at the end, along with a barrier waitingfor all threads to finish, to record the CTA’s ending time.In addition to storing runtimes, the PTX register %smid isaccessed to determine which streaming multiprocessor eachCTA was mapped to.

The BasicBlockInstrumentationPass constructs a ma-trix of counters with one row per basic block in the executedkernel and one column per dynamic PTX thread. By as-signing one basic block counter per thread, the instrumen-tation avoids contention to global memory that would beexperienced if each thread performed atomic increments tothe same block counter. Instrumentation code added via a

BasicBlockPass loads a pointer to the counter matrix froma global variable. The instrumentation pass then adds PTXinstructions to each basic block that compute that thread’scounter index and increments the associated counter usingnon-atomic loads and stores. Counters of the same blockfor consecutive threads are arranged in consecutive order inglobal memory to ensure accesses are coalesced and guar-anteed to hit the same L1 cache line. PTX instructionsadded to the beginning of each basic block appears in List-ing 1. This code is annotated with pseudocode illustratingthe purpose of each instruction sequence. At runtime, thisinstrumentation pass allocates the counter matrix sized ac-cording to the kernel’s configured block size.

Listing 1: Instrumentation inserted into each basicblock for BasicBlockInstrumentationPass.

// get pointer to counter matrixmov.u64 %r28 , __ocelot_basic_block_counter_base;ld.global.u64 %r13 , [%r28 + 0];add.u64 %r13 , %r13 , %ctaoffset;

// idx = nthreads * blockId + threadidmad.lo.u64 %r14 , %nthreads , 5, %threadid;

// ptr = idx * sizeof(Counter) + basemad.lo.u64 %r15 , %r14 , 8, %r13;

// *ptr ++;ld.global.u64 %r12 , [%r15 + 0];add.u64 %r12 , %r12 , 1;st.global.u64 [%r15 + 0], %r12;

The C++ BasicBlockInstrumentationPass class usesPTX IR factories to construct instrumentation code in-serted into the kernel. The BasicBlockInstrumentation-

Pass implements the runOnBlock method to add the rele-vant PTX at the beginning of every basic block. Listing2 shows a code snippet of the corresponding C++ code inthe runOnBlock method for creating the PTX to update theglobal basic block counter index. This code corresponds tothe last 3 lines of the PTX shown in Listing 1.

Listing 2: Corresponding C++ code for insert-ing instrumentation into each basic block for Ba-

sicBlockInstrumentationPass.

PTXInstruction ld ( PTXInstruction : : Ld ) ;PTXInstruction add ( PTXInstruction : : Add ) ;PTXInstruction s t ( PTXInstruction : : St ) ;

ld . addressSpace = PTXInstruction : : Global ;ld . a . addressMode = PTXOperand : : I nd i r e c t ;ld . a . reg = reg isterMap [ ” counterPtrReg ”] ;

ld . d . reg = r e g i s t e r I d ;ld . d . addressMode = PTXOperand : : Reg i s t e r ;

add . addressSpace = PTXInstruction : : Global ;add . d = ld . d ;add . a = ld . d ;add . b . addressMode = PTXOperand : : Immediate ;add . b . imm int = 1 ;

s t . addressSpace = PTXInstruction : : Global ;s t . d . addressMode = PTXOperand : : I n d i r e c t ;s t . d . reg = reg isterMap [ ” counterPtrReg ”] ;s t . a . addressMode = PTXOperand : : Reg i s t e r ;s t . a . reg = r e g i s t e r I d ;

// I n s e r t s at the beg inning o f the ba s i c block ,// as the 5th , 6th , and 7th statements// ( f i r s t statement indexed at 0 ) .

kerne l−>dfg()−> i n s e r t ( block , ld , 4 ) ;kerne l−>dfg()−> i n s e r t ( block , add , 5 ) ;kerne l−>dfg()−> i n s e r t ( block , st , 6 ) ;

4. EVALUATIONThe above instrumentation passes were implemented as

a branch of Ocelot version 1.1.560 [4]. To evaluate the use-fulness and performance impact of these instrumentationpasses, the following experiments were performed on a sys-tem with an Intel Core i7 running Ubuntu 10.04 x86-64 andequipped with an NVIDIA GeForce GTX480. Benchmarkapplications were chosen from the NVIDIA CUDA SoftwareDevelopment Kit [2] and the Parboil Benchmark Suite [10].

Experiment 1 - Hot Region Detection. To deter-mine the most frequently executed basic blocks within thekernel, we use our BasicBlockInstrumentationPass. Fig-ure 4 is a heat map visualizing the results from this experi-ment for the Scan application from the CUDA SDK. Basicblocks are colored in intensity in proportion to the numberof threads that have entered them. The hottest region con-sists of blocks BB_001_007, BB_001_008, BB_001_009 cor-responding to this kernel’s inner loop. This metric cap-tures architecture-independent behavior specified by the ap-plication. A similar instruction trace analysis offered byOcelot’s PTX emulator provides the same information butat the cost of emulation. By instrumenting native PTXand executing the kernels natively on a Fermi-class NVIDIAGTX480 [11], a speedup of approximately 1000x over theemulator was achieved. This is an example of workloadcharacterization accelerated by GPUs.

Experiment 2 - Overhead of Instrumentation. Thebasic block execution count instrumentation contributes aper-block overhead in terms of memory bandwidth and com-putation. Blocks in the hottest region make numerous ac-cesses when incrementing their respective per-thread coun-ters and displace some cache lines from the L1 and L2caches. Clock cycle count instrumentation inserts instruc-tions to read clock cycles at the beginning of the kernel andthen to store the difference into a counter in global memory.All forms of instrumentation can be expected to perturb ex-ecution times in some way. This experiment measures run-times of sample applications with and without each instru-mentation pass. Slowdowns for selected applications fromthe CUDA SDK and Parboil appear in Figure 5. These ap-plications cover a spectrum of structural properties relatedto basic block instrumentation. Properties include numberof operations per basic block, number of kernels launched,and whether they are memory- or compute-bound.

All applications perform consistently well with clock cyclecount instrumentation, achieving minimal slowdown (less

Figure 5: Slowdowns of selected applications due toBasicBlockInstrumentor and ClockCycleCountInstru-

mentor.

than 1.3x). Compute intensive applications with many in-structions per basic block, such as BicubicTexture, BlackSc-holes, BoxFilter, and Nbody achieve the least slowdownfrom BasicBlockInstrumentor, as the costs of accessingmemory are amortized or hidden entirely. Applications witha large number of short basic blocks, such as BinomialOp-tions, tpacf, and rpes, exhibit the largest slowdowns fromBasicBlockInstrumentor. Other applications that exhib-ited a mixture of large and small basic blocks but still hadeither a significantly large number of total basic blocks perkernel or many kernel invocations, such as Mandelbrot, Un-structuredMandelbrot, EigenValues, and ThreadFenceRe-duction, had a slowdown between 2x and 4x.

In order to verify the cause of the largest slowdown inour experiments, BinomialOptions, we use our basic blockinstrumentation data to determine the most frequently ex-ecuted basic blocks within the BinomialOptions kernel. Ananalysis of the associated PTX depicts that this kernel con-sists of many compute-intensive, short basic blocks withvery few memory accesses (less than 11% of the total in-structions, without instrumentation, are ld and/or st in-structions). The basic block instrumentation contributesa large additional bandwidth demand as well as a signif-icant fraction of dynamic instructions, resulting in a 5.5xslowdown for this application. As an optimization, a moresophisticated instrumentation pass could use registers forcounter variables and ellide registers for blocks that couldbe fused.

Experiment 3 - CTA Load Imbalance. CUDA ex-ecution semantics specify coarse-grain parallelism in whichcooperative thread arrays (CTAs) may execute concurrentlybut independently. By excluding synchronization acrossCTAs from the execution model, GPUs are free to schedulethe execution of CTAs in any order and with any level ofconcurrency. Moreover, the programmer is encouraged to

$BB_001_000

entry

$BB_001_002

instructions: 512

$BB_001_001

exit

$BB_001_003

entries: 511

$BB_001_004

instructions: 1

$BB_001_005

entries: 512

$BB_001_006

entries: 512

$BB_001_007

entries: 4608

$BB_001_008

entries: 4097

$BB_001_009

entries: 4608

$BB_001_0010

entries: 512

$BB_001_0012

entries: 512

$BB_001_006

entries: 512

$BB_001_007

entries: 4608

$BB_001_008

entries: 4097

$BB_001_009

entries: 4608

Hottest block(4608 entries)

Coldest block(0 entries)

__global__ void scan_naive( .. ) {

...

int offset = 1;

for ( ; ; ) { pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid];

if (thid >= offset) temp[pout*n+thid] += temp[pin*n+thid - offset]; offset *= 2; if (offset >= n) break; } ...}

Figure 4: Hot region visualization of CUDA SDK Scan application profiled during native GPU execution.Each block presents a count of the number of times a thread entered the basic block and is color codedto indicate computational intensity. The magnified portion of the control-flow graph illustrates a loop, thedominant computation in the kernel.

Figure 6: Normalized runtimes for each StreamingMultiprocessor for several workloads. Kernel run-time is the maximum number of cycles over all SMs,and some SMs are less heavily used than others.

specify the number of CTAs per kernel grid as a functionof problem size without consideration of the target GPUarchitecture. This is one approach toward parallelism scal-ability, as the hardware is free to map CTAs to streamingmultiprocessors (SMs) as they become available. An appli-cation’s performance may scale as it is run on a low-endGPU with 2 SMs to a high-end GPU with 30 SMs. Unfor-tunately, this approach may also result in load imbalanceas more work may be assigned to some CTAs than others.ClockCycleCountInstrumentor records CTA runtimes and

mapping from CTA to SM and determines whether thenumber of CTAs and corresponding workloads leave anySMs idle for extended periods or whether the workload is

well-balanced. Figure 6 plots runtimes of selected applica-tions for each SM normalized to total kernel runtime. TheMandelbrot application exhibits zero clock cycles for SMs 0,4, and 8, implying they are idle, yet SMS 1, 2, and 3 haveruntimes that are nearly twice as long as the other SMs.This implies a rather severe load imbalance in which nearly80% of the GPU is unutilized for half of the kernel’s execu-tion. This level of feedback may hint to the programmer toreduce the amount of possible work per CTA to enable thehardware work scheduler to assign additional, shorter run-ning CTAs to the unutilized SMs. The other applicationsexhibit a balanced workload with all SMs utilized for over75% of the total kernel runtime.

Experiment 4 - Online Credit-based Process Schedul-ing. In this experiment, we demonstrate Ocelot’s API forquerying the results of profiling information gathered fromdynamic binary instrumentation by simulating an on-lineprocess scheduler. This contrasts with the NVIDIA Com-pute Visual Profiler [12] which only offers offline analysis.The experimental configuration is derived from work pub-lished by Gupta, et. al. [13] in which a hypervisor attemptsto balance contention for a physical GPU among a collectionof guest operating systems in a virtualized environment.By instrumenting kernels at the hypervisor level, schedul-ing decisions may be informed by data collected withoutmodifying the actual applications.

A credit-based scheduler illustrated in Figure 7 assigns akernel launch rate to each process according to a history ofprevious execution times, and rates are adjusted to achieveutilization targets for each process. For our demonstra-tion, we wrote two CUDA applications, each performingmatrix multiplication on distinct data sets. ApplicationA performs matrix multiplication on a randomly-generateddata set that is intentionally smaller than Application B’srandomly-generated data set, resulting in disparate GPUutilizations. To properly balance compute resources acrossboth processes, the hyperviser must transparently deter-

Credit-basedProcess Scheduler

per-CTA runtimes

Guest Process 0 Guest Process 1

Long-runningkernels

Short-runningkernels

On-line profilinginformation

Figure 7: Credit-based process scheduler utilizingonline profile information to make control decisions.

mine kernel runtimes and assign credits accordingly. Wecapture kernel runtimes via the ClockCycleCountInstru-

mentor, and adjust each application’s kernel launch rate toachieve the desired 50% GPU utilization target. The sched-uler initially launches each application’s kernel with periodof 10 ms and adjusts the rates every 50 ms.

Figure 8 (a) plots GPU utilizations for the two appli-cations running without any scheduling policy. As notedearlier, Application B launches kernels with considerablylarger workloads than Application A. Figure 8 (b) plotsthe same applications with kernel launch rates specified bythe process scheduler with a policy that tries to achievethe 50% utilization target for the two applications. Thescheduler uses online information about kernel runtimes forgathering profiling results and adjusts kernel launch ratesso that Application B launches kernels less frequently thanA. The startup transient visible during the first few ker-nel launches results as runtime histories are constructed forboth applications, and then utilization converges toward the50% target.

Experiment 5 - Characterization of JIT Compi-lation Overheads. Dynamic binary instrumentation in-vokes a compilation step as the program is running. Ap-plication runtime is impacted both by the overheads asso-ciated with executing instrumentation code when it is en-countered and also by the process of inserting the instru-mentation itself. Dynamically instrumented CUDA pro-grams require an additional just-in-time compilation step totranslate from PTX to the native GPU instruction set, butapplications are typically written with long-running kernelsin mind. In this experiment, we attempt to characterizeoverheads in each step of Ocelot’s compilation pipeline fromparsing large PTX modules, performing static analysis, ex-ecuting PTX-to-PTX transformations, JIT compiling viathe CUDA Driver API, and executing on the GPU.

Figure 9 presents the dynamic compilation overheads incompiling and executing the Parboil application mri-fhdand instrumenting it with the basic block counters describedin Section 3.3. This application consists of a single PTXmodule of moderate size (2,916 lines of PTX). The figureshows the relative time spent performing the instrumenta-tion passes 14.6% is less than both the times to parse the

(a) A set of applications with imbalanced workloads and norate limiting scheduler.

(b) Process scheduler targeting 50% utilization for both pro-cesses by adjusting the kernel launch rate.

Figure 8: Instrumentation used for process schedul-ing to achieve load balance in a simulated virtual-ized environment.

PTX module and to re-emit it for loading by the CUDADriver API, steps that would be needed without adding in-strumentation. Online use of instrumentation would notneed to perform the parse step more than once. Resultsindicate there would be less than a 2x slowdown if kernelswere instrumented and re-emitted with each invocation, aslowdown which would decrease for longer running kernels.

5. FUTURE WORKWith our current work, we have demonstrated the capa-

bility and usefulness of online profiling for GPU computeapplications. However, we have only touched the surfacewith what we can achieve with this capability. For futurework, we would like to develop a much more comprehen-sive suite of metrics. We would also like to investigate cus-tomized instrumentations for capturing application-specificbehavior. One of the major benefits of our approach isthe ability to do selective instrumentation for specific cases.From a design perspective, our instrumentation frameworkprovides APIs that ease the design and implementation

14.6%

Figure 9: Overheads in compiling and executing anapplication from the Parboil benchmark suite. In-strumentation occupies 14.6% of total kernel run-time including compilation.

of custom, user-defined instrumentations. However, theseAPIs still require instrumentation to be specified at thePTX level. We are currently investigating higher-level, C-like constructs to specify instrumentation instead of requir-ing users of our tool to generate PTX instructions via theOcelot IR interface. Finally, we demonstrated the ability foronline scheduling as an example of how our framework canbe used to enforce system policies. This capability opensnew avenues for automatic run-time resource allocation anddecision-making, potentially leading to higher system effi-ciency. There are several real-world use cases that can bene-fit from this capability. GPU-accelerated Virtual Machinesthat provide system management capabilities for heteroge-neous manycore systems with specialized accelerators [13] isone such example. Another interesting use-case for onlineinstrumentation is optimizing GPUs for power consump-tion. The GPU Power and Performance Model work byHong and Kim predicts the optimal number of active pro-cessors for a given application [14]. We would like to exploreopportunities to integrate with such analytical models.

6. RELATED WORKPin [15] is a mature dynamic instrumentation tool for in-

serting probes into CPU application binaries. It strives tobe architecture independent, supporting CPU architecturessuch as x86, Itanium, ARM, and others. Our approach toinstrumenting CUDA kernels was largely inspired by Pin’sinstrumentation model by facilitating the creation of user-supplied instrumentation tools and inserting them into ex-isting applications. Pin does not target data-parallel ar-chitectures or execution models such as GPUs and PTXnor does it identify embedded PTX kernels as executablecode. Moreover, Ocelot manages resources such as devicememory allocations and textures and presents these to theinstrumentation tools when they are inserted.

NVIDIA’s Compute Visual Profiler [12] was released toaddress the profiling needs of developers of GPU computeapplications and provides a selection of metrics to choosefrom. This utility is implemented by reading hardware per-formance counters available in NVIDIA GPUs after appli-cations have run. However, it does not offer the opportu-nity to insert user-supplied instrumentation procedures andthe results are not conveniently available for making onlinescheduling decisions.

Boyer, et. al. [16] propose statically analyzing CUDA

programs and inserting instrumentation at the source level.This approach was used to detect race conditions amongthreads within a CTA and bank conflicts during accesses toshared memory. In their tool, a front end parser was writtento construct an intermediate representation of the high-levelprogram, to annotate it with instrumentation procedures,and ultimately to re-emit CUDA for compilation using theexisting toolchain. This is a heavy-weight approach thatmust be performed before the program is run and doesnot enable removing instrumentation once a program hasreached a steady phase of operation. Moreover, this ap-proach misses the opportunity to observe behaviors onlyvisible after CUDA has been compiled to PTX. These phe-nomena include PTX-level register spills if the program hasbeen compiled with register count constraints or requiresfeatures available on narrow classes of hardware. Finally,instrumentation procedures added at the source level existupstream of lower-level optimization transformations andmay not reflect the true behavior of the program. For ex-ample, a loop unrolling transformation may duplicate a ba-sic block counter in the loop body even though the resultingbinary has fused several loop iterations into a single block.Ocelot enables instrumentation to be added at arbitrarypoints as PTX-level optimizations are applied.

Many metrics of interest may be available through simu-lation using tools such as GPGPU-Sim [17], Barra [18], orOcelot’s own PTX emulator [8]. Additionally, these maydrive timing models such as those proposed by Zhang [19]and by Hong [20]. While these provide a high level of de-tail, compute-intensive GPU applications are intended toachieve notable speedups over CPUs when run natively onGPU hardware. CPUs executing an emulator executing aGPU compute application exhibit slowdowns on the orderof 100x to 1000x which can be prohibitively slow when char-acterizing long-running benchmark applications. Such sim-ulators implement GPU architectural features with varyinglevels of fidelity and range from cycle-accurate simulationof a hypothetical GPU architecture in the case of GPGPU-Sim to complete decoupling of the abstract PTX executionmodel in the case of Ocelot. Consequently, they do not nec-essarily capture the non-deterministic but significant inter-actions encountered by executing the GPU kernels on actualhardware. Instrumenting PTX kernels and executing themnatively renders large GPU compute applications tractableand captures hardware-specific behaviors.

7. CONCLUSIONGPU compute toolchains have greatly facilitated the ex-

plosion of research into GPUs as accelerators in heteroge-neous compute systems. However, toolchains have lackedthe ability to perform dynamic binary instrumentation, andif such were to be added, instrumentation would have to bewritten for each processor architecture. By inserting in-strumentation at the PTX level via a dynamic compilationframework, we are able to bridge the gap in transparent dy-namic binary instrumentation capability for CUDA appli-cations targeting both NVIDIA GPUs and x86 CPUs. Thisplatform provides an open-ended and flexible environmentfor adding instrumentation probes not present in closedtools such as the Compute Visual Profiler. We envisionour tool improving the state of workload characterizationby profiling native executions rather than simulations, fa-cilitating profile-directed optimization of data-parallel ker-nels, and enabling application-level scheduling and resourcemanagement decisions to be made online.

AcknowledgementsThis research was supported by NSF under grants CCF-0905459 and OCI-0910735, IBM through an OCR Innova-tion award, LogicBlox Corporation, and an NVIDIA Grad-uate Fellowship. We gratefully acknowledge the insights ofVishakha Gupta and Alexander Merritt for their suggestionof the online credit-based scheduler.

8. REFERENCES[1] KHRONOS OpenCL Working Group. The OpenCL

Specification, December 2008.

[2] NVIDIA. NVIDIA CUDA Compute Unified DeviceArchitecture. NVIDIA Corporation, Santa Clara,California, 2.1 edition, October 2008.

[3] NVIDIA. NVIDIA Compute PTX: Parallel ThreadExecution. NVIDIA Corporation, Santa Clara,California, 1.3 edition, October 2008.

[4] Gregory Diamos, Andrew Kerr, and SudhakarYalamanchili. Gpuocelot: A binary translationframework for ptx., June 2009.http://code.google.com/p/gpuocelot/.

[5] Gregory Diamos, Andrew Kerr, SudhakarYalamanchili, and Nathan Clark. Ocelot: a dynamicoptimization framework for bulk-synchronousapplications in heterogeneous systems. In Proceedingsof the 19th international conference on Parallelarchitectures and compilation techniques, PACT ’10,pages 353–364, New York, NY, USA, 2010. ACM.

[6] Rodrigo Dominguez, Dana Schaa, and David Kaeli.Caracal: Dynamic translation of runtimeenvironments for gpus. In Proceedings of the 4thWorkshop on General-Purpose Computation onGraphics Processing Units, 2011. To appear.

[7] Ron Cytron, Jeanne Ferrante, Barry K. Rosen,Mark N. Wegman, and F. Kenneth Zadeck. Efficientlycomputing static single assignment form and thecontrol dependence graph. ACM Transactions onProgramming Languages and Systems, 13(4):451–490,Oct 1991.

[8] Andrew Kerr, Gregory Diamos, and SudhakarYalamanchili. A characterization and analysis of ptxkernels. Workload Characterization, 2009. IISWC2009. IEEE International Symposium on, 2009.

[9] Chris Lattner and Vikram Adve. LLVM: ACompilation Framework for Lifelong ProgramAnalysis & Transformation. In Proceedings of the2004 International Symposium on Code Generationand Optimization (CGO’04), Palo Alto, California,Mar 2004.

[10] IMPACT. The parboil benchmark suite, 2007.

[11] NVIDIA Corporation. Nvidia’s next generationcompute architecture: Fermi. white paper, NVIDIA,November 2009.

[12] NVIDIA. NVIDIA Compute Visual Profiler. NVIDIACorporation, Santa Clara, California, 1.0 edition,October 2010.

[13] Vishakha Gupta, Ada Gavrilovska, Karsten Schwan,Harshvardhan Kharche, Niraj Tolia, Vanish Talwar,and Parthasarathy Ranganathan. Gvim:Gpu-accelerated virtual machines. In Proceedings ofthe 3rd ACM Workshop on System-levelVirtualization for High Performance Computing,HPCVirt ’09, pages 17–24, New York, NY, USA,2009. ACM.

[14] Sunpyo Hong and Hyesoon Kim. An integrated gpupower and performance model. ComputerArchitecture. IEEE International Symposium on,2010.

[15] Chi-Keung Luk, Robert Cohn, Robert Muth, HarishPatil, Artur Klauser, Geoff Lowney, Steven Wallace,Vijay Janapa Reddi, and Kim Hazelwood. Pin:building customized program analysis tools withdynamic instrumentation. In Proceedings of the 2005ACM SIGPLAN conference on Programminglanguage design and implementation, PLDI ’05, pages190–200, New York, NY, USA, 2005. ACM.

[16] Michael Boyer, Kevin Skadron, and Westley Weimer.Automated dynamic analysis of cuda programs. ThirdWorkshop on Software Tools for MultiCore Systems(STMCS), 2008.

[17] Ali Bakhoda, George Yuan, Wilson W. L. Fung,Henry Wong, and Tor M. Aamodt. Analyzing cudaworkloads using a detailed gpu simulator. In IEEEInternational Symposium on Performance Analysis ofSystems and Software (ISPASS), Boston, MA, USA,April 2009.

[18] Sylvain Collange, David Defour, and David Parello.Barra, a modular functional gpu simulator for gpgpu.Technical Report hal-00359342, 2009.

[19] Yao Zhang and John D. Owens. A quantitativeperformance analysis model for gpu architectures. InProceedings of the 17th IEEE InternationalSymposium on High-Performance ComputerArchitecture (HPCA 17), February 2011.

[20] Sunpyo Hong and Hyesoon Kim. An analytical modelfor a gpu architecture with memory-level andthread-level parallelism awareness. SIGARCHComput. Archit. News, 37(3):152–163, 2009.

a framework for dynamically instrumenting gpu compute...

Documents