ocelot: ptx emulator
DESCRIPTION
Ocelot: PTX Emulator. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Execution Model . NVIDIA’s PTX Execution Model. Array of multiprocessors each executing a CTA SIMD multiprocessors Single instruction multiple thread (SIMT) execution - PowerPoint PPT PresentationTRANSCRIPT
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 1
OCELOT: PTX EMULATOR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2
OverviewOcelot PTX Emulator
Multicore-Backend
NVIDIA GPU Backend
AMD GPU Backend
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Execution Model
Parallel thread execution (PTX) Explicit memory hierarchy Cooperative thread arrays (CTAs)
and ordering constraints
Array of multiprocessors each executing a CTA
SIMD multiprocessors Single instruction multiple thread (SIMT) execution Enables hardware to exploit
control flow uniformity and data locality
3
NVIDIA’s PTX Execution Model
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The Ocelot PTX EmulatorPerforms functional simulation of the PTX execution model
Access to complete machine stateEnables detailed performance evaluation/debugging
e.g. Program correctness checking Alignment checks, out of bounds etc.
Workload characterization and modeling
Trace generation to drive architecture simulators
Timing information not availablePTX 3.0 (Fermi) support
Abstract machine model
4
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Emulator ImplementationSerialized execution of CTAs
Implements the CUDA execution model semantics CUDA does not guarantee concurrent CTA execution!
Implements abstract reconvergence mechanism Multiple reconvergence mechanisms are implemented:
IPDOM, Barrier, Thread frontiers
Software support for special functions: Texture sampling
1D, 2D, 3D, cube Nearest, linear
New instructions may be prototyped
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: PTX Emulator Device Backend• ocelot/
• executive/
• interface/EmulatorDevice.h• interface/EmulatedKernel.h• interface/EmulatorCallStack.h• interface/CooperativeThreadArray.h• interface/ReconvergenceMechanism.h• interface/TextureOperations.h
• trace/
• interface/TraceEvent.h• interface/TraceGenerator.h
6
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Trace Generator Interface7
7
Emulator broadcasts events to event trace analyzers during execution
Events provide detailed device state: PC, activity mask, operand data, thread ID, etc.
Used for error checking, instrumentation, and simulation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Trace Generator Interface// ocelot/trace/interface/TraceGenerator.h
// Base class for generating traces
class TraceGenerator {public:TraceGenerator();
virtual ~TraceGenerator();
// Called when a traced kernel is launched to
// retrieve some parameters from the kernel
virtual void initialize( const executive::ExecutableKernel& kernel);
// Called whenever an event takes place.
virtual void event(const TraceEvent & event);
// Called when an event is committed
virtual void postEvent(const TraceEvent & event);
// Called when a kernel is finished. There will
// be no more events for this kernel.
virtual void finish();};
8
8
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
TraceEvent ObjectCaptures execution of a dynamic instruction, including
Device pointer to access device state Kernel grid, CTA dimensions, parameter values PC and internal representation of PTX instruction Set of active threads executing instruction Memory addresses and size of transfer Branch target(s) and diverging threads
9
9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
trace::TraceEvent
class TraceEvent {public:
// ID of the block that generated the event ir::Dim3 blockId;
// PC index into EmulatedKernel's packed instruction sequence
ir::PTXU64 PC;
// Depth of call stack [i.e. number of contexts on the runtime stack]
ir::PTXU32 contextStackSize;
// Instruction pointer to instruction pointed to by PC const ir::PTXInstruction* instruction;
// Bit mask of active threads that executed this instruction
BitMask active; // Taken thread mask in case of a branch BitMask taken;
// Fall through thread mask in case of a branch BitMask fallthrough;
// Vector of memory addresses possibly generated for this instruction
U64Vector memory_addresses;
// Vector of sizes of memory operations possibly issued by this instruction
ir::PTXU32 memory_size;
// Dimensions of the kernel grid that generated the event
ir::Dim3 gridDim;
// Dimensions of the kernel block that generated the event
ir::Dim3 blockDim; // Captures just events related to thread reconvergence ReconvergenceTraceEvent reconvergence;
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
PTX Emulator – CUDA Debugging – Race Detection
// file: raceCondition.cu__global__ void raceCondition(int *A) {
__shared__ int SharedMem[64];
SharedMem[threadIdx.x] = A[threadIdx.x];
// no synchronization barrier!
A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load}
. . .
raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr );
. . .
==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception:
==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252]
- Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between.
==Ocelot== Near raceCondition.cu:9:0
11
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
PTX Emulator – CUDA Debugging- Illegal Memory Accesses
// file: memoryCheck.cu__global__ void badMemoryReference(int *A) {
A[threadIdx.x] = 0; // line 3 - faulting store}
int main() {
int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation
int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64);
badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr ); return 0;
} ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range.==Ocelot== ==Ocelot== Nearby Device Allocations==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes)==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0
12
12
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Interactive DebuggerInteractive command-line debugger implemented using TraceGenerator interface
$ ./TestCudaSequenceA_gpu = 0x16dcbe0(ocelot-dbg) Attaching debugger to kernel 'sequence'(ocelot-dbg)(ocelot-dbg) watch global address 0x16dcbe4 s32[3]set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes(ocelot-dbg)(ocelot-dbg) continuest.global.s32 [%r11 + 0], %r7watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6break on watchpoint(ocelot-dbg)
• Enables • Inspection of application
state • Single-stepping of
instructions• Breakpoints and
watchpoints
• Faults in MemoryChecker and RaceDetector invoke ocelot-dbg automatically
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance Tuning
....for (int offset = 1; offset < n; offset *= 2){ // line 61
pout = 1 - pout;pin = 1 - pout;__syncthreads();temp[pout*n+thid] = temp[pin*n+thid];
if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid-offset];
}}....
• Identify critical regions• Memory demand• Floating-point intensity• Shared memory bank
conflicts
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Inter-thread Data Flow
• Which kernels exchange computed results through shared memory
• Track id of producer thread
• Ensure threads are well synchronized
15
15
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Current Trace Generators16
Trace Generators FunctionBranch Measures control flow uniformity and
branch divergenceInstruction Static and dynamic instruction countIntegrated Debugger GDB-like interfaceKernel Dimension Kernel grid and block dimensionsMachine Attributes Observe and record machine characteristics
Memory Working set size, memory intensity, memory efficiency
Memory Checker i) Bounds checks, ii) alignment checksiii) uninitialized loads (shared memory)
Memory Race Detector Race conditions on shared memoryParallelism MIMD and SIMD parallelism limitsPerformance Bound Compute and memory throughputShared Computation Extent of data flow among threads
Warp Synchronous Hot-paths/regions for warp synchronous execution
16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Using Trace Generators
Implement TraceGenerator interface
override methods: initialize( ), event( ), postEvent( ), finish( )
Add to Ocelot runtime explicitly: ocelot::addTraceGenerator( )
or, add to trace::TraceConfiguration
Online analysis or serialize event traces
Link applications with libocelotTrace.so
17
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Configuring & Executing Applications Edit configure.ocelot
Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially
attached executive controls device properties
trace: memoryChecker – ensures accesses to a memory
region associated with the currently selected device
raceDetector - enforces synchronized access to .shared
debugger - interactive debugger executive:
devices: emulated – Ocelot PTX emulator (trace generators)
trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: true, ignoreIrrelevantWrites: true }, debugger: { enabled: true, kernelFilter:
"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: false }, }, executive: { devices: [ "emulated" ], }}
18
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Thread Load Imbalance
//! Computes number of dynamic instructions for each thread
class ThreadLoadImbalance: public trace::TraceGenerator {
public:
std::vector< size_t > dynamicInstructions;
// For each dynamic instruction, increment counters of each
// thread that executes it
virtual void event(const TraceEvent & event) {
if (!dynamicInstructions.size())
dynamicInstructions.resize(event.active.size(), 0);
for (int i = 0; i < event.active.size(); i++) {
if (event.active[i])
dynamicInstructions[i]++;
}
}
};
• Mandelbrot (CUDA SDK)
Dyna
mic
Inst
ruct
ion
Coun
t
19
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20
EXAMPLE: CONTROL-FLOW DIVERGENCE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21
Control-Flow Divergence in PTX Emulator PTX Emulator facilitates customizable handlers for
control flow divergence
Currently implements: Immediate post-dominator (ipdom) And many more
Abstract handlers implement potentially divergent control instructions
e.g. Bra, Bar, Exit, Vote
Executing instructions drives TraceGenerators Reconvergence affects active threads, dynamic instruction
count, and instruction trace Analysis tools can group threads into warps of arbitrary size
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22
class ReconvergenceMechanism// ocelot/executive/interface/ReconvergenceMechanism.h
class ReconvergenceMechanism {public: ReconvergenceMechanism(CooperativeThreadArray *cta); virtual ~ReconvergenceMechanism();
virtual void initialize() = 0;
virtual void evalPredicate(CTAContext &context) = 0;
virtual bool eval_Bra(CTAContext &context, const ir::PTXInstruction &instr, const boost::dynamic_bitset<> & branch, const boost::dynamic_bitset<> & fallthrough) = 0;
virtual void eval_Bar(CTAContext &context, const ir::PTXInstruction &instr) = 0;
virtual void eval_Reconverge(CTAContext &context, const ir::PTXInstruction &instr) = 0;
virtual void eval_Exit(CTAContext &context, const ir::PTXInstruction &instr) = 0;
virtual void eval_Vote(CTAContext &context, const ir::PTXInstruction &instr);
virtual bool nextInstruction(CTAContext &context, const ir::PTXInstruction &instr, const ir::PTXInstruction::Opcode &) = 0; virtual CTAContext& getContext() = 0;}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23
Example: Immediate Post-Dominator Reconvergence// ocelot/executive/interface/ReconvergenceMechanism.h
class ReconvergenceIPDOM: public ReconvergenceMechanism {public: ReconvergenceIPDOM(CooperativeThreadArray *cta); ~ReconvergenceIPDOM(); void initialize(); void evalPredicate(CTAContext &context); bool eval_Bra(CTAContext &context, PTXInstruction &instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough);
void eval_Bar(CTAContext &context, PTXInstruction &instr);
void eval_Reconverge(CTAContext &context, PTXInstruction &instr);
void eval_Exit(CTAContext &context, PTXInstruction &instr);
bool nextInstruction(CTAContext &context, PTXInstruction &instr, PTXInstruction::Opcode &);
CTAContext& getContext(); size_t stackSize() const; void push(CTAContext&); void pop(); std::vector<CTAContext> runtimeStack; std::vector<int> pcStack; unsigned int reconvergeEvents;};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24
Example: Immediate Post-Dominator Reconvergence
// ocelot/executive/implementation/ReconvergenceMechanism.cpp
void executive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context, const ir::PTXInstruction &instr) { if (context.active.count() < context.active.size()) { // deadlock - not all threads reach synchronization barrier std::stringstream message; message << "barrier deadlock:\n"; throw RuntimeException(message.str(), context.PC, instr); }}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25
Example: Immediate Post-Dominator Reconvergencevoid executive::ReconvergenceIPDOM::eval_Reconverge( executive::CTAContext &context, const ir::PTXInstruction &instr) { if(runtimeStack.size() > 1) { if(pcStack.back() == context.PC) { pcStack.pop_back(); runtimeStack.pop_back(); ++reconvergeEvents; } else { context.PC++; } } else { context.PC++; }}void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context, const ir::PTXInstruction &instr) { eval_Bar(context, instr); context.running = false;}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26
Summary: Ocelot PTX Emulator Used for instruction level analysis
Can be attached to trace generators and trace analyzers
Simple processing and filtering of machine state
Forms the basis for a range of productivity tools Correctness tools Debugging tools Workload characterization
Drives instruction and address traces to MacSim