curs vsd10 11

7/31/2019 Curs Vsd10 11

1/78

Vizualizarea in sisteme distribuite

S.l. Dr. ing. Simona Caraiman

Master SDTW an II2011 - 2012

7/31/2019 Curs Vsd10 11

2/78

VSD - Curs 10-11

Programarea GPU (1V)

CUDA Advanced topics

VSD Curs 10-11 Master SDTW an II 2011 - 2012

7/31/2019 Curs Vsd10 11

3/78

Textures in CUDATexture is an object for reading data Benefits: data is cached (optimized for 2D locality) filtering linear / bilinear / trilinear

dedicated hardware wrap modes (for out-of-bounds addresses) clamp to edge / repeat

addressable in 1D, 2D or 3D using integer or normalized coordinates

Usage: CPU code binds data to a texture object Kernel reads data by calling a fetch function

2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

4/78

Textures in CUDA

Texture Addressing


7/31/2019 Curs Vsd10 11

5/78

Textures in CUDA

Two Texture Types Bound to linear memory global memory address is bound to a texture only 1D integer addressing no filtering, no addressing modes

Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, 3D float addressing (size-based or normalized) filtering addressing modes (clamp, repeat)


7/31/2019 Curs Vsd10 11

6/78

CUDA Texturing Steps Host (CPU) code: allocate/obtain memory (global linear, or CUDA array) create texture reference object bind the texture reference to memory/array when done:

unbind the texture reference, free resources

Device (kernel) code: fetch using texture reference

linear memory textures: tex1Dfetch() array textures: tex1D() or tex2D() or tex3D()


7/31/2019 Curs Vsd10 11

7/78

Atomics

Problem: How do you do global communication? Finish a grid and start a new one

Finish a kernel and start a new one All writes from all threads complete before a

kernel finishes

step1(...);// The system ensures that all

// writes from step1 complete.

step2(...);2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

8/78

Atomics

Global Communication Would need to decompose kernels into before

and after parts

Or, write to a predefined memory location

Race condition! Updates can be lostthreadId:0 threadId:1917

// vector[0] was equal to 0

vector[0] += 5; vector[0] += 1;

... ...a = vector[0]; a = vector[0];

What is the value ofa in thread 0?

What is the value ofa in thread 1917?


7/31/2019 Curs Vsd10 11

9/78

Atomics

Race conditionsThread 0 could have finished execution before

1917 started

Or the other way around

Or both are executing at the same time

Answer: not defined by the programmingmodel, can be arbitrary

CUDA provides atomic operations to dealwith this problem


7/31/2019 Curs Vsd10 11

10/78

Atomics

An atomic operation guarantees that only asingle thread has access to a piece of memorywhile an operation completes

The name atomic comes from the fact that it

is uninterruptable No dropped data, but ordering is still arbitrary

Different types of atomic instructions

atomic{Add, Sub, Exch, Min, Max, Inc,

Dec, CAS, And, Or, Xor}

More types in fermi


7/31/2019 Curs Vsd10 11

11/78

AtomicsExample: Histogram

// Determine frequency of colors in a picture

// colors have already been converted into ints

// Each thread looks at one pixel and increments

// a counter atomically

__global__void histogram(int* color,int* buckets)

{

int i = threadIdx.x

+ blockDim.x * blockIdx.x; int c = colors[i];atomicAdd(&buckets[c], 1);

}


7/31/2019 Curs Vsd10 11

12/78

AtomicsExample: Workqueue

// For algorithms where the amount of work per item

// is highly non-uniform, it often makes sense for

// to continuously grab work from a queue

__global__

void workq(int* work_q, int* q_counter,

int* output, int queue_max){

int i = threadIdx.x

+ blockDim.x * blockIdx.x;

int q_index =

atomicInc(q_counter, queue_max);int result = do_work(work_q[q_index]);

output[i] = result;

}


7/31/2019 Curs Vsd10 11

13/78

Atomics

Atomics are slower than normal load/storeYou can have the whole machine queuing on a

single location in memory

Atomics unavailable on G80!


7/31/2019 Curs Vsd10 11

14/78

AtomicsExample: Workqueue

// For algorithms where the amount of work per item

// is highly non-uniform, it often makes sense for

// to continuously grab work from a queue

__global__

void workq(int* work_q, int* q_counter,

int* output, int queue_max){

int i = threadIdx.x


int q_index =

atomicInc(q_counter, queue_max);int result = do_work(work_q[q_index]);

output[i] = result;

}


7/31/2019 Curs Vsd10 11

15/78

AtomicsExample: Global Min/Max (Naive)

// If you require the maximum across all threads

// in a grid, you could do it with a single

// global maximum value, but it will be VERY slow

__global__

void global_max(int* values, int* gl_max){

int i = threadIdx.x


int val = values[i];

atomicMax(gl_max,val);}


7/31/2019 Curs Vsd10 11

16/78

AtomicsExample: Global Min/Max (Better)

// introduce intermediate maximum results, so that

// most threads do not try to update the global max

__global__

void global_max(int* values, int* max,

int *regional_maxes,

int num_regions){

// i and val as before

int region = i % num_regions;

if(atomicMax(&reg_max[region],val) < val)

{ atomicMax(max,val);

}

}


7/31/2019 Curs Vsd10 11

17/78

Atomics

Global Min/Max

Single value causes serial bottleneck

Create hierarchy of values for more

parallelism Performance will still be slow, so usejudiciously


7/31/2019 Curs Vsd10 11

18/78

Performance optimization

Overview

Memory Optimizations

Execution Configuration Optimizations

Examples


7/31/2019 Curs Vsd10 11

19/78

Performance optimization - Overview

Optimize Algorithms for the GPU

maximize independent parallelism

maximize arithmetic intensity

(math/bandwith) sometimes its better to recompute than to

cache GPU spends its transistors on ALUs, not memory

do more computation on the GPU to avoidcostly data transfers even low parallelism computations can sometimes

be fastr than transferring back and forth to host2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

20/78


Optimize Memory Access

Coalesced vs. Non-coalesced = order ofmagnitude

Global/Local device memory Optimize for spatial locality in cached texture

memory

In shared memory, avoid high degree bank

conflicts


7/31/2019 Curs Vsd10 11

21/78


Take advantage of shared memory

hundreds of times faster than global memory

threads can cooperate via shared memory

use one / a few threads to load / computedata shared by all threads

use it to avoid non-coalesced access stage loads and stores in shared memory to re-

order non-coalesceable addressing


7/31/2019 Curs Vsd10 11

22/78


Use parallelism efficiently

partition your computation to keep the GPUmultiprocessors equally busy

many threads, many thread blocks

keep resource usage low enough to supportmultiple active threads blocks per

multiprocessor registers, shared memory


7/31/2019 Curs Vsd10 11

23/78

Memory optimizationsThe global, constant and

texture spaces areregions of devicememory

Each multiprocessor has:

a set of 32-bit registersper processor on-chip shared memory where the shared memory

space resides

a read-only constantcache to speed up access to the

constant memory space a read-only texture cache

to speed up access to thetexture memory space 2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

24/78

Memory optimizations

Optimizing host-devicedata transfers

Coalescing global dataaccesses

Using shared memoryeffectively


7/31/2019 Curs Vsd10 11

25/78


Host-Device Data Transfers

device2host memory bandwidth much lowerthan device2device memory bandwidth 4GB/s peak (PCI-e x16 Gen 1) vs. 76 GB/s peak

(Tesla C870)

minimize transfers intermediate data structures can be allocated,

operated on, and deallocated without evercopying them to host memory

group transfers

one large transfer much better than many smallones


7/31/2019 Curs Vsd10 11

26/78


Global and shared memory

Global memory not cached on G8x GPUs high latency, but launching more threads hides

latency important to minimize accesses

coalesce global memory accesses

Shared memory is on-chip, very high

bandwidth low latency like user-managed per-multiprocessor cache try to minimize or avoid bank conflicts


7/31/2019 Curs Vsd10 11

27/78


Texture and Constant Memory

Texture partition is cached uses the texture cache also used for graphics

optimized for 2D spatial locality best performance when threads of a warp read

locations that are close together in 2D

Constant memory is cached

4 cycles per address read within a single warp total cost 4 cycles if all threads in a warp read the same

address

total cost 64 cycles if all threads read differentaddresses


7/31/2019 Curs Vsd10 11

28/78


Global Memory Reads/Writes

global memory is not cached on G8x

highest latency instructions: 400-600 clock

cycles likely to be a performance bottleneck

optimizations can greatly increaseperformance


7/31/2019 Curs Vsd10 11

29/78


Coalescing a coordinated read by a half-warp (16 threads) a contiguous region of global memory 64 bytes each thread reads a word: int, float,.. 128 bytes each thread reads a double-word: int2,

float2, .. 256 bytes each thread reads a quad-word: int4,

float4,

additional restrictions

starting address for a region must be a multiple ofregion size

the kth thread in a half-warp must access the kthelement in a block being read

exception: not all threads must beartici atin2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

30/78


Coalesced Access: Reading floats


7/31/2019 Curs Vsd10 11

31/78


Uncoalesced Access: Reading floats


7/31/2019 Curs Vsd10 11

32/78


Coalescing: Timing results

Experiment: kernel: read a float, increment, write back

3M floats (12 MB) Times averaged over 10k runs

12K blocks x 256 threads: 356s coalesced

357s coalesced, some threads dontparticipate

3,494s permuted/misaligned thread access


7/31/2019 Curs Vsd10 11

33/78


Shared Memory

hundred times faster than global memory

cache data to reduce global memory accesses

threads can cooperate via shared memory use it to avoid non-coalesced access stage loads and stores in shared memory to re-

order non-coalesceable addressing


7/31/2019 Curs Vsd10 11

34/78

Memory optimizationsExample: thread-local variables

// motivate per-thread variables with

// Ten Nearest Neighbors application

__global__void ten_nn(float2 *result, float2 *ps, float2 *qs,

size_t num_qs)

{

// p goes in a register

float2 p = ps[threadIdx.x];

// per-thread heap goes in off-chip memory

float2 heap[10];

// read through num_qs points, maintaining

// the nearest 10 qs to p in the heap ...

// write out the contents of heap to result

...

}


7/31/2019 Curs Vsd10 11

35/78

Memory optimizationsExample: shared variables

// motivate shared variables with

// Adjacent Difference application:

// compute result[i] = input[i] input[i-1]

__global__void adj_diff_naive(int *result, int *input)

{

// compute this threads global index

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0)

{

// each thread loads two elements from global memory

int x_i = input[i];

int x_i_minus_one = input[i-1];

result[i] = x_i x_i_minus_one;

}

}


7/31/2019 Curs Vsd10 11

36/78






{



if(i > 0)

{

// what are the bandwidth requirements of this kernel?

int x_i = input[i];



}

}


Two loads

7/31/2019 Curs Vsd10 11

37/78






{



if(i > 0)

{

// How many times does this kernel load input[i]?

int x_i = input[i]; // once by thread i

int x_i_minus_one = input[i-1]; // again by thread i+1


}

}


7/31/2019 Curs Vsd10 11

38/78






{



if(i > 0)

{

// Idea: eliminate redundancy by sharing data

int x_i = input[i];



}

}


7/31/2019 Curs Vsd10 11

39/78


// optimized version of adjacent difference__global__void adj_diff(int *result, int *input)

{

// shorthand for threadIdx.x

int tx = threadIdx.x;

// allocate a __shared__ array, one element per thread

__shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data

unsigned int i = blockDim.x * blockIdx.x + tx;

s_data[tx] = input[i];

// avoid race condition: ensure all loads

// complete before continuing__syncthreads();

...

}


7/31/2019 Curs Vsd10 11

40/78


Example: shared variables// optimized version of adjacent difference

__global__void adj_diff(int *result, int *input)

{

...

if(tx > 0)

result[i] = s_data[tx] s_data[tx1];

elseif(i > 0)

{

// handle thread block boundary

result[i] = s_data[tx] input[i-1];

}}


7/31/2019 Curs Vsd10 11

41/78


Example: shared variables// when the size of the array isnt known at compile time...

__global__ void adj_diff(int *result, int *input)

{

// use extern to indicate a __shared__ array will be

// allocated dynamically at kernel launch time

extern__shared__ int s_data[];...

}

// pass the size of the per-block array, in bytes, as the third

// argument to the triple chevrons

adj_diff(r,i);


7/31/2019 Curs Vsd10 11

42/78


OccupancyThread instructions are executed sequentially,

so executing other warps is the only way tohide latencies and keep the hardware busy

Occupancy = Number of warps runningconcurrently on a multiprocessor divided bymaximum number of warps that can runconcurrently

Limited resource usage: registers shared memory


7/31/2019 Curs Vsd10 11

43/78


Grid/Block Size Heuristics

# of blocks > # of multiprocessors so all multiprocessors have at least one block to

execute

# of blocks / # of multiprocessors > 2 multiple blocks can run concurrently in a

multiprocessor blocks that arent waiting at a __syncthreads()

keep the hardware busy subject to resource availability register, shared

memory

# of blocks > 100 to scale to future devices

blocks executed in pipeline fashion 1000 blocks er rid will scale across multi le2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

44/78


Register Dependency

Read-after-write register dependency Instructions result can be read ~11 cycles later

Scenario:

To completely hide the latency:

run at least 192 threads (6 warps) permultiprocessor at least 25% occupancy

threads dont have to belong to the same thread

block 2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

45/78


Register Pressure Hide latency by using more threads per SM Limiting factors: Number of registers per kernel 8192 per SM, partitioned among concurrent threads

Amount of shared memory 16 KB per SM, partitioned among concurrent thread

blocks

Compile with --ptxas-options=-v flag

Use --maxrregcount=N flag to NVCC N = desired maximum registers/kernel At some point spilling into LMEM may occur reduces performance LMEM is slow


7/31/2019 Curs Vsd10 11

46/78


Determining resource usage

compile the kernel code with the cubin flagto determine register usage

open the .cubin file with a text editor and look

for the code section


7/31/2019 Curs Vsd10 11

47/78


Optimizing threads per block

Choose threads per block as a multiple of warp size avoid wasting computation on under-populated warps

More threads per-block == better memory latencyhiding

But, more threads per block == fewer registers perthread kernel invocations can fail if too many registers are used

Heuristics Minimum: 64 threads per block only if multiple concurrent blocks

192 or 256 threads a better choice usually still enough regs to copile and invoke successfully

this all depends on your computation, so experiment!2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

48/78


Occupancy != Performance

Increasing occupancy does not necessarilyincrease performance

BUT

Low-occupancy multiprocessors cannotadequately hide latency on memory-boundkernels it all comes down to arithmetic intensity and

available parallelism2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

49/78


Parameterize your application

Parameterization helps adaptation to differentGPUs

GPUs vary in many ways # of multiprocessors memory bandwidth

shared memory size

register file size max. threads per block

You can even make apps self-tuning experiment mode discovers and saves optimal

configuration 2011 - 2012Master SDTW an IIVSD Curs 10

7/31/2019 Curs Vsd10 11

50/78

A Common Programming Strategy

Global memory resides in device memory(DRAM) Much slower access than shared memory

Tile data to take advantage of fast shared

memory: Generalize from adjacent_difference example Divide and conquer

2011 - 2012VSD Curs 10-11 Master SDTW an II

7/31/2019 Curs Vsd10 11

51/78


Partition data into subsets that fit into sharedmemory


7/31/2019 Curs Vsd10 11

52/78


Handleeach data subset with one thread block


7/31/2019 Curs Vsd10 11

53/78


Load the subset from global memory to sharedmemory, using multiple threads to exploitmemory-level parallelism


7/31/2019 Curs Vsd10 11

54/78


Perform the computation on the subset fromshared memory


7/31/2019 Curs Vsd10 11

55/78


Copy the result from shared memory back toglobal memory


7/31/2019 Curs Vsd10 11

56/78


Carefully partition data according to accesspatterns

Read-only __constant__ memory (fast)

R/W & shared within block __shared__

memory (fast) R/W within each thread registers (fast)

Indexed R/W within each thread localmemory (slow)

R/W inputs/results cudaMalloced globalmemory (slow)


7/31/2019 Curs Vsd10 11

57/78

Communication Through Memory

Question:

__global__void race(void)

{ __shared__int my_shared_variable;

my_shared_variable = threadIdx.x;

// what is the value of

// my_shared_variable?

}


7/31/2019 Curs Vsd10 11

58/78

Communication Through Memory

This is a race condition

The result is undefined

The order in which threads access the variable

is undefined without explicit coordination Use barriers (e.g., __syncthreads) or

atomic operations (e.g., atomicAdd) to enforcewell-defined semantics


7/31/2019 Curs Vsd10 11

59/78

Communication Through Memory Use __syncthreads to ensure data is ready for

access

__global__void share_data(int *input)

{

__shared__int data[BLOCK_SIZE];

data[threadIdx.x] = input[threadIdx.x];

__syncthreads();

// the state of the entire data array// is now well-defined for all threads

// in this block

}


7/31/2019 Curs Vsd10 11

60/78

Communication Through Memory Use atomic operations to ensure exclusive access

to a variable

// assume *result is initialized to 0

__global__void sum(int *input, int *result)

{

atomicAdd(result, input[threadIdx.x]);

// after this kernel exits, the value of// *result will be the sum of the input

}


7/31/2019 Curs Vsd10 11

61/78

Resource Contention Atomic operations arent cheap!

They imply serialized access to a variable


{ atomicAdd(result, input[threadIdx.x]);

}

...

// how many threads will contend

// for exclusive access to result?

sum(input,result);


7/31/2019 Curs Vsd10 11

62/78

Hierarchical Atomics

Divide & Conquer Per-thread atomicAdd to a __shared__ partial sum

Per-block atomicAdd to the total sum

0 1



7/31/2019 Curs Vsd10 11

63/78



{ __shared__int partial_sum;

// thread 0 is responsible for

// initializing partial_sum if(threadIdx.x == 0)

partial_sum = 0;

__syncthreads();

...

}


7/31/2019 Curs Vsd10 11

64/78



{

...

// each thread updates the partial sum

atomicAdd(&partial_sum,

input[threadIdx.x]);

__syncthreads();

// thread 0 updates the total sum

if(threadIdx.x == 0)

atomicAdd(result, partial_sum);

}


7/31/2019 Curs Vsd10 11

65/78

Advice

Use barriers such as __syncthreads to waituntil __shared__ data is ready

Prefer barriers to atomics when data accesspatterns are regular or predictable

Prefer atomics to barriers when data accesspatterns are sparse or unpredictable

Atomics to __shared__ variables are muchfaster than atomics to global variables

Dont synchronize or serialize unnecessarily


7/31/2019 Curs Vsd10 11

66/78

Generalizeadjacent_difference example

AB = A * B Each element AB

ij = dot(row(A,i),col(B,j))

Parallelization strategy Thread ABij

2D kernel

Matrix Multiplication Example


7/31/2019 Curs Vsd10 11

67/78

First Implementation__global__ void mat_mul(float *a, float *b,

float *ab, int width){

// calculate the row & col index of the element

int row = blockIdx.y*blockDim.y + threadIdx.y;

int col = blockIdx.x*blockDim.x + threadIdx.x;

float result = 0;

// do dot product between row of a and col of b

for(int k = 0; k < width; ++k)

result += a[row*width+k] * b[k*width+col];

ab[row*width+col] = result;

}


7/31/2019 Curs Vsd10 11

68/78

How will this perform?

How many loads per term of dotproduct?

2 (a & b) = 8 Bytes

How many floating point operations? 2 (multiply & addition)

Global memory access to flop ratio(GMAC)

8 Bytes / 2 ops = 4 B/op

What is the peak fp performance ofGeForce GTX 260?

805 GFLOPS

Lower bound on bandwidth required toreach peak fp performance

GMAC * Peak FLOPS = 4 * 805 = 3.2TB/s

What is the actual memory bandwidth

of GeForce GTX 260?

112 GB/s

Then what is an upper bound onperformance of our implementation?

Actual BW / GMAC = 112 / 4 = 28GFLOPS


7/31/2019 Curs Vsd10 11

69/78

Idea: Use __shared__memory toreuse global data

Each input element isread by width threads

Load each element

into __shared__memory and haveseveral threads usethe local version to

reduce the memorybandwidth

width2011 - 2012VSD Curs 10-11 Master SDTW an II

7/31/2019 Curs Vsd10 11

70/78

Tiled Multiply

Partition kernel loopinto phases

Load a tile of bothmatrices into__shared__ eachphase

Each phase, each

thread computes apartial result

TILE_WIDTH


7/31/2019 Curs Vsd10 11

71/78

Better Implementation__global__ void mat_mul(float *a, float *b,

float *ab, int width){

// shorthand

int tx = threadIdx.x, ty = threadIdx.y;

int bx = blockIdx.x, by = blockIdx.y;

// allocate tiles in __shared__ memory

__shared__float s_a[TILE_WIDTH][TILE_WIDTH];

__shared__float s_b[TILE_WIDTH][TILE_WIDTH];

// calculate the row & col index

int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx;

float result = 0;


7/31/2019 Curs Vsd10 11

72/78

Better Implementation// loop over the tiles of the input in phases

for(int p = 0; p < width/TILE_WIDTH; ++p)

{ // collaboratively load tiles into __shared__

s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)];

s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col];

__syncthreads();

// dot product between row of s_a and col of s_b

for(int k = 0; k < TILE_WIDTH; ++k)

result += s_a[ty][k] * s_b[k][tx];

__syncthreads();

}

ab[row*width+col] = result;

}


7/31/2019 Curs Vsd10 11

73/78

Use of Barriers in mat_mul

Two barriers per phase: __syncthreads after all data is loaded into__shared__ memory

__syncthreads after all data is read from__shared__ memory

Note that second __syncthreads in phase pguards the load in phase p+1

Use barriers to guard data Guard against using uninitialized data Guard against bashing live data


7/31/2019 Curs Vsd10 11

74/78

First Order Size Considerations

Each thread block should have many threads TILE_WIDTH = 16 16*16 = 256 threads

There should be many thread blocks

1024*1024 matrices

64*64 = 4096 threadblocks

TILE_WIDTH = 16 gives each SM 3 blocks, 768threads

Full occupancy

Each thread block performs 2 * 256 = 512 32bloads for 256 * (2 * 16) = 8,192 fp ops

Memory bandwidth no longer limiting factor2011 - 2012VSD Curs 10-11 Master SDTW an II

Optimization Analysis

7/31/2019 Curs Vsd10 11

75/78

Optimization Analysis

Experiment performed on a GT200

This optimization was clearly worth the effort

Better performance still possible in theory

Implementation Original ImprovedGlobal Loads 2N3 2N2 *(N/TILE_WIDTH)

Throughput 10.7 GFLOPS 183.9 GFLOPS

SLOCs 20 44

Relative Improvement 1x 17.2x

Improvement/SLOC 1x 7.8x

TILE SIZE Effects

7/31/2019 Curs Vsd10 11

76/78

TILE_SIZE Effects

7/31/2019 Curs Vsd10 11

77/78

Memory Resources as Limit toParallelism

Effective use of different memory resourcesreduces the number of accesses to global

memoryThese resources are finite!

The more memory locations each threadrequires the fewer threads an SM can

accommodate

Resource Per GT200 SM Full Occupancy onGT200

Registers 16384

7/31/2019 Curs Vsd10 11

78/78

Final Thoughts

Effective use of CUDA memory hierarchy

decreases bandwidth consumption to increasethroughput

Use __shared__ memory to eliminateredundant loads from global memory Use __syncthreads barriers to protect__shared__ data

Use atomics if access patterns are sparse orunpredictable

Optimization comes with a development cost

Memory resources ultimately limit parallelism

curs vsd10 11

Documents