2011 parallel pseudo-random number generation

8/11/2019 2011 Parallel Pseudo-Random Number Generation

1/45

Task 2: Parallel Pseudo-Random Number

Generation

Nimalan Nandapalan Richard Brent

30 March 2011(revised 2 April 2011)


2/45

Acknowledgements

Thanks to Jir Jaros and Alistair Rendell for their assistance with Chapter3.

2


3/45


4/45

4.9.2 CURAND . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.9.3 Park-Miller on GPU . . . . . . . . . . . . . . . . . . . 374.9.4 Parallel LCG . . . . . . . . . . . . . . . . . . . . . . . 374.9.5 Parallel MRG . . . . . . . . . . . . . . . . . . . . . . . 374.9.6 Other Parallel Lagged Fibonacci Generators . . . . . . 374.9.7 Additional Notes . . . . . . . . . . . . . . . . . . . . . 38

4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography 40

4


5/45

Chapter 1

Introduction and Conclusions

This is a preliminary report on parallel pseudo-random number generation.It was written under tight time constraints, so makes no claim to being anexhaustive survey of the field, which is already extensive, and in a state offlux as new computer architectures are introduced.

Chapter2summarises the requirements of a good pseudo-random numbergenerator (PRNG), whether serial or parallel, and describes some popularclasses of generators, including LFSR generators, the Mersenne Twister,and xorgens. This chapter also includes comments on statistical testing ofPRNGs.

In Chapter3we summarise the features and contraints of contemporarymulticore computer systems, including GPUs, with particular reference to theconstraints that their hardware and APIs impose on parallel implementationsof PRNGs.

Some specific implementations of parallel PRNGs are described in Chap-ter4. Our conclusion is that the jury is still out on the best parallel PRNG,but quite likely what is best depends on the precise architecture on whichit is to be implemented. The same conclusion was reached by Thomas etal [64] in their recent comparison of CPUs, GPUs, FPGAs, and massivelyparallel processor arrays for pseudo-random number generation. There is

also a definite tradeoff between speed and quality/large-state-space, so it ishard to say what is best in all applications.

We include (in Chapter 4) some specific recommendations for testingof parallel PRNGs, which is a more demanding task than testing a singlesequential PRNG.

Finally, we include a bibliography which is by no means exhaustive, butdoes include both important historical references and a sample of recentreferences on both serial and parallel pseudo-random number generation.

5


6/45

Chapter 2

Pseudo-Random Number

Generators

In this chapter we survey popular and/or historically important pseudo-random number generators (PRNGs) and comment on their speed, quality,and suitability for parallel implementation. Specific comments on some par-allel PRNGs are given in Chapter 4.

Applications require random numbers with various distributions (e.g. uni-form, normal, exponential, Poisson,. . .) but the algorithms used to generate

these random numbers, and other random objects such as random permuta-tions, almost invariably require a good uniform random number generator. Inthis report we consider only the generation of uniformly distributed numbers.For other distributions see, for example, [3, 5, 6, 7, 11,25,57,63]. Usuallywe are concerned with realnumbers un which are intended to be uniformlydistributed on the interval [0, 1). Often it is convenient to consider integersUn in some range 0 Un < m. In this case we require un = Un/m to be(approximately) uniformly distributed. Typicallym is a power of two, e.g.263 or 264 on a machine that supports 64-bit integer arithmetic.

Pseudo-random numbers generated in a deterministic manner on a digi-tal computer can not be truly random[65]. We should always say pseudo-random to distinguish such numbers from truly random number (gener-ated, for example, by tossing dice, counting cosmic rays, or observing quan-tum fluctuations). However, for brevity we often omit the qualifier pseudowhen it is clear from the context. What is required of a PRNG is that finitesegments of the sequence u0, u1, behave in a manner indistinguishablefrom a truly random sequence. In practice, this means that they pass all sta-tistical tests which are relevant to the problem at hand. Since the problemsto which a library routine will be applied are not known in advance, randomnumber generators in subroutine libraries should pass a number of stringent

6


7/45

statistical tests (and not fail any) before being released for general use.

A sequence u0, u1, depending on a finite state must eventually be pe-riodic, i.e. there is a positive integer such thatun+=un for all sufficientlylarge n. The minimal such is called theperiod. If the generator has s statebits, then clearly 2s. It is often (but not always) true that u =u0, i.e.the sequence generated is a pure cycle with no tail.

2.1 Requirements

Following are the main requirements for a good uniform PRNG and its im-plementation in a subroutine library

Uniformity. The sequence of random numbers should pass statisticaltests for uniformity of distribution. In one dimension this is easy toachieve. Most generators in common use are provably uniform (apartfrom discretization due to the finite wordlength) when considered overtheir full period.

Independence. Subsequences of the full sequence u0, u1, should beindependent. For example, members of the even subsequence u0, u2, u4, should be independent of their odd neighbours u1, u3, . This im-plies that the sequence of pairs (u2n, u2n+1) should be uniformly dis-tributed in the unit square. More generally, random numbers areoften used to sample a d-dimensional space, so the sequence of d-tuples (udn, udn+1, . . . , udn+d1) should be uniformly distributed in thed-dimensional cube [0, 1]d for all reasonable values ofd (certainly forall d6).

Passing Statistical Tests. As a generalisation of the two requirementsabove (uniformity and independence), we can ask for our PRNG notto fail statistical tests that should be passed by truly random numbers.Of course, there is an infinite number of conceivable statistical tests in practice a finite battery of tests is chosen. This is discussed furtherin2.3.

Long Period versus Small State Space. A long simulation on a parallelcomputer might use 1020 random numbers. In such a case the periodmust exceed 1020. For many generators there are strong correlationsbetweenu0, u1, andum, um+1, , wherem = /2 or (+ 1)/2 (andsimilarly for other simple fractions of the period). Thus, in practicethe period should be muchlarger than the number of random numbers

7


8/45

which will ever be used. A good rule of thumb is that at most 1/2

random numbers should be used from a generator with period . Tobe conservative, 1/2 might be reduced to 1/3 if we are concernedwith passing certain statistical tests such as the birthday-spacingstest [39].

On the other hand, the state space required by the PRNG should notbe too large, especially if multiple copies are to be implemented on aparallel machine where the memory per processor is small. For period, the state space must be at least log2 bits. As a tradeoff betweenlong period and small state space, we recommend generators with log2 in the range 256 to 1024. Generators with smaller will probably fail

some statistical tests, and generators with larger will consume toomuch memory on certain parallel machines (e.g. GPGPUs). Generatorswith very large may also take a significant time to initialise. Inpractice, it is convenient to have a family of related generators withdifferent periods , so a suitable selection can be made, depending onthe statistical requirements and memory constraints.

Repeatability. For testing and development it is useful to be able torepeat a run with exactly the same sequence of random numbers aswas used in an earlier run [23]. This is usually easy if the sequence

is restarted from the beginning (u0). It may not be so easy if thesequence is to be restarted from some other value, say um for a largeintegerm, because this requires saving the state information associatedwith the random number generator. In some applications it is desirableto incorporate some physical source of randomness (e.g. as a seed for adeterministic PRNG), but this clearly rules out repeatability.

Unpredictability. In cryptographic applications we usually require thesequence of random numbers to be unpredictable, in the sense that itis computationally very difficult, if not impossible, to predict the nextnumberun in the sequence, given the preceeding numbers u0, . . . , un1.This requires the generator to have some hidden state bits; also, unmust not be a linear function ofu0, . . . , un1. If we insist on unpre-dictability, then several important classes of generators are ruled out(e.g. any that are linear over GF(2)). In the following we generallyassume that unpredictability is not a requirement. For more on unpre-dictable generators, see [2].

Portability. For testing and development purposes, it is useful to beable to generate exactlythe same sequence of random numbers on two

8


9/45

different machines, possibly with different wordlengths. In practice it

will be expensive to simulate a long wordlength on a machine with ashort wordlength, but the converse should be easy a machine witha long wordlength (say w = 64) should be able to simulate a machinewith a smaller wordlength (say w= 32) without loss of efficiency.

Disjoint Subsequences. If a simulation is to be run on a machine withseveral processors, or if a large simulation is to be performed on severalindependent machines, it is essential to ensure that the sequences ofrandom numbers used by each processor are disjoint. Two methodsof subdivision are commonly used [35]. Suppose, for example, that

we require 4 disjoint subsequences for a machine with 4 processors.One processor could use the subsequence (u0, u4, u8, ), another thesubsequence (u1, u5, u9, ), etc. For efficiency each processor shouldbe able to skip over the terms which it does not require. Alterna-tively (and generally easier to implement), processor j could use thesubsequence (umj , umj+1, ), where the indicesm0, m1, m2, m3are suf-ficiently widely separated that the (finite) subsequences do not overlap.This requires some efficient method of generating um for large m with-out generating all the intermediate values u1, . . . , um1 (this is calledjumping ahead below).

Efficiency. It should be possible to implement the method efficientlyso that only a few arithmetic operations are required to generate eachrandom number, all vector/parallel capabilities of the machine are used,and overheads such as those for subroutine calls are minimal. Thisimplies that the random number routine should return an array of(optionally) several numbers at a time, not just one, to avoid procedurecall overheads.

Proper Initialisation. It is critically important that the PRNG is ini-tialised correctly. A surprising number of problems observed with im-

plementations of PRNGs are due, not to the choice of a bad generatoror a faulty implementation of the generator itself, but by inadequateinitialisation. For example, Matsumoto et al [49] tested 58 generatorsin the GNU Scientific Library, and found some kind of initialisationdefect in 40 of them.

A common situation on a parallel machine is that different processorswill use the same PRNG, initialised with consecutive seeds. It is arequirement that the sequences generated will be independent. (Unfor-tunately, implementations often fail to meet this requirement, see 4.1.)

9


10/45

2.2 Spurious Requirements

In the literature one sometimes sees requirements for random number gen-erators that we might class as spurious because, although attractive orconvenient, they are not necessary. In this category we mention the follow-ing.

Equidistribution. This is a property that, for certain popular classesof generators, can be tested efficiently without generating a completecycle of random numbers, even though the definition is expressed interms of a complete cycle. We have argued in [12]that equidistributionis neither necessary nor sufficient for a good pseudo-random numbergenerator. Briefly, the equidistribution test would be failed by anygenuine (physical) random number generator; also, we have noted above(under Period) that one should not use the whole period provided bya pseudo-random number generator, so a criterion based on behaviourover the whole period is not necessarily relevant.

Ability to Jump Ahead. Under Disjoint Subsequences above we men-tioned that when using a pseudo-random number generator on several(sayp) processors we need to split the cycle generated into p distinct,non-overlapping segments. This can be done if we know in advance

how many random numbers will be used by each processor (or an up-per bound on this number, say b) and we have the ability to jumpahead from u0 to ub and restart the generator from index b (also 2b,3b , . . . , (p1)b). Recently Haramotoet al[18, 19] proposed an efficientway of doing this for the important class ofF2-linear generators.

1 Ifthe period of the generator is and we have p processors, then by abirthday paradox argument we can take p randomly chosen seeds tostart the generator atp different points in its cycle, and the probabilitythat the segments of lengthb starting from these points arenotdisjointisO(p2b/). For example, ifp220,b264, and 2128, this proba-bility isO(2

24

), which is negligible. We needto be at least this largefor the reasons stated above under Long Period. Thus, the ability tojump ahead is unnecessary. Since jumping ahead is non-trivial to imple-ment and imposes a significant startup overhead at runtime, it is best

1We note that this is nothing new, since essentially the same idea and efficient im-plementation via polynomial instead of matrix operations was proposed in [4] and imple-mented in RANU4 (1991) in the Fujitsu SSL for the VP series of vector processors. Theidea of jumping ahead is called fast leap-ahead by Mascagni et al[43], who give exam-ples, but use matrix operations instead of (equivalent but faster) polynomial operationsto implement the idea.

10


11/45

avoided in a practical implementation. That is why we did not include

such a feature in our recent random number generator xorgens[9, 10].

2.3 Statistical Testing

Theoretically, the performance of some PRNGs on certain statistical testscan be predicted, but usually this only applies if the test is performed over acomplete period of the PRNG. In practice, statistical testing of PRNGs overrealistic subsets of their periods requires empirical methods [31, 32,40].

For a given statistical test and PRNG to be tested, a test statistic is com-

puted using a finite number of outputs from the PRNG. It is necessary thatthe distribution of the test statistic for the test when applied to a sequenceof uniform, independently distributed random numbers is known, or at leastthat a sufficiently good approximation can be computed [34]. Typically, a p-valueis computed, which gives the probability that the test statistic exceedsthe observed value.

The p-value can be thought of as the probability that the test statisticor a larger value would be observed for perfectly uniform and independentinput. Thus thep-value itself should be distributed uniformly on (0, 1). If thep-value is extremely small, for example of the order 1010, then the PRNGdefinitely failsthe test. Similarly if 1

p is extremely small. If the p-value

is neither close to 0 nor 1, then the PRNG is said to passthe test, althoughthis only says that the test failed to detect any problem with the PRNG.

Typically a whole battery of tests are applied, so there are many p-values,not just one. We need to be cautious in interpreting the results of sucha battery of tests. For example, if 1000 tests are applied, we should notbe surprised to observe at least one p-value smaller than 0.001 or largerthan 0.999. A 2-test might be used test the hypothesis that the p-valuesthemselves look like a random sample from a uniform distribution.

For comments on the statistical testing ofparallelPRNGs, and on testingpackages such as TestU01, see

4.1.

2.4 Linear Congruential Generators

Linear conguential generators(LCGs) are PRNGs of the form

Un+1 = (aUn+c) modm,

where m > 0 is the modulus, a is the multiplier (0 < a < m), and cis an additive constant. Usually the modulus is chosen to be a power of 2,

11


12/45

saym= 2w where w is close to the wordlength, or close to a power of 2, say

m= 2w , where a small is chosen so that mis prime.LCGs have period at most m, which is too small for demanding applica-

tions. Thus, although there is much theory regarding them, and historicallythey have been used widely, we regard them as obsolete, except possibly whenused in combination with another generator.2 Some references on LCGs are[4,17,25,38,56].

The add-with-carryand multiply-with-carrygenerators of Marsaglia andZaman [42] can be regarded as a way of implementing a multiple-precisionmodulus in a linear congruential generator. However, such generators tendto be slower than LFSR generators, and may suffer from statistical problems

due to the special form of multiplier.

2.5 LFSR Generators and Generalizations

Menezeset al[50,6.2.1] gives the following definition of linear feedback shiftregister:

Alinear feedback shift register(LSFR) of lengthL consists ofLstages(ordelay elements) numbered 0, 1, . . . , L 1, each capableof storing one bit and having one input and one output; and a

clock which controls the movement of data. During each unit oftime the following operations are performed:

1. the content of stage 0 is output and forms part of theoutputsequence;

2. the content of stage i is moved to stage i 1 for eachi, 1iL 1; and

3. the new content of stage L 1 is the feedback bitsj whichis calculated by adding together modulo 2 the previous con-tents of a fixed subset of stages 0, 1, . . . , L 1.

The definition is illustrated in Figure2.1. Thec1, c2, . . . , cL in Figure2.1are either zeros or ones. These ci values are used in combination with thelogical AND gates (semi-circles). When ci is 1, the output of the gate is theoutput of theith stage. Whenci is 0, the output of the gate is 0. This meansthe feedback bit sj is then the modulo 2 of the sum of the outputs of thestages where ci is 1.

2For example, the Weyl generator, sometimes used in combination with LFSR genera-tors to hide their defect of linearity over GF(2), e.g. in xorgens[9,10], is a special case ofa LCG with multiplier 1.

12


13/45

Stage Stage

L-2

sj

L-1

c2c1 cL1 cL

output0

StageStage

1

Figure 2.1: Diagram of An LFSR: an illustration of the definition of an LFSR.Theciallow the selection of the fixed subset of stages that are added. Source:Menezeset al[50].

The definition above is convenient for a hardware implementation. Fora software implementation, it is convenient to extend it so the data at eachstage is a vector of bits rather than a single bit. For example, this vectormight consist of the 32 or 64 bits in a computer word.

More generally still, LEcuyer and Panneton [30, 55] consider a generalframework for representing linear generators over GF(2). Consider the matrixlinear recurrence:

xi = Axi1,

yi = Bxi,

ui =w

=1

yi,12,

wherexi is a k-bit state vector, yi is a w-bit output vector, ui [0, 1) is theoutput at step i, A is a kk transition matrix, and B is a wk outputtransformation matrix.

By making appropriate choices of the matrices A and B, several well-known types of generators can be obtained as special cases, including theTausworthe, linear feedback shift register (LFSR), generalized (linear) feed-back shift register (GFSR), twisted GFSR, Mersenne Twister, WELL, xor-shift, etc.

These generators have the advantage that they can be implemented effi-ciently using full-word logical operations (since the full-word exclusive oroperation corresponds to addition of vectors of size w over GF(2) ifw is thecomputer wordlength). They also have the property that the output is alinear function of the initial state x0 (by linear here we mean linear overGF(2)). To avoid linearity, we can combine these generators with a cheapgenerator of another class, e.g. a Weyl generator, or combine two succes-

13


14/45

sive outputs using a nonlinear function (e.g. addition mod 2w

1, which is

nonlinear over GF(2)).In practice, the matrix A is usually sparse, an extreme case being gen-

erators based on primitive trinomials over GF(2), and the matrix B is oftenthe identityI(in which case k = w). The transformation fromxi toyi whenB=I is often called tempering.

2.6 The Mersenne Twister

The Mersenne Twister is a PRNG and member of the class of LFSR gen-

erators. This generator was introduced by Matsumoto and Nishimura [46](although the algorithm has been improved several times, so there are nowseveral variants). This particular PRNG has attracted a lot of attentionand has become quite popular, although it has also attracted some crit-icism [36, 37]. It is the default generator used by a number of differentsoftware packages including Goose, SPSS, EViews, GNU, R and VSL[32].

IfMp= 2p1 is a (Mersenne) prime and p=1 mod 8, then it is usually

possible to find aprimitive trinomialof degreep. For examples, see [13]. Anysuch primitive trinomial can be used to give a PRNG with period Mp. Forexample, RANU4 (1991) and the closely related ranut(2001) implemented achoice of 25 different primitive trinomials, with degrees ranging from 127 to3021377, depending on how much memory the user wished to allocate. TheMersenne Twister uses degree 19937. This gives a very long period, 2199371,but a correspondingly large state space (at least 19937 bits). The exponent19937 might be reduced to say 607 for a GPU implementation.

We omit details of the Mersenne Twister, except to note that the firstversion did not include a tempering step, but this was introduced in laterversions to ameliorate some problems shared by all generators based on prim-itive trinomials [45] (or, more generally, primitive polynomials with smallHamming weight, see[55,7]).

Although generally well accepted, the Mersenne Twister has received

some criticism. For example, Marsaglia [36, 37] and others have observedthat it is not a particularly elegant algorithm, there is considerable complex-ity in its implementation, and its state space is rather large.

The Mersenne Twister (version with tempering) generally performs wellon statistical tests, but it has been observed to fail two of the tests in theTestU01 BigCrush benchmark. Specifically, it can fail linear dependencytests, because of its linearity over GF(2). If this is regarded as a prob-lem, it can be avoided by combining the Mersenne Twister with a generatorthat is nonlinear over GF(2). Similar remarks applied to the first version

14


15/45

of the xorgensgenerator the current version combines a LFSR generator

with a Weyl generator to avoid linearity, and passes all tests in the TestU01BigCrush benchmark.

2.7 Combined Tausworthe Generators

Tausworthe PRNGs based on primitive trinomials can have a long periodand fast implementation, but have bad statistical properties unless certainprecautions are taken. Roughly speaking, this is because each output de-pends on only two previous outputs (the very fact that makes the implemen-

tation efficient). Tezuka and LEcuyer showed how to combine several badgenerators to get a generator with good statistical properties.

Of course, the combined generator is slower than each of its componentgenerators, but still quite fast. For further details see LEcuyer [29], Tezukaand LEcuyer[62].

2.8 xorshift Generators

The class of xorshift (exclusive OR-shift) generators was first proposed byMarsaglia [41]. These generators are relatively simple in design and imple-mentation when compared to generators like the Mersenne Twister (2.6).This is because they are designed to be easy to implement with full-wordlogical operations, whereas the Mersenne Twister uses a trinomial of de-gree 19937, which is not a convenient multiple of the wordlength (usually 32or 64).

An xorshift PRNG produces its sequence by a process of repeatedly ex-ecuting a series of simple operations the exclusive OR (XOR) of a word,and the left or right shifts of a word. In the C programming language theseare basic operations (^, ) which map well onto the instruction set ofmost computers.

The xorshift class of generators is a sub-class of the more general and bet-ter known LFSR class of PRNG generators [8,54]. Xorshift class generatorsinherit the theoretical properties of LFSR class generators, both good andbad. From the viewpoint of software implementations, xorshift generatorshave a number of advantages over equivalent LFSR generators. The periodof the xorshift generators is usually chosen to be 2n 1, where n is somemultiple of the word length, see [41,8].

This is in contrast with generators like the Mersenne Twister which re-quire nto be a Mersenne exponent. In the case of the most common imple-

15


16/45

mentation of the Mersenne Twister, n= 19937.

Unlike the Mersenne Twister, xorshift generators do not require a temper-ing step, because each output bit depends on more than two previous outputbits. Because of this, xorshift generators have the potential to execute fasterthan other LFSR implementations, as they require fewer CPU operationsand are more efficient in their memory operations per random number.

2.9 xorgens

Marsaglias original paper[41] only gave xorshift generators with periods upto 2192

1. xorgens[9] is a recently proposed family of PRNGs that generalise

the idea and have period 2n 1 where n can be chosen to be any convenientpower of two up to 4096. The xorgens generator has been released as a freesoftware package, in a C language implementation (most recently xorgensversion 3.05 [10]).

Compared to previous xorshift generators, the xorgens family has severaladvantages:

A family of generators with different periods and corresponding mem-ory requirements, instead of just one.

Parameters are chosen optimally, subject to certain criteria designed

to give the best quality output.

The defect of linearity over GF(2) is overcome efficiently by combiningthe output with that of a Weyl generator.

Attention has been paid to the initialisation code (see comments in2.1on proper initialisation), so the generators are suitable for use ina parallel environment.

For details of the design and implementation of the xorgens family, werefer to [9, 10]. Here we just comment on the combination with a Weyl

generator.This step is performed to avoid the problem of linearity over GF(2) that iscommon to all LFSR generators. A Weyl generator has the following simpleform:

wk=wk1+ mod 2w,

where is some odd constant (a recommended choice is an odd integer closeto 2w1(

5 1)). The final output of an xorgens generator is given by:

wk(I+R) +xk mod 2

w, (2.1)

16


17/45

where xkis the output before addition of the Weyl generator, is some integer

constant close tow/2, andR is the right-shift operator. The inclusion of theterm R ensures that the least-significant-bits have high linear complexity(if we omitted this term, the Weyl generator would do little to improve thequality of the least-significant bit, since (wk mod 2) is periodic with period 2).

As addition mod 2w is a non-linear operation over GF(2), the result isa mixture of operations from two different algebraic structures, allowing thesequence produced by this generator to pass all of the empirical tests inBigCrush, including those failed by the Mersenne Twister. A bonus is thatthe period is increased by a factor 2w (though this is not free, since the statespace size is increased by w bits).

2.9.1 Cryptographic applications

xorgens is not recommended for cryptographic applications, since an oppo-nent who observed an intial segment of output might be able to subtractthe Weyl generator component and then use linearity over GF(2) to predictthe subsequent output. At the cost of a slowdown by a factor of about two,we can combine two successive outputs of xorgens using a suitable nonlinearoperation, for example addition mod (2w 1). This generator [implementedbut not yet released] is believed to be cryptographically secure ifn 256,

and it is certainly faster than most other cryptographically secure generators(e.g. those obtained by feeding the output of an insecure generator into asecure hash function, see for example [47,48]).

17


18/45

Chapter 3

Contemporary Multicore

Systems

Contemporary multicore systems can be loosely categorized into systemsthat have evolved based on replicating on a single die what was previouslyconsidered to be a single general purpose CPU, and those that have evolvedbased on the concept of having multiple units executing the same instructionon different data in a single instruction multiple data (SIMD) programmingmodel. The former began to emerge in the x86 systems offered by AMD andIntel around 2005 through chips such as the model 865 Opteron and PentiumD respectively. The latter is most closely associated with the modern GraphicProcessing Units (GPUs), which arguably began to transition from dedicatedgraphics engines into more general purpose computing systems around 2006with the release of the NVIDIA GTX8800 system.

Figure 3.1: Structural comparison between GPUs and CPUs[51].

The different philosophies behind the CPU and GPU architectures isfurther illustrated in Figure 3.1. This shows that the GPU architecturereserves a significantly higher proportion of its transistors for data processing

18


19/45

compared to the CPU[58]. That is, the GPU has evolved into a massively

parallel, multi-threaded and multi-core processor, while CPUs have evolvedto have a small number of cores where each is a highly efficient serial, single-threaded processor.

Existing work on the efficient implementation of random number genera-tors on contemporary processors has tended to focus on either CPU or GPUtype systems. In reality, however, the boundary between these two paradigmsis somewhat blurred, with conventional x86 processors being capable of per-forming some SIMD like instructions while GPUs contain multiple SIMDexecution units each capable of executing a different instruction stream.

In the future the boundary between CPU and GPU will blur even fur-

ther. Thus in this chapter rather than review existing CPU and GPU sys-tems we instead focus on three emerging multicore systems from Intel, AMDand NVIDIA that incorporate concepts from the CPU and GPU paradigms;namely the Intel Sandy-Bridge, AMD Fusion and NVIDIA Tegra systems.We then briefly consider OpenCL as a portable programming language formulticore systems.

3.1 Intel Sandy Bridge

The Sandy Bridge architecture[22]represents a new processor architectureproposed by Intel this year. Sandy Bridge is a 32nm CPU with an on-die GPU. While Clarkdale/Arrandale architectures have a 45nm GPU onpackage, Sandy Bridge moves the transistors on die enabling GPU to shareL3 cache with CPU.

There are a lot of improvements in the Sandy Bridge architecture like re-designed CPU front-end, new power saving and turbo modes, and improvedmemory controller. However, from the high performance computation pointof view the most important are new Advanced Vector Extension (AVX) in-structions and a more powerful GPU integration.

The AVX [21] instructions support 256-bit operands which is twice as

much as standard SSE registers. The extension is done at minimal die ex-pense. This minimizes the impact of AVX on the execution die area whileenabling twice the Floating Point operation throughput. Intel AVX improvesperformance due to wider vectors, new extensible syntax, and rich function-ality. This results in better management of data and general purpose ap-plications like image, audio/video processing, scientific simulations, financialanalytics and 3D modeling and analysis.

The largest performance improvement on Sandy Bridge vs. current West-mere architectures actually has nothing to do with the CPU, its all graphics.

19


20/45

While the CPU cores show a 10 - 30% improvement in performance, Sandy

Bridge graphics performance is easily double.The GPU is treated like an equal citizen in the Sandy Bridge world, it

gets equal access to the L3 cache. The graphics driver controls what getsinto the L3 cache and you can even limit how much cache the GPU is ableto use. Storing graphics data in the cache is particularly important as itsaves trips to main memory which are costly from both a performance andpower standpoint. Redesigning a GPU to make use of a cache is not a simpletask. It usually requires the sort of complete re-design that NVIDIA did withGF100.

Sandy Bridge graphics is the anti-Larrabee. While Larrabee focused on

extensive use of fully programmable hardware (with the exception of the tex-ture hardware), Sandy Bridge graphics (internally referred to as Gen 6 graph-ics) makes extensive use of fixed function hardware. The design mentalitywas that anything that could be described by a fixed function should be im-plemented in fixed function hardware. The benefit is performance/power/diearea efficiency, at the expense of flexibility. Keeping much of the GPU fixedfunction conforms to Intels CPU-centric view of the world. By contrast,making the GPU as programmable as possible makes more sense for a GPUfocused company like NVIDIA.

The programmable shader hardware is composed of shaders/cores/

execution units that Intel calls Execution units (EU). Each EU can dualissue, picking instructions from multiple threads. The internal ISA mapsone-to-one with most DirectX 10 API instructions resulting in a very CISC-like architecture. Moving to one-to-one API to instruction mapping increasesIPC by effectively increasing the width of the EUs.

There are other improvements within the EU. Transcendental functionsare handled by hardware in the EU and its performance has been sped upconsiderably. Intel informed us that sine and cosine operations are severalorders of magnitude faster now than in current HD Graphics.

3.2 AMD Fusion

The Advanced Micro Devices (AMD) Fusion Family [14, 16] has a designphilosophy that represents another example of the blur between CPUs andGPUs. These processors are classed by AMD as an Accelerated Process-ing Unit (APU). The AMD Fusion processor is currently available in a 40nmmicroarchitecture processor with a 32nm version in development. The philos-ophy taken by AMD makes the Fusion processors similar to the Sandy Bridgein that they both combine an x86 based processor with a vector processor,

20


21/45

similar to those on GPUs, on the same die.

Other improvements that the Fusion design offers include direct supportfor advanced memory controllers, input and output controllers, video de-coders, display outputs, and bus interfaces, and other features such as SSE.The main advantage however is the inclusion and direct support of both scalarand vector hardware as high-level programmable processors. This allows forhigh-level industry-standard tools such as DirectCompute and OpenCL tobe fully utilised on this device. The advantage of this level of support forhigh-level tools such as OpenCL means that code designed on other OpenCLcapable devices is readily portable to these architectures and should offera notable performance increase because of the integrated vector processing

elements and redesigned northbridge.As Fusion is a high-abstraction design philosophy for processors, many

of the specific details of actual processors are not readily available. Thefew Fusion based processors that have been released are targeted at smallform factor desktop machines and portable devices (Bobcatbased). Conse-quently, these processors are made of lightweight serial and vector cores. Thedevelopment plan for Fusion includes the provision for larger, more complex,more powerful desktop based processors with multiple serial cores (Bull-dozer based).

3.3 NVIDIA Tegra

NVIDIA are one of the largest producers of GPU processors and devices andup until recently developed exclusively for this platform. The Tegra plat-form[53]is a recently announced architecture from NVIDIA that representsa shift from their GPU focus. The Tegra is designed with portable devicesin particular consideration.

Like the Intel Sandy bridge and AMD Fusion based processors, the Tegra

gains performance by incorporating serial general-purpose CPU based proces-sors with vectorised GPU-based processors on the same chip. The differencewith the Tegra based processors is that they also include the northbridge,southbridge and complete memory controller on the same die. This approachis otherwise known as a System On a Chip (SOC), and is notably differentfrom the APUs and Sandy Bridge architecture discussed earlier.

The serial CPU based processor in the Tegra system uses an ARM basedprocessor. Nvidia have announced that a quad core system will available.Like the Fusion systems, all Tegra systems will be fully programmable.

21


22/45


23/45

Chapter 4

Parallel PRNG

In this chapter we review of some of the recent literature relating to the imple-mentation of parallel PRNGs on GPUs and multi-core CPUs, and include adiscussion of some of the general techniques used. First we discuss statisticaltesting of parallel PRNGs. Then we mention some approaches for gener-ating random numbers on parallel architectures as suggested by Aluru [1].These are thecontiguous subsequence,leapfrog, andindependent subsequencetechniques. We then present several current examples of parallel PRNG andcomment upon their design and performance.

4.1 Statistical Testing of Parallel PRNGs

We discussed the statistical testing of PRNGs briefly in2.3. Here we con-tinue that discussion with an emphasis on the testing ofparallelPRNGs.

Often statistical tests are applied to the sequence generated by a PRNGwith one initial seed. This is inadequate because it does not test the ProperInitialisation requirement stated in2.1. Suppose a PRNG generates asequence (xs,j)j0 when initialised with seed s. We can think of

X= (xs,j)s0,j0

as a two-dimensional semi-infinite array. Typical testing only tests the rowsX, i.e. the tests are applied with s fixed. We should also test columns, i.e.keep j fixed and vary s. The case j = 0, s varying is important in thiscase the PRNG is being used as a pseudo-random function of the seed s,specifically

f(s) =xs,0

should behave like a pseudo-random function.

23


24/45

Most common generators will fail this test unless it is considered as a pos-

sibility when the code for initialisation is implemented.1 Similarly, we shouldtest sequences obtained by interleaving rows ofX, since such sequences mightbe generated on a parallel machine. For example, we could test the sequence

x0,0, x1,0, x0,1, x1,1, x0,2, x1,2, . . .

obtained by interleaving rows 0 and 1, which could correspond to the sequencegenerated on a machine with two processors. Judging from a lack of mentionin the literature, we surmise that relatively little such testing of parallelPRNGs has yet been done.

In some applications a relatively small number of random numbers areused, then the PRNG is re-initialised with a different seed. For example, aPRNG might be used to shuffle a pack of cards, with a different seed for eachshuffle. Thus we should test the sequence

x0,0, . . . , x0,, x1,0, . . . , x1,, . . . , xk,0, . . . , xk,

for various choices of parameters (k, ).

TestU01: A Software Library for the Testing PRNGs

The TestU01 software library was created by LEcuyer and Simard to assistin the empirical statistical testing of uniform PRNGs[32].The number of possible statistical tests is clearly infinite [31]. It is also

clear that no single PRNG can pass all conceivable tests. However, statisticaltests are still extremely useful as a bad PRNG will fail simple tests, while agood PRNG will only fail overly complicated tests (perhaps designed withinside knowledge of the algorithm used in the PRNG), or tests that wouldtake an impractically long time to run.

Perhaps the firstde factostandard set of tests for PRNG was suggested byKnuth in the first edition ofThe Art of Computer Programming, Vol. 2. Sincethen, many authors have contributed other tests for detecting regularitieswithin the sequences generated by PRNGs [34]. The TestU01 library includesmany of these and represents a good framework for the testing of PRNG ina general environment.

Apart from TestU01, the main public-domain testing package for generalPRNG is DIEHARD [40]. However, the DIEHARD package suffers from a

1For example, Pedro Gimenes and Richard Brent [unpublished] found such a defectin a PRNG recommended in the (draft) third edition of Knuths The Art of ComputerProgramming, Volume 2[25]. Fortunately, the defect was found and corrected just beforethe book was printed.

24


25/45

number of limitations. Firstly, the suite of tests run, and the parameters they

run under are fixed inside the code of the package. Furthermore, the samplesizes for the tests are relatively small and so the package is not as stringent asTestU01. The interface for DIEHARD also requires the output of the PRNGto be provided in a formatted file of 32 bit blocks, which is slower and quiterestrictive when compared with the interface of TestU01. The tests withinDIEHARD are available in TestU01 in a much less restrictive form, as theparameters and tests used can be changed during runtime. Thus, we canregard DIEHARD as obsolete.

The TestU01 library includes three benchmarks with increasing levelsof stringency for the purpose of testing PRNGs[31]. These are the Small-

Crush, Crush, and BigCrush benchmarks, which require from a few sec-onds to many hours on present desktop personal computers. The BigCrushsuite requires approximately 238 random samples and has 106 tests producing160 test statistics and corresponding p-values.

Note that TestU01 is designed to test a single, serial PRNG. It is notspecifically designed to test a parallelPRNG. Thus, in order to test variouscombinations of rows, columns, merged rows etc of the arrayX= (xs,j)s0,j0above, some coding effort by the user is required in order to generate theproper input for TestU01.

4.2 The Contiguous Subsequences Technique

Now we mention some approaches for generating random numbers on par-allel architectures as suggested by Aluru [1] and others, starting with thecontiguous subsequencestechnique.

In this approach each processor is responsible for producing and process-ing a contiguous subsequence of length b of the original sequence. Thus, ifthe original sequence is (xj)j0, then processork is responsible for producingxkb, . . . , xkb+b1. In particular, processor k starts at xkb.

There are a number of implications. Clearly b should be chosen suffi-ciently large, so the subsequences of pseudo-random numbers required byeach processor do not exceed b (if this happened, then the subsequencesused on each processor would overlap, with disastrous consequence for theirstatistical independence). If it is not known in advance how many randomnumbers will be required, the only option is to set b very large. Producingthe initial state for each processor to generate the correct subsequence is anontrivial task, and is only feasible for certain classes of PRNGs (includinghowever the common classes LCG and LFSR).

25


26/45

4.3 The Leapfrog Technique

The Leapfrog technique is another approach to allocating subsequences be-tween parallel processors. The motivation for Leapfrog is to preserve sta-tistical qualities of the underlying PRNG. If there are p processors, thenprocessor i, 0 i < p, is responsible for generating xi, xi+p, xi+2p, . . . Thisimplies that we need an efficient way of generating every p-th member ofthe original PRNG sequence. In general, this is difficult or impossible with-out generating all the intermediate values, but in certain special cases it isfeasible.

For example, consider a linear congruential generator (LCG) of the form

xj = (xj1+) modm.

Then it is easy to see that the subsequence (xi+jp)j0 has the same form:

xi+jp = (Axi+(j1)p+B) modm,

whereA= p mod m, B=(p 1)/( 1) modm.Unfortunately, we saw in2.4that LCGs are not recommended for use as

PRNGs, because of their short periods and poor statistical qualities. Theyare satisfactory in some undemanding applications, but unsuitable for in-

clusion in a library of (serial or parallel) PRNGs intended for demandingapplications.

Providedp is a power of two, the Leapfrog technique is applicable to par-allel lagged Fibonacci generators, for which see4.7. However, the statisticalquality of these PRNGS is also dubious [55,7], since they are based on3-term recurrences (see the discussion in2.6).

4.4 Independent Subsequences

The final technique we consider as a general approach to parallel PRNG is

the independent subsequencetechnique [15]. This is a conceptually simplerapproach to distributing random number generation on parallel architectures.The Leapfrog and Contiguous Subsequences techniques are limited in thatthey are dependent on the efficient calculation of certain elements of thesequence, which is only possible in special cases.

We already outlined the independent subsequence technique when dis-cussing ability to jump ahead in2.2.

The independent subsequence technique is similar to the contiguous sub-sequence technique in that each processor produces a contiguous subsequence

26


27/45

of the original sequence. The difference is that these subsequences start at

pseudo-random points in the full period of the underlying PRNG. We simplychoose distinct pseudo-random seeds, one per processor, and each processorgenerates its sequence starting with its particular seed. In practice the seedsmight be chosen to be some simple function of the processor number, butthis is hardly a random choice, so care has to be taken with initialisation ofa PRNG to be used in this way see the discussion of proper initialisationin2.1.

In2.2 we showed that, if there are p processors and each produces bpseudo-random numbers, then the period of the underlying PRNG shouldbe much larger than p2b in order to ensure that the chance of different pro-

cessors producing overlapping sequences is negligible. In practice this means 2128 (approximately), which is not a severe constraint since we want aperiod at least as large as 2128 for other reasons discussed in2.1.

4.5 Combined and Hybrid Generators

In this section we present two closely related techniques that can be usedin aid of the creation of PRNG suitable for parallel architectures. Theseare combined generators and hybrid generators. The goal of these tech-niques is to amalgamate the output of two different generators to improve

the properties of the sequence of random numbers produced. This possibil-ity has been known, both empirically and theoretically, for some time [15].Hybrid generators differ from combined generators in that the underlying al-gorithms behind the PRNGs in the combination are different2. An exampleof a combined generator is the Combined Tausworthe Generator (cf2.7) ofLEcuyer[29].

There are several ways that the output of two generators can be com-bined. Some comments on this have been made by Marsaglia in [39], byLEcuyer [28]in the context of LCGs, and later in the context of Tausworthegenerators [29]. Some simple ways to combine outputs, say x and y, are to

take the exclusive orx y, the sum mod 2w (where w is the wordlength andx, y are regarded as unsigned integers), and the sum mod 2w 1 (which hasthe advantage that the least-significant bit is modified by a carry from themost-significant bit position).

In essence, the combination of two generators can be described as follows.If one generator produces the sequence x1, x2, x3 . . . and another generatorproduces the sequence y1, y2, y3 . . . in some finite set which has some binary

2The terminology used in the literature is inconsistent hybrid generators are oftencalled combined generators.

27


28/45

operator

that forms a Latin square, then the combination generator defined

by producing the sequence x1 y1, x2 y2, x3 y3 . . . should be better than,or at least no worse than, each of the original generators. Marsaglia [39]showed that, if a suitable operator,, is used to create a random variablez = xy, then the distribution ofz should be no less uniform than eachofx or y. A useful side-effect of combining generators is an increase in theperiod the period of the combined generator is the least common multipleof the periods of the component generators.

An example which uses both techniques is the Hybrid Combined Taus-worthe Generator. This generator was developed in the context of GPUimplementations and Monte Carlo simulations and is presented by Howes

et al [20]. The Hybrid Combined Tausworthe Generator is a combinationof two earlier generators: taus88 (LEcuyer), a three-component combinedTausworthe generator, and Quick and Dirty (Press et al), a 32-bit LCG.The defects of each generator make them unsuitable individually as PRNGs.However, in combination the properties of one generator mask the statisticaldefects of the other, making this combined generator satisfactory (accordingto the authors).

This generator has a combined period of 2121. Because of the simplicityof the Tausworthe generator and LCG designs, the Hybrid Combined Taus-worthe Generator only requires a state space of four 32-bit words, for the

32-bit generator.Howes et al compare their implementation of the Hybrid TauswortheCombined Generator in CUDA with another generator that they imple-mented in CUDA on a G80 GPU. The second generator is a Wallace GaussianGenerator which returns random numbers that are normally distributed. ABox-Muller transformation[25] was used to convert the output of the HybridTausworthe generator to the normal distribution. The Wallace generator isslightly faster at 5.2109 RN/s than the Hybrid Combined Tausworthe Gen-erator plus Box-Muller transformation at 4.3 109 RN/s, but both are fast one hundred times faster than generating random numbers on the CPU.

There are a number of other notable hybrid/combined generators. Thexorgens generator (section2.9) and its GPU implementation, xorgensGP(section4.8) are examples of hybrid generators showing improvements oversimpler versions such as all xorshift generators (section2.8). The xorgensclass generators combine xorshift generators with a Weyl Generator. Anotherexample is the KISS family of PRNG. One instance of this class of generator,KISS99 [32], combines three different types of generators: an LCG, threedifferent LFSR generators, and two multiply-with-carry generators. Perhapsa total of six component generators is overkill!

The techniques for hybrid and combined generators have several impli-

28


29/45

cations which make them suitable for use in parallel PRNG. If we recall the

contiguous subsequence technique, the increased period size of the combinedgenerators allow for b, the size of the subsequences, to be set comfortablylarge. Although not implemented in the Hybrid Combined Tausworthe Gen-erator, near maximally equidistributed generators can be found dynamicallyfor a specified number of combinations [29]. This means that a large numberof distinct combined generators can be created from a set of generators suit-able for this technique, with an obvious application to parallel generationof PRNGs. In the extreme case, we could use a different PRNG on eachprocessor.

4.6 Mersenne Twister for Graphic Processors

The Mersenne Twister for Graphic Processors (MTGP) is a recently releasedvariant of the Mersenne Twister [46, 60]. As its name suggests, it was de-signed for GPGPU applications. In particular, it was designed with parallelMonte Carlo simulations in mind. It is released with a parameter generatorfor the Mersenne Twister algorithm to supply users with distinct generatorson request (MTGPs with different sequences). The MTGP is implemented inNVIDIA CUDA[51] in both 32-bit and 64-bit versions. Following the popu-larity of the original Mersenne Twister PRNG, this generator has become a

de facto standard for comparing GPU-based PRNGs.The approach taken by the MTGP to make the Mersenne Twister parallel

can be explained as follows. The next element of the sequence, xi, is expressedas some function, h, of a number of previous elements in the sequence:

xi= h(xiN, xiN+1, x1N+M)

The parallelism that can be exploited in this algorithm becomes apparentwhen we consider the pattern of dependency between further elements of thesequence:

xi =h(xiN, xiN+1, xiN+M)xi+1 =h(xiN+1, xiN+2, xiN+M+1)

...

xi+NM1 =h(xiM1, xiM, xi1)

xi+NM =h(xiM, xiM+1, xi)

(4.1)

The last element in the sequence which producesxi+NMrequires the valueofxi, which has not been calculated yet. Thus, only N Melements of thesequence produced by a Mersenne Twister can be calculated in parallel.

29


30/45

As Nis fixed by the Mersenne Prime chosen for the algorithm, all that

is required to maximise the parallel efficiency of the MTGP is the carefulselection of the constant M. This constant M, specific to each generator,determines the selection of one of the previous elements in the sequence inthe recurrence algorithm that defines the MTGP. Thus, it has a direct impacton the quality of the random numbers generated, and the distribution of thesequence. A candidate set of parameters for an instance of the MTGP canbe evaluated via equidistribution theory. This provides a measure called thetotal dimension defect. The Dynamic Creator for MTGP (MTGPDC) usesthis to provide 128 parameter sets for each of the periods available (definedby different Mersenne primes: 211213 1, 223209 1, and 244497 1), forthe initialisation of distinct MTGP on request by applications during run-time. While the selection of N and M allow for N M elements to becalculated in parallel, there is also the restriction imposed by the number ofprevious elements available. The MTGP maintains these previous elementsin an internal state-space buffer. This is implemented as a circular array,A, so that older elements are over-written after they are no longer neededwhen no further terms in the recurrence expression are dependent on them.Thus, for a given size of the state space array,A, we can only produceA N numbers in parallel. As an example, for the 32-bit (w = 32)MTGP with p = 11213, we have N = 351 and N M typically greaterthan or equal to 256. If we chooseA = 512 then we can only calculateat mostA N= 161 numbers in parallel. The MTGP requestsA inincrements of powers of two to simplify memory access. Consequently, toachieve the full level of parallel computation supported by these parameters,this MTGP will requireA = 1024 of 32-bit variables (4 KB of memory).At the time of this generators development, the CUDA architecture (version1.3) supported 512 threads per block of threads and a maximum of 16 KBof shared memory per multi-processor on the device [51].

This relatively large state-space would ultimately be a limiting factor inachieving the full computational throughput possible by a GPU device. Theauthor of MTGP notes in the most recent release a change in the CUDAsample from larger periods (223209 1 and 244497 1 for the 32-bit and 64-bit versions respectively) to a smaller one (211213 1 for both versions) inorder to improve the utilisation of the device due to the increased memoryrequirements of the larger periods.

The author of MTGP provides performance results for the 32-bit MTGPon NVIDIAs GeForce GTX 260 GPU device. 5107 random numbersare generated across 108 blocks of 256 threads. The time claimed for thisto complete was measured at 4.6ms. This translates to a random numbergeneration rate of approximately 1.1 1010 RN/s. No information regarding

30


31/45

the quality of the MTGP sequence is reported. However, the MTGP is

still a variant of the Mersenne Twister, so it can be assumed that it shouldsuffer from the same problem in the linear recurrence tests at some level asdescribed in[32].

With respects to a programming model for the implementation of theMTGP, this PRNG loosely conforms to the general CUDA programmingmodel which encourages a contiguous divide-and-conquer approach. Theproblem of producing a sequence of random numbers is divided such that eachblock computes a sub-sequence. For example, if only two blocks are used,each block would compute contiguous halves of the 5 107 random numbers.Each block is then initialised as a distinct MTGP and can independently do

its portion of the work. This avoids the issue of overlapping sub-sequencesfound in other contiguous sub-sequence approaches. Each thread within theblock is responsible for generating one random number per iteration of thealgorithm. The threads work synchronously to produce sub-sequences of thework allocated to the block.

4.7 Parallel Lagged Fibonacci Generators

The Lagged Fibonacci Generator (LFG) is a well documented class of gen-erators. A number of authors have noted the advantages of LFG over LCG

generators and have implemented and analysed LFGs on a number of differ-ent parallel architectures, including distributed systems [1, 44, 61, 33] andGPU systems [24]. LFGs have been used in a wide variety of applications,including Monte Carlo algorithms, particle filtering, and machine-learning in-spired computer vision algorithms. Here we discuss a few different distributedimplementations of this generator and one NVIDIA CUDA[51]implementa-tion.

Worthy of note is the leapfrog variant of this generator class. To explainthe application of the leapfrog technique to the LFG let us first recall therecurrence:

xi= xir xir+s mod m,where

is a suitable operator, e.g. is the exclusive oroperator (XOR),mis the modulus value (usually 232 or 264), irrelevant in the case,r and sare integer parameters, r > s.

In the recurrence, xir is the older of the two terms used to define xi, andas we do not need to store any older terms, this defines the size of the state

31


32/45

space. Ifr and s are chosen suitably, and

is

, the period is 2r

1. (If

is a different operator, such as addition or multiplication mod m, then theperiod may be larger, but the leapfrog technique is not applicable, so weassume here thatis.)

It is not difficult to show, by consideration of Pascals triangle or usingproperties of the generating function in GF(2)[x], that

xi= xipr xipr+ps (4.2)whenever p is a power of two.

Janowczyket al[24]provide an implementation of this leapfrog LFG on aGPU architecture using NVIDIA CUDA [51]. Their implementation requiresthe initialisation ofp r random numbers, which can either be loaded froma file or created in parallel in a seeding step. As r also defines the size ofthe state space, its selection is important. Some suggested values forr forgood statistical properties of the generator range into the tens of thousands,which is undesirable on a GPU architecture where memory is limited.

The experiments presented in the paper by Jacowczyk et alwere con-ducted on the NIVIDA GeForce 8800GTS GPU device. In this implementa-tion, values ofrequal to 607 and 1279 are cited as being used, but the authorsfound that the larger value decreased the speed of random numbers per sec-ond (RN/s) achieved. They compared their results to the version of the

Mersenne Twister found in the NVIDIA CUDA SDK. A speed of 1 .26 109RN/s is claimed. This figure includes the time taken to copy the randomnumbers to the device memory. In separate tests where the results are notcopied, mimicking a situation where a simulation may execute inline withthe generation of random numbers, the leapfrog LFG performs at 3.5 109RN/s, and the Mersenne Twister at 2.59 109 RN/s. The difference is ex-plained by the increased computational cost per random number due to theMersenne Twisters tempering step, where the LFG requires only one XORoperation per random number (but may produce numbers with statisticallylower quality as a result).

As the sequence produced by this generator is identical to the sequentialLFG, its quality is exactly the same and all comments on its quality incomparison to other generators hold, for example those made by LEcuyer [ 32]and Marsaglia [39].

4.8 xorgensGP

In 2010 one of us [Nandapalan] implemented the xorgens generator in NVIDIACUDA. Extending the xorgens PRNG to be applicable in the GPGPU do-

32


33/45

main was not a trivial pursuit as a number of design considerations had to

be taken into account in order to create an efficient implementation. Thisprovided some insight into the problem of adapting algorithms to a GPUenvironment.

We are essentially seeking to exploit some level of parallelism inherentin the flow of data. To realise this, let us first recall the recursion relationdescribing the xorgens algorithm:

xi= xir(I+ La)(I+ Rb) +xis(I+L

c)(I+ Rd)

In this equation, the parameter r represents the degree of recurrence, andconsequently the size of the state space (in words, and not counting a small

constant for the Weyl generator and a circular array index). L and R rep-resent left-shift and right-shift operators, respectively. If we conceptualisethis state space array as a circular buffer ofr elements we can reveal somestructure in the flow of data. In a circular buffer, x, of r elements, wherex[i] denotes theith element,xi, the indicesi and i + rwould access the sameposition within the circular buffer. This means that as each new element xiin the sequence is calculated from x[i r] and x[i s], the result replacesthe rth oldest element in the state space, which is no longer necessary forcalculating future elements.

Now we can begin to consider the parallel computation of a sub-sequence

of xorgens. Let us examine the dependencies of the data flow within thebuffer x as a sequence is being produced:

xi =xirA+xisB

xi+1 =xir+1A+xis+1B

...

xi+(rs) =xir+(rs)A+xis+(rs)B

=xisA+xi+r2sB

...

xi+s =xir+sA+xis+sB

=xir+sA+xiB

(4.3)

If we consider the concurrent computation of the sequence, we observe thatthe maximum number of terms that can be computed in parallel is

min(s, r s).Here r is fixed by the period required, but we have some freedom in thechoice ofs. It is best to choose sr/2 to maximise the inherent parallism.

33


34/45

However, the constraint GCD(r, s) = 1 implies that the best we can do is

s = r/2 1, except in the case r = 2, s = 1. This provides one additionalconstraint, in the context of xorgensGP versus (serial) xorgens, on the pa-rameter set{r,s,a,b,c,d} defining a generator. These properties providedsufficient motivation for two different implementations of xorgensGP to betested, each of which exploit the data-parallelism with non-overlapping con-tiguous blocks of threads. The two different implementations are based onthe independent subsequences technique and a variation of it.

Let us first consider the independent subsequence technique. With theblock of threads architecture of the CUDA interface and the techniques goalof creating subsequences, it is a logical and natural decision to allocate each

subsequence to a block within the grid of blocks. This can be implementedby providing each block with its own local copy of a state space. This localstate space will then represent the same distinct generator, but at differentpoints within its period.

It should be noted that each generator is identical in that the same pa-rameter set{r,s,a,b,c,d} is used for each. An advantage of this is theparameters are known at compile time, allowing compile-time optimisationsthat would not be available if the parameters were dynamically allocated,and thus known only at runtime.

Now we outline the implementation using a variation of the independent

subsequences technique to making xorgens parallel. Whereas the previoustechnique uses the same parameter set for each generator and the same gen-erator on each block, this approach achieves independent subsequences byproviding each block with a distinct generator, that is with a unique set ofparameters{r,s,a,b,c,d}. As r is fixed, this requires searching through afive-dimensional space of parameters to find those that give period 2r1 andsatisfy other constraints designed to give output with high statistical quality.This search computationally expensive (requiring of the order of minutes ona single CPU), so suitable parameter sets were precomputed.

Each generator can be started with the same seed, since the unique pa-rameters of each generator should be sufficient to produce independent anduncorrelated outputs.

We now present an evaluation on the results obtained in testing the twoimplementations of xorgensGP. All experiments were performed on a singleGPU on the NVIDIA GeForce GTX 295 which is a dual GPU device.

First, the performance of the two candidate solutions with respects torandom number throughput (RN/s) was tested see Table 4.1.

Second, a report was produced on the memory resource allocation of eachimplementations kernel functions, known at the time of code compilation.This report is optionally produced by the cuda compiler. The report for

34


35/45

Implementation: RN/s

Independent Subsequence 6.3 109Variation 2.6 108

Table 4.1: Comparison of implementations of xorgensGP.

each kernel includes the number of registers required per thread, the totalstatically allocated shared memory consumed by the user and compiler (dy-namically allocated shared memory can only be calculated at runtime), and

the utilisation of the constant memory banks. The reports for the two imple-mentations are presented in Table4.2. Thus it can be seen that the standardindependent subsequences is a more memory efficient solution by a significantnumber of registers per thread. It should be noted that this does not includethe dynamically allocated shared memory of sizer for the state space that isthe same for both generators. In the case of 32-bit words and n set to 4096,this gives the state space a size of 128.

Implementation Registers Shared Memory Constant Memory

Independent Subsequence 14 32+16 bytes [1] 8 bytes

Variation 21 32+16 bytes [14] 4 bytes

Table 4.2: Comparison of Memory resource allocation known at compilationtime for threads, between candidates of xorgensGP.

The period of this generator is greater than 24096 1 due to the combi-nation with the Weyl generator. The xorshift based component guarantees aperiod of at least 24096 1. The period of the Weyl generator is either 232 or264, depending on the use of the 32-bit or 64-bit precision respectively. Thusthe overall period is 232(24096

1) or 264(24096

1).

The two implementations were each subjected to the BigCrush batteryfor empirical statistical PRNG testing, as offered by the TestU01 softwarelibrary [32]. The result of this experiment found that both implementa-tions pass all tests within the battery. It can be concluded that there is noappreciable difference in the quality of the two sequences from the implemen-tations by accepted benchmarks. The difference in performance between thetwo implementations can be attributed partly to the compiler optimisationsthat can be made when the parameters are known at compile time. Anotherfactor is that the code is most efficient when multiples of 32 threads are run

35


36/45

for each block. Since s must be odd, the best we can achieve is usually 31

threads per block.A preliminary test of the MTGP was conducted on the same device as the

xorgensGP implementations for comparison. The throughput of the MTGPwas found to be about 1.2 109 RN/s. This suggests that the xorgensGPimplementation with independent subsequences is approximately four timesfaster than the MTGP, at least on this and similar devices. (Note that thefigures given in Table4.3use different platforms, so are not comparable.)

4.9 Other Parallel PRNG

We now present a survey of some other recent parallel PRNGs. These im-plementations do not offer any further insight into the design of parallelPRNGs beyond the examples described so far. We present a summary oftheir performance and any interesting features.

4.9.1 CUDA SDK Mersenne Twister

The NVIDIA CUDA SDK provide a number of examples to assist develop-ers. One of these examples is an implementation of a Mersenne Twister andis described in [59]. This is not to be confused with the MTGP describedin4.6. The approach taken by this generator for the GPU architecture issimilar to the variant of the independent subsequences method attempted forthe xorgensGP. This implementation takes advantage of the dynamic gener-ator for the original Mersenne Twister (DCMT) to produce many distinctMersenne Twisters that execute in parallel. This is required as the MersenneTwister can not avoid correlations in subsequences for very different (byany definition) seeds, or initial values, unlike the xorgens generator.

The CUDA SDK Mersenne Twister implementation makes use of theunique thread identifier as an input to the dynamic Mersenne Twister gener-ator, allowing each thread to update a Mersenne Twister independently. Toreiterate, every thread is associated with its own state-space, and so thereas many generators as threads running on the device. Consequently, eachMersenne Twister in this implementation is comparatively small, having aperiod of just 2607 1, which is quite small in comparison to the period ofthe MTGP. This generator has been shown to be four times slower than theMTGP.

36


37/45

4.9.2 CURAND

The CUDA CURAND Library is NVIDIAs parallel PRNG framework andlibrary. It is documented in [52]. The default generator for this library isbased on the XORWOW algorithm introduced by Marsaglia in [41]. It is anexample of a xorshift class generator. No performance data appears to beavailable for this particular implementation.

4.9.3 Park-Miller on GPU

The Park-Miller Generator, also known as a Lehmer Generator, can beviewed as a special case of the LCG class of generators. Langdon presents

and releases code for a GPU implementation of this generator in CUDA in[26] and revise it in[27] with double precision and a pre-production model ofa NVIDIA Tesla T10P unit. A peak random number throughput of 3.5 109RN/s is claimed at double precision for this device.

4.9.4 Parallel LCG

A parallel implementation of a 64-bit LCG is provided in the Scalable Li-brary for Pseudorandom Number Generation (SPRNG) presented in [44].Performance information for this parallel LCG is made available by Tan

in [61] and was obtained on a cluster of twenty DEC Alpha 21164 proces-sors. The throughput of this generator measured by Tan was 12.8109RN/s. This LCG was made parallel by a variant of the independent subse-quences technique by using unique parameters to create distinct generatorsand subsequences.

4.9.5 Parallel MRG

The SPRNG library also provides a parallel implementation of a CombinedMultiple Recursive Generator (MRG). Similarly, performance information isavailable by Tan in [61]. The throughput of this generator was measured at4.2 109 RN/s. The variant of the independent subsequences technique wasalso used on this implementation.

4.9.6 Other Parallel Lagged Fibonacci Generators

The LFG has been implemented in parallel a number of times, see Aluru [ 1],Mascagni [44], and Tan et al [61]. Aluru and Mascagni do not present per-formance information for any specific implementation, but performance in-formation for Mascagnis implementation is available in Tans paper.

37


38/45


39/45

Generator RN/s (

109) Period State Space Platform

Wallace (normal) 5.2 280 2048 G80 based GPUHybrid Comb. Tausworth 4.3 2121 4 G80 based GPUMTGP 10.1 211213 1024 GeForce GTX 260

1.2 GeForce GTX 295Leapfrog 3.5 2607 607 GeForce 8800GTSCUDA SDK MT 2.59 2607 19 GeForce 8800GTSxorgensGP 6.5 24096+w 4096/w GeForce GTX 295Park-MillerGP 3.5 Tesla T10PSPRNG LCG 12.8 DEC Alpha 21164SPRNG Combined MRG 4.2 DEC Alpha 21164PLFG 4.0 223238 DEC Alpha 21164SPRNG MLFG 5.4 DEC Alpha 21164SPRNG ALFG 3.8 DEC Alpha 21164

Table 4.3: Summary of implementations surveyed. Periods and state spacesizes (in words) are approximate (and in some cases surmised). The wordsizew= 32 or 64 for xorgensGP.NB: The throughputs of each generator should not be directly compared inview of the different platforms on which each generator was implemented.

39


40/45

Bibliography

[1] S. Aluru. Lagged Fibonacci random number generators for distributedmemory parallel computers. Journal of Parallel and Distributed Com-

puting, 45(1):112, 1997.

[2] E. Barker and J. Kelsey. Recommendation for Random Number Gen-eration Using Deterministic Random Bit Generators (Revised). NISTSpecial Publication 800-90, 2007.

[3] R. P. Brent. Algorithm 488: A Gaussian pseudo-random number gen-erator. Communications of the ACM, 17:704706, 1974.

[4] R. P. Brent. Uniform random number generators for supercomputers.In Proceedings Fifth Australian Supercomputer Conference, Melbourne,December 1992, pages 95104, 1992.

[5] R. P. Brent. Fast normal random number generators on vector proces-sors. Technical Report TR-CS-93-04, ANU, 1993. arXiv:1004.3105v2.

[6] R. P. Brent. A fast vectorised implementation of Wallaces normal ran-dom number generator. Technical Report TR-CS-97-07, ANU, 1997.arXiv:1004.3114v1.

[7] R. P. Brent. Random number generation and simulation on vector and

parallel computers.Lecture Notes in Computer Science, 1470:120, 1998.

[8] R. P. Brent. Note on Marsaglias xorshift random number generators.Journal of Statistical Software, 11(5):14, 2004.

[9] R. P. Brent. Some long-period random number generators using shiftsand xors. ANZIAM Journal, 48 (CTAC2006):C188C202, 2007.

[10] R. P. Brent. xorgens version 3.05, 2008. http://maths.anu.edu.au/brent/random.html.

40


41/45

[11] R. P. Brent. Some comments on C. S. Wallaces random number gener-

ators. Computer Journal, 51(5):579584, 2008.

[12] R. P. Brent. The myth of equidistribution, 2010. arXiv:1005.1320v1.

[13] R. P. Brent and P. Zimmermann. The great trinomial hunt. Notices ofthe American Mathematical Society, 58(2):233239, 2011.

[14] N. Brookwood. AMD Fusion family of APUs: Enabling a superior,immersive PC experience. Insight 64, 1:18, March 2010.

[15] Paul D. Coddington. Random number generators for parallel computers.The NHSE Review, 2, 1997.

[16] Advanced Micro Devices. http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx.

[17] G. S. Fishman and L. R. Moore. An exhuastive analysis of multiplicativecongruential random number generators with modulus 231 1. SIAM J.Scientific and Statistical Computing, 7:2445, 1986.

[18] H. Haramoto, M. Matsumoto, and P. LEcuyer. A fast jump aheadalgorithm for linear recurrences in a polynomial space. In ProceedingsSETA 2008, pages 290298, 2008.

[19] H. Haramoto, M. Matsumoto, T. Nishimura, F. Panneton, andP. LEcuyer. Efficient jump ahead for F2-linear random number gen-erators. INFORMS J. on Computing, 20(3):385390, 2008.

[20] L. Howes and D. Thomas. Efficient random number generation andapplication using CUDA. GPU gems, 3:805830, 2007.

[21] Intel Corp. Intel advanced vector extensions programming reference.http://software.intel.com/file/33301, 2010.

[22] Intel Corp. Intel microarchitecture codename sandy bridge.

http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm, 2011.

[23] F. James. A review of pseudorandom number generators. ComputerPhysics Communications, 60:329334, 1990.

[24] A. Janowczyk, S. Chandran, and S. Aluru. Fast, processor-cardinalityagnostic PRNG with a tracking application. In Computer Vision,Graphics and Image Processing, 2008: Proc. Sixth Indian Conference,ICVGIP08, pages 171178. IEEE, 2009.

41
http://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://software.intel.com/file/33301http://www.intel.com/technology/architecture-silicon/2ndgen/index.htmhttp://www.intel.com/technology/architecture-silicon/2ndgen/index.htmhttp://www.intel.com/technology/architecture-silicon/2ndgen/index.htmhttp://www.intel.com/technology/architecture-silicon/2ndgen/index.htmhttp://software.intel.com/file/33301http://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspx


42/45

[25] D. E. Knuth. The Art of Computer Programming, Volume 2: Seminu-

merical Algorithms. Addison-Wesley, third edition, 1997.

[26] W. B. Langdon. PRNG Random Numbers on GPU. Technical ReportCES-477, University of Essex, UK, 2007.

[27] W. B. Langdon. A fast high quality pseudo random number generatorfor nVidia CUDA. InProceedings of GECCO09, pages 25112514, 2009.

[28] P. LEcuyer. Efficient and portable combined random number genera-tors. Communications of the ACM, 31(6):742751, 1988.

[29] P. LEcuyer. Maximally equidistributed combined Tausworthe genera-tors. Mathematics of Computation, 65(213):203213, 1996.

[30] P. LEcuyer and F. Panneton. Construction of equidistributed genera-tors based on linear recurrences mod 2. In K.-T. Fang et al, editor,MonteCarlo and Quasi-Monte Carlo Methods 2000, pages 318330. Springer-Verlag, 2002.

[31] P. LEcuyer and R. Simard. TestU01: A software library in ANSI Cfor empirical testing of random number generators. Technical report,Department dInformatique et de Recherche Operationnelle, Universitede Montreal, 17 August 2009, 219 pp.

[32] P. LEcuyer and R. Simard. TestU01: A C library for empirical test-ing of random number generators. ACM Transactions on MathematicalSoftware (TOMS), 33(4), 2007.

[33] JunKyu Lee, Yu Bi, Gregory D. Peterson, Robert J. Hinde, andRobert J. Harrison. Hasprng: Hardware accelerated scalable paral-lel random number generators. Computer Physics Communications,180(12):2574 2581, 2009. 40 YEARS OF CPC: A celebratory issue

focused on quality software for high performance, grid and novel com-puting architectures.

[34] P. Leopardi. Testing the tests: using random number generators toimprove empirical tests. Monte Carlo and Quasi-Monte Carlo Methods2008, pages 501512, 2009.

[35] J. Makino and O. Miyamura. Generation of shift register random num-bers on vector processors. Computer Physics Communications, 64:363368, 1991.

42


43/45

[36] G. Marsaglia. http://groups.google.com/group/comp.lang.c/msg/

e3c4ea1169e463ae, 14 May 2003.

[37] G. Marsaglia. http://groups.google.com/group/sci.crypt/msg/12152f657a3bb219, 14 July 2005.

[38] G. Marsaglia. Random numbers fall mainly on the planes.Proc. NationalAcademy of Science USA, 61(1):2528, 1968.

[39] G. Marsaglia. A current view of random number generators. In L. Bil-lard, editor,Computer Science and Statistics: The Interface, pages 310.

Elsevier Science Publishers B. V. (North-Holland), 1985.

[40] G. Marsaglia. DIEHARD: a battery of tests of randomness.http://stat.fsu.edu/geo/ diehard.html, 1996.

[41] G. Marsaglia. Xorshift RNGs.Journal of Statistical Software, 8(14):16,2003.

[42] G. Marsaglia and A. Zaman. A new class of random number generators.Annals of Applied Probability, 1(3):462480, 1991.

[43] M. Mascagni, S. A. Cuccaro, D. V. Pryor, and M. L. Robinson. Recentdevelopments in parallel pseudorandom number generation. InProceed-ings of PPSC1993, pages 524529, 1993.

[44] M. Mascagni and A. Srinivasan. Algorithm 806: SPRNG: a scalablelibrary for pseudorandom number generation. ACM Transactions onMathematical Software (TOMS), 26(3):436461, 2000.

[45] M. Matsumoto and Y. Kurita. Strong deviations from randomness inm-sequences based on trinomials.ACM Trans. on Modeling and ComputerSimulation, 6(2):99106, 1996.

[46] M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number gen-erator. ACM Transactions on Modeling and Computer Simulation(TOMACS), 8(1):330, 1998.

[47] M. Matsumoto, T. Nishimura, M. Hagita, and M. Saito.Mersenne Twister and Fubiki stream/block cipher, 2005.http://eprint.iacr.org/2005/165.pdf.

43


44/45

[48] M. Matsumoto, M. Saito, T. Nishimura, and M. Hagita. CryptMT

stream cipher ver. 3: description. Lecture Notes in Computer Science,4986:719, 2008.

[49] M. Matsumoto, I. Wada, A. Kuramoto, and H. Ashihara. Common de-fects in initialization of pseudorandom number generators. ACM Trans.on Modeling and Computer Simulation, 17(no. 4, article 15), 2007.

[50] A. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook ofApplied Cryptography. CRC Press, 1996 (fifth printing 2001).

[51] NVIDIA. Compute Unified Device Architecture (CUDA) Programming

Guide. NVIDIA Corp., Santa Clara, Ca, 2010.

[52] NVIDIA. CUDA CURAND Library. NVIDIA Corp., Santa Clara, Ca,2010.

[53] NVIDIA Corp. Tegra. http://www.nvidia.com/object/tegra.html,2011.

[54] F. Panneton and P. LEcuyer. On the xorshift random number gen-erators. ACM Transactions on Modeling and Computer Simulation,15(4):346361, 2005.

[55] F. Panneton, P. LEcuyer, and M. Matsumoto. Improved long-periodgenerators based on linear recurrences modulo 2. ACM Transactions onMathematical Software, 32(1):116, 2006.

[56] S. K. Park and K. W. Miller. Random number generators: good onesare hard to find. Communications of the ACM, 31:11921201, 1988.

[57] W. P. Petersen. Some vectorized random number generators for uni-form, normal, and Poisson distributions for CRAY X-MP. Journal ofSupercomputing, 1:327335, 1988.

[58] M. Pharr and R. Fernando. CPU Gems 2. Addison-Wesley, 2005.

[59] V. Podlozhnyuk. Parallel Mersenne Twister.NVIDIA white paper, 2007.

[60] M. Saito. A variant of Mersenne Twister suitable for graphic processors,2010. arXiv:1005.4973.

[61] Chih Tan and J. Rod Blais. PLFG: A highly scalable parallel pseudo-random number generator for Monte Carlo simulations. In M. Bubak,

44
http://www.nvidia.com/object/tegra.htmlhttp://www.nvidia.com/object/tegra.htmlhttp://www.nvidia.com/object/tegra.html


45/45

H. Afsarmanesh, B. Hertzberger, and R. Williams, editors, High Per-

formance Computing and Networking, volume 1823 ofLecture Notes inComputer Science, pages 127135. Springer Berlin, Heidelberg, 2000.

[62] S. Tezuka and P. LEcuyer. Efficient and portable combined Tauswortherandom number generators. ACM Transactions on Modeling and Com-puter Simulation, 1(2):99112, 1991.

[63] D. B. Thomas, W. Luk, P. H. W. Leong, and J. D. Villasenor. Gaussianrandom number generators. ACM Computing Surveys, 39(4):11:136,2007.

[64] D.B. Thomas, L. Howes, and W. Luk. A comparison of CPUs, GPUs,FPGAs, and massively parallel processor arrays for random number gen-eration. InProceeding of the ACM/SIGDA International Symposium onField Programmable Gate Arrays, pages 6372. ACM, 2009.

[65] J. von Neumann. Various techniques used in connection with randomdigits. In The Monte Carlo Method, pages 3638, 1951. Reprinted inJohn von Neumann Collected Works, volume 5, 768770.

2011 parallel pseudo-random number generation

Documents