how many operation units are adequate? · how many operation units are adequate? wolfgang matthes,...

How many operation units are adequate? Wolfgang Matthes, 1991

A manuscript in a tentative stage. The finished paper has been published in Computer Architecture News, June 1991 .

Abstract

A uniprocessor superscalar architecture is proposed which comprises four universal operation units arranged according to a tree-shaped dataflow graph, instruction issuing hardware, and operand selection means. The control principles are based on VLIW, microprogramming, and dataflow concepts. The proposal emerged mainly from investigations of inherent mathematical structures of application problems, especially from the analysis of dataflow graphs of elementary mathematical formulas (arithmetic of intervals, complex and rational numbers etc . ) . The particular operation unit itself is an ensemble of high-performance processing resources which may be compared to stateof-the-art processors <e . g . i860) . It may require a silicon budget from one to five million transistors. The whole processor may require 10 to 50 million transistors, thus being a suitable implementation target for IC technologies of the 90's.

1. Introduction

To make use of the inherent par~llelism in ordinary programs requires the availability of more than one processing resource to perform the desired operations. An obvious approach is to provide multiple operation units . This is known as the superscalar approach . Besides, multiple resources could be created by dividing an operation resource vertically into pipelined segments so that multiple operations can flow through in a step-by-step fas hion <superpipelining approach>. Both principles can be combined <supersca lar/superpipelined machines; J088) . In this paper a particular superscalar architecture will be proposed <with obvious possibilities to add superpipelining, too) . The main objective of the underlying research activities was to develop a performance-optimized uniprocessor architecture which should be

1> a versatile, powerful , and cost-effective ensemble of processing resources,

2) an advantageous implementation target for IC technologies of the 90 ' s,

3 ) a suitable processing element for high-performance and massively parallel systems, according to the good old principle of engineering sciences, to optimize the components first before implementing them in cost-i ntensive technologies or assembling them together in large quantities.

2. Known Structures - an Overview

Superscalar machines can be built as ensembles of different operation units (integer add, integer multiply, floating point add, multiply etc.; the CDC 6600 CTH064J is a well-known example>. Evidently, an ensemble cf universal operation units, each of which is capable of performing all the operations specified in the architecture, will provide more opportunity to exploit the available inherent parallelism. Hence we will concentrate on such structures. In a rough taxonomy, known structures could be divided into two categories:

1) tree-like dataflow connection structures,

2) crossbar-like connection structures.

The Figures 1,2 show two structures of the first category. The first structure CWU83J was derived from the observation, that many operation sequences have the form

(a OP1 bl OP2 c,

with the SAXPY (linked triad) d(i)= (a * b(i)) + c(i) being a well-known example. The second structure <proposed for a GaAs microprocessor; VLA88) had been based on the following empirical realization: Application programs can be divided into those with a low amount of calculations and those with an extraordinarily high amount. Programs belonging to the first category have in approximately 93% of all assignments no arithmetic operands or only one arithmetic operand. Calculation-intensive programs have in approximately 93% cf all assignments up to 3 operands (see Figure 3). Figure 4 shows the principial structure cf a superscalar machine with crossbar-like connections. The obvious advantage is the unlimited universality which is not restricted by a particular scheme of data flow. On the contrary, crossbar networks are cost-intensive and may lead to a slower machine cycle. This scheme is typical cf VLIW architectures [COL87, COBBJ, sometimes with the modification of separate register files, crossbars, and operation units for both integer/logical and floating point operations.

An important hardware viewpoint deserves consideration: the information paths in dataflow schemes are point-to-point connections which can be kept shcrt. Tree-shaped structures have the additional advantage that the connections de not cross each other. Hence they are better suited for integration than totally universal connection schemes (crossbar or bus structures).

3. Foundations for Developing a New Architecture

The present superscalar architectures had been developed on the basis of comprehensive analytical werk. The experiences had been gathered essentially by measurements cf the frequency of usage cf operations and operation sequences in comprehensive samples of application programs.

Such a measurement-oriented approach [MA86J may lead to considerably good machines, but it has two obvious drawbacks:

1> The rate of usable inherent parallelism is disappointingly low. In the literature, the recommended rate depends on the semantic level at which the investigations had been done. A lower level means less usable parallelism. If only the instruction level is considered, it has shown streng evidence, that it makes no sense to provide more than two operation units in parallel [J089, SM89J. On the contrary, investigations at Fortran source code level had promised rates of usable parallelism from 16 to more than 128 [KU74J.

2) Machine architectures derived from these data reflect current programming habits. Possible opportunities for further innovation may remain undiscovered.

Hence our approach is not to study programs, but to stuciy the underlying mathematical structures or, in more general terms, the deep structures, the essence of important application problems Ci. e. semantic levels above the programming languages).

Example: In many numerical applications it is possible to execute both integer and floating point operations in parallel. This fact had been applied to some architectures Ce. g. Trace, Transputer T 800, i860). But what is the essential cause behind this empirical observation? For what reason can integer units be kept busy in loops processing floating point data? - Evidently, the integer operations are necessary to do the address calculations for array element addressing. From this realization we can draw a significant advantage: We can sub-optimize the integer units. For example, we may restrict the number of universal integer units to one or two, and additionally we may provide some units specialized to address calculations Cas many as needed to feed the floating point units). Thus we may exploit more of the inherent parallelism and keep cost comparatively lower.

To obtain initial empirical data for a new proposal we have simply browsed some collections of formulas. 1 Figures 5-8 show some frequently needed mathematical operations together with the corresponding dataflow graphs. We realize only two essential interconnection structures:

1) none, i. e. independently operating units,

2) tree-shaped structures.

A more elaborate bookkeeping of the resources needed Cskipped here for sake of brevity> will show that fcur operation units may be used efficiently Ci. e. they may be kept busy in nearly the whole time). Calculations whose dataflow graph comprises more than four nodes are to be executed in more than one processing step. Hence some bypass and local storage means have to be provided.

1 Of course, this simple method cannot substitute comprehensive investigations. But it is sufficient to demonstrate the feasibility of our approach.

4 . The Proposed Structure

Accord1ng to Figures 9,10 the proposed structure comprises four universal Operation units and Cnot explicitely shown) a minimum/maximum and a delta detector. The operation units form a tree structure with four operand data paths from memory and one result path to memory. The structure is an extension of t h e structure according to Figure 2 by an additional unit whose 2nd operand input is attached to a stack-like accumulating memory or accumulator register, respectively. A stack-organized accumulat1ng memory of suff icient capacity may be used as the runtime data stack <at least as a stack cache; DI87), Hence the problem how ordinary programs can exploit the tree structure efficiently may be reduced to the problem of tree height reduction. The memory subsystem, the instruction issuing and control mechanisms, and address calcul~tion means are implemented in additional hardware which will be explained below. The structure according to Figures 9,10 is based on t h e assumption that a tree-like dataflow scheme will be more significant in a true universal processor than independent operation cf the four units. Hence only the tree connection s are prov ided in hardwar e to keep interconnection cost as low as possible. This approach requires some bypass provisions to feed the operation units 3,4 with control information, and, if operated independently, with memory data. On the other hand, this cost/performance tradeoff will cause effectivity lasses, if vectorized or unrolled code is to be executed. As a matter of routine, to allow for independen t oper ation of four units would require at least 12 memory ports (4*2=8 to fetch the operands, 4 to store the results), and additional interconnection means would be n ecessary to implemen t the tree-sh aped str ucture. The urge for cost reduction led to the extension of the proposed str uct ur e shown in Figure 11. Each of the operation units has a multipurpose memcry <MPM> which can be used as an accumulatori a stack~ a ccllection of vector registers, and a control storage. lt h as two independent ports for read and write accesses, respectively. Its capacity should be at least 8 kBytes, organized as 1024 buckets of 128 bits (if used as ~

vector register~ it could hold two vectors of 1024 64-bit elements). The whole structure is connected to the memory subsystem via eight ports Cfour read-only and four read/write ports, the latter are used to provide the paths of the tree-shaped structure as well). This scheme allows to load and store the MPMs at maximum speed. Each of the operation units can execute even triadic operations Ce. g. SAXPY> with two of the operands delivered via memory ports a n d one from the MPM. The results will be stored in the MPMs. They can be moved to the memory at maximum speed after the operations have been ccmpleted.

5 . The Internal Structure of an Operation Unit

Each of the faur operation units can process numerical and n onnumerical data, respectively. The internal structure of a processing kernel is shown in Figure 12. In principle, some state-of-the-art high-performance processors may serve as a paradigm for processing kernel design (e. g. TMS 34082, Motorola DSP960002, i860, AMD 29000), and Figure 12 shows nothing but an ensemble of processing resources a high-performance machine should

have, accord1ng to tcday's knowledge. Compatibility to existing architectures was not our concern. Instead, we tried to put as many innovative ideas as poss1ble in our proposal. Here are some cf these concepts:

!) All data structures are packed in buckets <machine words) of 128 b1ts. In the hardware, a bucket can be divided in bags of 64, 32, 16, or 8 bits.

2> In the buckets, arbitrary bit fields can be selected.

3) The bit field is the basic type for nonnumerical data. Normally, such data structures are packed in bags <B-64 bits). In some cases, the bags of a bucket may be pracessed in parallel (scann1ng of character strings, graphics operations etc.).

4) For n1~merical data? there is only one basic type: the b1nary coded natural number. Arbitrary bit fields can be treated as natural numbers. They will be processed in multiples of 32 b1ts <with appropriate extens1on before processing, if necessary>. All other numerical data types are extens1ons of this concept: Inteqers are naturals extended by a sign bit (sign/magnitude representation in contrast to the usual two's ccmplement representation). Floating point numbers are composed of an integer mantissa and an integer exponent (fixed formats of 32, 64, and 96 bits). BCD coded decimal numbers are not prov1ded. Decimal numbers can be represented as rational numbers (fractions) of the form a/b. The merits of th1s proposal have yet to be proved. But there are some obv1ous advantages~

aJ For each type of operation, necessary.

only one type of hardware resource is

b) Floating point operations could be conlrolled up to the elementary level (microcode level; DA89). High accuracy alqorithms <e. g. an accurate scalar product; KU81) could be implemented efficiently. I+ desired, extremely lang integers could be used instead of floatinc point numbers for intermediate variables within highaccuracy calculat1nns.

c) BCD hardware can be avoided. Binary rational number arithmetic can use the tree structure efficiently (see Figure 7). This promises to be cons1derably faster than the usual nibble-by-nibble BCD arithmetic.

Of course, the machine should be compatible to wide-spread basic data structures C2's complement integers, bytes etc.). But conversion (e. g. 2's complement to sign/magnitude representation and vice versa> cou ld be done on the f ly and requires considerabely less hardware than independent resources for each data type.

To each cf the basic operations, one dedicated har·dware resource is ass1qned. Some resources could be operated in parallel, but this kind of parallelism has been Festr1cted to keep cost down <e. g. in the numer1cal section, only mult1ply-add dataflow has been provided). Exponent calulations are performed in dedicated hardware. Special

c1rcuitry has been provided for data conversion <unpacking of stored data into the internal representation and vice versa). This circuitry (in Figure 12: Argument selection/alignment ) consists mainly of barre! shifters which can be exploited for multiple functions Cbitfield extraction/insertion, floating point mantissa shifting etc.).

Since none of the resources and operations is completely new, estimations of expenditures can be based on known high-performance processors. For example, the i860 [INT90J comes very close to our proposal, including 8 kBytes on-chip memory, 64-and 128-bit data paths, multiply-add chaining in the FP unit, graphics operations, and integer multiplications done within the FP multiply hardware. The i860 requires slightly more than one million transistors. Thus we can estimate to implement an operation unit with a silicon budget between one and five million transistors, depending on particular cost/performance tradeoffs.

6. The Processor Structure

The overall processor structure, which contains the described ensemble of four operation units as a subsystem, is shown in Figure 13. The basic steps of the instruction processing are assigned to dedicated hardware resources:

instruction issuing <control memory, Common Control), operand selection <Selector/Iterator Resources Ensemble, references/data memory), execution cf operations <Processing Resources Ensemble, i. e. the structure cf operation units described above).

For efficient Operand selection, adequate hardware is provided to keep the operation units busy nearly the whole time. This hardware is responsible for machine word (bucket> addressing and for elementary address calculations . Bitfield selection is done within the operation units. More complicated address calculations are executed by the operation units, too. The Selector/Iterator Resources Ensemble is provided to produce addresses according to access patterns [JE88J which are typical of many kinds cf innermost loops. Hardware implementation cf such access patterns allows for simple and efficient circuitry. An example is shown in Figure 14. This hardware structure is able to produce address values to access array structures from one to three dimensions. To formulate such an access pattern in a common programming language requires nested DD-loops, e. g.:

for AD3 ~ 1 to EC3 do for AD2 = 1 to EC2 do

for AD1= 1 to ECl do

..... calculations using variables Vi <AD1,AD2,AD3) •.•.

end; end;

end;

The usual way Vi <AD1,AD2,AD3)

to calculate the address is to apply the formula

of an array element

ADDRESSCVil = ARRAY_BASE + AD1 + CAD2-1l*EC1 + <AD3-1l*EC1*EC2.

<EC1,2,3 represent the element count of first, second, and third dimension, respectively.) The iterator hardware avoids address calculations in the loop body. lt is effective in parallel to the operation units. Multiply operations are only required for the set-up cf the hardware (offset calculations) prior to loop execution.

The memory subsystem in Figure 13 is conceptually located outside of the processor. lt must provide the necessary access paths as well as appropriate storage capacity (many Megabytes) and access bandwidth. In addition to this, it must provide the propagation of dataflow control information <see below). The control memory and the references/data memory are located inside of the processor. They may be used as instruction and data caches according to the well-known principles (e. g. set-associative access), but in our proposal the use of these memories should be cantrollable by software directly. The control memary contains the last recently used pragrams. lt will be laadad via the four read-anly ports. The preferred use af the references/data memory is ta hold reference informatian of the last recently used pragrams (access descriptors), constants, and intermediate variables. Four references can be processed in parallel. This memary can exchange data with the memory subsystem and with the operation units. To avoid confusion, some details should be mentianed:

1. The Processing Resaurces Ensemble in Figure 13 carrespands ta Figures 11,12, and the pracessar is designed with cost-effectivity in mind, thus multiple use af the data paths is necessary.

2. The references/data memary contains bidirectional bypasses for the read/write memary parts <5-8). The Processing Resaurces Ensemble has no access to this memory except for address calculations.

3. The memory subsystem cantains dataflaw-cantralled bypasses fram the ports 5-8 to the ports 1-4, thus data out of the reference/data memory can reach the operation units 1,2 via the correspanding ports.

4. If the Processing Resources Ensemble according to Figure 9 is to be chosen (low-cost alternative), then all internal memories are connected to the four read-only ports, and the write part from operation unit 4 is fed back ta the references/data memory which in turn has write ports to the memory subsystem and may be used like a conventional data cache.

To give a rough estimatian, the processor may be implemented within a silicon budget of 50 million transistors:

a) Memories: buckets

2 Memories, each has 4 banks of 4 k buckets <total 16 k 256 kBytes = 2 Mbit). 2 Mbit*2 = 4 Mbit; 6 transi-

stors/bit: 24 million transistors (+ address decoders etc.>, bl Cammon Control, Selectar/Iterator Resaurces Ensemble, bypasses,

glue and driver circuitry: 1-2 million transistars, c) 4 Operation units: 4 ... 20 (4*1 ... 4*5) million transistars.

Further cost/performance tradeoffs may reduce the transistor budget to the 10 millions range <dynamic memory cells, less memory, off-chip memory, operation units with less or performance-reduced resources).

8. Control Principles

A combination of VLIW, microprogramming, and dataflow control principles is employed. The instruction formats are based on 128-bit buckets. Basically, the following types of control information are provided:

1. Resource Control Words <RCWs>. RCWs are similar to horizontal microinstructions. An active RCW controls the information flow of the corresponding unit in the current machine cycle. RCWs can be executed out of the control memory or (in the operation units) out of the MPM.

2. Incarnation Control Words <ICWs>. ICWs are used to control resources according to dataflow principles. Examples of such information structures are shown in Figure 15. An ICW is composed at least of a Resources Selection Bucket <RSB> and an Argument Selector Bucket <ASB>. Thus resources control and argument selection are isolated from each other. A RSB contains 8 code fields. A code field can select a particular resource in a particular unit and a argument selector field out of the ASB. There is no operation code. Instead, the resources are identified by ordinal numbers, and the selected arguments will be delivered to the selected resource. E. g. to multiply two numbers in a particular operation unit requires feeding the arguments to the multiplicand and multiplicator registers in the desired unit. The appropriate control information accompanies the data. This control information is packed into 32-bit Resource Selection Words <RSWs> by the processor's Common Control circuitry. The RSWs must be propagated via the memory subsystem to appear together with the data buckets at the operation units. Proper synchronization is achieved by an order number in each RSW, which is generated by Common Control. A particular resource will be active only if all corresponding arguments have the same order number. The result will be forwarded with the same or with an advanced order number, according to the Advance Control bit in the RSW. Obviously, it poses no principial difficulties to introduce superpipelining by inserting appropriate pipeline stages into the processing resources.

9. Conclusion

An overview of a proposal for a high-performance uniprocessor architecture has been given. Of course, many details and intricacies had to be skipped, and a lot of research work remains to be done. Obviously, the following problems deserve special interest:

refinements of the dataflow control principles, respect to branch and start-up latencies,

especially with

- evaluation of each innovation against well-introduced migration paths from or even compatibility to systems de-facto standards, compiler-related issues.

principles, representing

10 . Refer enc:es

CD89

COL.8 7

DA89

DI8'l

IN88 INT90

,JE88

,JD88

,J[189

KLJ74

r<UL.B l

8089

THD64

VLA8B

WU87

h'.. Cohn et al., "Ar,..c:hi. t.E~c::t.ur'„f!~ ancl Compi J. er· TrC:l.cieoffs fc;r a Lc1nq I nstr·ucti on Wor·cl Mi c1'·opre>cesso1r Y

11 f.:) I GAF~CH Computer'· (-~r--

c:h i tec:ture News Vol. 17, No. :;;;:? pp.2··-14, April 1989. i::;:„P. Cnlw<':?11 E"~t al. Y

1'A VL...IL..J Arcite?c:t.urc:::.1 for a Trac:e Sc:he~du··-·

lirnJ Computer·," ClpE!rating Syst!::)ms Reviei,.J Val. 21, l\lo. 4, pp. 180-192, Clctober 1987. W.J„ DaJ.J.y, 11 Mi.cro-OptimizC:1.tion of Floating-„Point Opera-· t :i cms," ~3 I f31!.:\F~CH Computer· {~r-c:h i tf?ct.u1'·e News Vol. :l 7, No. 2, PP· 283-289, April 1989. D„ i::;:. Ditzel, "Dc-:?~";iqn T1'·;:\c:leof·fs to Supprnrt t l·Hi! C Pro<Jramming LanguagE' in the CF: I SF' Mi cn:>processor··," Operating !:.•ystems Review Vol. 21, No. 4, pp. 158-163, October 1987. INMOS Ltd. Transputer Referenc:e Manual. 1988. Intel Corporation. i860 64-bit Microprocessor Programmers Reference Manual. 1990. Y. :Je<JOUy "Access F'c:ü.ten1s: A l.Jr:;<eful Ccincept in Vector Pr-o ····· c;wammi. nc;.1, " SupE~r„comput i nc;.r. J. st International Cc>nf erence Athens, Greec:e, June 1988 Proceedings, pp. 377-391 (Springer 1988? LNCS VoJ.. 297). N. r:.· „ J oupp i. ,1

11 Supf.:)1'·1sc,:1l a1'· v~::;. Super'·p i pel :i ned M.::tch i ne1s ·i 1

'

SißARCH Computer Architec:ture News Vol. 16, No. 5, pp. 71-BO, November 1988. N.F·. Jouppi, D.W. Wall, "Avai.lat:>le Instructi.on-L.evel Paralle-· lifün +or St.q:H?.rscala1r anci b upf?r·pipelined l"l.::1ch:i.nf-.?s, 11 Siß(...)F~CH

Computer Architecture News Vol. 17, No. 2, pp. 272-282, April 1989„ D.J„ Kuck et al„, "MeasuremE.>nt.s of F'a1raJ.leJ.i.sm in Orclinary FO!::;:TF~Al\I Prog1"«'.:tms," Computf?r Vcil „ :1., No„ 1, pp. :~:.7--46, ,January 1974. U.W. Kulisch, W.L„ Miranker. Computer Arithmetic in Theory and Practic::e. Academic: Press, 1981„ M.,J. l"lahcm et al., 11 Hewl1:.»tt--F'ack.:ffd F'reci.sion Arc:hit.ecture: Tht? Proce~:; ~:;or· , 11 H01wlet.t-·Packa1rd c"louirnr.i.1, (..~ugust 1986, pp. 4--„, .-~

.a::.z::.11

1'1.D. Smi.th et .::11., "L.i. m:i.ts; ein Multiple Inst.ruction Issue," SIGARCH Computer Architec:ture News Vol„ 17, No. 2, pp. 290-502, Apr··i J. J.989. E-3.S. Sohi, S. V.:dapeyam, "Trade•offs i.n Instruction Format Design fo1r Hor··i zcmtci.l ?)rrchi tect.u1rEi!S 'J" SIGAfo:;:CH Computer A1'·ch:i. ··· t.ecture News Vol.17, No„ 2, pp. 15-25, April 1989 • .J. E.~ „ Tho1rnton? 1

' F'a1ral 1 el Oper·at :i on j. n th0~ Contr·ol Data 6600, '' AFIF'S Proc. FJCC, Part 2, Vol. 26, pp. 33-40, 1964. H „ Vlakos.i V. M:ilut.inovic, "GaAs Mic:ropr·oci:=sso1rs and Digital f.)ys:.tems; . An Overview of m~(D Effnrts, II IEEE Micr·o Vcil. 8, No. 1, pp. 28-56, February 1988. Wm. A. Wul+, "The WM Computer- Arc:hite:)ct.ure1," SIGAF~CH ComputerArchitecture News Vol. 15, No. 4, pp. 70-84, September- 1987.

a \

b

c

a

b c

d----'

Figure 3: Frequency

cf the number of operands in arithmetic

assign ments (VLA88).

Mul tiport Register File

0

Figure 1: Structure cf twc

operation u nits (WU87).

d := (a QeJ. b) QP2 c d

Figure 2: Structure of three operation units (VLA88).

e == (a QE.1 b) QP3 (c Qef d)

e

Frequency

--Ordinary pro~ams

· - · -·-.. computation-int'ensive - ·· · - . programs .......

---~------=- · -

2 3 5 6 7 operands

Nonblocking Crossbar Networks

Figure 4: Crossbar connection structure 1n a super scalar (VLIW) machine.

Addition

Subtraction

[x1 , x2J - [x3 , x4J = ~1 - x4 / x2 - x~

/vf u/ t ig_I ica t ion

~1 , x~ jf ~3 , x~ = &in(P), max(PO

with p = (x1,i,X3 , X1,i,.X4 , X2 ;lt< X3 1 X2 ?f. X)

X3

~_,mtn

max

Figur e 5 ; Ex am p l es o f e l e m e n t a r y i n t er v a l a r i t h m et i c o per a ti o n s.

Addition/subtract ion

/v!u lt 1j)_I ica t ion

(a1,b1) ~ (a2 ,~) = (a1 a2- b1 b2, a

1 bta

2 b1 l

a,

az b1 --~-_,-'--

b2

Division

[

a, a2 + b1 bz 2 b2 I 02 + 2

a,

02 ---...--+----.~_,,,.

b1

b2

Figure 6: Examples of elementary complex number

arithmetic Operations.

Additionlsubtr ac tion

a b -+ - -P, - P2 -

/v1u/t12_/ ica tion

:=xv--n :Y-d

n

d

n: resultnumerator

d: result denomi naror

Resu/t rounding_ (according to a specified precision p)

Figure 7: Examples of elementary rational number

arithmetic operations.

x x x2 x 3 e=1+-+-+-+ 1! 2! 3! ...

1)

0): initial value 0

1): initial value 1

X--------

Calculation of n! ; OJ

could be replaced

by fetching table

values.

1 n!

1)

accumula ting memory ( register) 1) !contai ns fin al r esul t ex )

END si nalization

Figure 8: Series expansion example.

from memory Subsystem

~ ~~l~~r~:~ - - - - - - - - - - -~ min/max to memor y

---~

detection hardw. subsyst em

acc umu lating memory ( regist er)

Fig_ure 9: The proposed structure of four Operation units {at first glance).

directly

bypassed

Read ports Por t 1 Port 2 Port 3 Par t 4

con trol control /

r / r-f • A c 8 A c ß

Proccissing Processing Kernel R Kernel R

+ bypass

1 ~ 2 • bypass

~ 1 - 1

eo n t rol , /

J r-A c B

Process ing

i Kernel R D[

t bypass ] ~v Accumutatin g

Memory 00

l control

Information Port / t r-tlow 1 2 3 4

Data to i1 A 28 A c a

18 2A Processing Control to / 1C 2C Kerne( R

Data to 3A 38,4A accumulate t Con tra( to 3( 4(

tol 4 \g~ memory

Wr it e port

Figure 10: The proposed structure: detailed v1e1N.

Interna! details ot an operation unit A B

r - -- ----- ----- ---- ----

A c 8 MA MB 00 1 00>< 1

Processing :L Multipurpose MPM 0..

Kernel :I: Memory

R QJ

DI '-0 .._ Vl

load MPM

SEL 1 L _ _ _ _ ___ _ _j

R

7 Read/write · ports {5„8) 8

s Read-on\y ports {1 .. 4)

6

1 2 3 4

A B A B

1 R 2 R

• store 11 load 3 A store 2

1 load 38

A B

3 R store 41 load 4 B

'

store 3 1

load 4 A

A B

4 R

Figure 11: The proposed structure extended.

a) b)

( )

~

A B to MPM/bvoass !Oll

microcode out of MPM

one of the buses can (01x1by!XISS)

1 Resource

be selected to detect A Fifa B Fifa Conf-rol conditions n Dataflow control information

Words (R C Ws)

outpu t of condi tio n from HPM (00) results

f - a) l (OO,tl \Arguments preselectian / ~

Cantrol AC- Bus

~ -SARG - selected arguments

Arguments se lection/al ignm ent R - result

+ t · · · ·control lines .. + RX - result extension Uncludes FP exponent hardware.)

AC - arithmetic chaining SARG 4- Bus

SARG3- Bus Vl

SARG2 - Bus c 0 ·-

SARG1 - Bus +-·-'O

l l c 0

' ~ LJ

Logic Graphics Conditions \Add/sub/ \Multiply / \Divide / •Gate thru • Or • c l ip •First/last occurrence

• lnvert • Xor •Dr a w •No. ot occurrences ·And etc. ·z-tiutfer operations • 1 n t erval compare

• Ma" /min c etect.

b) c) l R- Bus (01)

R X- 8110::. ' (Dlxl _

to MPM / R-output

Figure 12; Internat structure af a processing kernel.

Vl ..c +-g 0 L

c 0 u

Vl

~· c.....

Cl

Qj

a. -::::i

E

.-_: Qj Vl c..:

+-Vl c ·-

...... X Qj

c

Memory su bsystem

4 read-on ly access paths

1 2 3 4

Control memory*

(lnstruction cache)

1 Common Control

' s t t Se l ector /lterat or

4 reference Resources paths _

Ens~mble

Figu re 13: Processor struct ure.

t from/to p e ri pherals

t.. readiw'rite access paths

5 6 7 8

References/data ,......, memory *

(Data cache)

5 6 7 8

1 2 3 4

rs t6 f1la Proc essing Resources Ensemble (4operation units)

·See tig.111 12 for details.

~: Eac h of t he

memories com

prises at least 4 banks of

4 k buckets

(128 bit).

SteRS c ou n ti n g_ Ad dr ess calculation

Parameter bus

( L 1 l (W1)

RG Limit RG Offset RG Current adrs

Address +1 to

Stag_gj_ Vl memory °' QJ

L

;)(. rn to con trol logic QJ

.c: +-

SEL ..... 0

(L2) (W2)

RG L. . t 0

RG Otf set RG Current adrs 1m1 0 +-c

u +1 QJ

u a

Stage 2 0

QJ

* .0

0 +-

Vl

(L3) (W3) Vl QJ L

RG Limit RG RG u

Offset Current adrs u 0

-a +-

+1 ·-c

Stage 3

* „ Figure 14: lterator hordware tor three nested 00-loops. (Stage 1 correspon ds to

innermost loop.)

32 bit Resource Selection Word (RSW)

Format Or der No. Resour.::2 ord i i10l

L--=: Argument control

Advance contro l

Bittield se lf?c t or or MP M/ bronching contro l

Bitfield selector

Structure ot datatlow information

(tran sferred via read on\y ports)

!Format! Bit count 'Firs t b i t 1n bucket

15 13 7 6 0

Accompanying control information (up to 64 bit} Argument dato bucket (128 bit} ~, ---'----'-~R-SW-1~~------..~~~~RS_W_2~~-----..-ll~~~---.\~~~~~~~~~~

Resources Selection Bucket (RSB) ~gument Selector Bucket (ASB)

'---------->..-" --"---~------"? ~'------'\ 1 / ~ ~.___ _ _._I __

7

of 8 code t i elds ot 8 argument selector f i elds

15 13 10 9 B 7

RSB code field for Q ROrticular resource: IFormo t !Argument se~ ,,....... II Resou r ce or d i nal

Order No. contro l ( clear/k eep / odvonce}

7 6 5

· Fig_ure 15: Example dataflow control ward tormats. [ Unit 1 Resource

0

: tor pro-1cess ing t r esour.:es

how many operation units are adequate? · how many operation units are adequate? wolfgang matthes,...

Documents