an introduction to vlsi processor architecture for gaas
DESCRIPTION
An Introduction to VLSI Processor Architecture for GaAS. This research has been sponsored by RCA and conducted in collaboration with the RCA Advanced Technology Laboratories, Moorestown, New Jersey. Advantages. - PowerPoint PPT PresentationTRANSCRIPT
Page Number: 1/101
Page Number: 2/101
MICROPROCESSORS
DARPA EYES 100-MIPS GaAs CHIP FOR STAR WARSPALO ALTO
For its Star Wars program, the Department of Defenseintends to push well beyond the current limits of technol-ogy. And along with lasers and particle beams, one piece ofhardware it has in mind is a microprocessor chip having asmuch computing power as 100 of Digital EquipmentCorp.’s VAX-11/780 superminicomputers.One candidate for the role of basic computing engine forthe program, officially called the Strategic DefenseInitiative [ElectronicsWeek, May 13, 1985, p. 28], is a gal-lium arsenide version of the Mips reduced-instruction-setcomputer (RISC) developed at Stanford University. Threeteams are now working on the processor. And this month,the Defense Advanced Projects Research Agency closed therequest-for-proposal (RFP) process for a 1.25-µm siliconversion of the chip.Last October, Darpa awarded three contracts for a 32-bitGaAs microprocessor and a floating-point coprocessor. Onewent to McDonnell Douglas Corp., another to a teamformed by Texas Instruments Inc. and Control Data Corp.,and the third to a team from RCA Corp. and Tektronix Inc.The three are now working on processes to get usefulyields. After a year, the program will be reduced to one ortwo teams. Darpa’s target is to have a 10,000-gate GaAschip by the beginning of 1988.If it is as fast as Darpa expects, the chip will be the basicengine for the Advanced Onboard Signal Processor, one ofthe baseline machines for the SDI. “We went after RISCbecause we needed something small enough to put onGaAs,” says Sheldon Karp, principal scientist for strategictechnology at Darpa. The agency had been working withthe Motorola Inc. 68000 microprocessor, but Motorolawouldn’t even consider trying to put the complex 68000onto GaAs, Karp says.A natural. The Mips chip, which was originally funded byDarpa, was a natural for GaAs. “We have only 10,000 gatesto work with,” Karp notes. “And the Mips people had takenevery possible step to reduce hardware requirements. Thereare no hardware interlocks, and only 32 instructions.”
Even 10,000 gates is big for GaAs; the first phase of thework is intended to make sure that the RISC architecturecan be squeezed into that size at respectable yields, Karpsays.Mips was designed by a group under John Hennessey atStanford. Hennessey, who has worked as a consultant withDarpa on the SDI project, recently took the chip into theprivate sector by forming Mips Computer Systems ofMountain View, Calif. [ElectronicsWeek, April 29, 1985,p. 36]. Computer-aided-design software came from theMayo Clinic in Rochester, Minn.
The silicon Mips chip will come from a two-year effortusing the 1.25-µm design rules developed for the Very HighSpeed Integrated Circuit program. (The Darpa chip was notmade part of VHSIC in order to open the RFP tocontractors outside that program.)Both the silicon and GaAs microprocessors will be full 32-bit engines sharing 90% of a common instruction core.Pascal and Air Force 1750A compilers will be targeted forthe core instruction set, so that all software will be inter-changeable.The GaAs requirement specifies a clock frequency of200 MHz and a computation rate of 100 million instructionsper second. The silicon chip will be clocked at 40 MHz.Eventually, the silicon chip must be made radiation-hard;the GaAs chip will be intrinsically rad-hard.Darpa will not release figures on the size of its RISC effort.The silicon version is being funded through the Air Force’sAir Development Center in Rome, N.Y.
–Clifford Barney
The GaAs chipwill be clocked at 200 MHz,
the silicon at 40 MHz
Reprinted with permission ElectronicsWeek/May 20, 1985
Figure 1.1.a. A brochure about the RCA’s 32-bit and 8-bit versions of the GaAsRISC/MIPS processor, realized as a part of the “MIPS for Star Wars” project.
Page Number: 3/101
Phases of a Well-Structured VLSI Design
1. Generation of candidate architectures
with approximately the same VLSI area. 2. Comparison of candidate architectures,
from the point of view of the compiled HLL code speed. 3. Selection of one candidate architecture,
and finalization of its schematics. 4. Design of the VLSI chip:
a. Schematic capture b. Logic and timing testing c. Placement and routing
5. Generation of the mask. 6. Chip fabrication, etc...
Page Number: 4/101
Typical Development Phases for One 32-bit Microprocessor on a VLSI Chip
(or about the development of
DARPA's 32-bit RISC MIPS processors in GaAs and silicon)
1. Announcement of project requirements (on 1.1.1984.) a. Type of the architecture (SU-MIPS) b. Maximal on-chip transistor count (30K) c. Detailed specification of the assembly language (Core-MIPS) d. A set of benchmark programs typical of the end-user application (13) Three competitors selected by 12.13.1984.
a. McDonell Douglas b. CDC + TI c. RCA (Purdue + TriQuint)
Page Number: 5/101
2. In-house research by the three competitors (till 12.31.1985.)
a. Generation of several candidate architectures under 30K transistors.
b. Design of an ENDOT (isp') simulator of all candidate architectures (why isp'?).
c. All candidate architectures are ranked according to the above mentioned benchmark programs.
d. Reasons for high/low ranking of specific candidate architectures are analysed, and the best candidate architectures are modified to become better. The final architecture is determined and "frozen" after several iterations.
Detailed RTL design is completed, and it is proven that the total transistor count is below 30K.
Page Number: 6/101
3. Decision-making at the sponsor side (by 1.1.1986.)
a. Final architectures of all competitors are ranked (using the isp' simulators and the initially provided benchmarks).
b. A subset of competitors is selected for further financing; others are offered to stay in the competition with the own financing.
c. All those that stay in competition are shown all reports generated (by others) till that point.
Page Number: 7/101
4. In-house development by the three competitors (till 12.31.1986.)
a. Improvements are added, after the solutions of the competition are reviewed, and their impact
is verified with isp’ simulation b. The architecture is frozen, forever.
c. The RTL design is redone and frozen.
d. The appropriate semi-custom standard-cell family is selected,and the gate level design is completed. The standard-cel family choices, in the project which is the subject of this presentation
The 1 micron E/D-MESFET GaAs
e. The completed gate level (GTL) design contains only the elements of the cells from the selected family (which includes the input, output, and input/output pads).
The 1.25 micron SOS-CMOS Si
Page Number: 8/101
f. The gate level design is entered into a computer, using one of the following methods:
Graphic entry HDL based entry Logic equation entry State machine entry Direct entry of the net-list, using a text editor
Except in the last case, the net list (needed for further work) is obtained using the appropriate translator. g. The net-list is tested (logic and timing), using an appropriate testing program (LOGSIM). If errors, the work iterates back, as needed. h. The net-list is treated by an appropriate placement and routing program (MP2D). No timing errors (guaranteed) after the chip is fabricated! Logic errors possible after the chip is fabricated. The major two output files:
Artwork file for visual analysis (for printer or ploter)
Fab file (for shipment to a chip foundary, by regular mail or email) At the chip foundary, the tab file is analysed, and each standard cell is substituted with its full-custom equivalent (details are typically confidental).
Page Number: 9/101
5. Further narrowing down of the sponsored competition, and widening up of the support technology (by 1.1.1987.)
a. Only a subset of the sponsored competition is given further support for fabrication of a prototype at a lower-than-nominal speed.
b. More funding made available for R&D in both, semiconductor and packaging technologies.
c. More funding made available for the Core-MIPS translators (for the MC680x0 and the 1750A assembly languages) and compilers (for ADA and C).
Page Number: 10/101
6. Prototype fabrication (by 12.31.1987.) 7. Zero series at a still-lower-than-nominal speed (by 12.31.1988.) 8. Commercial series at the nominal speed (by 12.31.1989.) 9. The US epilogue! 10. The rest-of-the-world epilogue!
Page Number: 11/101
The ENDOT Package by TDT 1. First, the appropriate files are formed. In the most general case:
a. One or more .isp (isp') file (different names; same extensions) b. One .t (topology) file (trivial if one .isp file; complex if many .isp files) c. One .m (meta-micro) file (one jumbo case statement) d. One .i file (information related to linking and loading) e. One or more .b (benchmark) files (any extension allowed)
Only this, and nothing more! [Poe66]
2. Second, the formed files are treated with appropriate tools: a. Hardware tools
b. Software tools c. Postprocessing and utility tools Finally, the simulator is completed. 3. Third, the simulator is run, and the statistics about the analyzed architecture(s)
are collected. 4. Fourth, if needed, a silicon compiler is run, etc...
Page Number: 12/101
ENDOT (1) Hardware Tools
(1.1) ISP' Language (1.2) ISP' Compiler - ic (1.3) Topology Language
(1.4) Ecologist - ec (1.5) Simulation Command Language (1.6) Simulator - n2 (2) Software Tools (2.1) Meta-assembler - micro (2.2) Meta-loader - the linker/loader (2.2.1) Interpreter - inter (2.2.2) Allocator - cater (2.3.) Minor programs (2.3.1) mdump (2.3.2) merge (2.3.3) mas = micro + cater (2.3.4) mkmem (3) Postprocesing & Utility Tools (3.1) Statements counter - coverage (3.2) General purpose post-processor - gpp (3.3) N.2 help utility -nhelp (3.4) Build utility - build (3.5) VHDL translator - icv
Page Number: 13/101
THE N.2 DESIGN PROCESS Step 1: Idea!!! Step 2: Hardware (and Software) design Step 3: Simulation Step 4: Analysis Step 5: IF design <> ok THEN GOTO Step 2 Step 6: End With N.2 your design iterations become painless!!!
Page Number: 14/101
HARDWARE TOOLS
ISP' language
Purpose: DESCRIPTION OF THE HARDWARE SYSTEMS
ISP' program:
(1) Declaration section(2) Behavior section
Page Number: 15/101
Declaration section: - CONTAINS STRUCTURE DECLARATIONS. - STRUCTURES: ALL ISP' NAMED OBJECTS. - STRUCTURE TYPES: (1) MACRO (2) PORT (3) STATE (4) MEMORY (5) FORMAT (6) QUEUE MACRO subsection: names which are used to give convenient easily remembered names to objects. PORT subsection: names which are used for communication with outside world. STATE subsection: internal names of the ISP' model that can store information. MEMORY subsection: same as a state, except that memory can be initialized. FORMAT subsection: convenient names for inconvenient names; typically subranges of states. QUEUE subsection: names which are used for synchronization with outside world.
Page Number: 16/101
Behavior section: - CONTAINS ONE OR MORE PROCESSES. - PROCESS: (1) PROCESS DECLARATION (2) PROCESS BODY - PROCESS BODY:
SET OF ISP' STATEMENTS.
- ISP' STATEMENTS: PROCESS EXECUTES ALL
ITS INDEPENDENT STATEMENTS CONCURENTLY. - next AND delay STATEMENTS:
CAN BE USED TO FORCE SEQUENTIAL EXECUTION WITHIN A PROCESS
- main: OPERATES IN A COUNTINUOUS LOOP. - when: WAITS FOR AN EVENT. - procedure: SAME AS A SUBROUTINE IN A HLL; main process INVOKES a procedure. - function: SAME AS A FUNCTION IN A HLL.
Page Number: 17/101
Example: “wave.isp”
portCK 'output;
main CYCLE :=(
CK = 0;delay(50);CK = 1;delay(50);
)
Figure 3.1. File wave.isp with the description of a clock generator in theISP’ language.
Page Number: 18/101
File “cntr.isp”
portCK 'input,Q<4> 'output;
stateCOUNT<4>;
when EDGE(CK:lead) :=(
Q = COUNT + 1;COUNT = COUNT + 1;
)
Figure 3.2. File cntr.isp with the description of clocked counter in the ISP’language.
Page Number: 19/101
ic - The ISP' Compiler
Purpose: COMPILES ".isp" SOURCE FILESINTO ".sim" OBJECTS FILES
- input: ".isp" file
- output: ".sim" file
wave.isp ---> ic ---> wave.sim
cntr.isp ---> ic ---> cntr.sim
Page Number: 20/101
Topology Language
Purpose: DESCRIBES LINKSBETWEEN THE ".sim" FILES
Topology program:
(1) SIGNAL SECTION(2) PROCESSOR SECTION(3) MACRO SECTION(4) COMPOSITE SECTION(5) INCLUDE SECTION
- SIGNAL SECTION: IF EXISTS, CONTAINS A SET OF SIGNAL DECLARATIONS
- SIGNAL DECLARATIONS: signal_name [<width>][,signal declarations]
Page Number: 21/101
- PROCESSOR SECTION: CONTAINS A PROCESSOR DECLARATION. - PROCESSOR DECLARATION: processor_name = "filename.sim" [time delay = integer;] [connections signal_connections;] [initial memory_name = l.out;] - MACRO SECTION: USER'S CONVENIENT NAMES FOR TOPOLOGY OBJECTS. - COMPOSITE SECTION: THIS SECTION MAY CONTAIN SET OF THE TOPOLOGY LANGUAGE DECLARATIONS IN THE FOLLOWING FORMAT: begin declaration {declaration} end - INCLUDE SECTION: SIMPLE INCLUDING OF THE FILE WHICH CONTAINS TOPOLOGY LANGUAGE DECLARATIONS.
Page Number: 22/101
File “clcnt.t”
signalCLOCK,BUS<4>;
processor CLK = "wave.sim";time delay = 10;connections
CK = CLOCK;
processor CNT = "cntr.sim";connections
CK = CLOCK,Q = BUS;
Figure 3.3. File clcnt.t with the topology language description of theconnection between the clock generator and the clock counter, described inthe wave.isp and cntr.isp files, respectively.
Page Number: 23/101
ec - The Ecologist
Purpose: COMPILES ".t" SOURCE FILESINTO ".e00" FILES
- explicit input: ".t" file
- implicit input: ".sim" file(s)
- optional implicit input: "l.out" file (derived by the software tools)
-output: ".e00" file (object file)
clcnt.t ----------->wave.sim -------> ec ----->clcnt.e00cntr.sim -------->[l.out ------------>]
Page Number: 24/101
n2 - The Simulator
Purpose: SIMULATION OF THE DESCRIBEDHARDWARE
SYSTEM.
- input: ".sim" & ".e00" files
- optional input: "l.out" file (derived by the software
tools)
- output: if exists, ".txt" file
wave.sim ------->cntr.sim --------> n2 [ ----->clcnt.txt]clcnt.e00 ------->[l.out ------------>]
Page Number: 25/101
Simulation Command Language
Purpose: CONTROLLING THE FLOW OF SIMULATION
Some basic simulator commands:
- run: STARTS OR RESUMES THE SIMULATION.
- quit: EXIT THE SIMULATOR.
- time: QUERIES THE SIMULATION "CLOCK" TO OBTAIN THE ELAPSED UNITS
OF SIMULATION TIME.
- examine structures: QUERIES THE CONTE OF THE STRUCTURES.
- help keyword: PROVIDES AN ON-LINE REFERENCE.
- deposite value structure: SETS THE CONTENTS OF THE STRUCTURE WITH
THE VALUE FIELD.
- monitor structures & alert structures: PROVIDES A VARIETY OF CAPABILITIES FOR GETTING INFORMATION DURING SIMULATION..
Page Number: 26/101
Installation of ENDOT package on systems running SCO UNIX
1. Login as root 2. cd /usr 3. tar xv n2.tar.Z (extract) 4. uncompress -v n2.tar.Z 5. tar xvf n2.tar (extract) 6. rm n2.tar 7. cd n2 8. tar xvf nmpc.uof 9. cp nmpc.uof /usr/USERNAME Sequence of operations for simulation of the clocked counter 1. vi wave.isp 2. vi cntr.isp 3. ic wave.isp 4. ic cntr.isp 5. vi clcnt.t 6. ec -h clcnt.t 7. n2 -s clcnt.txt clcnt.e00
Page Number: 27/101
SOFTWARE TOOLS
metaMicro Purpose: ASSEMBLING AN ASSEMBLER PROGRAM.
- input: METAMICRO ASSEMBLER SOURCE FILE AND ASSEMBLERPROGRAM
- output: ".n" FILE
arch.m ----------> | ---> | ---> micro ---> arch.n program.m -----> | - arch.m: CONTAINS DEFINITION OF THE ASSEMBLER INSTRUCTIONS AND Begin-end Section: begin include program.m$ end
- program.m: CONTAINS ASSEMBLER PROGRAM
- arch.n: OBJECT FILE.
Page Number: 28/101
inter - the Interpreter Purpose: DESCRIPTION OF THE INSTRUCTION WORD; ADDRESS RESOLUTION AND RELOCATION.
- input: LINKER/LOADER SOURCE FILE
- output: ".a" FILE
arch.i -----> inter ------> arch.a - arch.i: CONTAINS DEFINITIONS OF THE INSTRUCTION WORD AND INFORMATION FOR THE ADDRESS RESOLUTION AND RELOCATION.
- arch.a: OBJECT FILE.
Page Number: 29/101
cater - The Allocator Purpose: LINKING THE ".n" AND ".a" FILES; RESOLVING ADDRESS & ALLOCATION.
- input: ".n" & ".a" files - output: "l.out" file - l.out: MEMORY IMAGE FILE
arch.n ---> | | ---> cater ---> l.out arch.a ---> |
Page Number: 30/101
Postprocessing & Utility Tools
coverage - ANALYZES PROCESSOR STATEMENTS BY USAGE, HIGHLIGHTING THE UNEXECUTED STATEMENTS.
gpp - ANALYZES PROCESSOR STRUCTURES BY VALUE, PROVIDING STATISTICAL, GRAPHICAL, OR COMPARATIVE PRESENTATION OF RESULTS.
nhelp - ON-LINE HELP.
build - MANAGING OF THE SOURCE FILES.
icv - TRANSLATING ISP' MODELS INTO VHDL
Page Number: 31/101
The Fura RISC CPU Word length: 32 bits Registers: sixteen 32-bit Execution model: register-to-register dp = register_read -> ALU_operation -> register_write
Memory access: load & store Pipelining: delayed branching!!! delayed loading! Instruction classes: (1) ALU class (2) branch class (3) data memory class (4) system class
Page Number: 32/101
Instruction cycles: (1) INSTRUCTION FETCH (IF) (2) INSTRUCTION DECODING AND EXECUTION (IDX) (3) DATA LOAD (LD)
A D
i-1: IF IDX LD
i: IF IDX LD
i+1 IF IDX LD
Possible isp' coding window positioning (i+1 is the current instruction) main := ( main:= ( IF(i+1); IF(i+1); IDX(i); delay(1); LD(i-1); LD(i); ) IDX(i+1); ) main := ( main := ( IF(i+1); delay(1); IDX(i+1); delay(1); LD(i+1); ) )
Page Number: 33/101
Instruction format:
31 24 23 20 19 16 15 12 11 0
OP DST SRC#1 SRC#2 X
31 24 23 20 19 16 15 5 4 0
OP DST SRC#1 X SIMM
31 24 23 20 19 16 15 0
OP DST SRC#1 LIMM
Page Number: 34/101
ALU Class: Add (a) ADD Rd, Rs1, Rs2
(b) ADD Rd, Rs1, imm16
(c) ADD Rd, PC, imm16
Substract (a) SUB Rd, Rs1, Rs2
(b) SUB Rd, Rs1, imm16
(c) SUB Rd, PC, imm16
Move (a) MOV Rd, Rs1
(b) MOV Rd, imm16
(c) MOV Rd, PC
Negate (a) NEG Rd, Rs1
Logical Not (a) LNOT Rd, Rs1
Logical And (a) LAND Rd, Rs1, Rs2
(b) LADD Rd, Rs1, imm16
Logical Or (a) LOR Rd, Rs1, Rs2
Arithmetic Shift Left (a) SLA Rd, Rs1, imm5
Arithmetic Shift Right (a) SRA Rd, Rs1, imm5 Set if Equal (a) SEQ Rd, Rs1, Rs2
Set if Greater Than (a) SGT Rd, Rs1, Rs2
(b) LOR Rd, Rs1, imm1
Page Number: 35/101
Branch Class: Branch on True
(a) BT Rd, Rs1
Branch Always (a) BA Rd
Data Memory Class: - load & store instructions
load: (1) three cycles: IF, IDX & LD (2) IDX: register_read - ALU_operation - output_latch_write (address)
(3) LD Load
(a) SEQ Rd, Rs1, Rs2
store:
(1) two cycles: IF & IDX (2) IDX: register_read - ALU_operation - output_latch_write (data & data address)
Store
(a) ST Rd, Rs2
Page Number: 36/101
System instructions: Noophalt (a) NOOPHALT idle state of the machine; this instruction may be used for
filling slot(s) behind branches and/or loads, or for real-time isp' programming, or to support modular isp' programming.
Page Number: 37/101
Branching in pipelined machines: Interlock mechanism: hw (cisc-mostly) versus sw (risc-mostly)
i
i+1
i+75
Scoreboard branch: hw interlock (clock slow-down)
ALU (arithmetic-logic-unit) suspend RWB (register-write-unit) suspend
Page Number: 38/101
Delayed branch: sw interlock
source code:i-1 ADD R7, imm32i JUMP R1, R2>R3i+1 MOVE R3, R4i+2 SUB R5, R6
after code generation:i-1 ADD R7, imm32i JUMP R1+1, R2>R3i+1 NOOPi+2 MOVE R3, R4i+3 SUB R5, R6
after code optimization:i-1i JUMP R1+1, R2>R3i+1 ADD R7, imm32i+2 MOVE R3, R4i+3 SUB R5, R6
Page Number: 39/101
condition: THE MOVED INSTRUCTION (a) MUST BE EXECUTED (no matter if the branch is taken or not), AND (b) HAS CONDITION AND/OR THE JUMP TARGET ADDRESS.
parameters: (a) PIPELINE FILL-IN DEPTH (which is not the pipeline depth minus one!) (b) BRANCHING-RELATED STATISTICS (branches executed versus branches taken) (c) BRANCH FILL-IN FUNCTION (local versus global code optimization) (d) CLOCK SLOW DOWN FUNCTION (in-the-critical-path versus off-the-critical-path) (e) TECHNOLOGY-RELATED STATISTICS (on-chip versus off-chip delays) (f) CACHE IMPACT (hit versus miss penalty) NUMERICAL EXAMPLE: What is the equation for the condition that hw and sw interlock have the same benchmark execution time (not clock-count)
Page Number: 40/101
Loading in pipelined machines: Interlock mechanism: hw versus sw i IF IDX LD
i+1 IF IDX
Scoreboard LOAD:
Syspend Bypass
Page Number: 41/101
Delayed LOAD: sw interlock source code: i-1 MOVE R3,R4 i LOAD R7, memory i+1 ADD R2, R1, R7
after code generation: i-1 MOVE R3,R4 i LOAD R7, memory i+1 NOOP i+2 ADD R2, R1, R7
after code optimization: i-1 i LOAD R7, memory i+1 MOVE R3,R4 i+2 ADD R2, R1, R7
condition: mutual independence parameters: technology related, design + organization + architecture related, system software related, and application related.
Page Number: 42/101
CURRENT WINDOW
IF IDX LDIF IDX LD
IF IDX LD
MAIN DELAY(1) END
IR=MEMRY[PASTPC] PASTPC=PC PC=PC+1 PASTOP=OP
PC=REG[DST]
i-1: leaves PASTPC, PASTOP (part of PASTIR)
i: leaves PC, OP (part of IR)i+1: after IF,
puts PC+1 into PC; after IDX (when branch), puts REG[dst] into PC;
Page Number: 43/101
Page Number: 44/101
The ".isp" file: - Macro section macro WORD = 32&, BYTE = 8&, NIBBLE = 4& ; - State section state reg[0:15]<WORD>, pc<WORD>, pastpc<WORD>, ir<WORD>, pastop<WORD>, ! pastdst<NIBBLE>, pastval<WORD>, hist[0:23]<WORD> ! ; - Memory section memory memry[0:0xfff]<WORD> ; - Format section format op = ir<31:24>, dst = ir<23:20>, src1 = ir<19:16>, src2 = ir<15:12>, imm16 = ir<15:0>, imm5 = ir<4:0>
Page Number: 45/101
- Main Program
main := (pastop = op;pastpc = pc;pc = pc + 1;ir = memry[pastpc];hist[pastop] = hist[opastop] + 1;delay(1);
if pastop eql 21reg[pastdst] = pastval;
case op0:reg[dst] = reg[src1] + reg[src2]
instructions 1 to 20
21: ( pastdst = dst;pastval = memry[reg[src2]])
22: memry[reg[src2]] = reg[dst]23:
esac;)
Page Number: 46/101
The complete "case":
! Instruction decode and execution is done here. The "case" statement performs! the decode - note that the opcode bits are tested as one would expect.! For each legal opcode, a unique action is specified.! Only one action is performed, the the bottom of the "main" process is reached,! and we return to the top of the process.
case op 0: reg[dst] = reg[src1] + reg[src2] ! add (reg-reg) 1: reg[dst] = reg[src1] + imm16 sxt 32 ! add (reg-imm) 2: reg[dst] = pc + imm16 sxt 32 ! add (pc-imm) !! 3: reg[dst] = reg[src1] - reg[src2] ! sub (reg-reg) 4: reg[dst] = reg[src1] - imm16 sxt 32 ! sub (reg-imm) 5: reg[dst] = pc - imm16 sxt 32 ! sub (pc-imm) 6: reg[dst] = reg[src1] ! mov (reg-reg) 7: reg[dst] = imm16 sxt 32 ! mov (reg-imm) 8: reg[dst] = pc ! mov (pc-imm) 9: reg[dst] = - reg[src1] ! negate10: reg[dst] = reg[src1] and reg[src2] ! and (reg-reg)11: reg[dst] = reg[src1] and imm16 sxt 32 ! and (reg-imm)12: reg[dst] = reg[src1] or reg[src2] ! or (reg-reg)13: reg[dst] = reg[src1] or imm16 sxt 32 ! or (reg-imm)14: reg[dst] = not reg[src1] ! not15: reg[dst] = reg[src1] *:arith (imm5 ext 32) ! shift left !!16: reg[dst] = reg[src1] /:arith (imm5 ext 32) ! shift right !!17: if reg[src1] eql reg[src2] ! set if equal
reg[dst] = - 1 else reg[dst] = 0
18: if reg[src1] gtr reg[src2] ! set if greater reg[dst] = - 1 else reg[dst] = 0
19: if reg[src1] eql -1 ! branch on true pc = reg[dst]
20: pc = reg[dst] ! branch always21: (pastdst = dst; ! load
pastval = memry[reg[src2]] )
22: memry[reg[src2]] = reg[dst] ! store
Page Number: 47/101
The ".m" file: - Instr Section instr I<32>$ - Format Section format op = I<32:24>, dst = I<23:20>, src1 = I<19:16>, src2 = I<15:12>, imm16 = I<15:12>, imm5 = I<4:0>$ - Macro section macro r0 = 0&, r1 = 1&, ... r15 = 15&, addr(d,s1,s2) = op=0; dst=d;
src1=s1; src2=s2$&, instructions 1 to 22 noophalt = op=23$&$ - Begin-end section begin
include ee666.test$ end
Page Number: 48/101
The ".i" file:
- Instr Sectioninstr
I<32>$
- Format Sectionformat
op = I<32:24>,dst = I<23:20>,src1 = I<19:16>,src2 = I<15:12>,imm16 = I<15:0>,imm5 = I<4:0>$
- Space sectionspace
<0:4095>$
- Transfer sectiontransfer
{new}
- Mode sectionmode
case op eql 7imm16~address$break$
esac,default:
imm16~imm16$
Page Number: 49/101
The ".t" file
processor cpu = "ee666.sim";
time delay = 100ns;
initial memry = l.out;
Page Number: 50/101
The ".b" file:
Sample assembler language program that uses the instructionsfor the RISC-like processor of the ee666 (Advanced Computer Systems),Purdue University, Spring Semester 1987.
Filename: eee666.test
movi(r0,100)subri(r1,10,100)movr(r2,r1)seq(r3,r1,r2)movi(r4,11)movi(r5,12)moci(r6,13)bt(r4,r3)ba(r5)movi(r1,10)
11: addri(r1,r1,1)addri(r1,r1,1)
12: sgt(r7,r2,r1)bt(r6,r7)addr(r8,r0,r2)subri(r9,r1,10)st(r9,r8)ba(r5)addri(r2,r2,2)
13: subri(r8,r8,2)ld(r8,r8)movr(r10,r8)addrr(r10,r10,r8)sla(r10,r10,2)halt
Page Number: 51/101
Sample Fura RISC VMS Session: 1. set def [.N2] 2. copy VL$A:[N2.E666]*.* *.* 3. @VL$A:[N2]login 4. n2 -script.txt ee666.e00
If you want to test your own CPU: 1. @VL$A:[N2]login 2. edit cpuname.isp 3. ic cpuname.isp 4. edit cpuname.m 5. edit program.m 6. micro cpuname.m 7. edit cpuname.i 8. inter cpuname.i 9. cater cpuname.a cpuname.n 10. edit cpuname.t 11. ec -b cpuname.t 12. n2 -s script.txt cpuname.e00
Page Number: 52/101
Papers from the Open Literature: 1) Rose, C.W., Ordy, G. M., Drongowski, P. J., "N.mpc: A Study in University-Industry Technology Transfer" IEEE Design & Test of Computers, February 1984, pp 44-56. 2) Rose, C. W., "System Design Tools - A Paradigm Shift," Endot Corporation Internal Report, 1986. 3) Gay, F., "Funcitonal Simulation Fuels System Design," VLSI Design Technology 4) Kong, S., Wood, D., Gibson, G., Katz, R., Patterson, D., "Design Methodology of a VLSI Multiprocessor Workstation," VLSI Systems, February 1987. 5) Bozanic, D., Fura, D., Milutinovic, V., "Simulation of a Simple RISC Processor," Application Note, No. D#001/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1993. 6) Petkovic, Z., Milutinovic, V., "Simulation of the Intel i860 RISC Processor," Application Note, No. D#003/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1994. 7) Milicev, D., Petkovic, Z., Milutinovic, V., "Simulation Study of Uniprocessor Cache Memories," Application Note, No. D#004/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1994. 8) Tomasevic, M., Milutinovic, V., "Using N.2 in a Simulation Study of Snoopy Cache Coherence Protocols for Shared Memory Multiprocessor System," Application Note, No. D#002/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1993.
Page Number: 53/101
WORKLOAD CHARACTERIZATION Important Reference: Ferrari, D., Computer Systems Performance Evaluation, Prentice-Hall, Englewood Cliffs, New Jersey, U.S.A., 1978. Introduction: Workload of a computer system has been defined as the set of all inputs (programs, data, commands, etc... ) that the system receives from its environment In measurement experiments, the system is driven by a model of the workload which is just a sample of the real production workload. The major question is how representative this sample is. Other important characteristics of a workload are:
a) simplicity of construction, b) usage cost, c) reproducibility, d) compactness, and e) system independence.
Types of Workload Models: 1. Natural workload model: A sample job stream taken from a production workload, and used to drive the system at the very time it was produced. 2. Artificial workload model: All other cases. 2a. Non executable:
Defined via statistical distributions of relevant parameters. Usage: In analytical studies. Typical forms: Probabilities of various instructions
(instruction mixes), memory accesses, procedure nesting depths, etc...
Relevant issues: Mean values, variances, correlations, autocorrelations, etc...
Standard instruction mixes: Flynn (MLL), Knuth (HLL), etc...
Page Number: 54/101
2b. Executable: Defined via one or more programs. Usage: In empirical studies. Typical forms: Synthetic jobs (parametric programs) and benchmarks (semantic programs). Relevant issues: application orientation, etc... Standard ones: See the PC magazines, etc...
Synthetic job approaches: Buchhulz (fixed flowchart with variable parameters) Kernigham + Hamilton (similar but more sophisticated) Archibald + Baer (the most widely cited computer architecture paper in 80's ) Benchmark types: Extracted Created Standard (application dependent)
Page Number: 55/101
The DARPA/Stanford benchmarks:
The DARPA/Stanford Benchmark Packageconsists of thirteen PASCAL programs:
1) ackp.p2) bubblesortp.p3) fftp.p4) fibp.p5) intmmp.p6) permp.p7) puzzlep.p8) eightqueenp.p9) quickp.p0) realmmp.p1) sievep.p2) towresp.p3) treep.p
These programs are located on ed machine,and the full path name of their directory is:/a/mips/bench
Page Number: 56/101
An Introduction toVLSI Processor Architecture
for GaAS
This research has been sponsored by RCAand conducted in collaboration with
the RCA Advanced Technology Laboratories, Moorestown, New Jersey.
Page Number: 57/101
• For the same power consumption, at least half order of magnitude faster than Silicon.
• Efficient integration of electronics and optics.
• Tolerant of temperature variations. Operating range: [200C, 200C].
• Radiation hard. Several orders of magnitude more than Silicon: [>100 million RADs].
Advantages
Page Number: 58/101
• High density of wafer dislocations Low Yield Small chip size Low transistor count. • Noise margin not as good as in Silicon. Area has to be traded in for higher reliability.
• At least two orders of magnitude more expensive than Silicon.
• Currently having problems with high-speed test equipment.
Disadvantages:
Page Number: 59/101
• Small area and low transistor count(* in general, implications of this fact are dependent on the speed of the technology *)
• High ratio of off-chip and on-chip delays(* consequently, off-chip and on-chip delays access is much longer then on-chip memory access *)
• Limited fan-in and fan-out (?)(* temporary differences *)
• High demand on efficient fault-tolerance (?)(* to improve the yield for bigger chips *)
Basic differences of Relevance for Microprocessor Architecture
Page Number: 60/101
•Bipolar (TI + CDC)
•JFET (McDAC)
•GaAs MESFET Logic Families (TriQuint + RCA)
D-MESFET
(* Depletion Mode *) E-MESFET(* Enhancement Mode *)
A Brief Look Into the GaAs IC Design
Page Number: 61/101
Speed Dissipation Complexity (ns) (W) (K transistors)
Arithmetic32‑bit adder 2,9 total 1,2 2,5(BFL D‑MESFET)1616‑bit multiplier 10,5 total 1,0 10,0(DCFL E/D MESFET) Control1K gate array 0,4/gate 1,0 6,0(STL HBT)2K gate array 0,08/gate 0,4 8,2(DCFL E/D MESFET) Memory4Kbit SRAM 2,0 total 1,6 26,9(DCFL E/D MODFET)16K SRAM 4,1 total 2,5 102,3(DCFL E/D MESFET)
Figure 7.1. Typical (conservative) data for speed, dissipation, and complexity of digital GaAs chips.
Page Number: 62/101
Figure 7.2. Comparison (conservative) of GaAs and silicon, in terms of complexity and speed of the chips (assuming equal dissipation). Symbols T and R refer to the transistors and the resistors, respectively. Data on silicon ECL technology complexity includes the transistor count increased for the resistor count.
GaAs(1 m E/D-MESFET)
Silicon(2 m NMOS)
Silicon(2 m CMOS)
Silicon(1.25 m NMOS)
Silicon(2 m ECL)
Complexity
On-chip transistor count 40K 200K 200K 400K 40K (T or R)
Speed
Gate delay
(minimal fan-out)50-150 ps 1-3 ns 800-1000 ps 500-700 ps 150-200 ps
On-chip memory access
(3232 bit capacity)0.5-2.0 ns 20-40 ns 10-20 ns 5-10 ns 2-3 ns
Off-chip, on package memory access (25632 bits)
4-8 ns 40-80 ns 30-40 ns 20-30 ns 6-10 ns
Off-package memory access (1k32 bits)
10-50 ns 100-200 ns 60-100 ns 40-80 ns 20-80 ns
Page Number: 63/101
Figure 7.3. Comparison of GaAs and silicon, in the case of actual 32-bit microprocessor implementations (courtesy of RCA). The impossibility of implementing “phantom” logic (wired-OR) is a consequence of the low noise immunity of GaAs circuits (200 mV).
GaAs E/D‑DCFL Silicon SOS‑CMOS
Minimal geometry 1 m 1.25 m
Levels of metal 2 2
Gate delay 250 ps 1.25 ns
Maximum fan-in 5 NOR, 2 AND 4 NOR, 4 NAND
Maximum fan-out 4 20
Noise immunity level 220 mV 1.5 V
Average gate transistor count 4.5 7
On-chip transistor count 25 000 100 000-150 000
Page Number: 64/101
Figure 7.4. Processor organization based on the BS (bit-slice) components. The meaning of symbols is as follows: IN—input, BUFF—buffer, MUX—multiplexer, DEC—decoder, L—latch, OUT—output. The remaining symbols are standard.
Page Number: 65/101
Figure 7.5. Processor organization based on the FS (function slice) components: IM—instruction memory, I_D_U—instruction decode unit, DM_I/O_U—data memory input/output unit, DM—data memory.
Page Number: 66/101
Only a single-chip reduced architecture makes sense!
In Silicon environment,we can argue “RISC” or “CISC”.
In GaAs environment,there is only one choice: “RISC”.
However, the RISC concept has to be significantly modified for efficient GaAs utilization.
Implication of the High Off/On RatioOn the Choice of Processor Design Philosophy
Page Number: 67/101
Assume a 10:1 advantage in on-chip switching speed, but only a 3:1 advantage in off-chip/off-package memory access.
Will the microprocessor be 10 times faster?
Or only 3 times faster?
Why the Information Bandwidth Problem?
The Reduced Philosophy:Large register filest or all on-chip memory is used for the register file On chip instruction cache is out of question
Instruction fetch must be from an off-chip environment
The Information Bandwidth Problem of GaAs
Page Number: 68/101
• General purpose processing in defense and aerospace, and execution of compiled HLL code.• General purpose processing and substitution of current CISC microprocessors.*• Dedicate special-purpose applications in digital control and signal processing.*• Multiprocessing of the SIMD/MIMD type, for numeric and symbolic applications.
Applications for GaAs Microprocessor
Page Number: 69/101
On-chip issues:•Register file•ALU•Pipeline organization•Instruction set
Off-chip issues:•Cache•Virtual memory management•Coprocessing•Multiprocessing
System software issues:CompilationCompilation
CompilationCode optimization
Code optimizationCode optimization
Which Design Issues Are Affected?
Page Number: 70/101
igure 7.6. Comparison of GaAs and silicon. Symbols CL and RC refer to the basic adder types (carry look ahead and ripple carry). Symbol B refers to the word size.a) Complexity comparison. Symbol C[tc] refers to complexity, expressed in transistor count.b) Speed comparison. Symbol D[ns] refers to propagation delay through the adder, expressed in nanoseconds. In the case of silicon technology, the CL adder is faster when the word size exceeds four bits (or a somewhat lower number, depending on the diagram in question). In the case of GaAs technology, the RC adder is faster for the word sizes up to n bits (actual value of n depends on the actual GaAs technology used).
Adder Design
Page Number: 71/101
Figure 7.7. Comparison of GaAs and silicon technologies: an example of the bit-serial adder. All symbols have their standard meanings.
Page Number: 72/101
Figure 7.8. Comparison of GaAs and silicon technologies: design of the register cell: (a) an example of the register cell frequently used in the silicon technology; (b) an example of the register cell frequently used in the GaAs microprocessors. Symbol BL refers to the unique bit line in the four-transistor cell. Symbols A BUS and B BUS refer to the double bit lines in the seven-transistor cell. Symbol F refers to the refresh input. All other symbols have their standard meanings.
Register File Design
a) b)
Page Number: 73/101
Pipeline design
Figure 7.9. Comparison of GaAs and silicon technologies: pipeline design—a possible design error: (a) two-stage pipeline typical of some silicon microprocessors; (b) the same two-stage pipeline when the off-chip delays are three times longer than on-chip delays (the off-chip delays are the same as in the silicon version). Symbols IF and DP refer to the instruction fetch and the ALU cycle (datapath). Symbol T refers to time.
Page Number: 74/101
b) IPFigure 7.10. Comparison of GaAs and silicon technologies: pipeline design—possible solutions; (a1) timing diagrams of a pipeline based on the IM (interleaved memory) or the MP (memory pipelining); (a2) a system based on the IM approach; (a3) a system based on the MP approach; (b) timing diagram of the pipeline based on the IP (instruction packing) approach. Symbols P, M, and MM refer to the processor, the memory, and the memory module. The other symbols were defined earlier
a1) a2)
a3) b)
Page Number: 75/101
32-bitGaAs MICROPROCESSORS
Goals and project requirements:
•200 MHz clock rate•32-bit parallel data path•16 general purpose registers•Reduced Instruction Set Computer (RISC) architecture•24-bit word addressing•Virtual memory addressing•Up to four coprocessors connected to the CPU (Coprocessors can be of any type and all different)
References:
1. Milutinović,V.,(editor),”Special Issue on GaAs Microprocessor Technology,” IEEE Computer, October 1986. 2. Helbig, W., Milutinović,V., “The RCA DCFL E/D- MESFET GaAs Experimental RISC Machine,” IEEE Transactions on Computers, December 1988.
Page Number: 76/101
3.The outputs of two circuits can not be tied together: a. one can not utilize phantom logic on the chip, to implement functions like WIRED-OR (all outputs active).Circuits have a low “operating noise margin”.B . One can not use three-state logic on the chip, to implement functions like MULTIPLE-SOURCE-BUS (only the output active). Circuits have no “off-state”.C . Actually, if one insist on having a MULTIPLE-SOURCE- BUS on the chip, one can have it at the cost of only one active load and the need to precharge (both mean “constraints” and “slowdown on the architecture level).D . Fortunately, logic function AND-OR is exactly what is needed to create a multiplexer - a perfect replacement for a bus.
E
Page Number: 77/101
MUX
Page Number: 78/101
Figure 7.11. The technological problems that arise from the usage of GaAs technology: (a) an example of the fan-out tree, which provides a fan-out of four, using logic elements with the fan-out of two; (b) an example of the logic element that performs a two-to-one one-bit multiplexing. Symbols a and b refer to data inputs. Symbol c refers to the control input. Symbol o refers to data output.
a)
b)
Page Number: 79/101
Figure 7.12. Some possible techniques for realization of PCBs (printed circuit boards): (a) The MS technique (microstrip); (b) The SL technique (stripline). Symbols and refer to the signal delay and the characteristic impedance, respectively. The meaning of other symbols is defined in former figures, or they have standard meanings
ZH
W T
D
r
r
0
0
87
1 41
5 98
0 8
1 016 0 475 0 67
,ln
,
,
, , , ns ft
ZB
W T
D
r
r
0
0
60 4
0 67 0 8
1 016
ln, ( , )
, ns ft
Page Number: 80/101
1. Deep Memory Pipelining:Optimal memory pipelining depends on the ratio of off-chip and on-chip delays, plus many other factors. Therefore, precise input from DP and CD people was crucial. Unfortunately, these data were not quite known at the design time, and some solutions (e.g. PC-stack) had to work for various levels of the pipeline depth.
2. Latency Stages:One group of latency stages (WAIT) was associated to instruction fetch; the other group was associated to operand load.
3. Four Basic Opcode Classes:•ALU•LOAD/STORE•BRANCH•COPROCESSOR
4. Register zero is hardwired to zero.
The CPU Architecture
Page Number: 81/101
IR
GRFCPU
M
Silicon
GaAs
CPU M3 M6 M9
Page Number: 82/101
ALU CLASS
Page Number: 83/101
CATALYTIC MIGRATIONfrom the
RISC ENVIRONMENTPOINT-OF-VIEW
This research was sponsored by NCR
Page Number: 84/101
DEFINITION: DIRECT MIGRATION Migration of an entire hardware resource into the system software.
EXAMPLES:
Pipeline interlock.Branch delay control.
ESSENCE: Examples that result in code* speed-up are very difficult to invent.
Page Number: 85/101
DELAYED CONTROL TRANSFER
Delayed Branch Scheme
I1 fetch
I2 fetch
I1 executionbranch address calculationbranch target calculation
I3 fetch
I2 execution
time
Page Number: 86/101
DEFINITION: Catalytic Migration
Migration base on the utilization of a catalyst. MIGRANT vs CATALIST
Figure 7.13. The catalytic migration concept. Symbols M, C, and P refer to the migrant, the catalyst, and the processor, respectively. The acceleration, achieved by the extraction of a migrant of a relatively large VLSI area, is achieved after adding a catalyst of a significantly smaller VLSI area.
ESSENCE:
Examples that result in code speed-up are much easier to invent.
Page Number: 87/101
METHODOLOGY:Area estimation: MigrantArea estimation: CatalystReal estate to invest: DifferenceInvestment strategy: R
Compile time algorithmsAnalytical analysisSimulation analysisImplementational analysis NOTE: Before the reinvestment,
the migration may result in slow-down.
Page Number: 88/101
(N-2)*W vs DMA
a)
b)Figure 7.16. An example of the DW (double windows) type of catalytic migration, (a) before the migration; (b) after the migration.
Symbol M refers to the main store. The symbol L-bit DMA refers to the direct memory access which transfers L bits in one clock cycle. Symbol NW refers to the register file with N partially overlapping windows (as in the UCB-RISC processor), while the symbol DW refers to the register file of the same type, only this time with two partially overlapping windows. The addition of the L-bit DMA mechanism, in parallel to the execution using one window, enables the simultaneous transfer between the main store and the window which is currently not in use. This enables one to keep the contents of the nonexistent N – 2 windows in the main store, which not only keeps the resulting code from slowing down, but actually speeds it up, because the transistors released through the omission of N – 2 windows can be reinvested more appropriately.
Migrant: (N2)*WCatalyst: L-bit DMA
Page Number: 89/101
i: load r1, MA{MEM – 6}i + 1: load r2, MA{MEM – 3}
a)
b)Figure 7.14. An example of catalytic migration: Type HW (hand walking): (a) before the migration; (b) after the migration. Symbols P and GRF refer to the processor and the general-purpose register file, respectively. Symbols RA and MA refer to the register address and the memory address in the load instruction. Symbol MEM – n refers to the main store which is n clocks away from the processor. Addition of another bus for the register address eliminates a relatively large number of nop instructions (which have to separate the interfering load instructions).
Page Number: 90/101
Figure 7.15. An example of catalytic migration: type II (ignore instruction): (a) before the migration; (b) after the migration. Symbol t refers to time, and symbol UI refers to the useful instruction. This figure shows the case in which the code optimizer has successfully eliminated only two nop instructions, and has inserted the ignore instruction, immediately after the last useful instruction. The addition of the ignore instruction and the accompanying decoder logic eliminates a relatively large number of nop instructions, and speeds up the code, through a better utilization of the instruction cache.
Page Number: 91/101
CODE INTERLEAVING
a)
b)Figure 7.17. An example of the CI (code interleaving) catalytic migration: (a) before the migration; (b) after the migration. Symbols A and B refer to the parts of the code in two different routines that share no data dependencies. Symbols GRF and SGRF refer to the general purpose register file (GRF), and the subset of the GRF (SGRF). The sequential code of routine A is used to fill in the slots in routine B, and vice versa. This is enabled by adding new registers (SGRF) and some additional control logic which is quite. The speed-up is achieved through the elimination of nop instructions, and the increased efficiency of the instruction cache (a consequence of the reduced code size).
Page Number: 92/101
CLASSIFICATION:CM
ICM ACM
C-+ C++ -+ ++
EXAMPLES:(N2)*W vs DMA
RDEST BUS vs CFF IGNORE CODE INTERLEAVING
Page Number: 93/101
for i := 1 to N do:
1. MAE2. CAE3. DFR4. RSD5. CTA
6. AAP7. AAC8. SAP9. SAC
10. SLL
end do Figure 7.18. A methodological review of catalytic migration (intended for a detailed study of a new catalytic migration example). Symbols S and R refer to the speed-up and the initial register count. Symbol N refers to the number of generated ideas. The meaning of other symbols is as follows: MAE—migrant area estimate, CAE—catalyst area estimate, DFR—difference for reinvestment, RSD—reinvestment strategy developed, CTA—compile-time algorithm, AAC—analytical analysis of the complexity, AAP—analytical analysis of the performance, SAC—simulation analysis of the complexity, SAP—simulation analysis of the performance, SLL—summary of lessons learned.
Page Number: 94/101
RISCs FOR NN: Core + Accelerators
Figure 8.1. RISC architecture with on-chip accelerators. Accelerators are labeled ACC#1, ACC#2, …, and they are placed in parallel with the ALU. The rest of the diagram is the common RISC core. All symbols have standard meanings.
Page Number: 95/101
Figure 8.2. Basic problems encountered during the realization of a neural computer: (a) an electronic neuron; (b) an interconnection network for a neural network. Symbol D stands for the dendrites (inputs), symbol S stands for the synapses (resistors), symbol N stands for the neuron body (amplifier), and symbol A stands for the axon (output). The symbols , , , and stand for the input connections, and
the symbols , , , and stand for the output connections.
Page Number: 96/101
Figure 8.3. A system architecture with N-RISC processors as nodes. Symbol PE (processing element) represents one N-RISC, and refers to “hardware neuron.” Symbol PU (processing unit) represents the software routine for one neuron, and refers to “software neuron.” Symbol H refers to the host processor, symbol L refers to the 16-bit link, and symbol R refers to the routing algorithm based on the MP (message passing) method.
Page Number: 97/101
Figure 8.4. The architecture of an N-RISC processor. This figure shows two neighboring N-RISC processors, on the same ring. Symbols A, D, and M refer to the addresses, data, and memory, respectively. Symbols PLA (comm) and PLA (proc) refer to the PLA logic for the communication and processor subsystems, respectively. Symbol NLR refers to the register which defines the address of the neuron (name/layer register). Symbol refers to the only register in the N‑RISC processor. Other symbols are standard.
Page Number: 98/101
Figure 8.5. Example of an accelerator for neural RISC: (a) a three-layer neural network; (b) its implementation based on the reference [Distante91]. The squares in Figure 8.5.a stand for input data sources, and the circles stand for the network nodes. Symbols W in Figure 8.5.b stand for weights, and symbols F stand for the firing triggers. Symbols PE refer to the processing elements. Symbols W have two indices associated with them, to define the connections of the element (for example, and so on). The exact values of the indices are left to the reader to determine, as an exercise. Likewise, the PE symbols have one index associated with them, to determine the node they belong to. The exact values of these indices were also left out, so the reader should determine them, too.
Page Number: 99/101
Figure 8.6. VLSI layout for the complete architecture of Figure 8.5. Symbol T refers to the delay unit, while symbols IN and OUT refer to the inputs and the outputs, respectively
Page Number: 100/101
Figure 8.7. Timing for the complete architecture of Figure 8.5. Symbol t refers to time, symbol F refers to the moments of triggering, and symbol P refers to the ordinal number of the processing element.