operational experiences with the ti advanced scientific computer
TRANSCRIPT
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
1/10
perationalexperiences with the TI dvanced Scientific Computer
by W. J. WATSON and H. M. CARR
Texas nstruments ncorporated
Austin, Texas
INTRODUCTION
Since 1966 a large computer development program has been
conducted by Texas Instruments. The goal for this effort was
to
provide needed capacity for supporting seismic processing,
plus offering a general purpose capability for large scientific
problems.
This development has resulted in the Advanced Scientific
Computer ASC)-a highly modular system offering a ,ide
spectrum of processor power, memory sizes, and I/O capabil
ity.
The
ASC is a high-speed, large-scale processing system
featuring extensive use of pipelining, multiple arithmetic
units, separate control processors, large and fast central
memory,
and
extensive user software aids. The central
processor has
both
scalar and vector instruction capabilities.
First delivered in 1972 and placed into operational status
during 1973, several operational ASC systems now offer
extremely high processing rates for particular classes of
problems.
OVERVIEW OF THE SYSTEM
The
major subsystems of a typical configuration are shown
in Figure 1:
the
central memory,
the
central processor,
the
peripheral processor, on-line
bulk
storage, a digital communi
cations interface, plus a selection of standard peripherals.
The
peripheral processor has been designed for executing
the
operating system. The central processor has been designed
expressly
to
provide high computing speeds when operating
upon large arrays of data. The central processor operates as
a slave to the peripheral processor. This design approach was
chosen
to
maximize the overlapping of system overhead tasks
with
the
execution of user programs. In operation
the
job
stream is analyzed by the peripheral processor. The language
processors, plus user object code, are executed by the central
processor. System control and I/O tasks are processed by the
peripheral processor. I/O is routed through high-speed,
head-per- track disc storage. A
data
communications interface
for the common carriers is provided for the support of remote
batch and interactive terminals. Standard types of peripherals
are also provided.
The
centra l memory serves as the common
communications and access storage medium for these
subsystems.
389
CENTRAL :\1EMORY
The
ASC central memory consists of a memory control
unit (MCU) and appropr iately sized modules of high-speed
or
medium-speed central memory. Optionally, a medium-speed
central memory extension can be used in conjunction with a
high-speed memory.
The MCU is organized as a two-way, 256-bit/channel
(8-word) parallel access traffic net between eight independent
processor ports and nine memory buses, with each processor
port
having full accessibility
to
all memories.
The
nine
memory buses are organized to provide eight-way interleaving
for
the
first eight buses with
the ninth
bus used for the central
memory extension. The MCU provides the facilities for
controlling access from
the
eight processor ports to a CM
having a 24-bit address space (16 million words). A
port
expander can be utilized
to
expand
the
number of processor
ports. Figure 2 illustrates this structure. .
The
semiconductor high-speed central memory modules
have a cycle time of 160 ns and a read time of 140 ns.
Additionally, all transfers are 256 bits (eight 32-bit words)
with a Hamming code providing single-bit error correction
and double-bit error detection for each 32-bit word. High
speed central memory is typically divided into eight equal
sized modules which allow for eight-way interleaving.
CENTR L
M MORY
CPITR L
PROCESSOR
CP)
PERIPHER L
PROCESSOR
PP)
DISC
STOR GE
D T COMMUNIC TIONS
PER IPHER LS
COMON C RRIERS
Figure
Major
ASC subsystems
From t e co ect on o t e Computer H story Museum (www.computer story.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
2/10
390 National Computer Conference, 1974
INTERLEAVED
HIGH-SPEED
R
MED
lUM-SPEED
MEMORY MODULES
r 1 E ~ ) P Y
CONTROL
UNIT
(MCU)
PRIMARY
MEMORY
ACCESS PORTS
SECONDARY
MEr1lRY
ACCESS PORTS
r
L
1
- - - - - - - - - - ~ - - - - - - - - - -
NTERLEAVED MEDIUr1-SPEED MEMORY
MODULES
: ~ ~ ~ 6 ~ L
EXTENSION
OPTIONAl)
Figure Modular structure of the ASC central memory
The
optional central memory extension allows large
amounts of medium speed memory
(1 p s
semiconductor
technology) to be used
in
the normal address space of central
memory. Block transfer between memory extension and
high-speed memory is controlled by the peripheral processor
and will transfer at a rate of 40 M words per second.
Memory mapping registers
and
protection registers are
used to facilitate central memory management and access
control of the ports.
CENTRAL PROCESSOR
The central processor provides both scalar (single operand)
and vector (array) instructions
at
the
machine level.
The
basic instruction size is 32 bits, with 16-, 32-, or 64-bit
operands. The single instruction stream, which contains a
mixture of scalar
and
vector instructions, is preprocessed by
the instruction processing unit.
The central processor design is such
that
one, two, three,
or four execution units or pipes can be provided. These
units employ the pipeline concept
in
both scalar and vector
modes. A single execution unit can have up to twelve scalar
instruction in process at one time. From one to four vector
results can be produced every 60 ns, depending on the
number of execution units provided.
The CP has 48 program-addressable registers. This group
of 32-bit registers consists of sixteen base address registers,
sixteen arithmetic registers, eight index registers, and
eight
vector parameter registers. This last group is used
to
extend
the
instruction format for the complete specification of vector
instructions.
The CP scalar instruction repertoire includes
an
extensive
set of load and store instructions: halfword, full word
and
doubleword instructions, with immediate, magnitude,
and
negative operand capabilities. Ability to load and store
register files and to load effective addre:sses is also available.
Arithmetic scalars include various adds, subtract, multiply,
and divide for halfword (16-bit) and fullword (32-bit) fixed
point numbers and fullword and doubleword (64-bit) floating
point numbers . Scalar logical instructions are provided as are
arithmetic, logical,
and
circular shifts. Various comparison
instructions and combination comparison-logical instructions
are provided for halfword, fullword, and doublewords. l\Iany
combinations of test and branching instructions with incre
menting or decrementing capability are also available.
Stacking and modifying arithmetic registers can be done with
single instructions. Subroutine. linkage
is
accomplished
through branch and load instructions. Format conversion for
single
and
doublewords, as well as normalize instructions, are
available.
The vector capabilities of the CP are made available
through the use of VECTL (vector after loading vector
parameter file) and
VECT
(assumes parameter file is already
loaded) instructions. The vector repertoire includes such
arithmetic operations as add, subtract, multiply, divide,
vector dot product, matrix multiplication, and others for both
fixed point and fl'oating point representations. Vector
instructions are also available for shifting; logical operations;
comparisons; format conversions normalization; and special
operations-such as l\Ierge, Order, Search, Peak Pick, Select
and Replace, among others.
One important characteristic of the vector instruction
capability is
the
ability to encompass three dimensions of
addressability within a single vector instruction. This is
equivalent to a nest of three indexing loops in a conventional
machine.
The basic structure of the CP shown in Figure 3, has three
major components: the instruction processing unit (IPU) for
non-arithmetic stages of instruction processing for the CP
instruction stream, the memory buffer unit (MBU) to provide
operand interfacing with the central memory, and an
arithmetic unit (AU) to perform the specified arithmetic or
logical operations. Figure 3 shows a CP diagram for 2- or
4-pipeline CP's, each with a corresponding number
of
MBU-AU pairs. Note that a memory
port
is required for
the
IPU
and,
in
addition, one memory port for each pipeline
(MBU-AU pair)
in
a
CPo
A significant feature of the CP hardware is an operand
look-ahead capability which causes memory references to be
requested prior
to
the time of actual need. Double buffering
PRIMARY
MEMORY
PORTS
r-----l
~
I \
i
~ c l J
_____ J
TWO P IP FL INE CP
PRIMARY
M MORY
PORTS
r---------
I
:
/TI 1
I
I
/1/
\
\,
1 ~ / : 6 6 1
MBU
MBU 9
Ti
I
I
I
: AU AU f 3 ~ :
L
_ _ _ _ _ _ _ _ =.
FOUP PIPFL NE CP
: ::;c
Figure
3-Basic
structure of the CP
From the collection of the Computer History Museum (www.computerhistory.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
3/10
Operational Experiences with the
TI
Advanced Scientific Computer 391
in multiple 8-word octet) buffers for each pipeline provides
a smooth data
flow
to and from each arithmetic unit.
The
pipelined AU achieves its highest sustained
flow
rate
in the
vector mode, typically a result each
60
ns per AU, or an
avrage of
15
ns per result for a 4-pipe central processor.
Instruction processing unit
The primary function of the instruction processing unit
IPU) is
to
supply a continuous stream of instructions for
execution by the other parts of the CPo One Central Memory
port is required to provide the instruction stream. Two 8-word
octet) buffers are utilized
to
achieve a balanced stream of
instructions from memory
to
the IPU. Instructions are
transferred from memory in octets as are all other references
to memory for fetching or storing of information.
Up to 36 instructions in various stages of execution can be
overlapped within the 4-pipe CPo There are twenty positions
for instructions in the 2-pipe CP and twelve positions for
instructions in the I-pipe
CPo
Four levels are contained
within the IPU, and eight levels are contained in each
arithmet ic pipeline MBU-AU pair). The IPU performs
routing of instructions to
the
MBU-AU pairs based on an
optimum use of arithmetic unit capability.
Vector processing is altered by software in order to
distribute segments of the vector for multiple pipe systems.
Several features are provided to alleviate
the
potential
problems of branches and instruction dependencies in the
instruction pipeline.
Memory buffer unit
The memory buffer unit MBU) provides an interface
between central memory and the arithmetic unit. Its primary
function is to supply the arithmetic unit with a continuous
stream of operands from memory
and to
provide for
the
storing of the results back to memory. All references to
memory, whether for fetching or storing, are made in 8-word
increments octets).
The MBU has three double buffers, one octet per buffer,
called the
X
and Y buffers for
input
and the Z buffers
for output. This double buffering is provided
so that
pipeline
processing can be sustained at a high rate with minimal
memory access conflicts.
rithmetic
un t
The primary function of a CP arithmetic unit AU) is to
perform the arithmetic operations specified
by
the operation
code of the instruction currently at the AU level. There is one
AU per pipeline in the CP, each having a
60
ns basic cycle
time. A distinguishing feature of an AU is the pipeline
structure which allows efficient execution of .the arithmetic
part of all instructions. There are eight exclusive partitions of
the AU pipeline involved, each of which can provide an output
every
60
ns. These eight sections are
1)
receiver register,
FLO TING DD
FIXED
MULT
I
I
I
ECEIVER REGISTER
I
I
L
XPONENT SUBTR CT
LIGN
MULTIPLY
:--
DD
L___
NORM LIZE
I
CCUMUL TE
-
I
I
I
I
I
I
I
I
I
I
I
_ 1
-...,
I
I
I
I
I
- - -
I
_ 1
OUTPUT
I
ESULT
RESULT
Figure 4-Arithmetic unit pipeline
2)
exponent sub tract,
3)
align,
4)
add,
5)
normalize,
6)
multiply, 7) accumulate, and 8) output. Figure 4 shows how
different sections of the AU are utilized for execution of
particular instructions; i.e., floating point addition and fixed
point multiplication.
An AU is a 64-bit parallel operating unit for most scalar
and vector instructions. Exceptions are double length
multiply and all types of division. In these circumstances
various combinations of the components of the AU are
From the collection of the Computer History Museum (www.computerhistory.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
4/10
39 National Computer Conference, 1974
utilized; and, therefore, more
than
one clock cycle is required
to complete these arithmetic operations.
THE
PERIPHERAL
PROCESSOR
The peripheral processor (PP) is a powerful multiprocessor
designed
to
perform
the
control
and data
management
functions of the ASC. Several aspects of the implementation
of
the
peripheral processor concept greatly increase
the
effectiveness of
the
ASC system.
The
PP
is a collection of eight individual processors called
virtual processors (VP's). Each VP has its own program
counter along with arithmetic, index, base,
and
instruction
registers. The eight VP's share a read only memory,
an
arithmetic unit, an instruction processing unit, and a central
memory buffer. Use of the common units is distributed among
the VP s
using sixteen single 85 ns cycles. When
an
equally
distributed sequence of time units is used, each of the eight
VP s
receives two 85 ns cycles every 1.4
J LS
The
typical
PP
instruction requires two
85
ns cycles for completion. The
distribution of available time units can be dynamically varied
to suit particular processing requirements.
The
4K
32-bit words of read only memory within the
PP
is utilized for program storage and execution of those short
routines which are highly utilized
by the
VP's, such as
polling loops.
Because the
PP
is intended to perform control functions
rather
than
execute mathematical algorithms, the instruction
set is oriented toward control operations and does no t require
multiplication, division, or floating point operations. The
instruction format is similar to that of the central processor,
using a 32-bit word for each instruction. Instructions are
provided for
bit (1
bit), byte
(8
bits), halfword (16 bits), and
fullword (32 bits) operations.
Each VP has direct access to the entire cent ral memory for
program execution and data storage. Therefore, a single copy
of reentrant code can be executed simultaneously
by
more
than
one VP.
The communications register (CR) file contains sixty-four
32-bit word registers which are program addressable by the
VP's.
The CR file
serves as
the
principal storage media for
control information necessary for the coordination of all pa rts
of
the
ASC system.
DISC STORAGE
Disc storage is the principal secondary storage system for
the
ASC system. Disc storage consists of head-per-track
HIT)
disc systems supplemented
by
positioning-arm disc
(PAD) systems.
The HIT disc system is a high-performance device whose
effective performance is further enhanced because the operat
ing system utilizes a shortest-access-time-first (SATF)
algorithm for
data
transfers. This combination of hardware
and soft rare pro \rides a T e r ~ l high effecti'le transfer rate.
Each HIT disc module has a capacity of 25 million 32-bit
words with a transfer rate of approximately 500K words per
second. Using
the
shortest-access-time-first algorithm, access
time ,ill average approximately 5 ns which results in
an
exceptionally fast effective transfer rate.
DATA COMMUNICATIONS
The
data
communication system is very modular and, thus,
externally flexible in the various devices which may be
utilized for communication with the ASC. D:ata communica
tions are controlled by a data concentrator which, in turn,
interfaces to the ~ I U through a channel control device.
The
data
concentrator is a TI-980A minicomputer
equipped with special-purpose hardware communication
interface units on its direct memory access ports.
The
data
communications system presently supports com
munication with three types of stations: high-performance
user terminals, other large computers,
and
remote concentra
tors. The system can be easily extended to support smaller
terminals down
to
the
teletype level. These stations may be
either remote or local.
Remote links are presently implemented with non
switched, full duplex common carrier
data
transmission
facilities.
Data
is transferred over these links synchronously
at rates determined by the modems and common carrier
bandwidths.
The data
communication system supports
transfer rates up
to
a maximum of 240,000 bits per second.
PERIPHERALS
Standard types of magnetic tape drives, card equipment,
and printers have been interfaced with the ASC. These
interfaces attach to primary or secondary memory ports
through a variety of standard selected and multiplexed
data
channels. A subset of the system's peripherals can also be
interfaced via the communications register file.
SYSTEM SOFTWARE
Software design and development for the ASC system has
progressed in parallel with development of the hardware.
This was accomplished through
the
use of simulators, meta
assemblers, and higher level programming languages imple
mented on the systems supporting Texas Instruments'
Corporate Information Center. Thus,
the
first version of this
software was placed into operational status v.rith the ASC
prototype machine. The major software capabilities are
discussed in the next
few
paragraphs with emphasis being
given to those attributes \vhich provide comprehensive and
flexible programming facilities for
the
user.
ASC
ortran language
The most obvious interface between the ASC system and
a user is ',rith the translation of the user-written program into
machine level instructions that efficiently utilize the special
hardware features in the system. Texas Instruments has
From t e co ect on o t e Computer H story Museum (www.computer story.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
5/10
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
6/10
394 Nationa l Computer Conference, 1974
H / 1 : g ~ ~ t t t E ~ N D
DISC
INTERFACE
IT 25M WORDS 500K WORDS/SEC.
UNIT
E
H / ~ O ~ ~ ~ t t ~ : N D
DISC INTERFACE
IT
25M WORDS
500K
WORDS/SEC.
X
UNIT
P
A
N
H : J = H ~ : t c t E ~ N D
DISC INTERFACE
D
HjT
25M
WORDS 500K
WORDS/SEC.
E
UNIT
R
M
E
H i : g : i r \ ~ C t E
~ N D
DISC
INTERFACE
IT 25M WORDS
SOOK WORDS/SEC.
0
UNIT
R
Y
r
CP- - - - --
I I
TWO
1500
CARD M I N
THREE
1200
LINE
M I N
TWO 100
CARD M I N
TEXT EDITING
CRTS TWo)
OPERATOR
COMM.
I
I
I
I
I
I
CARD READER
LINE
PRINTER
PUNCHES TWO CRTS
A)
1 1 4 2 1 9 B
..J
TAPE CONTROLLER
CHANNEL NUMBER 1
SECONDARY STORAG
CHANNEL NUMBER 2
SECONDARY STORAGE
TAPE
SWITCHING
UNIT
}
6 DUAL DENSITY
9
TRACK
800 1600
BPI
TAPE DRIVES
}
DUAL DENSITY
7 TRACK 556
800
BPI TAPE DRIVES
Figure
5-GFDL ASe
configuration
GPOS performing all overhead functions in the Peripheral
Processor. The operating system isolates the control, schedul
ing, and resource allocation algorithms for ease in
tuning
the system
to
match the specific requirements of each
installation. The overall system architecture is maintained to
accommodate hardware and software system growth and
flexibility. GPOS, by its simplicity and modular design,
minimizes the system use of central memory with a small
resident system and the remainder of the system non-resident.
The design of GPOS exploits hardware features unique to
the ASC. Most important of these features is complete access
to
central memory by the PP. Thus, a single reentrant copy
of code is available to all processors; and, only a branch
instruction is needed to switch a Virtual Processor from one
function
to
another. The Communications Register CR) file
is used to allow one VP to control the other seven, while
common access to the rest of this file supports communication
between
the
processors and other system components.
OPERATIONAL
HISTORY
The prototype ASC initially completed its checkout during
the Spring of
1971.
The system Serial
1)
was available for
use as a software development tool and for customer demon
strations for the remainder of 1971.
In
1972 the prototype
was moved to a permanent location
at
the
TI
facility in
Austin. During the period of downtime, a retrofit of the
hardware was carried out to incorporate the latest version of
circuits and boards and to support a production environment.
System 1 was operational early in 1973 and is currently being
devoted to software development and support of application
program conversion to the ASC.
ASC 1 is configured with a one-pipe central processor,
128K words of high-speed central memory, 128K words of
memory extension, a complement of head-per-track disc
storage, a
data
communications interface, plus standard tape
and paper devices.
Experience with an ASC operating in a center devoted to
seismic production work is currently being gained in the TI
facility at Amstelveen, Holland. This system Serial
2)
was
delivered early in 1973 and essentially duplicates the capabil
ities described for the prototype machine. Additionally,
several seismic interactive terminals are interfaced
both
locally and remotely
to
this system.
Seismic operational requirements are characterized by
large data bases, much magnetic
tape
input and output,
many
job steps composed of long computational sequences, and
the
need to precisely control a complicated series of such jobs. In
addition to the high computational speeds available on the
S C ~ the seiswic center experience is shmving that other
ASC features are valuable when applied to this application.
From t e co ect on o t e Computer H story Museum (www.computer story.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
7/10
Operational Experiences with
the TI
Advanced Scientific Computer 395
Head-per-track disc storage, management of the
data
ba.ses
and scheduling
by the
dedicated virtual processors, and Job
control available via
the
JSL language appear
to match the
environment of seismic work. Applications programs are
written
in
standard Fortran, and no need has been found
to
supplement the available compiler p t ~ z a t i o n
by
~ d i t i o n a l
hand coding. The system is well supportmg the reqUIrements
by
.generating significant improvements
in
unit p ~ o c e s s i n g
costs and by permitting new processing technologtes
to
be
e c o n ~ m i c l l y feasible. Improved productivity of geophysicists
and geologists through real-time interactive sessions is
? e i ~ g
achieved. t is expected
that the
use of ASC for selSIillC
processing capacity will continue
to
grow
at
rapid rate.
Operational experience has also been gamed from the
application of the ASC
to
the U.S. o v e r ~ e n t d a t a p r o c ~ s -
ing problem of ballistic missile defense. Senal 3, a
o n e ~ l p e
ASC with a configuration similar to the previously descnbed
systems, was delivered to the U.S. Army in
~ h e
S u m ~ e r
of
1973. t is
to
be used for research into processmg techmques
employed in ball istic missile defense.
Application
to
long-range prediction of
the
earth's weath.er
is
the intended use of
the
largest and fastest ASC
to
be built
to
date. The National Oceanic and Atmospheric Administra
tion (NOAA) has contracted for an ASC (Serial #4) for its
Geophysical Fluid Dynamics Laboratory
at
Princeton Uni
versity. Delivery is scheduled for early in 1974. The ASC is
configured with a four-pipe central processor, one million
words of high-speed central memory, head-per- track disc,
text
editing terminals, two channels of high density secondary
storage devices, and standard magnetic tape and paper
devices. This configuration is illustrated in Figure 5. Much
experience has been gained using benchmark programs
derived from weather models
and
the actual weather predic
tion codes themselves. Emphasis has been upon Fortran code
generated
by
analysts and weather scientists instead of
hand-optimized machine language. Results obtained from
the
system while undergoing final checkout
at TI s
facility showed
the
speeds available to be several times faster
than
other
current computer systems.
For weather codes characterized
by
large
data
bases that
are updated frequently, sequences of heavy computational
work using
the
data, and mathematical operations performed
on long arrays of data, the ASC is proving to be a valuable
asset. The large central memory enables one
to
maintain
ample data
so
that the central processor is utilized to a very
high degree. The
I/O
and multiprogramming capabilities
managed
by
the operating system resident in the peripheral
processor also support high
CP
workloads.
1)
TABLE
I-Simple Examples of Vectors
DO
DO
DO
10 K=l, 50
10 J =1,50
10 1=1,50
10 Z(I, J, K) =X(I, J, K) ' Y(I, J, K)
(2) Z=X*Y
(3) VECTL (#460, B2) VMF
TABLE II Vector
Instructions Produced from Weather Code
(1) DO 100
K=l, lO
(2)
DO 100
1=1,144
TBXY(I, K)=(T(I+1, K, J)+T(I,
K,
J * 0.5
TXY(K, K)=(T(I+1, K,
J)-T(I ,
K, J * RDX(JC)
PBXY(I, K)=(PS(I+1, K, J)+PS(I, K, * 0.5
100 PXY(I,
K)=(PS(I+1,
K,
J)-PS(I,
K,
J)
* RDX(JC)
VECTL (#3B8, B2)
VECTL
( 3CO, B2)
VECTL (#3C8, B2)
VECTL (#3DO, B2)
VECTL (#3D8, B2)
VECTL
(#3EO, B2)
VECTL (#3E8, B2)
VECTL (#3FO, B2)
VAF
VMF
VSF
VMF
VAF
VMF
VSF
VMF
MAXIMIZING PERFORMANCE
Experience thus far has shown
that
for
the
applications
that
have been considered
by
ASC users
the
most cost
effective performance is realizable when
the
capabilities of
ASC
Fortran
and
the
optimizing compiler are used. Although
particular sequences of code can be found wherein hand
coding will improve
the
speed of execution, for the broad
range of programs where much applications code is involved,
compiler-generated object code is
the
best choice. American
National Standard Institute (ANS) Fortran is completely
sufficient,
and
vector instructions a re readily produced from
this Fortran. ASC extensions to the
Fortran
are sometimes
found to be useful, not to provide unique access to some hard
ware feature
but
to simplify notation involved in writing
the
program so
that the
programmer can deal more directly with
the mathematics of
the
application.
The
ASC system design allows easy user access
to
perfor
mance enhancement through
the
use of additional central
processor pipes. Compiler software is responsible for both
the generation of vector instructions and the partitioning of
these vector operations over multiple pipes. Protection of the
user from vector hazard conditions is carried out
by
the
compiler. Partitioning of scalar instruct ions for multiple pipes
is carried out
by
the
CP
hardware. Extensive checks are made
by
hardware
to
protect
the
user from illegal scalar conditions
that
might occur. For mixtures of vector instructions and for
mixtures of scalars and vectors, the compiler prevents illegal
conditions
by
the
use of directive instructions for
the CP
to
operate
in
either parallel mode (FORK) or sequential mode
(JOIN). Thus,
the
burden is on
the
system instead of
the
user. Programs compiled for one-pipe ASC's will execute
correctly on multiple-pipe systems. Performance \\1.ll be
increased via a recompilation for the multiple-pipe machine.
Some typical examples of efficient code produced from
present applications \\1.11
illustrate
the
optimization level
provided
by the
system. Table I shows
the type
of instruction
generated
by the
compiler from a typical triple-nested DO
LOOP.
(1)
gives
the
Fortran source with three levels of indexing,
(2)
is
an
alternate notation
that
could be used, and
(3)
is the single vector instruction produced.
From t e co ect on o t e Computer H story Museum (www.computer story.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
8/10
396 National Computer Conference, 1974
TABLE III-ASC Maximum Performance
Rate
ASC IX ONE AU) ASC 4X FOUR AU'S)
32-BIT
64-BIT 32-BIT 64-BIT
RESULTS/SEC
RESULTS/SEC
RESULTS/SEC RESULTS/SEC
ADD
MULTIPLY
DOT PRODUCT
16 10
6
9.2 19
6
64 10
6
37 10
6
16 10
6
5.3
10
6
64 10
6
21 10
6
16 10
6
4.0 10
6
64 10
6
16 10
6
t is a floating vector multiply instruction preceded by the
loading of the vector parameter registers. Table II gives
some typical code found
in
weather models. A double-nested
DO
LOOP with typical indexing conventions is shown in (1).
gives the sequence of instructions produced by the ASC
compiler. All instructions are vectors, and the necessary
indexing information for addressing purposes is contained
in
each vector parameter file. No scalar instructions are neces
sary in
this example.
A powerful example of vector instruction capabilities is
found in the use of the hardware-implemented dot-product
operation. This operation consists of
the
multiplication of
appropri ate elements of two arrays followed
by
the sum of
the
products. To implement a matrix multiply operation from
Fortran,
the ASC compiler uses a single dot-product instruc
tion and the complex indexing capability of the hardware to
carry out the full matrix multiply. Three levels of addressing
changes are implied in this case, and the hardware is designed
to comprehend this level of indexing complexity.
The execution rate for the elementary operations of matrix
multiply is one result per clock cycle for a one-pipe CP, or a
rate of four results per clock cycle for a four-pipe
CPo
The
compiler partitions the total matrix multiply across
the
appropriate number of pipes. Therefore,
to
complete a matrix
multiply of two by matrices, a four-pipe CP will require
approximately N3 4 times the clock rate in seconds. This does
not include
the startup
overhead necessary
to fill the
pipelines
with operands.
TABLE IV-Relative Computer Capacity* Third Generation Systems
MFR
MODEL
RELATIVE SPEED
IBM
S/360 MODEL
65
IBM
S/360 MODEL 75 1.5
CDC
6500
1.5
CDC
6600 2.5
IBM
S/370
MODEL
165
3.5
IBM
8/360 MODEL
91
5
HITACHI
HITAC
8800
5
IBM
S/360 MODEL
95 7
CDC
7600
8
IBM
S/360 MODEL 195
8
* Data taken from Table E, page 546, Pr ogram for the stud y conference
on the Modeling speets of
G6
A
TE, BuJletin of the } mcric:1n ~ 9 f c t c G r G -
logical Society, Vol. 54 No.6, June, 1973.
t
is the authors'
OpInIOn
that performance indices for
array-oriented architectures are
not
meaningful when only
the
Millions of Instructions Per Second (MIPS) factor is used.
Since a single vector instruction is equivalent to several scalar
instructions (typically Load, Operation, Increment and
Test
Branch), and the number of
data
values used determines
the
number of execution of these scalar instructions,
MIP
ratings
are ambiguous at best.
Consider
the
performance of
an
ASC producing result s per
second. In this context results per second is the rate at
which data fetched from central memory can be operated
upon
and the
results stored back into central memory.
Table III shows
the
maximum performance ra tes for one- and
four-pipe ASC systems performing typical arithmetic opera
tions. Assumptions are that the clock cycle is 60 nanoseconds
and that the pipelines are already filled with operands.
Vector dot product is a special case in the sense that the
results per second rate pertains to the elementary operations.
Another performance measure can be determined from the
present performance of ASC System 4 executing a particular
weather benchmark. Although the benchmark is not a full
weathe r prediction code,
it
does have the characteristic source
code sequences and reflects the ability of the Fortran compiler
to
produce efficient code from a large applications package.
Execution speed of
the
benchmark on
the
IBM Model 91 is
approximately 246 minutes, and present ASC timing with
checkout not finalized has already demonstrated approxi
mately 30 minutes. This ratio of 8.2 is a measure of
the total
system performance upon this program. t reflects a mix of
both scalar and vector instructions as well as
I/O
and other
system services.
The
design of the ASC has been directed
t.oward matching the real world mix of instructions en
countered in typical applications instead of sacrificing scalar
capability to provide vector capability.
In order
to
compare
the
observed ASC performance on the
Weather Benchmark,
data
found in the Bulletin of
the
American Meteorological Societyl is given in Table IV. Using
the
IBIV[
S/360 Model 65 as the basis of reference, each of the
systems listed is compared as
to
relative speed. Using
the
observed ASC/M91 ratio of 8.2, the present ASC speed would
be
41 in the
table.
ACKNOWLEDGMENTS
t would not he possible t.o acknowledge all the contributors
to the development of the ASC; but particular recognition
From t e co ect on o t e Computer H story Museum (www.computer story.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
9/10
Operational Experiences wi th
the
TI Advanced Scientific Computer 97
should be given to
lVlessrs
H. G. Cragon \V D. Kastner
E. H. Husband D. R. Best C. M. Stephenson C. R. Hall
F. A Galindo E C. Garth and N. M. Chandler who
contributed significantly
to the
development of
the
hardware.
Software concepts are due in large
part
to the efforts of
Messrs. L. C. Dean
G
T. Boswell
A
E. Riccomi F.
A
Little W Winkelman W. L. Cohagan and S D. Nolte.
Many other members of the Texas Instruments staff have
also contributed
i YJlIIleasurably
in the development of the
ASC.
REFERENCES
1 Program for
the study
conference on
the
Modeling Aspects of Gate
Bulletin o the American Meteorological Society
Vol. 54
No 6
June
1973 page 546. tabl e
E
From t e co ect on o t e Computer H story Museum (www.computer story.org)
-
7/24/2019 Operational experiences with the TI Advanced Scientific Computer
10/10