operational experiences with the ti advanced scientific computer

7/24/2019 Operational experiences with the TI Advanced Scientific Computer

1/10

perationalexperiences with the TI dvanced Scientific Computer

by W. J. WATSON and H. M. CARR

Texas nstruments ncorporated

Austin, Texas

INTRODUCTION

Since 1966 a large computer development program has been

conducted by Texas Instruments. The goal for this effort was

to

provide needed capacity for supporting seismic processing,

plus offering a general purpose capability for large scientific

problems.

This development has resulted in the Advanced Scientific

Computer ASC)-a highly modular system offering a ,ide

spectrum of processor power, memory sizes, and I/O capabil

ity.

The

ASC is a high-speed, large-scale processing system

featuring extensive use of pipelining, multiple arithmetic

units, separate control processors, large and fast central

memory,

and

extensive user software aids. The central

processor has

both

scalar and vector instruction capabilities.

First delivered in 1972 and placed into operational status

during 1973, several operational ASC systems now offer

extremely high processing rates for particular classes of

problems.

OVERVIEW OF THE SYSTEM

The

major subsystems of a typical configuration are shown

in Figure 1:

the

central memory,

the

central processor,

the

peripheral processor, on-line

bulk

storage, a digital communi

cations interface, plus a selection of standard peripherals.

The

peripheral processor has been designed for executing

the

operating system. The central processor has been designed

expressly

to

provide high computing speeds when operating

upon large arrays of data. The central processor operates as

a slave to the peripheral processor. This design approach was

chosen

to

maximize the overlapping of system overhead tasks

with

the

execution of user programs. In operation

the

job

stream is analyzed by the peripheral processor. The language

processors, plus user object code, are executed by the central

processor. System control and I/O tasks are processed by the

peripheral processor. I/O is routed through high-speed,

head-per- track disc storage. A

data

communications interface

for the common carriers is provided for the support of remote

batch and interactive terminals. Standard types of peripherals

are also provided.

The

centra l memory serves as the common

communications and access storage medium for these

subsystems.

389

CENTRAL :\1EMORY

The

ASC central memory consists of a memory control

unit (MCU) and appropr iately sized modules of high-speed

or

medium-speed central memory. Optionally, a medium-speed

central memory extension can be used in conjunction with a

high-speed memory.

The MCU is organized as a two-way, 256-bit/channel

(8-word) parallel access traffic net between eight independent

processor ports and nine memory buses, with each processor

port

having full accessibility

to

all memories.

The

nine

memory buses are organized to provide eight-way interleaving

for

the

first eight buses with

the ninth

bus used for the central

memory extension. The MCU provides the facilities for

controlling access from

the

eight processor ports to a CM

having a 24-bit address space (16 million words). A

port

expander can be utilized

to

expand

the

number of processor

ports. Figure 2 illustrates this structure. .

The

semiconductor high-speed central memory modules

have a cycle time of 160 ns and a read time of 140 ns.

Additionally, all transfers are 256 bits (eight 32-bit words)

with a Hamming code providing single-bit error correction

and double-bit error detection for each 32-bit word. High

speed central memory is typically divided into eight equal

sized modules which allow for eight-way interleaving.

CENTR L

M MORY

CPITR L

PROCESSOR

CP)

PERIPHER L

PROCESSOR

PP)

DISC

STOR GE

D T COMMUNIC TIONS

PER IPHER LS

COMON C RRIERS

Figure

Major

ASC subsystems

From t e co ect on o t e Computer H story Museum (www.computer story.org)


2/10

390 National Computer Conference, 1974

INTERLEAVED

HIGH-SPEED

R

MED

lUM-SPEED

MEMORY MODULES

r 1 E ~ ) P Y

CONTROL

UNIT

(MCU)

PRIMARY

MEMORY

ACCESS PORTS

SECONDARY

MEr1lRY

ACCESS PORTS

r

L

1

- - - - - - - - - - ~ - - - - - - - - - -

NTERLEAVED MEDIUr1-SPEED MEMORY

MODULES

: ~ ~ ~ 6 ~ L

EXTENSION

OPTIONAl)

Figure Modular structure of the ASC central memory

The

optional central memory extension allows large

amounts of medium speed memory

(1 p s

semiconductor

technology) to be used

in

the normal address space of central

memory. Block transfer between memory extension and

high-speed memory is controlled by the peripheral processor

and will transfer at a rate of 40 M words per second.

Memory mapping registers

and

protection registers are

used to facilitate central memory management and access

control of the ports.

CENTRAL PROCESSOR

The central processor provides both scalar (single operand)

and vector (array) instructions

at

the

machine level.

The

basic instruction size is 32 bits, with 16-, 32-, or 64-bit

operands. The single instruction stream, which contains a

mixture of scalar

and

vector instructions, is preprocessed by

the instruction processing unit.

The central processor design is such

that

one, two, three,

or four execution units or pipes can be provided. These

units employ the pipeline concept

in

both scalar and vector

modes. A single execution unit can have up to twelve scalar

instruction in process at one time. From one to four vector

results can be produced every 60 ns, depending on the

number of execution units provided.

The CP has 48 program-addressable registers. This group

of 32-bit registers consists of sixteen base address registers,

sixteen arithmetic registers, eight index registers, and

eight

vector parameter registers. This last group is used

to

extend

the

instruction format for the complete specification of vector

instructions.

The CP scalar instruction repertoire includes

an

extensive

set of load and store instructions: halfword, full word

and

doubleword instructions, with immediate, magnitude,

and

negative operand capabilities. Ability to load and store

register files and to load effective addre:sses is also available.

Arithmetic scalars include various adds, subtract, multiply,

and divide for halfword (16-bit) and fullword (32-bit) fixed

point numbers and fullword and doubleword (64-bit) floating

point numbers . Scalar logical instructions are provided as are

arithmetic, logical,

and

circular shifts. Various comparison

instructions and combination comparison-logical instructions

are provided for halfword, fullword, and doublewords. l\Iany

combinations of test and branching instructions with incre

menting or decrementing capability are also available.

Stacking and modifying arithmetic registers can be done with

single instructions. Subroutine. linkage

is

accomplished

through branch and load instructions. Format conversion for

single

and

doublewords, as well as normalize instructions, are

available.

The vector capabilities of the CP are made available

through the use of VECTL (vector after loading vector

parameter file) and

VECT

(assumes parameter file is already

loaded) instructions. The vector repertoire includes such

arithmetic operations as add, subtract, multiply, divide,

vector dot product, matrix multiplication, and others for both

fixed point and fl'oating point representations. Vector

instructions are also available for shifting; logical operations;

comparisons; format conversions normalization; and special

operations-such as l\Ierge, Order, Search, Peak Pick, Select

and Replace, among others.

One important characteristic of the vector instruction

capability is

the

ability to encompass three dimensions of

addressability within a single vector instruction. This is

equivalent to a nest of three indexing loops in a conventional

machine.

The basic structure of the CP shown in Figure 3, has three

major components: the instruction processing unit (IPU) for

non-arithmetic stages of instruction processing for the CP

instruction stream, the memory buffer unit (MBU) to provide

operand interfacing with the central memory, and an

arithmetic unit (AU) to perform the specified arithmetic or

logical operations. Figure 3 shows a CP diagram for 2- or

4-pipeline CP's, each with a corresponding number

of

MBU-AU pairs. Note that a memory

port

is required for

the

IPU

and,

in

addition, one memory port for each pipeline

(MBU-AU pair)

in

a

CPo

A significant feature of the CP hardware is an operand

look-ahead capability which causes memory references to be

requested prior

to

the time of actual need. Double buffering

PRIMARY

MEMORY

PORTS

r-----l

~

I \

i

~ c l J

_____ J

TWO P IP FL INE CP

PRIMARY

M MORY

PORTS

r---------

I

:

/TI 1

I

I

/1/

\

\,

1 ~ / : 6 6 1

MBU

MBU 9

Ti

I

I

I

: AU AU f 3 ~ :

L

_ _ _ _ _ _ _ _ =.

FOUP PIPFL NE CP

: ::;c

Figure

3-Basic

structure of the CP

From the collection of the Computer History Museum (www.computerhistory.org)


3/10

Operational Experiences with the

TI

Advanced Scientific Computer 391

in multiple 8-word octet) buffers for each pipeline provides

a smooth data

flow

to and from each arithmetic unit.

The

pipelined AU achieves its highest sustained

flow

rate

in the

vector mode, typically a result each

60

ns per AU, or an

avrage of

15

ns per result for a 4-pipe central processor.

Instruction processing unit

The primary function of the instruction processing unit

IPU) is

to

supply a continuous stream of instructions for

execution by the other parts of the CPo One Central Memory

port is required to provide the instruction stream. Two 8-word

octet) buffers are utilized

to

achieve a balanced stream of

instructions from memory

to

the IPU. Instructions are

transferred from memory in octets as are all other references

to memory for fetching or storing of information.

Up to 36 instructions in various stages of execution can be

overlapped within the 4-pipe CPo There are twenty positions

for instructions in the 2-pipe CP and twelve positions for

instructions in the I-pipe

CPo

Four levels are contained

within the IPU, and eight levels are contained in each

arithmet ic pipeline MBU-AU pair). The IPU performs

routing of instructions to

the

MBU-AU pairs based on an

optimum use of arithmetic unit capability.

Vector processing is altered by software in order to

distribute segments of the vector for multiple pipe systems.

Several features are provided to alleviate

the

potential

problems of branches and instruction dependencies in the

instruction pipeline.

Memory buffer unit

The memory buffer unit MBU) provides an interface

between central memory and the arithmetic unit. Its primary

function is to supply the arithmetic unit with a continuous

stream of operands from memory

and to

provide for

the

storing of the results back to memory. All references to

memory, whether for fetching or storing, are made in 8-word

increments octets).

The MBU has three double buffers, one octet per buffer,

called the

X

and Y buffers for

input

and the Z buffers

for output. This double buffering is provided

so that

pipeline

processing can be sustained at a high rate with minimal

memory access conflicts.

rithmetic

un t

The primary function of a CP arithmetic unit AU) is to

perform the arithmetic operations specified

by

the operation

code of the instruction currently at the AU level. There is one

AU per pipeline in the CP, each having a

60

ns basic cycle

time. A distinguishing feature of an AU is the pipeline

structure which allows efficient execution of .the arithmetic

part of all instructions. There are eight exclusive partitions of

the AU pipeline involved, each of which can provide an output

every

60

ns. These eight sections are

1)

receiver register,

FLO TING DD

FIXED

MULT

I

I

I

ECEIVER REGISTER

I

I

L

XPONENT SUBTR CT

LIGN

MULTIPLY

:--

DD

L___

NORM LIZE

I

CCUMUL TE

-

I

I

I

I

I

I

I

I

I

I

I

_ 1

-...,

I

I

I

I

I

- - -

I

_ 1

OUTPUT

I

ESULT

RESULT

Figure 4-Arithmetic unit pipeline

2)

exponent sub tract,

3)

align,

4)

add,

5)

normalize,

6)

multiply, 7) accumulate, and 8) output. Figure 4 shows how

different sections of the AU are utilized for execution of

particular instructions; i.e., floating point addition and fixed

point multiplication.

An AU is a 64-bit parallel operating unit for most scalar

and vector instructions. Exceptions are double length

multiply and all types of division. In these circumstances

various combinations of the components of the AU are

From the collection of the Computer History Museum (www.computerhistory.org)


4/10


utilized; and, therefore, more

than

one clock cycle is required

to complete these arithmetic operations.

THE

PERIPHERAL

PROCESSOR

The peripheral processor (PP) is a powerful multiprocessor

designed

to

perform

the

control

and data

management

functions of the ASC. Several aspects of the implementation

of

the

peripheral processor concept greatly increase

the

effectiveness of

the

ASC system.

The

PP

is a collection of eight individual processors called

virtual processors (VP's). Each VP has its own program

counter along with arithmetic, index, base,

and

instruction

registers. The eight VP's share a read only memory,

an

arithmetic unit, an instruction processing unit, and a central

memory buffer. Use of the common units is distributed among

the VP s

using sixteen single 85 ns cycles. When

an

equally

distributed sequence of time units is used, each of the eight

VP s

receives two 85 ns cycles every 1.4

J LS

The

typical

PP

instruction requires two

85

ns cycles for completion. The

distribution of available time units can be dynamically varied

to suit particular processing requirements.

The

4K

32-bit words of read only memory within the

PP

is utilized for program storage and execution of those short

routines which are highly utilized

by the

VP's, such as

polling loops.

Because the

PP

is intended to perform control functions

rather

than

execute mathematical algorithms, the instruction

set is oriented toward control operations and does no t require

multiplication, division, or floating point operations. The

instruction format is similar to that of the central processor,

using a 32-bit word for each instruction. Instructions are

provided for

bit (1

bit), byte

(8

bits), halfword (16 bits), and

fullword (32 bits) operations.

Each VP has direct access to the entire cent ral memory for

program execution and data storage. Therefore, a single copy

of reentrant code can be executed simultaneously

by

more

than

one VP.

The communications register (CR) file contains sixty-four

32-bit word registers which are program addressable by the

VP's.

The CR file

serves as

the

principal storage media for

control information necessary for the coordination of all pa rts

of

the

ASC system.

DISC STORAGE

Disc storage is the principal secondary storage system for

the

ASC system. Disc storage consists of head-per-track

HIT)

disc systems supplemented

by

positioning-arm disc

(PAD) systems.

The HIT disc system is a high-performance device whose

effective performance is further enhanced because the operat

ing system utilizes a shortest-access-time-first (SATF)

algorithm for

data

transfers. This combination of hardware

and soft rare pro \rides a T e r ~ l high effecti'le transfer rate.

Each HIT disc module has a capacity of 25 million 32-bit

words with a transfer rate of approximately 500K words per

second. Using

the

shortest-access-time-first algorithm, access

time ,ill average approximately 5 ns which results in

an

exceptionally fast effective transfer rate.

DATA COMMUNICATIONS

The

data

communication system is very modular and, thus,

externally flexible in the various devices which may be

utilized for communication with the ASC. D:ata communica

tions are controlled by a data concentrator which, in turn,

interfaces to the ~ I U through a channel control device.

The

data

concentrator is a TI-980A minicomputer

equipped with special-purpose hardware communication

interface units on its direct memory access ports.

The

data

communications system presently supports com

munication with three types of stations: high-performance

user terminals, other large computers,

and

remote concentra

tors. The system can be easily extended to support smaller

terminals down

to

the

teletype level. These stations may be

either remote or local.

Remote links are presently implemented with non

switched, full duplex common carrier

data

transmission

facilities.

Data

is transferred over these links synchronously

at rates determined by the modems and common carrier

bandwidths.

The data

communication system supports

transfer rates up

to

a maximum of 240,000 bits per second.

PERIPHERALS

Standard types of magnetic tape drives, card equipment,

and printers have been interfaced with the ASC. These

interfaces attach to primary or secondary memory ports

through a variety of standard selected and multiplexed

data

channels. A subset of the system's peripherals can also be

interfaced via the communications register file.

SYSTEM SOFTWARE

Software design and development for the ASC system has

progressed in parallel with development of the hardware.

This was accomplished through

the

use of simulators, meta

assemblers, and higher level programming languages imple

mented on the systems supporting Texas Instruments'

Corporate Information Center. Thus,

the

first version of this

software was placed into operational status v.rith the ASC

prototype machine. The major software capabilities are

discussed in the next

few

paragraphs with emphasis being

given to those attributes \vhich provide comprehensive and

flexible programming facilities for

the

user.

ASC

ortran language

The most obvious interface between the ASC system and

a user is ',rith the translation of the user-written program into

machine level instructions that efficiently utilize the special

hardware features in the system. Texas Instruments has



5/10


6/10

394 Nationa l Computer Conference, 1974

H / 1 : g ~ ~ t t t E ~ N D

DISC

INTERFACE

IT 25M WORDS 500K WORDS/SEC.

UNIT

E

H / ~ O ~ ~ ~ t t ~ : N D

DISC INTERFACE

IT

25M WORDS

500K

WORDS/SEC.

X

UNIT

P

A

N

H : J = H ~ : t c t E ~ N D

DISC INTERFACE

D

HjT

25M

WORDS 500K

WORDS/SEC.

E

UNIT

R

M

E

H i : g : i r \ ~ C t E

~ N D

DISC

INTERFACE

IT 25M WORDS

SOOK WORDS/SEC.

0

UNIT

R

Y

r

CP- - - - --

I I

TWO

1500

CARD M I N

THREE

1200

LINE

M I N

TWO 100

CARD M I N

TEXT EDITING

CRTS TWo)

OPERATOR

COMM.

I

I

I

I

I

I

CARD READER

LINE

PRINTER

PUNCHES TWO CRTS

A)

1 1 4 2 1 9 B

..J

TAPE CONTROLLER

CHANNEL NUMBER 1

SECONDARY STORAG

CHANNEL NUMBER 2

SECONDARY STORAGE

TAPE

SWITCHING

UNIT

}

6 DUAL DENSITY

9

TRACK

800 1600

BPI

TAPE DRIVES

}

DUAL DENSITY

7 TRACK 556

800

BPI TAPE DRIVES

Figure

5-GFDL ASe

configuration

GPOS performing all overhead functions in the Peripheral

Processor. The operating system isolates the control, schedul

ing, and resource allocation algorithms for ease in

tuning

the system

to

match the specific requirements of each

installation. The overall system architecture is maintained to

accommodate hardware and software system growth and

flexibility. GPOS, by its simplicity and modular design,

minimizes the system use of central memory with a small

resident system and the remainder of the system non-resident.

The design of GPOS exploits hardware features unique to

the ASC. Most important of these features is complete access

to

central memory by the PP. Thus, a single reentrant copy

of code is available to all processors; and, only a branch

instruction is needed to switch a Virtual Processor from one

function

to

another. The Communications Register CR) file

is used to allow one VP to control the other seven, while

common access to the rest of this file supports communication

between

the

processors and other system components.

OPERATIONAL

HISTORY

The prototype ASC initially completed its checkout during

the Spring of

1971.

The system Serial

1)

was available for

use as a software development tool and for customer demon

strations for the remainder of 1971.

In

1972 the prototype

was moved to a permanent location

at

the

TI

facility in

Austin. During the period of downtime, a retrofit of the

hardware was carried out to incorporate the latest version of

circuits and boards and to support a production environment.

System 1 was operational early in 1973 and is currently being

devoted to software development and support of application

program conversion to the ASC.

ASC 1 is configured with a one-pipe central processor,

128K words of high-speed central memory, 128K words of

memory extension, a complement of head-per-track disc

storage, a

data

communications interface, plus standard tape

and paper devices.

Experience with an ASC operating in a center devoted to

seismic production work is currently being gained in the TI

facility at Amstelveen, Holland. This system Serial

2)

was

delivered early in 1973 and essentially duplicates the capabil

ities described for the prototype machine. Additionally,

several seismic interactive terminals are interfaced

both

locally and remotely

to

this system.

Seismic operational requirements are characterized by

large data bases, much magnetic

tape

input and output,

many

job steps composed of long computational sequences, and

the

need to precisely control a complicated series of such jobs. In

addition to the high computational speeds available on the

S C ~ the seiswic center experience is shmving that other

ASC features are valuable when applied to this application.



7/10

Operational Experiences with

the TI

Advanced Scientific Computer 395

Head-per-track disc storage, management of the

data

ba.ses

and scheduling

by the

dedicated virtual processors, and Job

control available via

the

JSL language appear

to match the

environment of seismic work. Applications programs are

written

in

standard Fortran, and no need has been found

to

supplement the available compiler p t ~ z a t i o n

by

~ d i t i o n a l

hand coding. The system is well supportmg the reqUIrements

by

.generating significant improvements

in

unit p ~ o c e s s i n g

costs and by permitting new processing technologtes

to

be

e c o n ~ m i c l l y feasible. Improved productivity of geophysicists

and geologists through real-time interactive sessions is

? e i ~ g

achieved. t is expected

that the

use of ASC for selSIillC

processing capacity will continue

to

grow

at

rapid rate.

Operational experience has also been gamed from the

application of the ASC

to

the U.S. o v e r ~ e n t d a t a p r o c ~ s -

ing problem of ballistic missile defense. Senal 3, a

o n e ~ l p e

ASC with a configuration similar to the previously descnbed

systems, was delivered to the U.S. Army in

~ h e

S u m ~ e r

of

1973. t is

to

be used for research into processmg techmques

employed in ball istic missile defense.

Application

to

long-range prediction of

the

earth's weath.er

is

the intended use of

the

largest and fastest ASC

to

be built

to

date. The National Oceanic and Atmospheric Administra

tion (NOAA) has contracted for an ASC (Serial #4) for its

Geophysical Fluid Dynamics Laboratory

at

Princeton Uni

versity. Delivery is scheduled for early in 1974. The ASC is

configured with a four-pipe central processor, one million

words of high-speed central memory, head-per- track disc,

text

editing terminals, two channels of high density secondary

storage devices, and standard magnetic tape and paper

devices. This configuration is illustrated in Figure 5. Much

experience has been gained using benchmark programs

derived from weather models

and

the actual weather predic

tion codes themselves. Emphasis has been upon Fortran code

generated

by

analysts and weather scientists instead of

hand-optimized machine language. Results obtained from

the

system while undergoing final checkout

at TI s

facility showed

the

speeds available to be several times faster

than

other

current computer systems.

For weather codes characterized

by

large

data

bases that

are updated frequently, sequences of heavy computational

work using

the

data, and mathematical operations performed

on long arrays of data, the ASC is proving to be a valuable

asset. The large central memory enables one

to

maintain

ample data

so

that the central processor is utilized to a very

high degree. The

I/O

and multiprogramming capabilities

managed

by

the operating system resident in the peripheral

processor also support high

CP

workloads.

1)

TABLE

I-Simple Examples of Vectors

DO

DO

DO

10 K=l, 50

10 J =1,50

10 1=1,50

10 Z(I, J, K) =X(I, J, K) ' Y(I, J, K)

(2) Z=X*Y

(3) VECTL (#460, B2) VMF

TABLE II Vector

Instructions Produced from Weather Code

(1) DO 100

K=l, lO

(2)

DO 100

1=1,144

TBXY(I, K)=(T(I+1, K, J)+T(I,

K,

J * 0.5

TXY(K, K)=(T(I+1, K,

J)-T(I ,

K, J * RDX(JC)

PBXY(I, K)=(PS(I+1, K, J)+PS(I, K, * 0.5

100 PXY(I,

K)=(PS(I+1,

K,

J)-PS(I,

K,

J)

* RDX(JC)

VECTL (#3B8, B2)

VECTL

( 3CO, B2)

VECTL (#3C8, B2)

VECTL (#3DO, B2)

VECTL (#3D8, B2)

VECTL

(#3EO, B2)

VECTL (#3E8, B2)

VECTL (#3FO, B2)

VAF

VMF

VSF

VMF

VAF

VMF

VSF

VMF

MAXIMIZING PERFORMANCE

Experience thus far has shown

that

for

the

applications

that

have been considered

by

ASC users

the

most cost

effective performance is realizable when

the

capabilities of

ASC

Fortran

and

the

optimizing compiler are used. Although

particular sequences of code can be found wherein hand

coding will improve

the

speed of execution, for the broad

range of programs where much applications code is involved,

compiler-generated object code is

the

best choice. American

National Standard Institute (ANS) Fortran is completely

sufficient,

and

vector instructions a re readily produced from

this Fortran. ASC extensions to the

Fortran

are sometimes

found to be useful, not to provide unique access to some hard

ware feature

but

to simplify notation involved in writing

the

program so

that the

programmer can deal more directly with

the mathematics of

the

application.

The

ASC system design allows easy user access

to

perfor

mance enhancement through

the

use of additional central

processor pipes. Compiler software is responsible for both

the generation of vector instructions and the partitioning of

these vector operations over multiple pipes. Protection of the

user from vector hazard conditions is carried out

by

the

compiler. Partitioning of scalar instruct ions for multiple pipes

is carried out

by

the

CP

hardware. Extensive checks are made

by

hardware

to

protect

the

user from illegal scalar conditions

that

might occur. For mixtures of vector instructions and for

mixtures of scalars and vectors, the compiler prevents illegal

conditions

by

the

use of directive instructions for

the CP

to

operate

in

either parallel mode (FORK) or sequential mode

(JOIN). Thus,

the

burden is on

the

system instead of

the

user. Programs compiled for one-pipe ASC's will execute

correctly on multiple-pipe systems. Performance \\1.ll be

increased via a recompilation for the multiple-pipe machine.

Some typical examples of efficient code produced from

present applications \\1.11

illustrate

the

optimization level

provided

by the

system. Table I shows

the type

of instruction

generated

by the

compiler from a typical triple-nested DO

LOOP.

(1)

gives

the

Fortran source with three levels of indexing,

(2)

is

an

alternate notation

that

could be used, and

(3)

is the single vector instruction produced.



8/10


TABLE III-ASC Maximum Performance

Rate

ASC IX ONE AU) ASC 4X FOUR AU'S)

32-BIT

64-BIT 32-BIT 64-BIT

RESULTS/SEC

RESULTS/SEC

RESULTS/SEC RESULTS/SEC

ADD

MULTIPLY

DOT PRODUCT

16 10

6

9.2 19

6

64 10

6

37 10

6

16 10

6

5.3

10

6

64 10

6

21 10

6

16 10

6

4.0 10

6

64 10

6

16 10

6

t is a floating vector multiply instruction preceded by the

loading of the vector parameter registers. Table II gives

some typical code found

in

weather models. A double-nested

DO

LOOP with typical indexing conventions is shown in (1).

gives the sequence of instructions produced by the ASC

compiler. All instructions are vectors, and the necessary

indexing information for addressing purposes is contained

in

each vector parameter file. No scalar instructions are neces

sary in

this example.

A powerful example of vector instruction capabilities is

found in the use of the hardware-implemented dot-product

operation. This operation consists of

the

multiplication of

appropri ate elements of two arrays followed

by

the sum of

the

products. To implement a matrix multiply operation from

Fortran,

the ASC compiler uses a single dot-product instruc

tion and the complex indexing capability of the hardware to

carry out the full matrix multiply. Three levels of addressing

changes are implied in this case, and the hardware is designed

to comprehend this level of indexing complexity.

The execution rate for the elementary operations of matrix

multiply is one result per clock cycle for a one-pipe CP, or a

rate of four results per clock cycle for a four-pipe

CPo

The

compiler partitions the total matrix multiply across

the

appropriate number of pipes. Therefore,

to

complete a matrix

multiply of two by matrices, a four-pipe CP will require

approximately N3 4 times the clock rate in seconds. This does

not include

the startup

overhead necessary

to fill the

pipelines

with operands.

TABLE IV-Relative Computer Capacity* Third Generation Systems

MFR

MODEL

RELATIVE SPEED

IBM

S/360 MODEL

65

IBM

S/360 MODEL 75 1.5

CDC

6500

1.5

CDC

6600 2.5

IBM

S/370

MODEL

165

3.5

IBM

8/360 MODEL

91

5

HITACHI

HITAC

8800

5

IBM

S/360 MODEL

95 7

CDC

7600

8

IBM

S/360 MODEL 195

8

* Data taken from Table E, page 546, Pr ogram for the stud y conference

on the Modeling speets of

G6

A

TE, BuJletin of the } mcric:1n ~ 9 f c t c G r G -

logical Society, Vol. 54 No.6, June, 1973.

t

is the authors'

OpInIOn

that performance indices for

array-oriented architectures are

not

meaningful when only

the

Millions of Instructions Per Second (MIPS) factor is used.

Since a single vector instruction is equivalent to several scalar

instructions (typically Load, Operation, Increment and

Test

Branch), and the number of

data

values used determines

the

number of execution of these scalar instructions,

MIP

ratings

are ambiguous at best.

Consider

the

performance of

an

ASC producing result s per

second. In this context results per second is the rate at

which data fetched from central memory can be operated

upon

and the

results stored back into central memory.

Table III shows

the

maximum performance ra tes for one- and

four-pipe ASC systems performing typical arithmetic opera

tions. Assumptions are that the clock cycle is 60 nanoseconds

and that the pipelines are already filled with operands.

Vector dot product is a special case in the sense that the

results per second rate pertains to the elementary operations.

Another performance measure can be determined from the

present performance of ASC System 4 executing a particular

weather benchmark. Although the benchmark is not a full

weathe r prediction code,

it

does have the characteristic source

code sequences and reflects the ability of the Fortran compiler

to

produce efficient code from a large applications package.

Execution speed of

the

benchmark on

the

IBM Model 91 is

approximately 246 minutes, and present ASC timing with

checkout not finalized has already demonstrated approxi

mately 30 minutes. This ratio of 8.2 is a measure of

the total

system performance upon this program. t reflects a mix of

both scalar and vector instructions as well as

I/O

and other

system services.

The

design of the ASC has been directed

t.oward matching the real world mix of instructions en

countered in typical applications instead of sacrificing scalar

capability to provide vector capability.

In order

to

compare

the

observed ASC performance on the

Weather Benchmark,

data

found in the Bulletin of

the

American Meteorological Societyl is given in Table IV. Using

the

IBIV[

S/360 Model 65 as the basis of reference, each of the

systems listed is compared as

to

relative speed. Using

the

observed ASC/M91 ratio of 8.2, the present ASC speed would

be

41 in the

table.

ACKNOWLEDGMENTS

t would not he possible t.o acknowledge all the contributors

to the development of the ASC; but particular recognition



9/10

Operational Experiences wi th

the

TI Advanced Scientific Computer 97

should be given to

lVlessrs

H. G. Cragon \V D. Kastner

E. H. Husband D. R. Best C. M. Stephenson C. R. Hall

F. A Galindo E C. Garth and N. M. Chandler who

contributed significantly

to the

development of

the

hardware.

Software concepts are due in large

part

to the efforts of

Messrs. L. C. Dean

G

T. Boswell

A

E. Riccomi F.

A

Little W Winkelman W. L. Cohagan and S D. Nolte.

Many other members of the Texas Instruments staff have

also contributed

i YJlIIleasurably

in the development of the

ASC.

REFERENCES

1 Program for

the study

conference on

the

Modeling Aspects of Gate

Bulletin o the American Meteorological Society

Vol. 54

No 6

June

1973 page 546. tabl e

E



10/10

operational experiences with the ti advanced scientific computer

Documents