operational experiences with the ti advanced scientific computer

Upload: pasannanayakkara

Post on 23-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    1/10

    perationalexperiences with the TI dvanced Scientific Computer

    by W. J. WATSON and H. M. CARR

    Texas nstruments ncorporated

    Austin, Texas

    INTRODUCTION

    Since 1966 a large computer development program has been

    conducted by Texas Instruments. The goal for this effort was

    to

    provide needed capacity for supporting seismic processing,

    plus offering a general purpose capability for large scientific

    problems.

    This development has resulted in the Advanced Scientific

    Computer ASC)-a highly modular system offering a ,ide

    spectrum of processor power, memory sizes, and I/O capabil

    ity.

    The

    ASC is a high-speed, large-scale processing system

    featuring extensive use of pipelining, multiple arithmetic

    units, separate control processors, large and fast central

    memory,

    and

    extensive user software aids. The central

    processor has

    both

    scalar and vector instruction capabilities.

    First delivered in 1972 and placed into operational status

    during 1973, several operational ASC systems now offer

    extremely high processing rates for particular classes of

    problems.

    OVERVIEW OF THE SYSTEM

    The

    major subsystems of a typical configuration are shown

    in Figure 1:

    the

    central memory,

    the

    central processor,

    the

    peripheral processor, on-line

    bulk

    storage, a digital communi

    cations interface, plus a selection of standard peripherals.

    The

    peripheral processor has been designed for executing

    the

    operating system. The central processor has been designed

    expressly

    to

    provide high computing speeds when operating

    upon large arrays of data. The central processor operates as

    a slave to the peripheral processor. This design approach was

    chosen

    to

    maximize the overlapping of system overhead tasks

    with

    the

    execution of user programs. In operation

    the

    job

    stream is analyzed by the peripheral processor. The language

    processors, plus user object code, are executed by the central

    processor. System control and I/O tasks are processed by the

    peripheral processor. I/O is routed through high-speed,

    head-per- track disc storage. A

    data

    communications interface

    for the common carriers is provided for the support of remote

    batch and interactive terminals. Standard types of peripherals

    are also provided.

    The

    centra l memory serves as the common

    communications and access storage medium for these

    subsystems.

    389

    CENTRAL :\1EMORY

    The

    ASC central memory consists of a memory control

    unit (MCU) and appropr iately sized modules of high-speed

    or

    medium-speed central memory. Optionally, a medium-speed

    central memory extension can be used in conjunction with a

    high-speed memory.

    The MCU is organized as a two-way, 256-bit/channel

    (8-word) parallel access traffic net between eight independent

    processor ports and nine memory buses, with each processor

    port

    having full accessibility

    to

    all memories.

    The

    nine

    memory buses are organized to provide eight-way interleaving

    for

    the

    first eight buses with

    the ninth

    bus used for the central

    memory extension. The MCU provides the facilities for

    controlling access from

    the

    eight processor ports to a CM

    having a 24-bit address space (16 million words). A

    port

    expander can be utilized

    to

    expand

    the

    number of processor

    ports. Figure 2 illustrates this structure. .

    The

    semiconductor high-speed central memory modules

    have a cycle time of 160 ns and a read time of 140 ns.

    Additionally, all transfers are 256 bits (eight 32-bit words)

    with a Hamming code providing single-bit error correction

    and double-bit error detection for each 32-bit word. High

    speed central memory is typically divided into eight equal

    sized modules which allow for eight-way interleaving.

    CENTR L

    M MORY

    CPITR L

    PROCESSOR

    CP)

    PERIPHER L

    PROCESSOR

    PP)

    DISC

    STOR GE

    D T COMMUNIC TIONS

    PER IPHER LS

    COMON C RRIERS

    Figure

    Major

    ASC subsystems

    From t e co ect on o t e Computer H story Museum (www.computer story.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    2/10

    390 National Computer Conference, 1974

    INTERLEAVED

    HIGH-SPEED

    R

    MED

    lUM-SPEED

    MEMORY MODULES

    r 1 E ~ ) P Y

    CONTROL

    UNIT

    (MCU)

    PRIMARY

    MEMORY

    ACCESS PORTS

    SECONDARY

    MEr1lRY

    ACCESS PORTS

    r

    L

    1

    - - - - - - - - - - ~ - - - - - - - - - -

    NTERLEAVED MEDIUr1-SPEED MEMORY

    MODULES

    : ~ ~ ~ 6 ~ L

    EXTENSION

    OPTIONAl)

    Figure Modular structure of the ASC central memory

    The

    optional central memory extension allows large

    amounts of medium speed memory

    (1 p s

    semiconductor

    technology) to be used

    in

    the normal address space of central

    memory. Block transfer between memory extension and

    high-speed memory is controlled by the peripheral processor

    and will transfer at a rate of 40 M words per second.

    Memory mapping registers

    and

    protection registers are

    used to facilitate central memory management and access

    control of the ports.

    CENTRAL PROCESSOR

    The central processor provides both scalar (single operand)

    and vector (array) instructions

    at

    the

    machine level.

    The

    basic instruction size is 32 bits, with 16-, 32-, or 64-bit

    operands. The single instruction stream, which contains a

    mixture of scalar

    and

    vector instructions, is preprocessed by

    the instruction processing unit.

    The central processor design is such

    that

    one, two, three,

    or four execution units or pipes can be provided. These

    units employ the pipeline concept

    in

    both scalar and vector

    modes. A single execution unit can have up to twelve scalar

    instruction in process at one time. From one to four vector

    results can be produced every 60 ns, depending on the

    number of execution units provided.

    The CP has 48 program-addressable registers. This group

    of 32-bit registers consists of sixteen base address registers,

    sixteen arithmetic registers, eight index registers, and

    eight

    vector parameter registers. This last group is used

    to

    extend

    the

    instruction format for the complete specification of vector

    instructions.

    The CP scalar instruction repertoire includes

    an

    extensive

    set of load and store instructions: halfword, full word

    and

    doubleword instructions, with immediate, magnitude,

    and

    negative operand capabilities. Ability to load and store

    register files and to load effective addre:sses is also available.

    Arithmetic scalars include various adds, subtract, multiply,

    and divide for halfword (16-bit) and fullword (32-bit) fixed

    point numbers and fullword and doubleword (64-bit) floating

    point numbers . Scalar logical instructions are provided as are

    arithmetic, logical,

    and

    circular shifts. Various comparison

    instructions and combination comparison-logical instructions

    are provided for halfword, fullword, and doublewords. l\Iany

    combinations of test and branching instructions with incre

    menting or decrementing capability are also available.

    Stacking and modifying arithmetic registers can be done with

    single instructions. Subroutine. linkage

    is

    accomplished

    through branch and load instructions. Format conversion for

    single

    and

    doublewords, as well as normalize instructions, are

    available.

    The vector capabilities of the CP are made available

    through the use of VECTL (vector after loading vector

    parameter file) and

    VECT

    (assumes parameter file is already

    loaded) instructions. The vector repertoire includes such

    arithmetic operations as add, subtract, multiply, divide,

    vector dot product, matrix multiplication, and others for both

    fixed point and fl'oating point representations. Vector

    instructions are also available for shifting; logical operations;

    comparisons; format conversions normalization; and special

    operations-such as l\Ierge, Order, Search, Peak Pick, Select

    and Replace, among others.

    One important characteristic of the vector instruction

    capability is

    the

    ability to encompass three dimensions of

    addressability within a single vector instruction. This is

    equivalent to a nest of three indexing loops in a conventional

    machine.

    The basic structure of the CP shown in Figure 3, has three

    major components: the instruction processing unit (IPU) for

    non-arithmetic stages of instruction processing for the CP

    instruction stream, the memory buffer unit (MBU) to provide

    operand interfacing with the central memory, and an

    arithmetic unit (AU) to perform the specified arithmetic or

    logical operations. Figure 3 shows a CP diagram for 2- or

    4-pipeline CP's, each with a corresponding number

    of

    MBU-AU pairs. Note that a memory

    port

    is required for

    the

    IPU

    and,

    in

    addition, one memory port for each pipeline

    (MBU-AU pair)

    in

    a

    CPo

    A significant feature of the CP hardware is an operand

    look-ahead capability which causes memory references to be

    requested prior

    to

    the time of actual need. Double buffering

    PRIMARY

    MEMORY

    PORTS

    r-----l

    ~

    I \

    i

    ~ c l J

    _____ J

    TWO P IP FL INE CP

    PRIMARY

    M MORY

    PORTS

    r---------

    I

    :

    /TI 1

    I

    I

    /1/

    \

    \,

    1 ~ / : 6 6 1

    MBU

    MBU 9

    Ti

    I

    I

    I

    : AU AU f 3 ~ :

    L

    _ _ _ _ _ _ _ _ =.

    FOUP PIPFL NE CP

    : ::;c

    Figure

    3-Basic

    structure of the CP

    From the collection of the Computer History Museum (www.computerhistory.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    3/10

    Operational Experiences with the

    TI

    Advanced Scientific Computer 391

    in multiple 8-word octet) buffers for each pipeline provides

    a smooth data

    flow

    to and from each arithmetic unit.

    The

    pipelined AU achieves its highest sustained

    flow

    rate

    in the

    vector mode, typically a result each

    60

    ns per AU, or an

    avrage of

    15

    ns per result for a 4-pipe central processor.

    Instruction processing unit

    The primary function of the instruction processing unit

    IPU) is

    to

    supply a continuous stream of instructions for

    execution by the other parts of the CPo One Central Memory

    port is required to provide the instruction stream. Two 8-word

    octet) buffers are utilized

    to

    achieve a balanced stream of

    instructions from memory

    to

    the IPU. Instructions are

    transferred from memory in octets as are all other references

    to memory for fetching or storing of information.

    Up to 36 instructions in various stages of execution can be

    overlapped within the 4-pipe CPo There are twenty positions

    for instructions in the 2-pipe CP and twelve positions for

    instructions in the I-pipe

    CPo

    Four levels are contained

    within the IPU, and eight levels are contained in each

    arithmet ic pipeline MBU-AU pair). The IPU performs

    routing of instructions to

    the

    MBU-AU pairs based on an

    optimum use of arithmetic unit capability.

    Vector processing is altered by software in order to

    distribute segments of the vector for multiple pipe systems.

    Several features are provided to alleviate

    the

    potential

    problems of branches and instruction dependencies in the

    instruction pipeline.

    Memory buffer unit

    The memory buffer unit MBU) provides an interface

    between central memory and the arithmetic unit. Its primary

    function is to supply the arithmetic unit with a continuous

    stream of operands from memory

    and to

    provide for

    the

    storing of the results back to memory. All references to

    memory, whether for fetching or storing, are made in 8-word

    increments octets).

    The MBU has three double buffers, one octet per buffer,

    called the

    X

    and Y buffers for

    input

    and the Z buffers

    for output. This double buffering is provided

    so that

    pipeline

    processing can be sustained at a high rate with minimal

    memory access conflicts.

    rithmetic

    un t

    The primary function of a CP arithmetic unit AU) is to

    perform the arithmetic operations specified

    by

    the operation

    code of the instruction currently at the AU level. There is one

    AU per pipeline in the CP, each having a

    60

    ns basic cycle

    time. A distinguishing feature of an AU is the pipeline

    structure which allows efficient execution of .the arithmetic

    part of all instructions. There are eight exclusive partitions of

    the AU pipeline involved, each of which can provide an output

    every

    60

    ns. These eight sections are

    1)

    receiver register,

    FLO TING DD

    FIXED

    MULT

    I

    I

    I

    ECEIVER REGISTER

    I

    I

    L

    XPONENT SUBTR CT

    LIGN

    MULTIPLY

    :--

    DD

    L___

    NORM LIZE

    I

    CCUMUL TE

    -

    I

    I

    I

    I

    I

    I

    I

    I

    I

    I

    I

    _ 1

    -...,

    I

    I

    I

    I

    I

    - - -

    I

    _ 1

    OUTPUT

    I

    ESULT

    RESULT

    Figure 4-Arithmetic unit pipeline

    2)

    exponent sub tract,

    3)

    align,

    4)

    add,

    5)

    normalize,

    6)

    multiply, 7) accumulate, and 8) output. Figure 4 shows how

    different sections of the AU are utilized for execution of

    particular instructions; i.e., floating point addition and fixed

    point multiplication.

    An AU is a 64-bit parallel operating unit for most scalar

    and vector instructions. Exceptions are double length

    multiply and all types of division. In these circumstances

    various combinations of the components of the AU are

    From the collection of the Computer History Museum (www.computerhistory.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    4/10

    39 National Computer Conference, 1974

    utilized; and, therefore, more

    than

    one clock cycle is required

    to complete these arithmetic operations.

    THE

    PERIPHERAL

    PROCESSOR

    The peripheral processor (PP) is a powerful multiprocessor

    designed

    to

    perform

    the

    control

    and data

    management

    functions of the ASC. Several aspects of the implementation

    of

    the

    peripheral processor concept greatly increase

    the

    effectiveness of

    the

    ASC system.

    The

    PP

    is a collection of eight individual processors called

    virtual processors (VP's). Each VP has its own program

    counter along with arithmetic, index, base,

    and

    instruction

    registers. The eight VP's share a read only memory,

    an

    arithmetic unit, an instruction processing unit, and a central

    memory buffer. Use of the common units is distributed among

    the VP s

    using sixteen single 85 ns cycles. When

    an

    equally

    distributed sequence of time units is used, each of the eight

    VP s

    receives two 85 ns cycles every 1.4

    J LS

    The

    typical

    PP

    instruction requires two

    85

    ns cycles for completion. The

    distribution of available time units can be dynamically varied

    to suit particular processing requirements.

    The

    4K

    32-bit words of read only memory within the

    PP

    is utilized for program storage and execution of those short

    routines which are highly utilized

    by the

    VP's, such as

    polling loops.

    Because the

    PP

    is intended to perform control functions

    rather

    than

    execute mathematical algorithms, the instruction

    set is oriented toward control operations and does no t require

    multiplication, division, or floating point operations. The

    instruction format is similar to that of the central processor,

    using a 32-bit word for each instruction. Instructions are

    provided for

    bit (1

    bit), byte

    (8

    bits), halfword (16 bits), and

    fullword (32 bits) operations.

    Each VP has direct access to the entire cent ral memory for

    program execution and data storage. Therefore, a single copy

    of reentrant code can be executed simultaneously

    by

    more

    than

    one VP.

    The communications register (CR) file contains sixty-four

    32-bit word registers which are program addressable by the

    VP's.

    The CR file

    serves as

    the

    principal storage media for

    control information necessary for the coordination of all pa rts

    of

    the

    ASC system.

    DISC STORAGE

    Disc storage is the principal secondary storage system for

    the

    ASC system. Disc storage consists of head-per-track

    HIT)

    disc systems supplemented

    by

    positioning-arm disc

    (PAD) systems.

    The HIT disc system is a high-performance device whose

    effective performance is further enhanced because the operat

    ing system utilizes a shortest-access-time-first (SATF)

    algorithm for

    data

    transfers. This combination of hardware

    and soft rare pro \rides a T e r ~ l high effecti'le transfer rate.

    Each HIT disc module has a capacity of 25 million 32-bit

    words with a transfer rate of approximately 500K words per

    second. Using

    the

    shortest-access-time-first algorithm, access

    time ,ill average approximately 5 ns which results in

    an

    exceptionally fast effective transfer rate.

    DATA COMMUNICATIONS

    The

    data

    communication system is very modular and, thus,

    externally flexible in the various devices which may be

    utilized for communication with the ASC. D:ata communica

    tions are controlled by a data concentrator which, in turn,

    interfaces to the ~ I U through a channel control device.

    The

    data

    concentrator is a TI-980A minicomputer

    equipped with special-purpose hardware communication

    interface units on its direct memory access ports.

    The

    data

    communications system presently supports com

    munication with three types of stations: high-performance

    user terminals, other large computers,

    and

    remote concentra

    tors. The system can be easily extended to support smaller

    terminals down

    to

    the

    teletype level. These stations may be

    either remote or local.

    Remote links are presently implemented with non

    switched, full duplex common carrier

    data

    transmission

    facilities.

    Data

    is transferred over these links synchronously

    at rates determined by the modems and common carrier

    bandwidths.

    The data

    communication system supports

    transfer rates up

    to

    a maximum of 240,000 bits per second.

    PERIPHERALS

    Standard types of magnetic tape drives, card equipment,

    and printers have been interfaced with the ASC. These

    interfaces attach to primary or secondary memory ports

    through a variety of standard selected and multiplexed

    data

    channels. A subset of the system's peripherals can also be

    interfaced via the communications register file.

    SYSTEM SOFTWARE

    Software design and development for the ASC system has

    progressed in parallel with development of the hardware.

    This was accomplished through

    the

    use of simulators, meta

    assemblers, and higher level programming languages imple

    mented on the systems supporting Texas Instruments'

    Corporate Information Center. Thus,

    the

    first version of this

    software was placed into operational status v.rith the ASC

    prototype machine. The major software capabilities are

    discussed in the next

    few

    paragraphs with emphasis being

    given to those attributes \vhich provide comprehensive and

    flexible programming facilities for

    the

    user.

    ASC

    ortran language

    The most obvious interface between the ASC system and

    a user is ',rith the translation of the user-written program into

    machine level instructions that efficiently utilize the special

    hardware features in the system. Texas Instruments has

    From t e co ect on o t e Computer H story Museum (www.computer story.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    5/10

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    6/10

    394 Nationa l Computer Conference, 1974

    H / 1 : g ~ ~ t t t E ~ N D

    DISC

    INTERFACE

    IT 25M WORDS 500K WORDS/SEC.

    UNIT

    E

    H / ~ O ~ ~ ~ t t ~ : N D

    DISC INTERFACE

    IT

    25M WORDS

    500K

    WORDS/SEC.

    X

    UNIT

    P

    A

    N

    H : J = H ~ : t c t E ~ N D

    DISC INTERFACE

    D

    HjT

    25M

    WORDS 500K

    WORDS/SEC.

    E

    UNIT

    R

    M

    E

    H i : g : i r \ ~ C t E

    ~ N D

    DISC

    INTERFACE

    IT 25M WORDS

    SOOK WORDS/SEC.

    0

    UNIT

    R

    Y

    r

    CP- - - - --

    I I

    TWO

    1500

    CARD M I N

    THREE

    1200

    LINE

    M I N

    TWO 100

    CARD M I N

    TEXT EDITING

    CRTS TWo)

    OPERATOR

    COMM.

    I

    I

    I

    I

    I

    I

    CARD READER

    LINE

    PRINTER

    PUNCHES TWO CRTS

    A)

    1 1 4 2 1 9 B

    ..J

    TAPE CONTROLLER

    CHANNEL NUMBER 1

    SECONDARY STORAG

    CHANNEL NUMBER 2

    SECONDARY STORAGE

    TAPE

    SWITCHING

    UNIT

    }

    6 DUAL DENSITY

    9

    TRACK

    800 1600

    BPI

    TAPE DRIVES

    }

    DUAL DENSITY

    7 TRACK 556

    800

    BPI TAPE DRIVES

    Figure

    5-GFDL ASe

    configuration

    GPOS performing all overhead functions in the Peripheral

    Processor. The operating system isolates the control, schedul

    ing, and resource allocation algorithms for ease in

    tuning

    the system

    to

    match the specific requirements of each

    installation. The overall system architecture is maintained to

    accommodate hardware and software system growth and

    flexibility. GPOS, by its simplicity and modular design,

    minimizes the system use of central memory with a small

    resident system and the remainder of the system non-resident.

    The design of GPOS exploits hardware features unique to

    the ASC. Most important of these features is complete access

    to

    central memory by the PP. Thus, a single reentrant copy

    of code is available to all processors; and, only a branch

    instruction is needed to switch a Virtual Processor from one

    function

    to

    another. The Communications Register CR) file

    is used to allow one VP to control the other seven, while

    common access to the rest of this file supports communication

    between

    the

    processors and other system components.

    OPERATIONAL

    HISTORY

    The prototype ASC initially completed its checkout during

    the Spring of

    1971.

    The system Serial

    1)

    was available for

    use as a software development tool and for customer demon

    strations for the remainder of 1971.

    In

    1972 the prototype

    was moved to a permanent location

    at

    the

    TI

    facility in

    Austin. During the period of downtime, a retrofit of the

    hardware was carried out to incorporate the latest version of

    circuits and boards and to support a production environment.

    System 1 was operational early in 1973 and is currently being

    devoted to software development and support of application

    program conversion to the ASC.

    ASC 1 is configured with a one-pipe central processor,

    128K words of high-speed central memory, 128K words of

    memory extension, a complement of head-per-track disc

    storage, a

    data

    communications interface, plus standard tape

    and paper devices.

    Experience with an ASC operating in a center devoted to

    seismic production work is currently being gained in the TI

    facility at Amstelveen, Holland. This system Serial

    2)

    was

    delivered early in 1973 and essentially duplicates the capabil

    ities described for the prototype machine. Additionally,

    several seismic interactive terminals are interfaced

    both

    locally and remotely

    to

    this system.

    Seismic operational requirements are characterized by

    large data bases, much magnetic

    tape

    input and output,

    many

    job steps composed of long computational sequences, and

    the

    need to precisely control a complicated series of such jobs. In

    addition to the high computational speeds available on the

    S C ~ the seiswic center experience is shmving that other

    ASC features are valuable when applied to this application.

    From t e co ect on o t e Computer H story Museum (www.computer story.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    7/10

    Operational Experiences with

    the TI

    Advanced Scientific Computer 395

    Head-per-track disc storage, management of the

    data

    ba.ses

    and scheduling

    by the

    dedicated virtual processors, and Job

    control available via

    the

    JSL language appear

    to match the

    environment of seismic work. Applications programs are

    written

    in

    standard Fortran, and no need has been found

    to

    supplement the available compiler p t ~ z a t i o n

    by

    ~ d i t i o n a l

    hand coding. The system is well supportmg the reqUIrements

    by

    .generating significant improvements

    in

    unit p ~ o c e s s i n g

    costs and by permitting new processing technologtes

    to

    be

    e c o n ~ m i c l l y feasible. Improved productivity of geophysicists

    and geologists through real-time interactive sessions is

    ? e i ~ g

    achieved. t is expected

    that the

    use of ASC for selSIillC

    processing capacity will continue

    to

    grow

    at

    rapid rate.

    Operational experience has also been gamed from the

    application of the ASC

    to

    the U.S. o v e r ~ e n t d a t a p r o c ~ s -

    ing problem of ballistic missile defense. Senal 3, a

    o n e ~ l p e

    ASC with a configuration similar to the previously descnbed

    systems, was delivered to the U.S. Army in

    ~ h e

    S u m ~ e r

    of

    1973. t is

    to

    be used for research into processmg techmques

    employed in ball istic missile defense.

    Application

    to

    long-range prediction of

    the

    earth's weath.er

    is

    the intended use of

    the

    largest and fastest ASC

    to

    be built

    to

    date. The National Oceanic and Atmospheric Administra

    tion (NOAA) has contracted for an ASC (Serial #4) for its

    Geophysical Fluid Dynamics Laboratory

    at

    Princeton Uni

    versity. Delivery is scheduled for early in 1974. The ASC is

    configured with a four-pipe central processor, one million

    words of high-speed central memory, head-per- track disc,

    text

    editing terminals, two channels of high density secondary

    storage devices, and standard magnetic tape and paper

    devices. This configuration is illustrated in Figure 5. Much

    experience has been gained using benchmark programs

    derived from weather models

    and

    the actual weather predic

    tion codes themselves. Emphasis has been upon Fortran code

    generated

    by

    analysts and weather scientists instead of

    hand-optimized machine language. Results obtained from

    the

    system while undergoing final checkout

    at TI s

    facility showed

    the

    speeds available to be several times faster

    than

    other

    current computer systems.

    For weather codes characterized

    by

    large

    data

    bases that

    are updated frequently, sequences of heavy computational

    work using

    the

    data, and mathematical operations performed

    on long arrays of data, the ASC is proving to be a valuable

    asset. The large central memory enables one

    to

    maintain

    ample data

    so

    that the central processor is utilized to a very

    high degree. The

    I/O

    and multiprogramming capabilities

    managed

    by

    the operating system resident in the peripheral

    processor also support high

    CP

    workloads.

    1)

    TABLE

    I-Simple Examples of Vectors

    DO

    DO

    DO

    10 K=l, 50

    10 J =1,50

    10 1=1,50

    10 Z(I, J, K) =X(I, J, K) ' Y(I, J, K)

    (2) Z=X*Y

    (3) VECTL (#460, B2) VMF

    TABLE II Vector

    Instructions Produced from Weather Code

    (1) DO 100

    K=l, lO

    (2)

    DO 100

    1=1,144

    TBXY(I, K)=(T(I+1, K, J)+T(I,

    K,

    J * 0.5

    TXY(K, K)=(T(I+1, K,

    J)-T(I ,

    K, J * RDX(JC)

    PBXY(I, K)=(PS(I+1, K, J)+PS(I, K, * 0.5

    100 PXY(I,

    K)=(PS(I+1,

    K,

    J)-PS(I,

    K,

    J)

    * RDX(JC)

    VECTL (#3B8, B2)

    VECTL

    ( 3CO, B2)

    VECTL (#3C8, B2)

    VECTL (#3DO, B2)

    VECTL (#3D8, B2)

    VECTL

    (#3EO, B2)

    VECTL (#3E8, B2)

    VECTL (#3FO, B2)

    VAF

    VMF

    VSF

    VMF

    VAF

    VMF

    VSF

    VMF

    MAXIMIZING PERFORMANCE

    Experience thus far has shown

    that

    for

    the

    applications

    that

    have been considered

    by

    ASC users

    the

    most cost

    effective performance is realizable when

    the

    capabilities of

    ASC

    Fortran

    and

    the

    optimizing compiler are used. Although

    particular sequences of code can be found wherein hand

    coding will improve

    the

    speed of execution, for the broad

    range of programs where much applications code is involved,

    compiler-generated object code is

    the

    best choice. American

    National Standard Institute (ANS) Fortran is completely

    sufficient,

    and

    vector instructions a re readily produced from

    this Fortran. ASC extensions to the

    Fortran

    are sometimes

    found to be useful, not to provide unique access to some hard

    ware feature

    but

    to simplify notation involved in writing

    the

    program so

    that the

    programmer can deal more directly with

    the mathematics of

    the

    application.

    The

    ASC system design allows easy user access

    to

    perfor

    mance enhancement through

    the

    use of additional central

    processor pipes. Compiler software is responsible for both

    the generation of vector instructions and the partitioning of

    these vector operations over multiple pipes. Protection of the

    user from vector hazard conditions is carried out

    by

    the

    compiler. Partitioning of scalar instruct ions for multiple pipes

    is carried out

    by

    the

    CP

    hardware. Extensive checks are made

    by

    hardware

    to

    protect

    the

    user from illegal scalar conditions

    that

    might occur. For mixtures of vector instructions and for

    mixtures of scalars and vectors, the compiler prevents illegal

    conditions

    by

    the

    use of directive instructions for

    the CP

    to

    operate

    in

    either parallel mode (FORK) or sequential mode

    (JOIN). Thus,

    the

    burden is on

    the

    system instead of

    the

    user. Programs compiled for one-pipe ASC's will execute

    correctly on multiple-pipe systems. Performance \\1.ll be

    increased via a recompilation for the multiple-pipe machine.

    Some typical examples of efficient code produced from

    present applications \\1.11

    illustrate

    the

    optimization level

    provided

    by the

    system. Table I shows

    the type

    of instruction

    generated

    by the

    compiler from a typical triple-nested DO

    LOOP.

    (1)

    gives

    the

    Fortran source with three levels of indexing,

    (2)

    is

    an

    alternate notation

    that

    could be used, and

    (3)

    is the single vector instruction produced.

    From t e co ect on o t e Computer H story Museum (www.computer story.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    8/10

    396 National Computer Conference, 1974

    TABLE III-ASC Maximum Performance

    Rate

    ASC IX ONE AU) ASC 4X FOUR AU'S)

    32-BIT

    64-BIT 32-BIT 64-BIT

    RESULTS/SEC

    RESULTS/SEC

    RESULTS/SEC RESULTS/SEC

    ADD

    MULTIPLY

    DOT PRODUCT

    16 10

    6

    9.2 19

    6

    64 10

    6

    37 10

    6

    16 10

    6

    5.3

    10

    6

    64 10

    6

    21 10

    6

    16 10

    6

    4.0 10

    6

    64 10

    6

    16 10

    6

    t is a floating vector multiply instruction preceded by the

    loading of the vector parameter registers. Table II gives

    some typical code found

    in

    weather models. A double-nested

    DO

    LOOP with typical indexing conventions is shown in (1).

    gives the sequence of instructions produced by the ASC

    compiler. All instructions are vectors, and the necessary

    indexing information for addressing purposes is contained

    in

    each vector parameter file. No scalar instructions are neces

    sary in

    this example.

    A powerful example of vector instruction capabilities is

    found in the use of the hardware-implemented dot-product

    operation. This operation consists of

    the

    multiplication of

    appropri ate elements of two arrays followed

    by

    the sum of

    the

    products. To implement a matrix multiply operation from

    Fortran,

    the ASC compiler uses a single dot-product instruc

    tion and the complex indexing capability of the hardware to

    carry out the full matrix multiply. Three levels of addressing

    changes are implied in this case, and the hardware is designed

    to comprehend this level of indexing complexity.

    The execution rate for the elementary operations of matrix

    multiply is one result per clock cycle for a one-pipe CP, or a

    rate of four results per clock cycle for a four-pipe

    CPo

    The

    compiler partitions the total matrix multiply across

    the

    appropriate number of pipes. Therefore,

    to

    complete a matrix

    multiply of two by matrices, a four-pipe CP will require

    approximately N3 4 times the clock rate in seconds. This does

    not include

    the startup

    overhead necessary

    to fill the

    pipelines

    with operands.

    TABLE IV-Relative Computer Capacity* Third Generation Systems

    MFR

    MODEL

    RELATIVE SPEED

    IBM

    S/360 MODEL

    65

    IBM

    S/360 MODEL 75 1.5

    CDC

    6500

    1.5

    CDC

    6600 2.5

    IBM

    S/370

    MODEL

    165

    3.5

    IBM

    8/360 MODEL

    91

    5

    HITACHI

    HITAC

    8800

    5

    IBM

    S/360 MODEL

    95 7

    CDC

    7600

    8

    IBM

    S/360 MODEL 195

    8

    * Data taken from Table E, page 546, Pr ogram for the stud y conference

    on the Modeling speets of

    G6

    A

    TE, BuJletin of the } mcric:1n ~ 9 f c t c G r G -

    logical Society, Vol. 54 No.6, June, 1973.

    t

    is the authors'

    OpInIOn

    that performance indices for

    array-oriented architectures are

    not

    meaningful when only

    the

    Millions of Instructions Per Second (MIPS) factor is used.

    Since a single vector instruction is equivalent to several scalar

    instructions (typically Load, Operation, Increment and

    Test

    Branch), and the number of

    data

    values used determines

    the

    number of execution of these scalar instructions,

    MIP

    ratings

    are ambiguous at best.

    Consider

    the

    performance of

    an

    ASC producing result s per

    second. In this context results per second is the rate at

    which data fetched from central memory can be operated

    upon

    and the

    results stored back into central memory.

    Table III shows

    the

    maximum performance ra tes for one- and

    four-pipe ASC systems performing typical arithmetic opera

    tions. Assumptions are that the clock cycle is 60 nanoseconds

    and that the pipelines are already filled with operands.

    Vector dot product is a special case in the sense that the

    results per second rate pertains to the elementary operations.

    Another performance measure can be determined from the

    present performance of ASC System 4 executing a particular

    weather benchmark. Although the benchmark is not a full

    weathe r prediction code,

    it

    does have the characteristic source

    code sequences and reflects the ability of the Fortran compiler

    to

    produce efficient code from a large applications package.

    Execution speed of

    the

    benchmark on

    the

    IBM Model 91 is

    approximately 246 minutes, and present ASC timing with

    checkout not finalized has already demonstrated approxi

    mately 30 minutes. This ratio of 8.2 is a measure of

    the total

    system performance upon this program. t reflects a mix of

    both scalar and vector instructions as well as

    I/O

    and other

    system services.

    The

    design of the ASC has been directed

    t.oward matching the real world mix of instructions en

    countered in typical applications instead of sacrificing scalar

    capability to provide vector capability.

    In order

    to

    compare

    the

    observed ASC performance on the

    Weather Benchmark,

    data

    found in the Bulletin of

    the

    American Meteorological Societyl is given in Table IV. Using

    the

    IBIV[

    S/360 Model 65 as the basis of reference, each of the

    systems listed is compared as

    to

    relative speed. Using

    the

    observed ASC/M91 ratio of 8.2, the present ASC speed would

    be

    41 in the

    table.

    ACKNOWLEDGMENTS

    t would not he possible t.o acknowledge all the contributors

    to the development of the ASC; but particular recognition

    From t e co ect on o t e Computer H story Museum (www.computer story.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    9/10

    Operational Experiences wi th

    the

    TI Advanced Scientific Computer 97

    should be given to

    lVlessrs

    H. G. Cragon \V D. Kastner

    E. H. Husband D. R. Best C. M. Stephenson C. R. Hall

    F. A Galindo E C. Garth and N. M. Chandler who

    contributed significantly

    to the

    development of

    the

    hardware.

    Software concepts are due in large

    part

    to the efforts of

    Messrs. L. C. Dean

    G

    T. Boswell

    A

    E. Riccomi F.

    A

    Little W Winkelman W. L. Cohagan and S D. Nolte.

    Many other members of the Texas Instruments staff have

    also contributed

    i YJlIIleasurably

    in the development of the

    ASC.

    REFERENCES

    1 Program for

    the study

    conference on

    the

    Modeling Aspects of Gate

    Bulletin o the American Meteorological Society

    Vol. 54

    No 6

    June

    1973 page 546. tabl e

    E

    From t e co ect on o t e Computer H story Museum (www.computer story.org)

  • 7/24/2019 Operational experiences with the TI Advanced Scientific Computer

    10/10