synchronization and communication in the t3e multiprocessor

19
Synchronization and Communication in the T3E Multiprocessor

Upload: patrick-hampton

Post on 27-Dec-2015

241 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Synchronization and Communication in the T3E Multiprocessor

Synchronization and Communication in the T3E

Multiprocessor

Page 2: Synchronization and Communication in the T3E Multiprocessor

Background

• T3E is the second of Cray’s massively scalable multiprocessors (after T3D)

• Both are scalable up to 2048 processing elements

• Shared memory systems, but programmable using message passing (PVM or MPI, “more portable”) or shared memory (HPF)

Page 3: Synchronization and Communication in the T3E Multiprocessor

Challenges

• T3E (and T3D) attempted to overcome the inherent limitations of employing commodity microprocessors in very large multiprocessors

• Memory interface - cache-line based system makes references to single words inefficient

• Typical address spaces too small for use in big systems

• Non-cached references are often desirable (e.g. message to other processor)

Page 4: Synchronization and Communication in the T3E Multiprocessor

T3D Strengths (used in T3E)

• External structure in each PE to expand address space

• Shared address space• 3D torus interconnect• Pipelined remote memory

access with prefetch queue and non-cached stores

Page 5: Synchronization and Communication in the T3E Multiprocessor

T3D: Room for improvement

• Overblown barrier network

• One outstanding cache line fill at a time (low load bandwidth)

• Too many ways to access remote memory

• Low single-node performance

• Unoptimized special hardware features (block transfer engine, DTB Annex, dedicated message queues and registers)

Page 6: Synchronization and Communication in the T3E Multiprocessor

T3E Overview

• Each PE contains Alpha 21164, local memory, and control and routing chips

• Network links time-multiplexed at 5X system frequency

• Self-hosted running Unicos/mk

• No remote caching or board-level caches

Page 7: Synchronization and Communication in the T3E Multiprocessor

E-Registers

• Extend physical address space

• Increase attainable memory pipelining

• Enable high single-word bandwidth

• Provide mechanisms for data distribution, messaging, and atomic memory operations

• In general, they improve on the inefficient individual structures of the T3D

Page 8: Synchronization and Communication in the T3E Multiprocessor

Operations with E-Registers

• Appropriate operands are stored in appropriate E-registers by processor

• Processor then issues another store command to initiate operation

– Address specifies command and source or destination E-register

– Data specifies pointer to already stored operands and remote address index

Page 9: Synchronization and Communication in the T3E Multiprocessor

Address Translation

• Global virtual addresses and virtual PE numbers formed outside processors

• Centrifuge used for efficient data distribution

• Specifying memory location on data bus enables bigger address space

Page 10: Synchronization and Communication in the T3E Multiprocessor

Remote Reads/Writes

• All operations done by reading into E-registers (Gets) or writing from E-registers to memory (Puts)

• Vector forms transfer 8 words with arbitrary stride (e.g. every 3rd word)

• Large number of E-registers allows significant Gets/Puts pipelining– Limited by bus interface (256B/26.7ns)

• Single word load bandwidth high – can be loaded into contiguous E-registers, then moved into cache (instead of getting each cache line)

Page 11: Synchronization and Communication in the T3E Multiprocessor

Atomic Memory Operations

• Fetch_&_Inc, Fetch_&_Add, Compare_&_Swap, Masked_Swap

• Can be performed on any memory location

• Performed like any E-register operation– Operands in E-registers– Triggered via store, sent over network– Result sent back and stored in specified E-

register

Page 12: Synchronization and Communication in the T3E Multiprocessor

Messaging

T3D: Specific queue location of fixed size

T3E: Arbitrary number of queues, mapped to normal memory, of any size up to 128 MB

T3D: All incoming messages generated interrupts, adding significant penalties

T3E: Three options – interrupt, don’t interrupt (detected via polling), and interrupt after threshold number of messages

Page 13: Synchronization and Communication in the T3E Multiprocessor

Messaging Specifics

• Message queues consist of Message Queue Control Words (MQCW)

• Messages assembled into 8 E-registers, SEND issued with address of MQCW

• Message queue is managed in software – avoids OS if polling is used

Page 14: Synchronization and Communication in the T3E Multiprocessor

Synchronization

• Support for barriers and eurekas (message from one processor to group)

• 32 barrier synchronization units (BSUs) at each processor, accessed as memory-mapped registers

• Synchronization packets use a dedicated high-priority virtual channel– Propagated through logical tree embedded in

3D torus interconnect

Page 15: Synchronization and Communication in the T3E Multiprocessor

Synchronization• Simple barrier operation

involves 2 states– First arms all processors in

group (S_ARM)– Once all are armed,

network notifies all of completion and processors return to S_BAR

• Eureka requires 3 states to ensure one is received before issuing next one

– Eureka notification immediately followed by barrier

Page 16: Synchronization and Communication in the T3E Multiprocessor

Performance

Increasing number of E-registers allows greater pipelining and bandwidth (limited by control logic)

Effective bandwidth greatly increases with higher transfer sizes due to effects of overhead, startup latency

Page 17: Synchronization and Communication in the T3E Multiprocessor

Performance

Transfer bandwidth independent of stride, except when data happens to be loaded from same bank(s) (multiples of 4, 8)

Several million AMOs/sec required to saturate memory system and increase latency

Page 18: Synchronization and Communication in the T3E Multiprocessor

Performance

Very high message bandwidth is supported without latency increase

Hardware barrier many times faster than efficient software barrier (about 15 for 1024 PEs)

Page 19: Synchronization and Communication in the T3E Multiprocessor

Conclusions

• E-registers allow a highly pipelined memory system and provide a common interface for all global memory operations

• Both messages and standard shared memory ops supported

• Fast hardware barrier supported with almost no extra cost

• No remote caching eliminates need for bulky coherence mechanisms and helps allow 2048 PE systems

• Paper provides no means of quantitative comparison to alternative systems