debunking then duplicating ultracomputer performance claims by debugging the combining switches

24
1 Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches Eric Freudenthal and Allan Gottlieb {freudenthal, gottlieb}@nyu.edu

Upload: daria-riley

Post on 03-Jan-2016

17 views

Category:

Documents


1 download

DESCRIPTION

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches. Eric Freudenthal and Allan Gottlieb {freudenthal, gottlieb}@nyu.edu. Talk Summary. Review Ultracomputer combining networks - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

1

Debunking then Duplicating Ultracomputer Performance Claims by

Debugging the Combining Switches

Eric Freudenthal and Allan Gottlieb

{freudenthal, gottlieb}@nyu.edu

Page 2: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

2

Talk Summary

• Review Ultracomputer combining networks– MIMD architecture expected to provide high

performance for hot spot traffic & centralized coordination

• Duplicating & debunking– High hot spot latency, slow centralized coord.– Why?

• Minor improvements to architecture– Significantly reduced hot spot latency– Improved coordination performance

Page 3: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

3

222120

PE7

PE6

PE5

PE4

PE3

PE2

PE1

PE0

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

MM7

MM6

MM5

MM4

MM3

MM2

MM1

MM0

ProcessingElements Switches

MemoryModules

Routing:

23 PE computer with omega network

NUMA Connections

“Dance Hall”(All Processors equally distant from all Memory Modules)

“Budoir”(Processors & Memory Modules can be co-resident)

Page 4: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

4

Network congestion due to polling of single variable in MM3

• Each PE has single outstanding reference to same variable.– Low offered load

• These references serialize at MM3– Switch queues in “funnel” near MM3 (in red) fill– High memory latency results

• If switches could “combine” references to a single variable– A single MM operation would satisfy multiple requests– Lower network congestion & latency– NYU Ultracomputer does this

PE7PE6PE5PE4PE3PE2PE1PE0

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

MM7MM6MM5MM4MM3MM2MM1MM0

Page 5: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

5

Fetch-and-add

FAA(X, e)

• Atomic operation

• Fetches old value of X and adds e to X

• Useful for busy waiting coordination

• Ultracomputer switches combine FAAs

• FAA(X,0) is equivalent to load X

Page 6: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

6

Combining of Fetch & Add (and loads)

FAA(X,1)

FAA(X,2)

FAA(X,4)

FAA(X,8)

Start: X=0

“wait buffer”Lower port first, its addend=12

MM

X:12

X:0

X:13

X:4

X:12

X:0

FAA(X,3)

FAA(X,12)

1U

4U

FAA(X,15)12L

X:0 End: X=15

Semantics equivalent to some serialization.

Page 7: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

7

Coordination with fetch-and-add

Spin-locks:Shared int L = 1lock(): while (faa(L,-1) < 1) faa(L,+1) while (L < 1) ;

unlock(): faa(L,+1)

Readers and Writers:constant int p = max readersShared int C = p // p resourcesReader() { // take 1 instance while (faa(C,-1) < 1) faa(C,+1) while (C < 1); read() faa(C,+1)

Writer() // take all p instances while (faa(C,-p) < p) faa(C,+p) while (C< p); write() faa(C,+p)

Page 8: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

8

Characteristics of FAA Centralized Coordination Algorithms

• Many faa coord algs reference a small number of shared variables.

• Spin-locks and r/w reference one• Uncontended spin- and r/w-lock

generates one shared access– Including multiple readers in absence of

writers

• FAA barrier and queue algorithms have similar characteristics

Page 9: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

9

Combining Queue Design

Background: Guibas & Liang Systolic FIFO

Ultracomputer Combining Queuein outchute

ALU

in outchute

ALU

in outchute

ALU

in outchute

ALU

No associative memory required

outin outin outin outin

Page 10: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

10

Summary of Baseline Ultracomputer

• Architecture reasonable and motivated– Switches not prohibitively expensive – Serialization-free coordination algorithms

• Queues in switches permit high bandwidth– Low latency for random & mixed hot spot traffic

• NYU simulations (surprisingly) did not include 100% hot spot traffic– (Lee Kruskal Kuck did, but with different flow control)– In fact combining helpful, but not as good as expected– Queues near hot memory fill; others nearly empty

• Non-trivial queuing delays– Combining only in full queues

• Low message “multiplicity”

Page 11: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

11

Rest of this talk• Debunking: High latency despite Ultra3 flow

control– Algorithms that minimize hot spot traffic outperform

centralized.• Deconstructing: Understanding of high latency

– Reduced combining due to wait buffer exhaustion– Queuing delays in network – reduced Q capacity helps

• Debugging: Improvements to combining switches– Larger wait buffer needed– Adaptive reduction of queue capacity when combining

occurs• Duplication: Centralized algorithms competitive

– Much superior for concurrent-access locks

Page 12: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

12

Ultra III “baseline” switchesMemory Latency, one request / PE

% h

ot s

pot

100%, n

o combining

ideal

~2x

~4x

0-10%

40%

100%

20%

Page 13: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

13

Two “Fixes” to Ultra III Switch Design

• Problem: Full wait buffers reduce combining– “Sufficient” waitbuf capacity → 45% latency

reduction

• Problem: Congestion in “combining funnel”– Shortened queues → backpressure

• Lower per-stage queuing delays• More non-empty queues

– more combining, hence higher message “multiplicity”

• Reduces latency another 30%; • FAA algs now competitive

Page 14: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

14

What is the “Best” queue length

• Problem– Non-hot spot latency benefits from large queues– Hot-spot latency benefits from small queues

• Solution– Detect switches engaged in combining

• Multiple combined messages awaiting transmission

– Adaptively reduce capacity of these switches• Other switches unaffected

• Results– Reduced polling latency, good non-poll latency

Page 15: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

15

Memory latency, 1024 PE SystemsOver a range of accepted load

• Baseline Ultra III switch– Limited wait buffer– Fixed queue size

• Waitbuf100– Baseline– Sufficient wait buffer

• Improved– Waitbuf100– Adaptive queue length

• Aggressive– Improved – Combines from both ports &

on first slice– Potential clock rate

reduction

100% hot

20% hot

Uniform

Page 16: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

16

Mellor-Crummey & Scott (MCS):Local-spin coordination

• No hot spot polling– Each PE spins on distinct shared var in co-located MM– Other parts of algorithm may generate hot spot traffic

• Serialization-free barriers– Barrier satisfaction “disseminated” without generating

hotspot traffic

– Each processor has log2(N) rendezvous

• Locks: Global state in hot spot variables– Heads of linked lists (blocked requestors)– Count of readers– Hot spot accesses benefit from combining

Page 17: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

17

Synchronization: BarriersMCS also serialization-free

• IntenseLoop:– barrier

• RealisticLoop:– Ref 15 or 30 shared vars– barrier

Better

Page 18: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

18

Reader-Writer Experiment• Loop:

Determine if reader or writer“Sleep” for 100 cyclesLockReference 10 shared variablesUnlock

• Reader-writer mix– All reader, all writer– 1 expected writer

• P(writer) = 1/N

• Plots on next slides– Rate readers and writer locks granted (unit=rate/kc)– Greater values indicate greater progress

Page 19: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

19

All Readers / All Writers

• All Readers– Combining helps MCS– Serialization-free (FAA

algorithm) faster

• All Writers– Essentially a highly

contended semaphore– Only aggressive competes

Better

Page 20: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

20

1 Expected Writer

• Reader performance– FAA faster– MCS benefits from

combining

• Writer performance– FAA generally faster– MCS benefits from

combining

Better

Page 21: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

21

Conclusions

• “Improved” architecture superior– Large wait buffers decrease hot spot latency– Adaptive Q capacity decreases latency

• General technique?

• Performance of FAA Algorithms– R/W competitive with MCS

• Much superior when readers dominate• Require combining.

– Barrier near MCS• Faster with aggressive design

Page 22: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

22

Relevance & Future Work

• Large shared memory systems are manufactured– Combining not restricted to omega network

• Return messages must be routed to combine sites

– Combining demonstrated as useful for inter-process coordination.

• Application of adaptive queue capacity modulation to other domains– Such as responding to flash-flood & DOS traffic

• Analytic model of queuing delays for hot spot combining under development

Page 23: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

23

Difficulties with aggressive(2-input, coupled) queues

Single input queues simpler• Dual input combining queue

built from two single-input combining queues

• Messages from different ports ineligible for combining

Decoupled ALUs• Idea: remove ALU from

transmission path• Shorter clock intervals

– max(transmission, ALU)• Head item can not combine

– Combining less likely– ≥ 3 enqueued messages

ALU ALU

mux ALU

ALU

Page 24: Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

24

END

• Questions?