debunking then duplicating ultracomputer performance claims by debugging the combining switches

Debunking then Duplicating Ultracomputer Performance Claims by

Debugging the Combining Switches

Eric Freudenthal and Allan Gottlieb

{freudenthal, gottlieb}@nyu.edu

Talk Summary

• Review Ultracomputer combining networks– MIMD architecture expected to provide high

performance for hot spot traffic & centralized coordination

• Duplicating & debunking– High hot spot latency, slow centralized coord.– Why?

• Minor improvements to architecture– Significantly reduced hot spot latency– Improved coordination performance

222120

ProcessingElements Switches

MemoryModules

Routing:

23 PE computer with omega network

NUMA Connections

“Dance Hall”(All Processors equally distant from all Memory Modules)

“Budoir”(Processors & Memory Modules can be co-resident)

Network congestion due to polling of single variable in MM3

• Each PE has single outstanding reference to same variable.– Low offered load

• These references serialize at MM3– Switch queues in “funnel” near MM3 (in red) fill– High memory latency results

• If switches could “combine” references to a single variable– A single MM operation would satisfy multiple requests– Lower network congestion & latency– NYU Ultracomputer does this

PE7PE6PE5PE4PE3PE2PE1PE0

MM7MM6MM5MM4MM3MM2MM1MM0

Fetch-and-add

FAA(X, e)

• Atomic operation

• Fetches old value of X and adds e to X

• Useful for busy waiting coordination

• Ultracomputer switches combine FAAs

• FAA(X,0) is equivalent to load X

Combining of Fetch & Add (and loads)

FAA(X,1)

FAA(X,2)

FAA(X,4)

FAA(X,8)

Start: X=0

“wait buffer”Lower port first, its addend=12

FAA(X,3)

FAA(X,12)

FAA(X,15)12L

X:0 End: X=15

Semantics equivalent to some serialization.

Coordination with fetch-and-add

Spin-locks:Shared int L = 1lock(): while (faa(L,-1) < 1) faa(L,+1) while (L < 1) ;

unlock(): faa(L,+1)

Readers and Writers:constant int p = max readersShared int C = p // p resourcesReader() { // take 1 instance while (faa(C,-1) < 1) faa(C,+1) while (C < 1); read() faa(C,+1)

Writer() // take all p instances while (faa(C,-p) < p) faa(C,+p) while (C< p); write() faa(C,+p)

Characteristics of FAA Centralized Coordination Algorithms

• Many faa coord algs reference a small number of shared variables.

• Spin-locks and r/w reference one• Uncontended spin- and r/w-lock

generates one shared access– Including multiple readers in absence of

writers

• FAA barrier and queue algorithms have similar characteristics

Combining Queue Design

Background: Guibas & Liang Systolic FIFO

Ultracomputer Combining Queuein outchute

in outchute

No associative memory required

outin outin outin outin

Summary of Baseline Ultracomputer

• Architecture reasonable and motivated– Switches not prohibitively expensive – Serialization-free coordination algorithms

• Queues in switches permit high bandwidth– Low latency for random & mixed hot spot traffic

• NYU simulations (surprisingly) did not include 100% hot spot traffic– (Lee Kruskal Kuck did, but with different flow control)– In fact combining helpful, but not as good as expected– Queues near hot memory fill; others nearly empty

• Non-trivial queuing delays– Combining only in full queues

• Low message “multiplicity”

Rest of this talk• Debunking: High latency despite Ultra3 flow

control– Algorithms that minimize hot spot traffic outperform

centralized.• Deconstructing: Understanding of high latency

– Reduced combining due to wait buffer exhaustion– Queuing delays in network – reduced Q capacity helps

• Debugging: Improvements to combining switches– Larger wait buffer needed– Adaptive reduction of queue capacity when combining

occurs• Duplication: Centralized algorithms competitive

– Much superior for concurrent-access locks

Ultra III “baseline” switchesMemory Latency, one request / PE

100%, n

o combining

Two “Fixes” to Ultra III Switch Design

• Problem: Full wait buffers reduce combining– “Sufficient” waitbuf capacity → 45% latency

reduction

• Problem: Congestion in “combining funnel”– Shortened queues → backpressure

• Lower per-stage queuing delays• More non-empty queues

– more combining, hence higher message “multiplicity”

• Reduces latency another 30%; • FAA algs now competitive

What is the “Best” queue length

• Problem– Non-hot spot latency benefits from large queues– Hot-spot latency benefits from small queues

• Solution– Detect switches engaged in combining

• Multiple combined messages awaiting transmission

– Adaptively reduce capacity of these switches• Other switches unaffected

• Results– Reduced polling latency, good non-poll latency

Memory latency, 1024 PE SystemsOver a range of accepted load

• Baseline Ultra III switch– Limited wait buffer– Fixed queue size

• Waitbuf100– Baseline– Sufficient wait buffer

• Improved– Waitbuf100– Adaptive queue length

• Aggressive– Improved – Combines from both ports &

on first slice– Potential clock rate

reduction

100% hot

20% hot

Uniform

Mellor-Crummey & Scott (MCS):Local-spin coordination

• No hot spot polling– Each PE spins on distinct shared var in co-located MM– Other parts of algorithm may generate hot spot traffic

• Serialization-free barriers– Barrier satisfaction “disseminated” without generating

hotspot traffic

– Each processor has log2(N) rendezvous

• Locks: Global state in hot spot variables– Heads of linked lists (blocked requestors)– Count of readers– Hot spot accesses benefit from combining

Synchronization: BarriersMCS also serialization-free

• IntenseLoop:– barrier

• RealisticLoop:– Ref 15 or 30 shared vars– barrier

Better

Reader-Writer Experiment• Loop:

Determine if reader or writer“Sleep” for 100 cyclesLockReference 10 shared variablesUnlock

• Reader-writer mix– All reader, all writer– 1 expected writer

• P(writer) = 1/N

• Plots on next slides– Rate readers and writer locks granted (unit=rate/kc)– Greater values indicate greater progress

All Readers / All Writers

• All Readers– Combining helps MCS– Serialization-free (FAA

algorithm) faster

• All Writers– Essentially a highly

contended semaphore– Only aggressive competes

Better

1 Expected Writer

• Reader performance– FAA faster– MCS benefits from

combining

• Writer performance– FAA generally faster– MCS benefits from

combining

Better

Conclusions

• “Improved” architecture superior– Large wait buffers decrease hot spot latency– Adaptive Q capacity decreases latency

• General technique?

• Performance of FAA Algorithms– R/W competitive with MCS

• Much superior when readers dominate• Require combining.

– Barrier near MCS• Faster with aggressive design

Relevance & Future Work

• Large shared memory systems are manufactured– Combining not restricted to omega network

• Return messages must be routed to combine sites

– Combining demonstrated as useful for inter-process coordination.

• Application of adaptive queue capacity modulation to other domains– Such as responding to flash-flood & DOS traffic

• Analytic model of queuing delays for hot spot combining under development

Difficulties with aggressive(2-input, coupled) queues

Single input queues simpler• Dual input combining queue

built from two single-input combining queues

• Messages from different ports ineligible for combining

Decoupled ALUs• Idea: remove ALU from

transmission path• Shorter clock intervals

– max(transmission, ALU)• Head item can not combine

– Combining less likely– ≥ 3 enqueued messages

ALU ALU

mux ALU

• Questions?

debunking then duplicating ultracomputer performance claims by debugging the combining switches

Documents

debunking myths

debunking digital strategy

myth debunking

debunking consumerism myths

the wonderful duplicating “machine”

debunking evolution - problems, errors, and lies of ......

debunking evolution

debunking myths (1) - centre for science and...

1 debunking then duplicating ultracomputer performance...

debunking adware myths

debunking the kalam.pdf

debunking humphrey’s executor

debunking then duplicating ultracomputer performance claims...

d32, kodak duplicating microfilms data sheet · kodak...

debunking economics - supplement

debunking ad testing

debunking myths about_redo_ppt

ultracomputer research laboratory

duplicating your workspace

debunking vmware nsx