architecture and design of alphaserver gs320 kourosh gharachorloo, madhu sharma, simon steely, and...
Post on 22-Dec-2015
216 views
TRANSCRIPT
Architecture and Design of AlphaServer GS320
Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren
ASPLOS’2000
Presented By: Alok Garg
Motivations
• Coherence Protocol– Bandwidth limitations of snoopy-based
protocol– Inefficiencies in directory protocol– Correctness issues related to rare protocol
races
• Implementation of Consistency models– Burdens the common transaction flow
Paper Contributions
• Exploiting network ordering to simplify cache coherence protocol
• Solutions to decrease network occupancy
• Elegant solution for deadlock, livelock, starvation, and fairness problems
• Techniques for efficiently supporting memory ordering
Overview
• Architecture Overview
• MOESI Cache Coherence Protocol
• GS320 optimized Cache Coherence Protocol
• Alpha Consistency Model
• Consistency Model Implementation
• Performance
Block Diagram
8x8Global
CrossbarSwitch
QBB
QBB
QBB
QBB
QBB
QBB
QBB
QBB
QBB – Quad-Processor Building Block
1.6 GB/s
Quad-Processor Building Block (QBB)
10-Port Local
CrossbarSwitch
1.6 GB/s3.2 GB/s
P L2
P L2
P L2
P L2
SDRAM Memory8GB, 64-bit200 MHz
64-entryCache
I/OPCI:4 PCI Bus64-bit, 33 MHz Global Port
DTAG
DIR
TTT
Arbitration Point
32 Alpha 21264 Duplicate Tag Store
Transactions In Transit Buffer
The Directory
Owner = 0 S0 S1 S2 S3 S4 S5 S6 S7
14-bit per 64 Byte Memory Line
6-bit
Forward
QBB0DTAG
QBB3
P0 P1 P2 P3
Invalidate
Invalidate
Crossbar SwitchNetwork bi-section Bandwidth:
Global Switch (8x8): 12.8 GB/sLocal Switch (10-port): 6.4 GB/s
MOESI - Directory• States
– Invalid (I) : – Shared (S) : Valid, (potentially) shared, clean– Exclusive (E) : Valid, exclusive, clean– Modified (M) : Valid, exclusive, (potentially) dirty– Owner (O) : Valid, (potentially) shared, clean
• Responsible for supplying Data instead of memory (potentially)
• Request Messages– Read (Rd) : Data needed in shard state (S/E)– Read Exclusive (RE) : Data needed in Modified State (M) – Exclusive (Ex) : Data needed in Modified State (M)
• Home node – Original owner of data (directory)
MOESI Read-Exclusive
H/D N3/I
N4/IN5/S
N2/ON1/I
RE
Forward
Marker
Invalidate
N5/I
N2/I
Ack
Reply
N3/E
GS320 Optimized Cache Coherence Protocol
• Dirty Sharing
• No negative acknowledgment– 3 Deadlock Conditions due to races
Late Request Race Condition
H/D N3/I
N4/IN5/I
N2/MN1/I
Rd
Forward
Marker
N2/X
Write Back
Ack
Reply
Write Buffer
N5/S
DEADLOCK?
Early Request Race Condition
H/D N3/M
N4/IN5/I
N2/IN1/I
RE
H/D Forward
Rd
Marker
Forward
H/D N3/I
Reply
N2/MN2/O
Reply
N5/S
Marker
DEADLOCK?
Crossbar NetworkQ0 Queue: Request to Home Node – (point to point order)
Q1 Queue: Forward, Replies and Invalidations from Home Node – (global order)Q2 Queue: Data Replies from Owner to Requester Node
Total Ordering on Q1!!
P1
A (O)Cache
Q1InboundQueue
P2
B (O)
Cache
Q1InboundQueue
HA
Q1OutboundQueue
HB
Q1OutboundQueue
Crossbar Switch
A (P1) B (P2)
RE1(B) RE2(A)
A (P2) B (P1)
P1 – RE2(A) P2 – RE1(B)
RE4(A) RE3(B)
A (X) B (Y)
P1 – RE3(B)P2 – RE4(A)
P1 – RE3(B) P2 – RE4(A)P1 – RE2(A) P2 – RE1(B)
DEADLOCK?
Desirable Characteristics
• Dirty sharing - efficient for migratory accesses
• All directory changes are instant. Needs just single access to home node and directory
• Eliminate livelock, starvation, and fairness problems
• Writes can start as soon as Exclusive request is issued
Alpha Consistency Model
MB: Memory Barrier
LOAD
STORE
LOAD
STORE
STORE
LOAD
Oldest Memory Operation
Program Order
LOAD
STORE
STORE
LOADLOAD
STORE
STORE
LOAD
LOAD
Atomicity is not violated: Read others write early
Consistency Model Implementation
• Barrier Performance (Commit Event)– Early acknowledge of Invalidates – Early acknowledge of Forwards of (Exclusive, Read
Exclusive and Read Requests)
• Overall Performance– Relax total order condition on Q1 at commit points.
Let replies (Q1->Q2) bypass forwards (Q1), and invalidations (Q1)
Early Acknowledgement of Invalidation Request
P1
A = 0B = 0
Cache
Q1InboundQueue
P2
A = 0
Cache
Q1InboundQueue
Crossbar Switch
SCA = 1;B = 1;
SCu = B;v = A;
u? v?u = 1v = 0
SCA = 1;B = 1;
EX
INVAL A
A = 1
SCA = 1;B = 1;
B = 1
SCu = B;v = A;
u? v?u = 1v = 0
Rd Marker P1
B = 1
Not a RaceCondition
B = 1
Commit
Races
MB
1. Optimize memory barrier at P1 for write to write/read ordering2. Commit events in Q1 queue for ordering purposes in case of replies3. Sufficient condition: Commit events not to bypass invalidates4. Memory Barrier at P2 wait for all the commits before going ahead
INVCommit
Commit pt Commit ptINV Ack
Early Acknowledge of Forwards
P1
A = 0B = 0
Cache
Q1InboundQueue
P2
A = 0
Cache
Q1InboundQueue
Crossbar Switch
A = 1;MBB = 1;
u = B;MBv = A;
u? v?u = 1v = 0
Commit pt Commit pt
u = B;MBv = A;
u? v?u = 1v = 0
Read B Commit/RdB
u = B;MBv = A; = 0u? v?u = ?v = 0Fwd Ack
A = 1;MBB = 1;
INVAL ACommit/INV A
bypass
Sufficient condition: Commit events not to bypass invalidates, reads and read-exclusive forwards
A = 1
A = 1;MBB = 1;
Optimization Summary• Dirty Sharing – Reduces home node traffic• No negative acknowledgements
– Reduces network traffic (Home Node)– Simple implementation of directory– Removes livelock, starvation, and fairness problems– Network total ordering avoid deadlocks – Write optimization
• Bypass of replies in Q1 queue– Improve overall performance
• Improves barrier performance– Early invalidation acknowledgements– Early Forward responses (Rd, RE, EX)– Memory ordering based on commit events