architecture and design of alphaserver gs320 kourosh gharachorloo, madhu sharma, simon steely, and...

27
Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Architecture and Design of AlphaServer GS320

Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren

ASPLOS’2000

Presented By: Alok Garg

Motivations

• Coherence Protocol– Bandwidth limitations of snoopy-based

protocol– Inefficiencies in directory protocol– Correctness issues related to rare protocol

races

• Implementation of Consistency models– Burdens the common transaction flow

Paper Contributions

• Exploiting network ordering to simplify cache coherence protocol

• Solutions to decrease network occupancy

• Elegant solution for deadlock, livelock, starvation, and fairness problems

• Techniques for efficiently supporting memory ordering

Overview

• Architecture Overview

• MOESI Cache Coherence Protocol

• GS320 optimized Cache Coherence Protocol

• Alpha Consistency Model

• Consistency Model Implementation

• Performance

Architecture Overview

Block Diagram

8x8Global

CrossbarSwitch

QBB

QBB

QBB

QBB

QBB

QBB

QBB

QBB

QBB – Quad-Processor Building Block

1.6 GB/s

Quad-Processor Building Block (QBB)

10-Port Local

CrossbarSwitch

1.6 GB/s3.2 GB/s

P L2

P L2

P L2

P L2

SDRAM Memory8GB, 64-bit200 MHz

64-entryCache

I/OPCI:4 PCI Bus64-bit, 33 MHz Global Port

DTAG

DIR

TTT

Arbitration Point

32 Alpha 21264 Duplicate Tag Store

Transactions In Transit Buffer

The Directory

Owner = 0 S0 S1 S2 S3 S4 S5 S6 S7

14-bit per 64 Byte Memory Line

6-bit

Forward

QBB0DTAG

QBB3

P0 P1 P2 P3

Invalidate

Invalidate

Crossbar SwitchNetwork bi-section Bandwidth:

Global Switch (8x8): 12.8 GB/sLocal Switch (10-port): 6.4 GB/s

MOESI Cache Coherence Protocol

MOESI - Directory• States

– Invalid (I) : – Shared (S) : Valid, (potentially) shared, clean– Exclusive (E) : Valid, exclusive, clean– Modified (M) : Valid, exclusive, (potentially) dirty– Owner (O) : Valid, (potentially) shared, clean

• Responsible for supplying Data instead of memory (potentially)

• Request Messages– Read (Rd) : Data needed in shard state (S/E)– Read Exclusive (RE) : Data needed in Modified State (M) – Exclusive (Ex) : Data needed in Modified State (M)

• Home node – Original owner of data (directory)

MOESI Read

H/D N3/I

N4/IN5/I

N2/MN1/I N2/O

Rd

Forward

MarkerReply

N5/S

MOESI Read-Exclusive

H/D N3/I

N4/IN5/S

N2/ON1/I

RE

Forward

Marker

Invalidate

N5/I

N2/I

Ack

Reply

N3/E

GS320 Optimized Cache Coherence Protocol

• Dirty Sharing

• No negative acknowledgment– 3 Deadlock Conditions due to races

Late Request Race Condition

H/D N3/I

N4/IN5/I

N2/MN1/I

Rd

Forward

Marker

N2/X

Write Back

Ack

Reply

Write Buffer

N5/S

DEADLOCK?

Early Request Race Condition

H/D N3/M

N4/IN5/I

N2/IN1/I

RE

H/D Forward

Rd

Marker

Forward

H/D N3/I

Reply

N2/MN2/O

Reply

N5/S

Marker

DEADLOCK?

Crossbar NetworkQ0 Queue: Request to Home Node – (point to point order)

Q1 Queue: Forward, Replies and Invalidations from Home Node – (global order)Q2 Queue: Data Replies from Owner to Requester Node

Total Ordering on Q1!!

P1

A (O)Cache

Q1InboundQueue

P2

B (O)

Cache

Q1InboundQueue

HA

Q1OutboundQueue

HB

Q1OutboundQueue

Crossbar Switch

A (P1) B (P2)

RE1(B) RE2(A)

A (P2) B (P1)

P1 – RE2(A) P2 – RE1(B)

RE4(A) RE3(B)

A (X) B (Y)

P1 – RE3(B)P2 – RE4(A)

P1 – RE3(B) P2 – RE4(A)P1 – RE2(A) P2 – RE1(B)

DEADLOCK?

Desirable Characteristics

• Dirty sharing - efficient for migratory accesses

• All directory changes are instant. Needs just single access to home node and directory

• Eliminate livelock, starvation, and fairness problems

• Writes can start as soon as Exclusive request is issued

Alpha Consistency Model

MB: Memory Barrier

LOAD

STORE

LOAD

STORE

STORE

LOAD

Oldest Memory Operation

Program Order

LOAD

STORE

STORE

LOADLOAD

STORE

STORE

LOAD

LOAD

Atomicity is not violated: Read others write early

Consistency Model Implementation

• Barrier Performance (Commit Event)– Early acknowledge of Invalidates – Early acknowledge of Forwards of (Exclusive, Read

Exclusive and Read Requests)

• Overall Performance– Relax total order condition on Q1 at commit points.

Let replies (Q1->Q2) bypass forwards (Q1), and invalidations (Q1)

Early Acknowledgement of Invalidation Request

P1

A = 0B = 0

Cache

Q1InboundQueue

P2

A = 0

Cache

Q1InboundQueue

Crossbar Switch

SCA = 1;B = 1;

SCu = B;v = A;

u? v?u = 1v = 0

SCA = 1;B = 1;

EX

INVAL A

A = 1

SCA = 1;B = 1;

B = 1

SCu = B;v = A;

u? v?u = 1v = 0

Rd Marker P1

B = 1

Not a RaceCondition

B = 1

Commit

Races

MB

1. Optimize memory barrier at P1 for write to write/read ordering2. Commit events in Q1 queue for ordering purposes in case of replies3. Sufficient condition: Commit events not to bypass invalidates4. Memory Barrier at P2 wait for all the commits before going ahead

INVCommit

Commit pt Commit ptINV Ack

Commit Points

8x8Global

CrossbarSwitch

QBB

QBB

QBB

QBB

QBB

QBB

QBB

QBB

DTAG

DIR

TTT

Commit Point

Early Acknowledge of Forwards

P1

A = 0B = 0

Cache

Q1InboundQueue

P2

A = 0

Cache

Q1InboundQueue

Crossbar Switch

A = 1;MBB = 1;

u = B;MBv = A;

u? v?u = 1v = 0

Commit pt Commit pt

u = B;MBv = A;

u? v?u = 1v = 0

Read B Commit/RdB

u = B;MBv = A; = 0u? v?u = ?v = 0Fwd Ack

A = 1;MBB = 1;

INVAL ACommit/INV A

bypass

Sufficient condition: Commit events not to bypass invalidates, reads and read-exclusive forwards

A = 1

A = 1;MBB = 1;

Optimization Summary• Dirty Sharing – Reduces home node traffic• No negative acknowledgements

– Reduces network traffic (Home Node)– Simple implementation of directory– Removes livelock, starvation, and fairness problems– Network total ordering avoid deadlocks – Write optimization

• Bypass of replies in Q1 queue– Improve overall performance

• Improves barrier performance– Early invalidation acknowledgements– Early Forward responses (Rd, RE, EX)– Memory ordering based on commit events

Performance

DOUBTS?