efficient hardware support for the partitioned global

Holger Fröning

Efficient Hardware Support for the Efficient Hardware Support for the Partitioned Global Address SpacePartitioned Global Address Space

Holger Fröning and Heiner LitzComputer Architecture Group

University of Heidelberg

Efficient Hardware Support for the Partitioned Global Address Space2 Holger Fröning

OutlineOutline

Motivation & GoalArchitecturePerformance EvaluationConclusion


MotivationMotivationPGASPGAS

Cache coherent shared memory does not scaleNeither Broadcast- nor Directory-based cache protocolsSee also AMD’s Probe Filter

Partitioned Global Address Space (PGAS)Locally coherent, globally non-coherentYelick 2006: “Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing.”

Leverage local coherency advantages,avoid global coherency disadvantages


MotivationMotivationGoalGoal

PGAS relies onHigh bandwidth bulk transfersFine grain accesses for both communication and synchronization purposes

GoalProvide best support for fine grain accesses with minimal software overhead

Ban

dwid

th


ArchitectureArchitectureOverviewOverview

1. Lean Shared Memory EngineAddress TranslationSrcTag Management Stateless on Origin sideVirtualized

2. Reliable network with in-order delivery

HT requests supposed to be answered

3. Leverage HyperTransport’s latency advantage and direct CPU connectivity

4. Minimal protocol conversionCPU HT On-chip network Network

Mem

ory

Mem

ory


ArchitectureArchitectureWorking PrincipleWorking Principle

1. LOAD / STORE instruction2. HT request to SMFU3. SMFU performs address

translation, target nodedetermination

4. Request is send as Extollnetwork packet to target

5. SMFU performs SrcTagtranslation

6. HT request to target MC7. MC handles request8. HT response to SMFU9. SMFU re-translates

SrcTag10. HT response encapsulated

in Extoll network packet11. HT response to CPU

Mem

ory

CPU

HT

SMFU

Mem

ory

CPU

HT

SMFU EXTOLL Network Packet

1

2

34

95

7

8

10

11

EXTOLLCustom FPGA-based high performance

interconnection network


ArchitectureArchitectureEgress & Ingress UnitEgress & Ingress Unit


ArchitectureArchitectureAddress TranslationAddress Translation

oStartAddroAddrgAddr −=

( ) countshiftmaskgAddrtNodeID _& >>=

( ) tStartAddrmaskgAddrtAddr += ~&

oStartAddr

tStartAddr

oAddr

tAddr


ArchitectureArchitectureMatching StoreMatching Store

Ingress Unit

Egress Unit

Requester PathAddress

Calculation I

Target Node ID

Address Calculation

IINetwork

Responder Path

HT response Network

Matching Store

tSrcTag Matching

oSrcTag & oNode ID

Completer Path

Forward transaction NetworkHT request/response

Responder Path

NetworkAssign free tSrcTag

Store oSrcTag & oNode ID

HT non-posted request

HT request

Stores origin informationNode ID, HT SrcTagOnly used on target side

Ingress Responder UnitStores oSrcTag, oNodeIDtSrcTag returned

Egress Responder UnitUses tSrcTag for lookup


ArchitectureArchitectureFrameworkFramework

HT-Core:Direct CPU connectionFully synchronousEfficient pipelined structureIncoming / Outgoing : 12 / 6 cycles

HTAX:Non-blocking crossbarHT-derived protocol3(+2) cycles latency

Hyp

er-T

rans

port

IP C

ore

On

Chi

p N

etw

ork

Net

wor

k S

witc

h

Net

-wor

k P

ort


ArchitectureArchitectureFrameworkFramework

Network Switch:In-order delivery of packetsHardware retransmissionVirtual Output QueuingVirtual channelsCut-through switchingSource-Path routingCredit based flow-controlFault toleranceRemote management access

Hyp

er-T

rans

port

IP C

ore

On

Chi

p N

etw

ork

Net

wor

k S

witc

h

Net

-wor

k P

ort


Performance EvaluationPerformance Evaluation

Two nodes, each:4x AMD Opteron 2.2GHz Quad Core16GB RAMStandard Linux

Virtex-4 FX100 Prototype156MHz core clockHT400 interface (1.6GB/s)6 links, each 6.24 GbpsFPGA: 75% utilization


Performance EvaluationPerformance EvaluationLimitations & SolutionsLimitations & Solutions

CPU microarchitectureOnly one outstanding load transaction on MMIO spaceMax. size of load on uncacheable memory 64bit

BIOSMMIO space per PCI B:D:F number max. 256MB

Move to DRAM space

Cacheable memory

Extended MMIO (EMMIO)Move to DRAM space


Performance EvaluationPerformance EvaluationRemote LoadsRemote Loads

Read Latency

1.21 1.27

1.981.89

0.0

0.5

1.0

1.5

2.0

2.5

4 8Size[byte]

Late

ncy[μ

s]

Measurement HT400 / 156MHz core clock

Simulation HT1000 / 800MHz core clock

Read Latency

1.21 1.27

1.981.89

0.0

0.5

1.0

1.5

2.0

2.5

4 8Size[byte]

Late

ncy[μ

s]

Measurement HT400 / 156MHz core clock

Simulation HT1000 / 800MHz core clock

Read Throughput

0

20

40

60

80

100

120

140

160

180

200

4 8 16 32 64

Size[byte]

Thro

ghpu

t[M

B/s]

Nonposted Measurement, max. 1 outstanding

Nonposted Simulation, max. 1 outstanding


Read Throughput

0

20

40

60

80

100

120

140

160

180

200

4 8 16 32 64

Size[byte]

Thro

ghpu

t[M

B/s]

Nonposted Measurement, max. 1 outstanding



Little’s Law (1961, queuing theory)Number outstanding N = 1Response Time R = 2 usec (approx.)Throughput X

sMBus

BRNX /4281/ =⋅

==

Full round trip latency

including main memory access


Performance EvaluationPerformance EvaluationRemote StoresRemote Stores

At network peak bandwidth for 64bytesOutstanding transaction rate

Write Throughput

0

200

400

600

800

1000

1200

1400

8 16 32 64 128

Size[byte]

Thro

ghpu

t[M

B/s]

Simulation HT1000 /800MHz core clock

Measurement HT400 /156MHz core clock

Write Throughput

0

200

400

600

800

1000

1200

1400

8 16 32 64 128

Size[byte]

Thro

ghpu

t[M

B/s]

Simulation HT1000 /800MHz core clock

Measurement HT400 /156MHz core clock

Posted Transaction Rate

0.0001,0

00,00

0.0002,0

00,00

0.0003,0

00,00

0.0004,0

00,00

0.0005,0

00,00

0.0006,0

00,00

0.0007,0

00,00

0.000

8 16 32 64 128 256 512 1024 2048 4096

Size [Byte]

Rate

[1/s

]

Posted Transaction Rate

0.0001,0

00,00

0.0002,0

00,00

0.0003,0

00,00

0.0004,0

00,00

0.0005,0

00,00

0.0006,0

00,00

0.0007,0

00,00

0.000

8 16 32 64 128 256 512 1024 2048 4096

Size [Byte]

Rate

[1/s

]


Performance EvaluationPerformance EvaluationPut it into context!Put it into context!

Comparing SMFU vs. EXTOLL (API)(two nodes, 4x 2.2GHz K10 Opteron)

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

8 16 32 64 128 256 512 1,024 2,048 4,096 8,192

size [bytes]

MB

/s

0.00

2.50

5.00

7.50

10.00

12.50

15.00

17.50

20.00

SpeedupSMFU over EXTOLL

SpeedupSMFU over IB

Shared Memory(FPGA-based)Remote StoresAPI over EXTOLL(FPGA-based)NetPIPE StreamMPI over IB(ASIC-based)NetPIPE Stream

Comparing SMFU vs. EXTOLL (API)(two nodes, 4x 2.2GHz K10 Opteron)

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

8 16 32 64 128 256 512 1,024 2,048 4,096 8,192

size [bytes]

MB

/s

0.00

2.50

5.00

7.50

10.00

12.50

15.00

17.50

20.00

SpeedupSMFU over EXTOLL

SpeedupSMFU over IB

Shared Memory(FPGA-based)Remote StoresAPI over EXTOLL(FPGA-based)NetPIPE StreamMPI over IB(ASIC-based)NetPIPE Stream


Conclusion & FutureConclusion & Future

ConclusionProve of concept of Distributed Shared MemoryFine grained remote stores & loadsEfficient and slim designFirst performance numbers outstanding and encouraging (taken into account the technology differences)

FutureAtomic operationsDRAM space (requires coherency)Consistency supporting strict and relaxed operationsApplication level evaluation

UPC/GASNetAggregating memory…

efficient hardware support for the partitioned global

Documents