efficient hardware support for the partitioned global

17
Holger Fröning Efficient Hardware Support for the Efficient Hardware Support for the Partitioned Global Address Space Partitioned Global Address Space Holger Fröning and Heiner Litz Computer Architecture Group University of Heidelberg

Upload: others

Post on 07-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Holger Fröning

Efficient Hardware Support for the Efficient Hardware Support for the Partitioned Global Address SpacePartitioned Global Address Space

Holger Fröning and Heiner LitzComputer Architecture Group

University of Heidelberg

Efficient Hardware Support for the Partitioned Global Address Space2 Holger Fröning

OutlineOutline

Motivation & GoalArchitecturePerformance EvaluationConclusion

Efficient Hardware Support for the Partitioned Global Address Space3 Holger Fröning

MotivationMotivationPGASPGAS

Cache coherent shared memory does not scaleNeither Broadcast- nor Directory-based cache protocolsSee also AMD’s Probe Filter

Partitioned Global Address Space (PGAS)Locally coherent, globally non-coherentYelick 2006: “Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing.”

Leverage local coherency advantages,avoid global coherency disadvantages

Efficient Hardware Support for the Partitioned Global Address Space4 Holger Fröning

MotivationMotivationGoalGoal

PGAS relies onHigh bandwidth bulk transfersFine grain accesses for both communication and synchronization purposes

GoalProvide best support for fine grain accesses with minimal software overhead

Ban

dwid

th

Efficient Hardware Support for the Partitioned Global Address Space5 Holger Fröning

ArchitectureArchitectureOverviewOverview

1. Lean Shared Memory EngineAddress TranslationSrcTag Management Stateless on Origin sideVirtualized

2. Reliable network with in-order delivery

HT requests supposed to be answered

3. Leverage HyperTransport’s latency advantage and direct CPU connectivity

4. Minimal protocol conversionCPU HT On-chip network Network

Mem

ory

Mem

ory

Efficient Hardware Support for the Partitioned Global Address Space6 Holger Fröning

ArchitectureArchitectureWorking PrincipleWorking Principle

1. LOAD / STORE instruction2. HT request to SMFU3. SMFU performs address

translation, target nodedetermination

4. Request is send as Extollnetwork packet to target

5. SMFU performs SrcTagtranslation

6. HT request to target MC7. MC handles request8. HT response to SMFU9. SMFU re-translates

SrcTag10. HT response encapsulated

in Extoll network packet11. HT response to CPU

Mem

ory

CPU

HT

SMFU

Mem

ory

CPU

HT

SMFU EXTOLL Network Packet

1

2

34

95

7

8

10

11

EXTOLLCustom FPGA-based high performance

interconnection network

Efficient Hardware Support for the Partitioned Global Address Space7 Holger Fröning

ArchitectureArchitectureEgress & Ingress UnitEgress & Ingress Unit

Efficient Hardware Support for the Partitioned Global Address Space8 Holger Fröning

ArchitectureArchitectureAddress TranslationAddress Translation

oStartAddroAddrgAddr −=

( ) countshiftmaskgAddrtNodeID _& >>=

( ) tStartAddrmaskgAddrtAddr += ~&

oStartAddr

tStartAddr

oAddr

tAddr

Efficient Hardware Support for the Partitioned Global Address Space9 Holger Fröning

ArchitectureArchitectureMatching StoreMatching Store

Ingress Unit

Egress Unit

Requester PathAddress

Calculation I

Target Node ID

Address Calculation

IINetwork

Responder Path

HT response Network

Matching Store

tSrcTag Matching

oSrcTag & oNode ID

Completer Path

Forward transaction NetworkHT request/response

Responder Path

NetworkAssign free tSrcTag

Store oSrcTag & oNode ID

HT non-posted request

HT request

Stores origin informationNode ID, HT SrcTagOnly used on target side

Ingress Responder UnitStores oSrcTag, oNodeIDtSrcTag returned

Egress Responder UnitUses tSrcTag for lookup

Efficient Hardware Support for the Partitioned Global Address Space10 Holger Fröning

ArchitectureArchitectureFrameworkFramework

HT-Core:Direct CPU connectionFully synchronousEfficient pipelined structureIncoming / Outgoing : 12 / 6 cycles

HTAX:Non-blocking crossbarHT-derived protocol3(+2) cycles latency

Hyp

er-T

rans

port

IP C

ore

On

Chi

p N

etw

ork

Net

wor

k S

witc

h

Net

-wor

k P

ort

Efficient Hardware Support for the Partitioned Global Address Space11 Holger Fröning

ArchitectureArchitectureFrameworkFramework

Network Switch:In-order delivery of packetsHardware retransmissionVirtual Output QueuingVirtual channelsCut-through switchingSource-Path routingCredit based flow-controlFault toleranceRemote management access

Hyp

er-T

rans

port

IP C

ore

On

Chi

p N

etw

ork

Net

wor

k S

witc

h

Net

-wor

k P

ort

Efficient Hardware Support for the Partitioned Global Address Space12 Holger Fröning

Performance EvaluationPerformance Evaluation

Two nodes, each:4x AMD Opteron 2.2GHz Quad Core16GB RAMStandard Linux

Virtex-4 FX100 Prototype156MHz core clockHT400 interface (1.6GB/s)6 links, each 6.24 GbpsFPGA: 75% utilization

Efficient Hardware Support for the Partitioned Global Address Space13 Holger Fröning

Performance EvaluationPerformance EvaluationLimitations & SolutionsLimitations & Solutions

CPU microarchitectureOnly one outstanding load transaction on MMIO spaceMax. size of load on uncacheable memory 64bit

BIOSMMIO space per PCI B:D:F number max. 256MB

Move to DRAM space

Cacheable memory

Extended MMIO (EMMIO)Move to DRAM space

Efficient Hardware Support for the Partitioned Global Address Space14 Holger Fröning

Performance EvaluationPerformance EvaluationRemote LoadsRemote Loads

Read Latency

1.21 1.27

1.981.89

0.0

0.5

1.0

1.5

2.0

2.5

4 8Size[byte]

Late

ncy[μ

s]

Measurement HT400 / 156MHz core clock

Simulation HT1000 / 800MHz core clock

Read Latency

1.21 1.27

1.981.89

0.0

0.5

1.0

1.5

2.0

2.5

4 8Size[byte]

Late

ncy[μ

s]

Measurement HT400 / 156MHz core clock

Simulation HT1000 / 800MHz core clock

Read Throughput

0

20

40

60

80

100

120

140

160

180

200

4 8 16 32 64

Size[byte]

Thro

ghpu

t[M

B/s]

Nonposted Measurement, max. 1 outstanding

Nonposted Simulation, max. 1 outstanding

Nonposted Simulation, max. 32 outstanding

Read Throughput

0

20

40

60

80

100

120

140

160

180

200

4 8 16 32 64

Size[byte]

Thro

ghpu

t[M

B/s]

Nonposted Measurement, max. 1 outstanding

Nonposted Simulation, max. 1 outstanding

Nonposted Simulation, max. 32 outstanding

Little’s Law (1961, queuing theory)Number outstanding N = 1Response Time R = 2 usec (approx.)Throughput X

sMBus

BRNX /4281/ =⋅

==

Full round trip latency

including main memory access

Efficient Hardware Support for the Partitioned Global Address Space15 Holger Fröning

Performance EvaluationPerformance EvaluationRemote StoresRemote Stores

At network peak bandwidth for 64bytesOutstanding transaction rate

Write Throughput

0

200

400

600

800

1000

1200

1400

8 16 32 64 128

Size[byte]

Thro

ghpu

t[M

B/s]

Simulation HT1000 /800MHz core clock

Measurement HT400 /156MHz core clock

Write Throughput

0

200

400

600

800

1000

1200

1400

8 16 32 64 128

Size[byte]

Thro

ghpu

t[M

B/s]

Simulation HT1000 /800MHz core clock

Measurement HT400 /156MHz core clock

Posted Transaction Rate

0.0001,0

00,00

0.0002,0

00,00

0.0003,0

00,00

0.0004,0

00,00

0.0005,0

00,00

0.0006,0

00,00

0.0007,0

00,00

0.000

8 16 32 64 128 256 512 1024 2048 4096

Size [Byte]

Rate

[1/s

]

Posted Transaction Rate

0.0001,0

00,00

0.0002,0

00,00

0.0003,0

00,00

0.0004,0

00,00

0.0005,0

00,00

0.0006,0

00,00

0.0007,0

00,00

0.000

8 16 32 64 128 256 512 1024 2048 4096

Size [Byte]

Rate

[1/s

]

Efficient Hardware Support for the Partitioned Global Address Space16 Holger Fröning

Performance EvaluationPerformance EvaluationPut it into context!Put it into context!

Comparing SMFU vs. EXTOLL (API)(two nodes, 4x 2.2GHz K10 Opteron)

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

8 16 32 64 128 256 512 1,024 2,048 4,096 8,192

size [bytes]

MB

/s

0.00

2.50

5.00

7.50

10.00

12.50

15.00

17.50

20.00

SpeedupSMFU over EXTOLL

SpeedupSMFU over IB

Shared Memory(FPGA-based)Remote StoresAPI over EXTOLL(FPGA-based)NetPIPE StreamMPI over IB(ASIC-based)NetPIPE Stream

Comparing SMFU vs. EXTOLL (API)(two nodes, 4x 2.2GHz K10 Opteron)

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

8 16 32 64 128 256 512 1,024 2,048 4,096 8,192

size [bytes]

MB

/s

0.00

2.50

5.00

7.50

10.00

12.50

15.00

17.50

20.00

SpeedupSMFU over EXTOLL

SpeedupSMFU over IB

Shared Memory(FPGA-based)Remote StoresAPI over EXTOLL(FPGA-based)NetPIPE StreamMPI over IB(ASIC-based)NetPIPE Stream

Efficient Hardware Support for the Partitioned Global Address Space17 Holger Fröning

Conclusion & FutureConclusion & Future

ConclusionProve of concept of Distributed Shared MemoryFine grained remote stores & loadsEfficient and slim designFirst performance numbers outstanding and encouraging (taken into account the technology differences)

FutureAtomic operationsDRAM space (requires coherency)Consistency supporting strict and relaxed operationsApplication level evaluation

UPC/GASNetAggregating memory…