simplifying active memory clusters by leveraging directory protocol threads

58
Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging Directory Protocol Threads Directory Protocol Threads Dhiraj D. Kalamkar, Intel Dhiraj D. Kalamkar, Intel Mainak Chaudhuri, Mainak Chaudhuri, Mark Heinrich, Mark Heinrich, IIT Kanpur IIT Kanpur University of Central Florida University of Central Florida

Upload: strom

Post on 21-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads. Dhiraj D. Kalamkar, Intel Mainak Chaudhuri, Mark Heinrich, IIT Kanpur University of Central Florida. Talk in One Slide. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging

Directory Protocol ThreadsDirectory Protocol Threads

Dhiraj D. Kalamkar, IntelDhiraj D. Kalamkar, Intel

Mainak Chaudhuri, Mark Mainak Chaudhuri, Mark Heinrich,Heinrich,

IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida

Page 2: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Talk in One SlideTalk in One Slide Address re-mapping improves Address re-mapping improves

performance of important kernelsperformance of important kernels– Vertical reduction and transpose in this talkVertical reduction and transpose in this talk

But requires custom hardware support But requires custom hardware support in memory controllerin memory controller– Address translation, cache line assemblyAddress translation, cache line assembly

We move this hardware support to We move this hardware support to software running on a dir. protocol software running on a dir. protocol threadthread– Can be a thread in SMT or a core in CMPCan be a thread in SMT or a core in CMP– Enjoys 1.45 and 1.29 speedup for reduction Enjoys 1.45 and 1.29 speedup for reduction

and transpose on 16-node DSM multiproc.and transpose on 16-node DSM multiproc.

Page 3: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

SketchSketch BackgroundBackground

– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU

– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture

Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction

Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary

Page 4: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: AM TechniquesBackground: AM Techniques Focus on two kernels in this workFocus on two kernels in this work

– Parallel vertical reduction of an array of Parallel vertical reduction of an array of vectors and matrix transposevectors and matrix transpose

Consider vertically reducing each Consider vertically reducing each column of an NxN matrix to produce a column of an NxN matrix to produce a 1xN vector1xN vector– For the ease of page distribution a block-For the ease of page distribution a block-

row partitioning among processors is row partitioning among processors is carried outcarried out

– Each processor reduces its portion into a Each processor reduces its portion into a private 1xN vectorprivate 1xN vector

– A merge phase accumulates the private A merge phase accumulates the private contributionscontributions

Page 5: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Parallel Background: Parallel ReductionReduction

P0

P1

P2

P3

P1P0 P2P3

all-to-all

Page 6: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: AM Parallel Background: AM Parallel ReductionReduction

for j=0 to N-1for j=0 to N-1

p_x[pid][j] = e;p_x[pid][j] = e;

for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

p_x[pid][j] = p_x[pid][j] = p_x[pid][j] + A[i][j];p_x[pid][j] + A[i][j];

BARRIERBARRIER

for j=pid*(N/P) to for j=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

for i=0 to P-1for i=0 to P-1

x[j] = x[j]+p_x[i]x[j] = x[j]+p_x[i][j];[j];

BARRIERBARRIER

Subsequent uses of xSubsequent uses of x

Do not want this

Page 7: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: AM Parallel Background: AM Parallel ReductionReduction

P0

P1

P2

P3

Special shadow space(not backed by memory)

Cache eviction,Merge in MC

Physical result vector

Page 8: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: AM Parallel Background: AM Parallel ReductionReduction

for j=0 to N-1for j=0 to N-1

p_x[pid][j] = e;p_x[pid][j] = e;

for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

p_x[pid][j] = p_x[pid][j] = p_x[pid][j] + A[i][j];p_x[pid][j] + A[i][j];

BARRIERBARRIER

for j=pid*(N/P) to for j=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

for i=0 to P-1for i=0 to P-1

x[j] = x[j]+p_x[i]x[j] = x[j]+p_x[i][j];[j];

BARRIERBARRIER

Subsequent uses of xSubsequent uses of x

/* AM optimized *//* AM optimized */x’ = x’ = AMInstallAMInstall (x, N, (x, N,

sizeof (long long));sizeof (long long));for j=0 to N-1for j=0 to N-1 for i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 x’[pid][j] = x’[pid]x’[pid][j] = x’[pid]

[j]+A[i][j];[j]+A[i][j];BARRIERBARRIERSubsequent uses of xSubsequent uses of x

Page 9: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Memory ControlBackground: Memory Control Memory controller does the followingMemory controller does the following

– Identify requests to the shadow space (easy)Identify requests to the shadow space (easy)– Send identity cache block to local processor Send identity cache block to local processor

and handle coherence in backgroundand handle coherence in background– Identify shadow space writebacks and Identify shadow space writebacks and

accumulate the data in the evicted block accumulate the data in the evicted block with the in-memory data (requires a with the in-memory data (requires a translation)translation)

– On a normal space request, retrieve On a normal space request, retrieve corresponding shadow blocks from corresponding shadow blocks from processors (P shadow blocks contribute to processors (P shadow blocks contribute to one normal block), accumulate them with in-one normal block), accumulate them with in-memory data, and send the final resultmemory data, and send the final result

Merge removed from critical pathMerge removed from critical path

Page 10: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: TranslationBackground: Translation Suppose the memory controller receives Suppose the memory controller receives

a shadow writeback to address Aa shadow writeback to address A– If starting shadow address of the result If starting shadow address of the result

vector is S, the offset is A-Svector is S, the offset is A-S S is a fixed number decided by the hardware S is a fixed number decided by the hardware

and OS designers; also, shadow space is and OS designers; also, shadow space is contiguouscontiguous

– Add A-S to the starting virtual address of Add A-S to the starting virtual address of the result vector (recall: starting virtual the result vector (recall: starting virtual address is communicated to MC via address is communicated to MC via AMInstallAMInstall))

– Look up memory-resident TLB with this Look up memory-resident TLB with this address to get the physical address of the address to get the physical address of the data to be written backdata to be written back

Page 11: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Memory ControlBackground: Memory Control

NetworkInterface

CoherenceEngine

TLB

Router

Merger

SDRAM

Physical

Block

ShadowWriteback

Data Buffer Pool

Page 12: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: AMDU PipelineBackground: AMDU Pipeline

VA Calc.

AMTLB AMTLBPref.

Dir.Addr. Calc.

SDRAM/Data Buffer

CoherenceEngine

Base Address

Virtual Address

Ph. Addr.

Directory Address Application DataMsg. Buffer

Page 13: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: FlexibilityBackground: Flexibility Flexibility was a primary goal of AMFlexibility was a primary goal of AM

– Do not want to add new hardware for Do not want to add new hardware for every new AM optimizationevery new AM optimization

Two key components to achieve this Two key components to achieve this goalgoal– A general-enough AMDUA general-enough AMDU– Integrate control code of AMDU into Integrate control code of AMDU into

software coherence protocol running on software coherence protocol running on the coherence enginethe coherence engine

– Coherence engine itself is a simple Coherence engine itself is a simple processor core in a CMP or a thread processor core in a CMP or a thread context in SMTcontext in SMT

– This work eliminates the AMDU and This work eliminates the AMDU and achieves maximum possible flexibilityachieves maximum possible flexibility

Page 14: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Flexible Background: Flexible CoherenceCoherence

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU

OOO Core In-order Core

PCPL1

Page 15: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Flexible Background: Flexible CoherenceCoherence

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1

AMDU

OOO Core In-order Core

PCSL2

Page 16: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Flexible Background: Flexible CoherenceCoherence

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU

OOO Core In-order Core

PCSL2PL1

Page 17: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Flexible Background: Flexible CoherenceCoherence

AT PT

IL1DL1

L2MemoryControl

Router

SDRAM

AMDU

OOO SMT Core

SMTp

Page 18: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

ContributionsContributions Two major contributionsTwo major contributions

– First implementation of AM techniques First implementation of AM techniques without any custom hardware in MCwithout any custom hardware in MC Brings AM closer to adoption in commodity Brings AM closer to adoption in commodity

systemssystems

– Evaluation of new flexible AM protocols Evaluation of new flexible AM protocols on four different directory controller on four different directory controller architecturesarchitectures Innovative use of contemporary dual-core Innovative use of contemporary dual-core

and SMT nodesand SMT nodes

Page 19: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

SketchSketch BackgroundBackground

– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU

– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture

Deconstructing the AMDUDeconstructing the AMDUParallel ReductionParallel Reduction

Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary

Page 20: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDUDeconstructing the AMDU Involves efficiently emulating the Involves efficiently emulating the

AMDU pipeline in the protocol codeAMDU pipeline in the protocol code– Virtual address calculation is easy: one Virtual address calculation is easy: one

shift and one 32-bit additionshift and one 32-bit addition– Directory address calculation is easy: Directory address calculation is easy:

one shift by constant amount and one one shift by constant amount and one 40-bit addition40-bit addition

– Challenging componentsChallenging components TLBTLB MergerMerger Dynamic cache line gather/scatter (needed Dynamic cache line gather/scatter (needed

for transpose)for transpose)

Page 21: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: TLBDeconstructing the AMDU: TLB Two design optionsTwo design options

– Emulate a direct-mapped software TLB in Emulate a direct-mapped software TLB in the protocol data areathe protocol data area

Each entry holds tag, translation, permission Each entry holds tag, translation, permission bits, valid bitbits, valid bit

Hit/miss detection in softwareHit/miss detection in software On a miss, invoke page walker or access On a miss, invoke page walker or access

memory-resident page tablememory-resident page table Advantage: can be larger than a hardware TLBAdvantage: can be larger than a hardware TLB

– Share the application TLB: easy in SMTp, Share the application TLB: easy in SMTp, butbut

Requires extra port or interferes with app. Requires extra port or interferes with app. threadsthreads

Other three architectures: floor-planning issuesOther three architectures: floor-planning issues Not explored in this workNot explored in this work

Page 22: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: TLBDeconstructing the AMDU: TLB Emulating a TLB in protocol softwareEmulating a TLB in protocol software

– Handling a TLB miss requires the page Handling a TLB miss requires the page table base addresstable base address Don’t want to trap to kernelDon’t want to trap to kernel Load the page table base address in an Load the page table base address in an

architectural register of protocol thread at architectural register of protocol thread at the time of application launch (this register the time of application launch (this register cannot be used by protocol compiler)cannot be used by protocol compiler)

– TLB shootdown now needs to worry TLB shootdown now needs to worry about keeping the soft TLB coherentabout keeping the soft TLB coherent Must invalidate the TLB in the protocol data Must invalidate the TLB in the protocol data

areaarea Starting address and size of the TLB area Starting address and size of the TLB area

should be made known to the TLB should be made known to the TLB shootdown kernel shootdown kernel

Page 23: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

NetworkInterface

CoherenceEngine

TLB

Router

Merger

SDRAM

Physical

Block

ShadowWriteback

Data Buffer Pool MEB

MYB

Page 24: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

Naïve approachNaïve approach– Writeback data is in message buffer (MEB) Writeback data is in message buffer (MEB)

and physical block is loaded into a and physical block is loaded into a memory buffer (MYB)memory buffer (MYB)

– Protocol thread can access 64 bits of data Protocol thread can access 64 bits of data at 8-byte aligned offsets within a data at 8-byte aligned offsets within a data buffer through uncached loads/stores buffer through uncached loads/stores (data buffer pool is memory-mapped)(data buffer pool is memory-mapped)

– Load 64 bit data from MYB and MEB into Load 64 bit data from MYB and MEB into two general purpose registers, merge two general purpose registers, merge them, store the result back to same offset them, store the result back to same offset in MYBin MYB

– At the end of the loop, write MYB back to At the end of the loop, write MYB back to memorymemory

Page 25: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

SDRAM

Physical

Block

Data Buffer Pool MEB

MYB RFUC Load+

UC Load+

Add+

UC Store+

UC Store

Page 26: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger Naïve approachNaïve approach

– For 128B data buffers, 32 uncached For 128B data buffers, 32 uncached loads, 16 uncached stores (as opposed loads, 16 uncached stores (as opposed to 16 cycles in AMDU pipe)to 16 cycles in AMDU pipe)

– Worse: uncached operations are often Worse: uncached operations are often implemented as non-speculative in implemented as non-speculative in processor pipesprocessor pipes

Improvement: caching the buffersImprovement: caching the buffers– Data buffers are already memory-Data buffers are already memory-

mappedmapped– Treat them as standard cached memoryTreat them as standard cached memory– Now can use cached loads and stores Now can use cached loads and stores

which can issue speculatively and can be which can issue speculatively and can be pipelinedpipelined

Page 27: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger Caching the buffersCaching the buffers

– The cache block(s) holding MYB must The cache block(s) holding MYB must be flushed to memory at the end of all be flushed to memory at the end of all the mergesthe merges Use writeback invalidate instruction Use writeback invalidate instruction

available on all microprocessorsavailable on all microprocessors

– The cache block(s) holding MEB must The cache block(s) holding MEB must be invalidated at the end of all the be invalidated at the end of all the mergesmerges Otherwise next time when the same buffer Otherwise next time when the same buffer

is used, there is a danger that the protocol is used, there is a danger that the protocol thread may see stale data in cachethread may see stale data in cache

Use invalidate instruction available on all Use invalidate instruction available on all microprocessorsmicroprocessors

Page 28: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

SDRAM

Physical

Block

Data Buffer Pool MEB

MYB RF Add+

L+L+

S+

Miss

Fill

Miss

Fill

Flush

UC Store

D$

Inv

Page 29: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

Why flush the cache block(s) holding Why flush the cache block(s) holding MYB at the end of the merge?MYB at the end of the merge?– Recall that there are P shadow blocks Recall that there are P shadow blocks

corresponding to one physical blockcorresponding to one physical block– Potentially there could be P writeback Potentially there could be P writeback

operations accessing the same physical operations accessing the same physical blockblock

– Caching each physical block across merges Caching each physical block across merges (not just during a merge) improves reuse(not just during a merge) improves reuse

– Cannot cache at the same physical address Cannot cache at the same physical address though: coherence problem in shared though: coherence problem in shared cachecache

– Use a different address space: 0x2345680 Use a different address space: 0x2345680 gets cached at 0xc2345680 (bypasses gets cached at 0xc2345680 (bypasses MYB)MYB)

Page 30: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

Caching across mergesCaching across merges– Flush the blocks at the end of Flush the blocks at the end of allall the the

merges (potentially P in number)merges (potentially P in number) Can be decided by the protocol thread from Can be decided by the protocol thread from

the directory state (shadow owner vector)the directory state (shadow owner vector) Problem: the address of the flushed block is Problem: the address of the flushed block is

slightly different from the actual physical slightly different from the actual physical address (in higher bits)address (in higher bits)

Memory controller anyway ignores address Memory controller anyway ignores address bits higher than installed DRAM capacitybits higher than installed DRAM capacity

– Must still flush MEB after every merge Must still flush MEB after every merge as usualas usual

Page 31: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

SDRAM

Physical

Block

Data Buffer Pool MEB

RF {Add+}+

{L+}+

{L+}+{S+}+

{Miss}+

{Fill}+

D$

{Inv}+

Miss/Fill

Flush

Page 32: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

A three-dimensional optimization spaceA three-dimensional optimization space– Caching (C) or not caching (U) Caching (C) or not caching (U) MEBMEB during a during a

merge, merge, MYBMYB during a merge, during a merge, merge resultsmerge results across mergesacross merges

UUUUUU, , UUCCUU, …, , …, CCCCCC

UUU UCU

UCCUUC

CCCCUC

CUU CCU

Not viable

Best performing

Caching MEB hurts

Page 33: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

SketchSketch BackgroundBackground

– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU

– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture

Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction

Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary

Page 34: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation EnvironmentSimulation Environment Each node is dual-core with one Each node is dual-core with one

OOO SMT core and one in-order coreOOO SMT core and one in-order core– On-die memory controller and routerOn-die memory controller and router– All components are clocked at 2.4 GHzAll components are clocked at 2.4 GHz– SMT core has 32 KB IL1, DL1 (dual-SMT core has 32 KB IL1, DL1 (dual-

ported), and 2 MB L2 (3-cycle tag hit), ported), and 2 MB L2 (3-cycle tag hit), 18-stage pipe18-stage pipe

– DRAM bandwidth 6.4 GB/s per channel, DRAM bandwidth 6.4 GB/s per channel, 40 ns page hit, 80 ns page miss40 ns page hit, 80 ns page miss

– Hop time 10 ns, link bandwidth 3.2 Hop time 10 ns, link bandwidth 3.2 GB/s, 2-way bristled hypercubeGB/s, 2-way bristled hypercube

– 16 nodes, each node capable of 16 nodes, each node capable of running up to two application threadsrunning up to two application threads

Page 35: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation EnvironmentSimulation Environment

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU

OOO Core In-order Core

PCPL1_128KBPCPL1_2MB

32 KB

Page 36: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation EnvironmentSimulation Environment

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1

AMDU

OOO Core In-order Core

PCSL2

32 KB

Page 37: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation EnvironmentSimulation Environment

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU

OOO Core In-order Core

PCSL2PL1

32 KB 128 KB

Page 38: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation EnvironmentSimulation Environment

AT PT

IL1DL1

L2MemoryControl

Router

SDRAM

AMDU

OOO SMT Core

SMTp

Page 39: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Benchmark ApplicationsBenchmark Applications Parallel reduction [All prefetched]Parallel reduction [All prefetched]

– Mean Square Average (MSA) [micro]Mean Square Average (MSA) [micro]– DenseMMM: C = ADenseMMM: C = ATTBB– SparseFlow: flow computation in sparse SparseFlow: flow computation in sparse

multi-source graphmulti-source graph– Spark98 kernel: SMVPSpark98 kernel: SMVP

Transpose [Prefetched, tiled]Transpose [Prefetched, tiled]– Transpose [micro]Transpose [micro]– SPLASH-2 FFT: only forward transformSPLASH-2 FFT: only forward transform

Involves three tiled transpose phasesInvolves three tiled transpose phases

– FFTW: forward and inverseFFTW: forward and inverse

Page 40: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

SketchSketch BackgroundBackground

– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU

– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture

Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction

Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary

Page 41: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation ResultsSimulation Results

Two key questions to answerTwo key questions to answer– How much speedup does our design How much speedup does our design

achieve over a baseline that does not achieve over a baseline that does not use AM protocols (both without the use AM protocols (both without the AMDU)?AMDU)?

– How much performance penalty do we How much performance penalty do we pay due to the elimination of the pay due to the elimination of the hardwired AMDU?hardwired AMDU?

Page 42: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: Spark98Simulation Results: Spark98No performance loss

Close

20% speedup

Page 43: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Result Summary: ReductionResult Summary: Reduction

Very encouraging resultsVery encouraging results– Architectures without AMDU comes Architectures without AMDU comes

within 3% of architectures with within 3% of architectures with complex AMDUcomplex AMDU

– SMTp+UCC and PCSL2PL1+UCC are the SMTp+UCC and PCSL2PL1+UCC are the most attractive architecturesmost attractive architectures 45% and 49% speedup with 16 application 45% and 49% speedup with 16 application

threads compared to non-AM baselinethreads compared to non-AM baseline

Page 44: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: 1D FFTSimulation Results: 1D FFT4.1% gap

2.3% gap

Page 45: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Result Summary: TransposeResult Summary: Transpose Within 13.2% of AMDU performanceWithin 13.2% of AMDU performance

– On average 8.7% gapOn average 8.7% gap SMTp+SoftTr delivers 29% speedupSMTp+SoftTr delivers 29% speedup PCSL2PL1+SoftTr delivers 23% PCSL2PL1+SoftTr delivers 23%

speedupspeedup Flashback: reduction summaryFlashback: reduction summary

– Within 3% of AMDU performanceWithin 3% of AMDU performance– 45% and 49% speedup of SMTp+UCC and 45% and 49% speedup of SMTp+UCC and

PCSL2PL1+UCCPCSL2PL1+UCC Architecturally, SMTp is more Architecturally, SMTp is more

attractive (area overhead is small), but attractive (area overhead is small), but PCSL2PL1 may be easier to verifyPCSL2PL1 may be easier to verify

Page 46: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Prior ResearchPrior Research Impulse memory controller Impulse memory controller

introduced the concept of address re-introduced the concept of address re-mappingmapping– Used in single-threaded systemsUsed in single-threaded systems– Software-directed cache flush for Software-directed cache flush for

coherencecoherence Active memory leveraged cache Active memory leveraged cache

coherence to do address re-mappingcoherence to do address re-mapping– Allowed seamless extensions to SMPs Allowed seamless extensions to SMPs

and DSMsand DSMs– Introduced AMDU and flexibility in AMIntroduced AMDU and flexibility in AM

This work closes the loop by bringing This work closes the loop by bringing AM closer to commodityAM closer to commodity

Page 47: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

SummarySummary Eliminates custom hardware support Eliminates custom hardware support

in memory controller traditionally in memory controller traditionally needed for AMneeded for AM– Parallel reduction performance comes Parallel reduction performance comes

within 3% of AMDUwithin 3% of AMDU– Transpose performance comes within Transpose performance comes within

13.2% of AMDU (lack of efficient 13.2% of AMDU (lack of efficient pipelining)pipelining)

– Protocol thread architecture achieves Protocol thread architecture achieves 45% and 29% speedup for reduction and 45% and 29% speedup for reduction and transposetranspose

– Protocol core architecture with private L1 Protocol core architecture with private L1 and shared L2 achieves 49% and 27%and shared L2 achieves 49% and 27%

Page 48: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging

Directory Protocol ThreadsDirectory Protocol Threads

Dhiraj D. Kalamkar, IntelDhiraj D. Kalamkar, Intel

Mainak Chaudhuri, Mark Mainak Chaudhuri, Mark Heinrich,Heinrich,

IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida

THANK YOU!

Page 49: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Matrix TransposeBackground: Matrix Transposefor i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A[i][j];sum += A[i][j];BARRIERBARRIERTranspose (A, A’);Transpose (A, A’);BARRIERBARRIERfor i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A’[i][j];sum += A’[i][j];BARRIERBARRIERTranspose (A’, A);Transpose (A’, A);BARRIERBARRIER

/* AM optimized transpose /* AM optimized transpose */*/

A’ = A’ = AMInstallAMInstall (A, N, N, (A, N, N, sizeof(Complex));sizeof(Complex));

for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

for j=0 to N-1for j=0 to N-1 sum += A[i][j];sum += A[i][j];BARRIERBARRIERfor i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A’[i][j];sum += A’[i][j];BARRIERBARRIER

Page 50: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Memory ControlBackground: Memory Control

NetworkInterface

CoherenceEngine

AMDU

Router

SDRAM

Shadow Put

Data Buffer Pool

Assembled block

GatherShadowGet

Page 51: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Background: Memory ControlBackground: Memory Control Memory controller does the followingMemory controller does the following

– Identify shadow space requests (easy)Identify shadow space requests (easy)– Gather the actual words from requested Gather the actual words from requested

column segment and assemble a cache column segment and assemble a cache block on read/read exclusive requests block on read/read exclusive requests (requires address computation and (requires address computation and translation)translation)

– In the case of a writeback, scatter the In the case of a writeback, scatter the words to the column locations (requires words to the column locations (requires address computation and translation)address computation and translation)

– On normal space requests, retrieve On normal space requests, retrieve corresponding shadow blocks and corresponding shadow blocks and assemble a cache blockassemble a cache block

Page 52: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Deconstructing the AMDU: Deconstructing the AMDU: Dynamic AssemblyDynamic Assembly

Required for packing a cache block Required for packing a cache block with gathered words from different with gathered words from different addressesaddresses– Easy to move to softwareEasy to move to software– Issue addresses along with destination Issue addresses along with destination

word offsets within a data bufferword offsets within a data buffer– Each word offset in the data buffer comes Each word offset in the data buffer comes

with a valid bit (a la Stanford FLASH)with a valid bit (a la Stanford FLASH)– Memory-mapped valid bits can be Memory-mapped valid bits can be

accessed by the protocol software through accessed by the protocol software through uncached loads and storesuncached loads and stores

– When all valid bits are set, the protocol When all valid bits are set, the protocol software can initiate the data transfersoftware can initiate the data transfer

Page 53: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Complexity, Area, PowerComplexity, Area, Power Asynchronous interface between the Asynchronous interface between the

AMDU and protocol thread makes the AMDU and protocol thread makes the hardware complex and hard to verifyhardware complex and hard to verify– Worth eliminating the AMDUWorth eliminating the AMDU

Verifying new protocol software is easierVerifying new protocol software is easier– Regular code segments reused in many Regular code segments reused in many

places: amortizes the costplaces: amortizes the cost Area saving: at 65 nm, area of the Area saving: at 65 nm, area of the

AMDU is 1.77 mmAMDU is 1.77 mm2 2 (with 1024-entry (with 1024-entry AMTLB)AMTLB)

Peak dynamic power saving: 2.81 WPeak dynamic power saving: 2.81 W– High power density (1.6 W/mmHigh power density (1.6 W/mm22) could have ) could have

generated a hot-spotgenerated a hot-spot

Page 54: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: MSASimulation Results: MSA

12% speedupGood scalability

Page 55: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: Simulation Results: DenseMMMDenseMMM

7% speedup

15% speedup

Slow cache hit hurts2.3% faster than AMDU

Page 56: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: Simulation Results: SparseFlowSparseFlow

22% gap

49% gap

2.9% gap

Best UCC (2.57)

2.4% gap

Not scalable

Page 57: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: Simulation Results: Transpose MicrobenchmarkTranspose Microbenchmark

51%

Similar performance13.2% gap37% speedupSlow cache hit hurts

Page 58: Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Simulation Results: FFTWSimulation Results: FFTW

8.7% gap2.3% gap