simplifying active memory clusters by leveraging directory protocol threads
DESCRIPTION
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads. Dhiraj D. Kalamkar, Intel Mainak Chaudhuri, Mark Heinrich, IIT Kanpur University of Central Florida. Talk in One Slide. - PowerPoint PPT PresentationTRANSCRIPT
Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging
Directory Protocol ThreadsDirectory Protocol Threads
Dhiraj D. Kalamkar, IntelDhiraj D. Kalamkar, Intel
Mainak Chaudhuri, Mark Mainak Chaudhuri, Mark Heinrich,Heinrich,
IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Talk in One SlideTalk in One Slide Address re-mapping improves Address re-mapping improves
performance of important kernelsperformance of important kernels– Vertical reduction and transpose in this talkVertical reduction and transpose in this talk
But requires custom hardware support But requires custom hardware support in memory controllerin memory controller– Address translation, cache line assemblyAddress translation, cache line assembly
We move this hardware support to We move this hardware support to software running on a dir. protocol software running on a dir. protocol threadthread– Can be a thread in SMT or a core in CMPCan be a thread in SMT or a core in CMP– Enjoys 1.45 and 1.29 speedup for reduction Enjoys 1.45 and 1.29 speedup for reduction
and transpose on 16-node DSM multiproc.and transpose on 16-node DSM multiproc.
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
SketchSketch BackgroundBackground
– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU
– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture
Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction
Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: AM TechniquesBackground: AM Techniques Focus on two kernels in this workFocus on two kernels in this work
– Parallel vertical reduction of an array of Parallel vertical reduction of an array of vectors and matrix transposevectors and matrix transpose
Consider vertically reducing each Consider vertically reducing each column of an NxN matrix to produce a column of an NxN matrix to produce a 1xN vector1xN vector– For the ease of page distribution a block-For the ease of page distribution a block-
row partitioning among processors is row partitioning among processors is carried outcarried out
– Each processor reduces its portion into a Each processor reduces its portion into a private 1xN vectorprivate 1xN vector
– A merge phase accumulates the private A merge phase accumulates the private contributionscontributions
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Parallel Background: Parallel ReductionReduction
P0
P1
P2
P3
P1P0 P2P3
all-to-all
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: AM Parallel Background: AM Parallel ReductionReduction
for j=0 to N-1for j=0 to N-1
p_x[pid][j] = e;p_x[pid][j] = e;
for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1
p_x[pid][j] = p_x[pid][j] = p_x[pid][j] + A[i][j];p_x[pid][j] + A[i][j];
BARRIERBARRIER
for j=pid*(N/P) to for j=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1
for i=0 to P-1for i=0 to P-1
x[j] = x[j]+p_x[i]x[j] = x[j]+p_x[i][j];[j];
BARRIERBARRIER
Subsequent uses of xSubsequent uses of x
Do not want this
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: AM Parallel Background: AM Parallel ReductionReduction
P0
P1
P2
P3
Special shadow space(not backed by memory)
Cache eviction,Merge in MC
Physical result vector
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: AM Parallel Background: AM Parallel ReductionReduction
for j=0 to N-1for j=0 to N-1
p_x[pid][j] = e;p_x[pid][j] = e;
for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1
p_x[pid][j] = p_x[pid][j] = p_x[pid][j] + A[i][j];p_x[pid][j] + A[i][j];
BARRIERBARRIER
for j=pid*(N/P) to for j=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1
for i=0 to P-1for i=0 to P-1
x[j] = x[j]+p_x[i]x[j] = x[j]+p_x[i][j];[j];
BARRIERBARRIER
Subsequent uses of xSubsequent uses of x
/* AM optimized *//* AM optimized */x’ = x’ = AMInstallAMInstall (x, N, (x, N,
sizeof (long long));sizeof (long long));for j=0 to N-1for j=0 to N-1 for i=pid*(N/P) to for i=pid*(N/P) to
(pid+1)*(N/P)-1(pid+1)*(N/P)-1 x’[pid][j] = x’[pid]x’[pid][j] = x’[pid]
[j]+A[i][j];[j]+A[i][j];BARRIERBARRIERSubsequent uses of xSubsequent uses of x
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Memory ControlBackground: Memory Control Memory controller does the followingMemory controller does the following
– Identify requests to the shadow space (easy)Identify requests to the shadow space (easy)– Send identity cache block to local processor Send identity cache block to local processor
and handle coherence in backgroundand handle coherence in background– Identify shadow space writebacks and Identify shadow space writebacks and
accumulate the data in the evicted block accumulate the data in the evicted block with the in-memory data (requires a with the in-memory data (requires a translation)translation)
– On a normal space request, retrieve On a normal space request, retrieve corresponding shadow blocks from corresponding shadow blocks from processors (P shadow blocks contribute to processors (P shadow blocks contribute to one normal block), accumulate them with in-one normal block), accumulate them with in-memory data, and send the final resultmemory data, and send the final result
Merge removed from critical pathMerge removed from critical path
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: TranslationBackground: Translation Suppose the memory controller receives Suppose the memory controller receives
a shadow writeback to address Aa shadow writeback to address A– If starting shadow address of the result If starting shadow address of the result
vector is S, the offset is A-Svector is S, the offset is A-S S is a fixed number decided by the hardware S is a fixed number decided by the hardware
and OS designers; also, shadow space is and OS designers; also, shadow space is contiguouscontiguous
– Add A-S to the starting virtual address of Add A-S to the starting virtual address of the result vector (recall: starting virtual the result vector (recall: starting virtual address is communicated to MC via address is communicated to MC via AMInstallAMInstall))
– Look up memory-resident TLB with this Look up memory-resident TLB with this address to get the physical address of the address to get the physical address of the data to be written backdata to be written back
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Memory ControlBackground: Memory Control
NetworkInterface
CoherenceEngine
TLB
Router
Merger
SDRAM
Physical
Block
ShadowWriteback
Data Buffer Pool
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: AMDU PipelineBackground: AMDU Pipeline
VA Calc.
AMTLB AMTLBPref.
Dir.Addr. Calc.
SDRAM/Data Buffer
CoherenceEngine
Base Address
Virtual Address
Ph. Addr.
Directory Address Application DataMsg. Buffer
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: FlexibilityBackground: Flexibility Flexibility was a primary goal of AMFlexibility was a primary goal of AM
– Do not want to add new hardware for Do not want to add new hardware for every new AM optimizationevery new AM optimization
Two key components to achieve this Two key components to achieve this goalgoal– A general-enough AMDUA general-enough AMDU– Integrate control code of AMDU into Integrate control code of AMDU into
software coherence protocol running on software coherence protocol running on the coherence enginethe coherence engine
– Coherence engine itself is a simple Coherence engine itself is a simple processor core in a CMP or a thread processor core in a CMP or a thread context in SMTcontext in SMT
– This work eliminates the AMDU and This work eliminates the AMDU and achieves maximum possible flexibilityachieves maximum possible flexibility
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Flexible Background: Flexible CoherenceCoherence
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1 DL1
AMDU
OOO Core In-order Core
PCPL1
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Flexible Background: Flexible CoherenceCoherence
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1
AMDU
OOO Core In-order Core
PCSL2
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Flexible Background: Flexible CoherenceCoherence
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1 DL1
AMDU
OOO Core In-order Core
PCSL2PL1
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Flexible Background: Flexible CoherenceCoherence
AT PT
IL1DL1
L2MemoryControl
Router
SDRAM
AMDU
OOO SMT Core
SMTp
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
ContributionsContributions Two major contributionsTwo major contributions
– First implementation of AM techniques First implementation of AM techniques without any custom hardware in MCwithout any custom hardware in MC Brings AM closer to adoption in commodity Brings AM closer to adoption in commodity
systemssystems
– Evaluation of new flexible AM protocols Evaluation of new flexible AM protocols on four different directory controller on four different directory controller architecturesarchitectures Innovative use of contemporary dual-core Innovative use of contemporary dual-core
and SMT nodesand SMT nodes
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
SketchSketch BackgroundBackground
– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU
– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture
Deconstructing the AMDUDeconstructing the AMDUParallel ReductionParallel Reduction
Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDUDeconstructing the AMDU Involves efficiently emulating the Involves efficiently emulating the
AMDU pipeline in the protocol codeAMDU pipeline in the protocol code– Virtual address calculation is easy: one Virtual address calculation is easy: one
shift and one 32-bit additionshift and one 32-bit addition– Directory address calculation is easy: Directory address calculation is easy:
one shift by constant amount and one one shift by constant amount and one 40-bit addition40-bit addition
– Challenging componentsChallenging components TLBTLB MergerMerger Dynamic cache line gather/scatter (needed Dynamic cache line gather/scatter (needed
for transpose)for transpose)
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: TLBDeconstructing the AMDU: TLB Two design optionsTwo design options
– Emulate a direct-mapped software TLB in Emulate a direct-mapped software TLB in the protocol data areathe protocol data area
Each entry holds tag, translation, permission Each entry holds tag, translation, permission bits, valid bitbits, valid bit
Hit/miss detection in softwareHit/miss detection in software On a miss, invoke page walker or access On a miss, invoke page walker or access
memory-resident page tablememory-resident page table Advantage: can be larger than a hardware TLBAdvantage: can be larger than a hardware TLB
– Share the application TLB: easy in SMTp, Share the application TLB: easy in SMTp, butbut
Requires extra port or interferes with app. Requires extra port or interferes with app. threadsthreads
Other three architectures: floor-planning issuesOther three architectures: floor-planning issues Not explored in this workNot explored in this work
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: TLBDeconstructing the AMDU: TLB Emulating a TLB in protocol softwareEmulating a TLB in protocol software
– Handling a TLB miss requires the page Handling a TLB miss requires the page table base addresstable base address Don’t want to trap to kernelDon’t want to trap to kernel Load the page table base address in an Load the page table base address in an
architectural register of protocol thread at architectural register of protocol thread at the time of application launch (this register the time of application launch (this register cannot be used by protocol compiler)cannot be used by protocol compiler)
– TLB shootdown now needs to worry TLB shootdown now needs to worry about keeping the soft TLB coherentabout keeping the soft TLB coherent Must invalidate the TLB in the protocol data Must invalidate the TLB in the protocol data
areaarea Starting address and size of the TLB area Starting address and size of the TLB area
should be made known to the TLB should be made known to the TLB shootdown kernel shootdown kernel
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
NetworkInterface
CoherenceEngine
TLB
Router
Merger
SDRAM
Physical
Block
ShadowWriteback
Data Buffer Pool MEB
MYB
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
Naïve approachNaïve approach– Writeback data is in message buffer (MEB) Writeback data is in message buffer (MEB)
and physical block is loaded into a and physical block is loaded into a memory buffer (MYB)memory buffer (MYB)
– Protocol thread can access 64 bits of data Protocol thread can access 64 bits of data at 8-byte aligned offsets within a data at 8-byte aligned offsets within a data buffer through uncached loads/stores buffer through uncached loads/stores (data buffer pool is memory-mapped)(data buffer pool is memory-mapped)
– Load 64 bit data from MYB and MEB into Load 64 bit data from MYB and MEB into two general purpose registers, merge two general purpose registers, merge them, store the result back to same offset them, store the result back to same offset in MYBin MYB
– At the end of the loop, write MYB back to At the end of the loop, write MYB back to memorymemory
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
SDRAM
Physical
Block
Data Buffer Pool MEB
MYB RFUC Load+
UC Load+
Add+
UC Store+
UC Store
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger Naïve approachNaïve approach
– For 128B data buffers, 32 uncached For 128B data buffers, 32 uncached loads, 16 uncached stores (as opposed loads, 16 uncached stores (as opposed to 16 cycles in AMDU pipe)to 16 cycles in AMDU pipe)
– Worse: uncached operations are often Worse: uncached operations are often implemented as non-speculative in implemented as non-speculative in processor pipesprocessor pipes
Improvement: caching the buffersImprovement: caching the buffers– Data buffers are already memory-Data buffers are already memory-
mappedmapped– Treat them as standard cached memoryTreat them as standard cached memory– Now can use cached loads and stores Now can use cached loads and stores
which can issue speculatively and can be which can issue speculatively and can be pipelinedpipelined
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger Caching the buffersCaching the buffers
– The cache block(s) holding MYB must The cache block(s) holding MYB must be flushed to memory at the end of all be flushed to memory at the end of all the mergesthe merges Use writeback invalidate instruction Use writeback invalidate instruction
available on all microprocessorsavailable on all microprocessors
– The cache block(s) holding MEB must The cache block(s) holding MEB must be invalidated at the end of all the be invalidated at the end of all the mergesmerges Otherwise next time when the same buffer Otherwise next time when the same buffer
is used, there is a danger that the protocol is used, there is a danger that the protocol thread may see stale data in cachethread may see stale data in cache
Use invalidate instruction available on all Use invalidate instruction available on all microprocessorsmicroprocessors
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
SDRAM
Physical
Block
Data Buffer Pool MEB
MYB RF Add+
L+L+
S+
Miss
Fill
Miss
Fill
Flush
UC Store
D$
Inv
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
Why flush the cache block(s) holding Why flush the cache block(s) holding MYB at the end of the merge?MYB at the end of the merge?– Recall that there are P shadow blocks Recall that there are P shadow blocks
corresponding to one physical blockcorresponding to one physical block– Potentially there could be P writeback Potentially there could be P writeback
operations accessing the same physical operations accessing the same physical blockblock
– Caching each physical block across merges Caching each physical block across merges (not just during a merge) improves reuse(not just during a merge) improves reuse
– Cannot cache at the same physical address Cannot cache at the same physical address though: coherence problem in shared though: coherence problem in shared cachecache
– Use a different address space: 0x2345680 Use a different address space: 0x2345680 gets cached at 0xc2345680 (bypasses gets cached at 0xc2345680 (bypasses MYB)MYB)
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
Caching across mergesCaching across merges– Flush the blocks at the end of Flush the blocks at the end of allall the the
merges (potentially P in number)merges (potentially P in number) Can be decided by the protocol thread from Can be decided by the protocol thread from
the directory state (shadow owner vector)the directory state (shadow owner vector) Problem: the address of the flushed block is Problem: the address of the flushed block is
slightly different from the actual physical slightly different from the actual physical address (in higher bits)address (in higher bits)
Memory controller anyway ignores address Memory controller anyway ignores address bits higher than installed DRAM capacitybits higher than installed DRAM capacity
– Must still flush MEB after every merge Must still flush MEB after every merge as usualas usual
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
SDRAM
Physical
Block
Data Buffer Pool MEB
RF {Add+}+
{L+}+
{L+}+{S+}+
{Miss}+
{Fill}+
D$
{Inv}+
Miss/Fill
Flush
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger
A three-dimensional optimization spaceA three-dimensional optimization space– Caching (C) or not caching (U) Caching (C) or not caching (U) MEBMEB during a during a
merge, merge, MYBMYB during a merge, during a merge, merge resultsmerge results across mergesacross merges
UUUUUU, , UUCCUU, …, , …, CCCCCC
UUU UCU
UCCUUC
CCCCUC
CUU CCU
Not viable
Best performing
Caching MEB hurts
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
SketchSketch BackgroundBackground
– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU
– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture
Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction
Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation EnvironmentSimulation Environment Each node is dual-core with one Each node is dual-core with one
OOO SMT core and one in-order coreOOO SMT core and one in-order core– On-die memory controller and routerOn-die memory controller and router– All components are clocked at 2.4 GHzAll components are clocked at 2.4 GHz– SMT core has 32 KB IL1, DL1 (dual-SMT core has 32 KB IL1, DL1 (dual-
ported), and 2 MB L2 (3-cycle tag hit), ported), and 2 MB L2 (3-cycle tag hit), 18-stage pipe18-stage pipe
– DRAM bandwidth 6.4 GB/s per channel, DRAM bandwidth 6.4 GB/s per channel, 40 ns page hit, 80 ns page miss40 ns page hit, 80 ns page miss
– Hop time 10 ns, link bandwidth 3.2 Hop time 10 ns, link bandwidth 3.2 GB/s, 2-way bristled hypercubeGB/s, 2-way bristled hypercube
– 16 nodes, each node capable of 16 nodes, each node capable of running up to two application threadsrunning up to two application threads
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation EnvironmentSimulation Environment
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1 DL1
AMDU
OOO Core In-order Core
PCPL1_128KBPCPL1_2MB
32 KB
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation EnvironmentSimulation Environment
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1
AMDU
OOO Core In-order Core
PCSL2
32 KB
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation EnvironmentSimulation Environment
AT
IL1DL1
L2MemoryControl
Router
SDRAM
PTIL1 DL1
AMDU
OOO Core In-order Core
PCSL2PL1
32 KB 128 KB
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation EnvironmentSimulation Environment
AT PT
IL1DL1
L2MemoryControl
Router
SDRAM
AMDU
OOO SMT Core
SMTp
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Benchmark ApplicationsBenchmark Applications Parallel reduction [All prefetched]Parallel reduction [All prefetched]
– Mean Square Average (MSA) [micro]Mean Square Average (MSA) [micro]– DenseMMM: C = ADenseMMM: C = ATTBB– SparseFlow: flow computation in sparse SparseFlow: flow computation in sparse
multi-source graphmulti-source graph– Spark98 kernel: SMVPSpark98 kernel: SMVP
Transpose [Prefetched, tiled]Transpose [Prefetched, tiled]– Transpose [micro]Transpose [micro]– SPLASH-2 FFT: only forward transformSPLASH-2 FFT: only forward transform
Involves three tiled transpose phasesInvolves three tiled transpose phases
– FFTW: forward and inverseFFTW: forward and inverse
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
SketchSketch BackgroundBackground
– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU
– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture
Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction
Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation ResultsSimulation Results
Two key questions to answerTwo key questions to answer– How much speedup does our design How much speedup does our design
achieve over a baseline that does not achieve over a baseline that does not use AM protocols (both without the use AM protocols (both without the AMDU)?AMDU)?
– How much performance penalty do we How much performance penalty do we pay due to the elimination of the pay due to the elimination of the hardwired AMDU?hardwired AMDU?
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: Spark98Simulation Results: Spark98No performance loss
Close
20% speedup
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Result Summary: ReductionResult Summary: Reduction
Very encouraging resultsVery encouraging results– Architectures without AMDU comes Architectures without AMDU comes
within 3% of architectures with within 3% of architectures with complex AMDUcomplex AMDU
– SMTp+UCC and PCSL2PL1+UCC are the SMTp+UCC and PCSL2PL1+UCC are the most attractive architecturesmost attractive architectures 45% and 49% speedup with 16 application 45% and 49% speedup with 16 application
threads compared to non-AM baselinethreads compared to non-AM baseline
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: 1D FFTSimulation Results: 1D FFT4.1% gap
2.3% gap
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Result Summary: TransposeResult Summary: Transpose Within 13.2% of AMDU performanceWithin 13.2% of AMDU performance
– On average 8.7% gapOn average 8.7% gap SMTp+SoftTr delivers 29% speedupSMTp+SoftTr delivers 29% speedup PCSL2PL1+SoftTr delivers 23% PCSL2PL1+SoftTr delivers 23%
speedupspeedup Flashback: reduction summaryFlashback: reduction summary
– Within 3% of AMDU performanceWithin 3% of AMDU performance– 45% and 49% speedup of SMTp+UCC and 45% and 49% speedup of SMTp+UCC and
PCSL2PL1+UCCPCSL2PL1+UCC Architecturally, SMTp is more Architecturally, SMTp is more
attractive (area overhead is small), but attractive (area overhead is small), but PCSL2PL1 may be easier to verifyPCSL2PL1 may be easier to verify
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Prior ResearchPrior Research Impulse memory controller Impulse memory controller
introduced the concept of address re-introduced the concept of address re-mappingmapping– Used in single-threaded systemsUsed in single-threaded systems– Software-directed cache flush for Software-directed cache flush for
coherencecoherence Active memory leveraged cache Active memory leveraged cache
coherence to do address re-mappingcoherence to do address re-mapping– Allowed seamless extensions to SMPs Allowed seamless extensions to SMPs
and DSMsand DSMs– Introduced AMDU and flexibility in AMIntroduced AMDU and flexibility in AM
This work closes the loop by bringing This work closes the loop by bringing AM closer to commodityAM closer to commodity
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
SummarySummary Eliminates custom hardware support Eliminates custom hardware support
in memory controller traditionally in memory controller traditionally needed for AMneeded for AM– Parallel reduction performance comes Parallel reduction performance comes
within 3% of AMDUwithin 3% of AMDU– Transpose performance comes within Transpose performance comes within
13.2% of AMDU (lack of efficient 13.2% of AMDU (lack of efficient pipelining)pipelining)
– Protocol thread architecture achieves Protocol thread architecture achieves 45% and 29% speedup for reduction and 45% and 29% speedup for reduction and transposetranspose
– Protocol core architecture with private L1 Protocol core architecture with private L1 and shared L2 achieves 49% and 27%and shared L2 achieves 49% and 27%
Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging
Directory Protocol ThreadsDirectory Protocol Threads
Dhiraj D. Kalamkar, IntelDhiraj D. Kalamkar, Intel
Mainak Chaudhuri, Mark Mainak Chaudhuri, Mark Heinrich,Heinrich,
IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida
THANK YOU!
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Matrix TransposeBackground: Matrix Transposefor i=pid*(N/P) to for i=pid*(N/P) to
(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A[i][j];sum += A[i][j];BARRIERBARRIERTranspose (A, A’);Transpose (A, A’);BARRIERBARRIERfor i=pid*(N/P) to for i=pid*(N/P) to
(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A’[i][j];sum += A’[i][j];BARRIERBARRIERTranspose (A’, A);Transpose (A’, A);BARRIERBARRIER
/* AM optimized transpose /* AM optimized transpose */*/
A’ = A’ = AMInstallAMInstall (A, N, N, (A, N, N, sizeof(Complex));sizeof(Complex));
for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1
for j=0 to N-1for j=0 to N-1 sum += A[i][j];sum += A[i][j];BARRIERBARRIERfor i=pid*(N/P) to for i=pid*(N/P) to
(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A’[i][j];sum += A’[i][j];BARRIERBARRIER
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Memory ControlBackground: Memory Control
NetworkInterface
CoherenceEngine
AMDU
Router
SDRAM
Shadow Put
Data Buffer Pool
Assembled block
GatherShadowGet
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Background: Memory ControlBackground: Memory Control Memory controller does the followingMemory controller does the following
– Identify shadow space requests (easy)Identify shadow space requests (easy)– Gather the actual words from requested Gather the actual words from requested
column segment and assemble a cache column segment and assemble a cache block on read/read exclusive requests block on read/read exclusive requests (requires address computation and (requires address computation and translation)translation)
– In the case of a writeback, scatter the In the case of a writeback, scatter the words to the column locations (requires words to the column locations (requires address computation and translation)address computation and translation)
– On normal space requests, retrieve On normal space requests, retrieve corresponding shadow blocks and corresponding shadow blocks and assemble a cache blockassemble a cache block
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Deconstructing the AMDU: Deconstructing the AMDU: Dynamic AssemblyDynamic Assembly
Required for packing a cache block Required for packing a cache block with gathered words from different with gathered words from different addressesaddresses– Easy to move to softwareEasy to move to software– Issue addresses along with destination Issue addresses along with destination
word offsets within a data bufferword offsets within a data buffer– Each word offset in the data buffer comes Each word offset in the data buffer comes
with a valid bit (a la Stanford FLASH)with a valid bit (a la Stanford FLASH)– Memory-mapped valid bits can be Memory-mapped valid bits can be
accessed by the protocol software through accessed by the protocol software through uncached loads and storesuncached loads and stores
– When all valid bits are set, the protocol When all valid bits are set, the protocol software can initiate the data transfersoftware can initiate the data transfer
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Complexity, Area, PowerComplexity, Area, Power Asynchronous interface between the Asynchronous interface between the
AMDU and protocol thread makes the AMDU and protocol thread makes the hardware complex and hard to verifyhardware complex and hard to verify– Worth eliminating the AMDUWorth eliminating the AMDU
Verifying new protocol software is easierVerifying new protocol software is easier– Regular code segments reused in many Regular code segments reused in many
places: amortizes the costplaces: amortizes the cost Area saving: at 65 nm, area of the Area saving: at 65 nm, area of the
AMDU is 1.77 mmAMDU is 1.77 mm2 2 (with 1024-entry (with 1024-entry AMTLB)AMTLB)
Peak dynamic power saving: 2.81 WPeak dynamic power saving: 2.81 W– High power density (1.6 W/mmHigh power density (1.6 W/mm22) could have ) could have
generated a hot-spotgenerated a hot-spot
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: MSASimulation Results: MSA
12% speedupGood scalability
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: Simulation Results: DenseMMMDenseMMM
7% speedup
15% speedup
Slow cache hit hurts2.3% faster than AMDU
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: Simulation Results: SparseFlowSparseFlow
22% gap
49% gap
2.9% gap
Best UCC (2.57)
2.4% gap
Not scalable
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: Simulation Results: Transpose MicrobenchmarkTranspose Microbenchmark
51%
Similar performance13.2% gap37% speedupSlow cache hit hurts
Simplifying Active Memory ClustersSimplifying Active Memory Clusters
Simulation Results: FFTWSimulation Results: FFTW
8.7% gap2.3% gap