simplifying active memory clusters by leveraging directory protocol threads

Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging

Directory Protocol ThreadsDirectory Protocol Threads

Dhiraj D. Kalamkar, IntelDhiraj D. Kalamkar, Intel

Mainak Chaudhuri, Mark Mainak Chaudhuri, Mark Heinrich,Heinrich,

IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida

Simplifying Active Memory ClustersSimplifying Active Memory Clusters

Talk in One SlideTalk in One Slide Address re-mapping improves Address re-mapping improves

performance of important kernelsperformance of important kernels– Vertical reduction and transpose in this talkVertical reduction and transpose in this talk

But requires custom hardware support But requires custom hardware support in memory controllerin memory controller– Address translation, cache line assemblyAddress translation, cache line assembly

We move this hardware support to We move this hardware support to software running on a dir. protocol software running on a dir. protocol threadthread– Can be a thread in SMT or a core in CMPCan be a thread in SMT or a core in CMP– Enjoys 1.45 and 1.29 speedup for reduction Enjoys 1.45 and 1.29 speedup for reduction

and transpose on 16-node DSM multiproc.and transpose on 16-node DSM multiproc.


SketchSketch BackgroundBackground

– Active Memory Techniques and the Active Memory Techniques and the AMDUAMDU

– Flexible Directory Controller Flexible Directory Controller ArchitectureArchitecture

Deconstructing the AMDUDeconstructing the AMDU– Parallel ReductionParallel Reduction

Simulation EnvironmentSimulation Environment Simulation ResultsSimulation Results SummarySummary


Background: AM TechniquesBackground: AM Techniques Focus on two kernels in this workFocus on two kernels in this work

– Parallel vertical reduction of an array of Parallel vertical reduction of an array of vectors and matrix transposevectors and matrix transpose

Consider vertically reducing each Consider vertically reducing each column of an NxN matrix to produce a column of an NxN matrix to produce a 1xN vector1xN vector– For the ease of page distribution a block-For the ease of page distribution a block-

row partitioning among processors is row partitioning among processors is carried outcarried out

– Each processor reduces its portion into a Each processor reduces its portion into a private 1xN vectorprivate 1xN vector

– A merge phase accumulates the private A merge phase accumulates the private contributionscontributions


Background: Parallel Background: Parallel ReductionReduction

P0

P1

P2

P3

P1P0 P2P3

all-to-all


Background: AM Parallel Background: AM Parallel ReductionReduction

for j=0 to N-1for j=0 to N-1

p_x[pid][j] = e;p_x[pid][j] = e;

for i=pid*(N/P) to for i=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

p_x[pid][j] = p_x[pid][j] = p_x[pid][j] + A[i][j];p_x[pid][j] + A[i][j];

BARRIERBARRIER

for j=pid*(N/P) to for j=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

for i=0 to P-1for i=0 to P-1

x[j] = x[j]+p_x[i]x[j] = x[j]+p_x[i][j];[j];

BARRIERBARRIER

Subsequent uses of xSubsequent uses of x

Do not want this



P0

P1

P2

P3

Special shadow space(not backed by memory)

Cache eviction,Merge in MC

Physical result vector



for j=0 to N-1for j=0 to N-1

p_x[pid][j] = e;p_x[pid][j] = e;


p_x[pid][j] = p_x[pid][j] = p_x[pid][j] + A[i][j];p_x[pid][j] + A[i][j];

BARRIERBARRIER

for j=pid*(N/P) to for j=pid*(N/P) to (pid+1)*(N/P)-1(pid+1)*(N/P)-1

for i=0 to P-1for i=0 to P-1

x[j] = x[j]+p_x[i]x[j] = x[j]+p_x[i][j];[j];

BARRIERBARRIER

Subsequent uses of xSubsequent uses of x

/* AM optimized *//* AM optimized */x’ = x’ = AMInstallAMInstall (x, N, (x, N,

sizeof (long long));sizeof (long long));for j=0 to N-1for j=0 to N-1 for i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 x’[pid][j] = x’[pid]x’[pid][j] = x’[pid]

[j]+A[i][j];[j]+A[i][j];BARRIERBARRIERSubsequent uses of xSubsequent uses of x


Background: Memory ControlBackground: Memory Control Memory controller does the followingMemory controller does the following

– Identify requests to the shadow space (easy)Identify requests to the shadow space (easy)– Send identity cache block to local processor Send identity cache block to local processor

and handle coherence in backgroundand handle coherence in background– Identify shadow space writebacks and Identify shadow space writebacks and

accumulate the data in the evicted block accumulate the data in the evicted block with the in-memory data (requires a with the in-memory data (requires a translation)translation)

– On a normal space request, retrieve On a normal space request, retrieve corresponding shadow blocks from corresponding shadow blocks from processors (P shadow blocks contribute to processors (P shadow blocks contribute to one normal block), accumulate them with in-one normal block), accumulate them with in-memory data, and send the final resultmemory data, and send the final result

Merge removed from critical pathMerge removed from critical path


Background: TranslationBackground: Translation Suppose the memory controller receives Suppose the memory controller receives

a shadow writeback to address Aa shadow writeback to address A– If starting shadow address of the result If starting shadow address of the result

vector is S, the offset is A-Svector is S, the offset is A-S S is a fixed number decided by the hardware S is a fixed number decided by the hardware

and OS designers; also, shadow space is and OS designers; also, shadow space is contiguouscontiguous

– Add A-S to the starting virtual address of Add A-S to the starting virtual address of the result vector (recall: starting virtual the result vector (recall: starting virtual address is communicated to MC via address is communicated to MC via AMInstallAMInstall))

– Look up memory-resident TLB with this Look up memory-resident TLB with this address to get the physical address of the address to get the physical address of the data to be written backdata to be written back


Background: Memory ControlBackground: Memory Control

NetworkInterface

CoherenceEngine

TLB

Router

Merger

SDRAM

Physical

Block

ShadowWriteback

Data Buffer Pool


Background: AMDU PipelineBackground: AMDU Pipeline

VA Calc.

AMTLB AMTLBPref.

Dir.Addr. Calc.

SDRAM/Data Buffer

CoherenceEngine

Base Address

Virtual Address

Ph. Addr.

Directory Address Application DataMsg. Buffer


Background: FlexibilityBackground: Flexibility Flexibility was a primary goal of AMFlexibility was a primary goal of AM

– Do not want to add new hardware for Do not want to add new hardware for every new AM optimizationevery new AM optimization

Two key components to achieve this Two key components to achieve this goalgoal– A general-enough AMDUA general-enough AMDU– Integrate control code of AMDU into Integrate control code of AMDU into

software coherence protocol running on software coherence protocol running on the coherence enginethe coherence engine

– Coherence engine itself is a simple Coherence engine itself is a simple processor core in a CMP or a thread processor core in a CMP or a thread context in SMTcontext in SMT

– This work eliminates the AMDU and This work eliminates the AMDU and achieves maximum possible flexibilityachieves maximum possible flexibility


Background: Flexible Background: Flexible CoherenceCoherence

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU

OOO Core In-order Core

PCPL1



AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1

AMDU


PCSL2



AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU


PCSL2PL1



AT PT

IL1DL1

L2MemoryControl

Router

SDRAM

AMDU

OOO SMT Core

SMTp


ContributionsContributions Two major contributionsTwo major contributions

– First implementation of AM techniques First implementation of AM techniques without any custom hardware in MCwithout any custom hardware in MC Brings AM closer to adoption in commodity Brings AM closer to adoption in commodity

systemssystems

– Evaluation of new flexible AM protocols Evaluation of new flexible AM protocols on four different directory controller on four different directory controller architecturesarchitectures Innovative use of contemporary dual-core Innovative use of contemporary dual-core

and SMT nodesand SMT nodes





Deconstructing the AMDUDeconstructing the AMDUParallel ReductionParallel Reduction



Deconstructing the AMDUDeconstructing the AMDU Involves efficiently emulating the Involves efficiently emulating the

AMDU pipeline in the protocol codeAMDU pipeline in the protocol code– Virtual address calculation is easy: one Virtual address calculation is easy: one

shift and one 32-bit additionshift and one 32-bit addition– Directory address calculation is easy: Directory address calculation is easy:

one shift by constant amount and one one shift by constant amount and one 40-bit addition40-bit addition

– Challenging componentsChallenging components TLBTLB MergerMerger Dynamic cache line gather/scatter (needed Dynamic cache line gather/scatter (needed

for transpose)for transpose)


Deconstructing the AMDU: TLBDeconstructing the AMDU: TLB Two design optionsTwo design options

– Emulate a direct-mapped software TLB in Emulate a direct-mapped software TLB in the protocol data areathe protocol data area

Each entry holds tag, translation, permission Each entry holds tag, translation, permission bits, valid bitbits, valid bit

Hit/miss detection in softwareHit/miss detection in software On a miss, invoke page walker or access On a miss, invoke page walker or access

memory-resident page tablememory-resident page table Advantage: can be larger than a hardware TLBAdvantage: can be larger than a hardware TLB

– Share the application TLB: easy in SMTp, Share the application TLB: easy in SMTp, butbut

Requires extra port or interferes with app. Requires extra port or interferes with app. threadsthreads

Other three architectures: floor-planning issuesOther three architectures: floor-planning issues Not explored in this workNot explored in this work


Deconstructing the AMDU: TLBDeconstructing the AMDU: TLB Emulating a TLB in protocol softwareEmulating a TLB in protocol software

– Handling a TLB miss requires the page Handling a TLB miss requires the page table base addresstable base address Don’t want to trap to kernelDon’t want to trap to kernel Load the page table base address in an Load the page table base address in an

architectural register of protocol thread at architectural register of protocol thread at the time of application launch (this register the time of application launch (this register cannot be used by protocol compiler)cannot be used by protocol compiler)

– TLB shootdown now needs to worry TLB shootdown now needs to worry about keeping the soft TLB coherentabout keeping the soft TLB coherent Must invalidate the TLB in the protocol data Must invalidate the TLB in the protocol data

areaarea Starting address and size of the TLB area Starting address and size of the TLB area

should be made known to the TLB should be made known to the TLB shootdown kernel shootdown kernel


Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger

NetworkInterface

CoherenceEngine

TLB

Router

Merger

SDRAM

Physical

Block

ShadowWriteback

Data Buffer Pool MEB

MYB



Naïve approachNaïve approach– Writeback data is in message buffer (MEB) Writeback data is in message buffer (MEB)

and physical block is loaded into a and physical block is loaded into a memory buffer (MYB)memory buffer (MYB)

– Protocol thread can access 64 bits of data Protocol thread can access 64 bits of data at 8-byte aligned offsets within a data at 8-byte aligned offsets within a data buffer through uncached loads/stores buffer through uncached loads/stores (data buffer pool is memory-mapped)(data buffer pool is memory-mapped)

– Load 64 bit data from MYB and MEB into Load 64 bit data from MYB and MEB into two general purpose registers, merge two general purpose registers, merge them, store the result back to same offset them, store the result back to same offset in MYBin MYB

– At the end of the loop, write MYB back to At the end of the loop, write MYB back to memorymemory



SDRAM

Physical

Block


MYB RFUC Load+

UC Load+

Add+

UC Store+

UC Store


Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger Naïve approachNaïve approach

– For 128B data buffers, 32 uncached For 128B data buffers, 32 uncached loads, 16 uncached stores (as opposed loads, 16 uncached stores (as opposed to 16 cycles in AMDU pipe)to 16 cycles in AMDU pipe)

– Worse: uncached operations are often Worse: uncached operations are often implemented as non-speculative in implemented as non-speculative in processor pipesprocessor pipes

Improvement: caching the buffersImprovement: caching the buffers– Data buffers are already memory-Data buffers are already memory-

mappedmapped– Treat them as standard cached memoryTreat them as standard cached memory– Now can use cached loads and stores Now can use cached loads and stores

which can issue speculatively and can be which can issue speculatively and can be pipelinedpipelined


Deconstructing the AMDU: Deconstructing the AMDU: MergerMerger Caching the buffersCaching the buffers

– The cache block(s) holding MYB must The cache block(s) holding MYB must be flushed to memory at the end of all be flushed to memory at the end of all the mergesthe merges Use writeback invalidate instruction Use writeback invalidate instruction

available on all microprocessorsavailable on all microprocessors

– The cache block(s) holding MEB must The cache block(s) holding MEB must be invalidated at the end of all the be invalidated at the end of all the mergesmerges Otherwise next time when the same buffer Otherwise next time when the same buffer

is used, there is a danger that the protocol is used, there is a danger that the protocol thread may see stale data in cachethread may see stale data in cache

Use invalidate instruction available on all Use invalidate instruction available on all microprocessorsmicroprocessors



SDRAM

Physical

Block


MYB RF Add+

L+L+

S+

Miss

Fill

Miss

Fill

Flush

UC Store

D$

Inv



Why flush the cache block(s) holding Why flush the cache block(s) holding MYB at the end of the merge?MYB at the end of the merge?– Recall that there are P shadow blocks Recall that there are P shadow blocks

corresponding to one physical blockcorresponding to one physical block– Potentially there could be P writeback Potentially there could be P writeback

operations accessing the same physical operations accessing the same physical blockblock

– Caching each physical block across merges Caching each physical block across merges (not just during a merge) improves reuse(not just during a merge) improves reuse

– Cannot cache at the same physical address Cannot cache at the same physical address though: coherence problem in shared though: coherence problem in shared cachecache

– Use a different address space: 0x2345680 Use a different address space: 0x2345680 gets cached at 0xc2345680 (bypasses gets cached at 0xc2345680 (bypasses MYB)MYB)



Caching across mergesCaching across merges– Flush the blocks at the end of Flush the blocks at the end of allall the the

merges (potentially P in number)merges (potentially P in number) Can be decided by the protocol thread from Can be decided by the protocol thread from

the directory state (shadow owner vector)the directory state (shadow owner vector) Problem: the address of the flushed block is Problem: the address of the flushed block is

slightly different from the actual physical slightly different from the actual physical address (in higher bits)address (in higher bits)

Memory controller anyway ignores address Memory controller anyway ignores address bits higher than installed DRAM capacitybits higher than installed DRAM capacity

– Must still flush MEB after every merge Must still flush MEB after every merge as usualas usual



SDRAM

Physical

Block


RF {Add+}+

{L+}+

{L+}+{S+}+

{Miss}+

{Fill}+

D$

{Inv}+

Miss/Fill

Flush



A three-dimensional optimization spaceA three-dimensional optimization space– Caching (C) or not caching (U) Caching (C) or not caching (U) MEBMEB during a during a

merge, merge, MYBMYB during a merge, during a merge, merge resultsmerge results across mergesacross merges

UUUUUU, , UUCCUU, …, , …, CCCCCC

UUU UCU

UCCUUC

CCCCUC

CUU CCU

Not viable

Best performing

Caching MEB hurts


Simulation EnvironmentSimulation Environment Each node is dual-core with one Each node is dual-core with one

OOO SMT core and one in-order coreOOO SMT core and one in-order core– On-die memory controller and routerOn-die memory controller and router– All components are clocked at 2.4 GHzAll components are clocked at 2.4 GHz– SMT core has 32 KB IL1, DL1 (dual-SMT core has 32 KB IL1, DL1 (dual-

ported), and 2 MB L2 (3-cycle tag hit), ported), and 2 MB L2 (3-cycle tag hit), 18-stage pipe18-stage pipe

– DRAM bandwidth 6.4 GB/s per channel, DRAM bandwidth 6.4 GB/s per channel, 40 ns page hit, 80 ns page miss40 ns page hit, 80 ns page miss

– Hop time 10 ns, link bandwidth 3.2 Hop time 10 ns, link bandwidth 3.2 GB/s, 2-way bristled hypercubeGB/s, 2-way bristled hypercube

– 16 nodes, each node capable of 16 nodes, each node capable of running up to two application threadsrunning up to two application threads


Simulation EnvironmentSimulation Environment

AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU


PCPL1_128KBPCPL1_2MB

32 KB



AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1

AMDU


PCSL2

32 KB



AT

IL1DL1

L2MemoryControl

Router

SDRAM

PTIL1 DL1

AMDU


PCSL2PL1

32 KB 128 KB



AT PT

IL1DL1

L2MemoryControl

Router

SDRAM

AMDU

OOO SMT Core

SMTp


Benchmark ApplicationsBenchmark Applications Parallel reduction [All prefetched]Parallel reduction [All prefetched]

– Mean Square Average (MSA) [micro]Mean Square Average (MSA) [micro]– DenseMMM: C = ADenseMMM: C = ATTBB– SparseFlow: flow computation in sparse SparseFlow: flow computation in sparse

multi-source graphmulti-source graph– Spark98 kernel: SMVPSpark98 kernel: SMVP

Transpose [Prefetched, tiled]Transpose [Prefetched, tiled]– Transpose [micro]Transpose [micro]– SPLASH-2 FFT: only forward transformSPLASH-2 FFT: only forward transform

Involves three tiled transpose phasesInvolves three tiled transpose phases

– FFTW: forward and inverseFFTW: forward and inverse


Simulation ResultsSimulation Results

Two key questions to answerTwo key questions to answer– How much speedup does our design How much speedup does our design

achieve over a baseline that does not achieve over a baseline that does not use AM protocols (both without the use AM protocols (both without the AMDU)?AMDU)?

– How much performance penalty do we How much performance penalty do we pay due to the elimination of the pay due to the elimination of the hardwired AMDU?hardwired AMDU?


Simulation Results: Spark98Simulation Results: Spark98No performance loss

Close

20% speedup


Result Summary: ReductionResult Summary: Reduction

Very encouraging resultsVery encouraging results– Architectures without AMDU comes Architectures without AMDU comes

within 3% of architectures with within 3% of architectures with complex AMDUcomplex AMDU

– SMTp+UCC and PCSL2PL1+UCC are the SMTp+UCC and PCSL2PL1+UCC are the most attractive architecturesmost attractive architectures 45% and 49% speedup with 16 application 45% and 49% speedup with 16 application

threads compared to non-AM baselinethreads compared to non-AM baseline


Simulation Results: 1D FFTSimulation Results: 1D FFT4.1% gap

2.3% gap


Result Summary: TransposeResult Summary: Transpose Within 13.2% of AMDU performanceWithin 13.2% of AMDU performance

– On average 8.7% gapOn average 8.7% gap SMTp+SoftTr delivers 29% speedupSMTp+SoftTr delivers 29% speedup PCSL2PL1+SoftTr delivers 23% PCSL2PL1+SoftTr delivers 23%

speedupspeedup Flashback: reduction summaryFlashback: reduction summary

– Within 3% of AMDU performanceWithin 3% of AMDU performance– 45% and 49% speedup of SMTp+UCC and 45% and 49% speedup of SMTp+UCC and

PCSL2PL1+UCCPCSL2PL1+UCC Architecturally, SMTp is more Architecturally, SMTp is more

attractive (area overhead is small), but attractive (area overhead is small), but PCSL2PL1 may be easier to verifyPCSL2PL1 may be easier to verify


Prior ResearchPrior Research Impulse memory controller Impulse memory controller

introduced the concept of address re-introduced the concept of address re-mappingmapping– Used in single-threaded systemsUsed in single-threaded systems– Software-directed cache flush for Software-directed cache flush for

coherencecoherence Active memory leveraged cache Active memory leveraged cache

coherence to do address re-mappingcoherence to do address re-mapping– Allowed seamless extensions to SMPs Allowed seamless extensions to SMPs

and DSMsand DSMs– Introduced AMDU and flexibility in AMIntroduced AMDU and flexibility in AM

This work closes the loop by bringing This work closes the loop by bringing AM closer to commodityAM closer to commodity


SummarySummary Eliminates custom hardware support Eliminates custom hardware support

in memory controller traditionally in memory controller traditionally needed for AMneeded for AM– Parallel reduction performance comes Parallel reduction performance comes

within 3% of AMDUwithin 3% of AMDU– Transpose performance comes within Transpose performance comes within

13.2% of AMDU (lack of efficient 13.2% of AMDU (lack of efficient pipelining)pipelining)

– Protocol thread architecture achieves Protocol thread architecture achieves 45% and 29% speedup for reduction and 45% and 29% speedup for reduction and transposetranspose

– Protocol core architecture with private L1 Protocol core architecture with private L1 and shared L2 achieves 49% and 27%and shared L2 achieves 49% and 27%

Simplifying Active Memory Simplifying Active Memory Clusters by Leveraging Clusters by Leveraging

Directory Protocol ThreadsDirectory Protocol Threads

Dhiraj D. Kalamkar, IntelDhiraj D. Kalamkar, Intel

Mainak Chaudhuri, Mark Mainak Chaudhuri, Mark Heinrich,Heinrich,

IIT Kanpur University of Central IIT Kanpur University of Central FloridaFlorida

THANK YOU!


Background: Matrix TransposeBackground: Matrix Transposefor i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A[i][j];sum += A[i][j];BARRIERBARRIERTranspose (A, A’);Transpose (A, A’);BARRIERBARRIERfor i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A’[i][j];sum += A’[i][j];BARRIERBARRIERTranspose (A’, A);Transpose (A’, A);BARRIERBARRIER

/* AM optimized transpose /* AM optimized transpose */*/

A’ = A’ = AMInstallAMInstall (A, N, N, (A, N, N, sizeof(Complex));sizeof(Complex));


for j=0 to N-1for j=0 to N-1 sum += A[i][j];sum += A[i][j];BARRIERBARRIERfor i=pid*(N/P) to for i=pid*(N/P) to

(pid+1)*(N/P)-1(pid+1)*(N/P)-1 for j=0 to N-1for j=0 to N-1 sum += A’[i][j];sum += A’[i][j];BARRIERBARRIER


Background: Memory ControlBackground: Memory Control

NetworkInterface

CoherenceEngine

AMDU

Router

SDRAM

Shadow Put

Data Buffer Pool

Assembled block

GatherShadowGet


Background: Memory ControlBackground: Memory Control Memory controller does the followingMemory controller does the following

– Identify shadow space requests (easy)Identify shadow space requests (easy)– Gather the actual words from requested Gather the actual words from requested

column segment and assemble a cache column segment and assemble a cache block on read/read exclusive requests block on read/read exclusive requests (requires address computation and (requires address computation and translation)translation)

– In the case of a writeback, scatter the In the case of a writeback, scatter the words to the column locations (requires words to the column locations (requires address computation and translation)address computation and translation)

– On normal space requests, retrieve On normal space requests, retrieve corresponding shadow blocks and corresponding shadow blocks and assemble a cache blockassemble a cache block


Deconstructing the AMDU: Deconstructing the AMDU: Dynamic AssemblyDynamic Assembly

Required for packing a cache block Required for packing a cache block with gathered words from different with gathered words from different addressesaddresses– Easy to move to softwareEasy to move to software– Issue addresses along with destination Issue addresses along with destination

word offsets within a data bufferword offsets within a data buffer– Each word offset in the data buffer comes Each word offset in the data buffer comes

with a valid bit (a la Stanford FLASH)with a valid bit (a la Stanford FLASH)– Memory-mapped valid bits can be Memory-mapped valid bits can be

accessed by the protocol software through accessed by the protocol software through uncached loads and storesuncached loads and stores

– When all valid bits are set, the protocol When all valid bits are set, the protocol software can initiate the data transfersoftware can initiate the data transfer


Complexity, Area, PowerComplexity, Area, Power Asynchronous interface between the Asynchronous interface between the

AMDU and protocol thread makes the AMDU and protocol thread makes the hardware complex and hard to verifyhardware complex and hard to verify– Worth eliminating the AMDUWorth eliminating the AMDU

Verifying new protocol software is easierVerifying new protocol software is easier– Regular code segments reused in many Regular code segments reused in many

places: amortizes the costplaces: amortizes the cost Area saving: at 65 nm, area of the Area saving: at 65 nm, area of the

AMDU is 1.77 mmAMDU is 1.77 mm2 2 (with 1024-entry (with 1024-entry AMTLB)AMTLB)

Peak dynamic power saving: 2.81 WPeak dynamic power saving: 2.81 W– High power density (1.6 W/mmHigh power density (1.6 W/mm22) could have ) could have

generated a hot-spotgenerated a hot-spot


Simulation Results: MSASimulation Results: MSA

12% speedupGood scalability


Simulation Results: Simulation Results: DenseMMMDenseMMM

7% speedup

15% speedup

Slow cache hit hurts2.3% faster than AMDU


Simulation Results: Simulation Results: SparseFlowSparseFlow

22% gap

49% gap

2.9% gap

Best UCC (2.57)

2.4% gap

Not scalable


Simulation Results: Simulation Results: Transpose MicrobenchmarkTranspose Microbenchmark

51%

Similar performance13.2% gap37% speedupSlow cache hit hurts


Simulation Results: FFTWSimulation Results: FFTW

8.7% gap2.3% gap

simplifying active memory clusters by leveraging directory protocol threads

Documents

memory data

shadow address

memoryresident tlb

processors p shadow

xpidj aijbarrierfor

shadow writeback

xpidj aijbarriersubsequent

corresponding shadow