an fpga approach to quantifying coherence traffic efficiency on multiprocessor systems

An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor

Systems

An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor

Systems

Taeweon Suh Taeweon Suh ┼┼, , Shih-Lien L. Lu Shih-Lien L. Lu ¥¥, and , and

Hsien-Hsin S. Lee Hsien-Hsin S. Lee §§

Platform Validation Engineering, Intel Platform Validation Engineering, Intel ┼┼

Microprocessor Technology Lab, Intel Microprocessor Technology Lab, Intel ¥¥

ECE, Georgia Institute of Technology ECE, Georgia Institute of Technology §§

August 27, 2007August 27, 2007

2/17FPL’07

Motivation and ContributionMotivation and Contribution

Evaluation of coherence traffic Evaluation of coherence traffic efficiencyefficiency– Why important?Why important?

Understand the impact of coherence traffic Understand the impact of coherence traffic on system performance on system performance

Reflect into communication architecture Reflect into communication architecture – Problems with traditional methodsProblems with traditional methods

Evaluation of protocols themselvesEvaluation of protocols themselves Software simulationsSoftware simulations Experiments on SMP machines: ambiguousExperiments on SMP machines: ambiguous

– SolutionSolution A novel method to measure the intrinsic A novel method to measure the intrinsic

delay of coherence traffic and evaluate its delay of coherence traffic and evaluate its efficiencyefficiency

3/17FPL’07

Cache Coherence ProtocolCache Coherence Protocol

ExampleExample– MESI Protocol: Snoop-based protocolMESI Protocol: Snoop-based protocol

ProtocolStates

ModifiedExclusiveSharedInvalid

Processor 0(MESI)

Memory

Processor 1(MESI)

1234

Example operation sequence

E 1234S 1234 S 1234

shared

M abcd

invalidate

I 1234 S abcdS abcd

P0: readP1: readP1: write (abcd)P0: read

I ----- I -----

cache-to-cache

4/17FPL’07

Previous Work 1Previous Work 1

MemorIES (2000)MemorIES (2000)– MemorMemory y IInstrumentation and nstrumentation and EEmulation mulation

SSystem from IBM T.J. Watsonystem from IBM T.J. Watson– L3 Cache and/or coherence protocol L3 Cache and/or coherence protocol

emulation emulation Plugged into 6xx bus of RS/6000 SMP machinePlugged into 6xx bus of RS/6000 SMP machine

– Passive emulatorPassive emulator

5/17FPL’07

Previous Work 2Previous Work 2

ACE (2006)ACE (2006)– AActivective C Cacheache E Emulationmulation– Active L3 Cache size emulation with timingActive L3 Cache size emulation with timing– Time dilationTime dilation

6/17FPL’07

Evaluation MethodologyEvaluation Methodology Goal Goal

– Measure Measure the intrinsic delay of coherence trafficthe intrinsic delay of coherence traffic and and evaluate its efficiencyevaluate its efficiency

Shortcomings in multiprocessor environmentShortcomings in multiprocessor environment– Nearly impossible to isolate the impact of coherence Nearly impossible to isolate the impact of coherence

traffic on system performancetraffic on system performance– Even worse, there are non-deterministic factorsEven worse, there are non-deterministic factors

Arbitration delayArbitration delay Stall in pipelined busStall in pipelined bus

“cache-to-cache transfer”shared bus

Processor Processor 00

(MESI)(MESI)

Memorycontroller

Main memory


(MESI)(MESI)


(MESI)(MESI)


(MESI)(MESI)

7/17FPL’07

Evaluation Methodology (continued)Evaluation Methodology (continued) Our methodologyOur methodology

– Use an Intel server system equipped with two Pentium-IIIsUse an Intel server system equipped with two Pentium-IIIs– Replace one Pentium-III with an FPGAReplace one Pentium-III with an FPGA– Implement a cache in FPGAImplement a cache in FPGA– Save evicted cache lines into theSave evicted cache lines into the cachecache– Supply data using cache-to-cache transfer when Pentium-Supply data using cache-to-cache transfer when Pentium-

III requests it next timeIII requests it next time– Measure execution time of benchmarks and compare with Measure execution time of benchmarks and compare with

the baselinethe baseline

Front-side bus (FSB)

Pentium-III Pentium-III (MESI)(MESI)

Memorycontroller

2GB SDRAM

“cache-to-cache transfer”

Pentium-III Pentium-III (MESI)(MESI)

FPGAFPGA

D$D$

8/17FPL’07

Intel server systemIntel server system

Pentium-IIIPentium-III

FPGA boardFPGA board

Logic analyzerLogic analyzerHost PCHost PC

UARTUART

Evaluation EquipmentEvaluation Equipment

9/17FPL’07

Evaluation Equipment (continued)Evaluation Equipment (continued)

Xilinx Virtex-IIFPGA

FSB interface

Logic analyzer ports

LEDs

10/17FPL’07

Implementation Implementation

Simplified P6 FSB timing diagramSimplified P6 FSB timing diagram– Cache-to-cache transfer on the P6 FSBCache-to-cache transfer on the P6 FSB

ADS

addrA[35:3]#

HIT#

HITM#

TRDY#

DRDY#

DBSY#

data0D[63:0]# data2 data3data1

request1

request2

error1

error2

snoop

response

dataFSB pipeline stages

snoop-hit

memory controller is ready to accept data

new transaction

11/17FPL’07

Implementation (continued)Implementation (continued) Implemented modules in FPGAImplemented modules in FPGA

– State machinesState machines To keep track of FSB transactionsTo keep track of FSB transactions

– Taking evicted data from FSBTaking evicted data from FSB– Initiating cache-to-cache transferInitiating cache-to-cache transfer

– Direct-mapped cachesDirect-mapped caches Cache size in FPGA varies from 1KB to 256KBCache size in FPGA varies from 1KB to 256KB Note that Pentium-III has 256KB 4-way set associative L2 Note that Pentium-III has 256KB 4-way set associative L2

– Statistics moduleStatistics module

Xilinx Virtex-II FPGAXilinx Virtex-II FPGA

Front-side bus (FSB)

Direct-mappedDirect-mapped cachecacheTagTag DataData

Registers for Registers for statisticsstatistics

PC via UART

Logic Analyzer

State machineState machine

write-back

cache-to-cache

the rest

8

12/17FPL’07

Experiment Environment and MethodExperiment Environment and Method Operating systemOperating system

– Redhat Linux 2.4.20-8 Redhat Linux 2.4.20-8

Natively run SPEC2000 benchmarkNatively run SPEC2000 benchmark– Selection of benchmark does not affect the Selection of benchmark does not affect the

evaluation as long as reasonable # bus traffic is evaluation as long as reasonable # bus traffic is generatedgenerated

FPGA sends statistics information to PC via FPGA sends statistics information to PC via UARTUART– # cache-to-cache transfers on FSB per second# cache-to-cache transfers on FSB per second– # invalidation traffic on FSB per second# invalidation traffic on FSB per second

Read-for-ownership transactionsRead-for-ownership transactions– 0-byte memory read with invalidation (upon upgrade miss)0-byte memory read with invalidation (upon upgrade miss)– Full-line (4Full-line (48B) memory read with invalidation8B) memory read with invalidation

– # burst-read (4# burst-read (48B) transactions on FSB per second8B) transactions on FSB per second

More metricsMore metrics– Hit rate in the FPGA’s cache Hit rate in the FPGA’s cache – Execution time difference compared to baselineExecution time difference compared to baseline

13/17FPL’07

100k

200k

300k

400k

500k

600k

700k

800k 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Experiment Results Experiment Results

Average # cache-to-cache transfers / Average # cache-to-cache transfers / secondsecond

gzip vpr gcc mcf parser gap bzip2 twolf average

Avera

ge #

cach

e-t

o-c

ach

e

tran

sfe

rs/s

ec

804.2K/sec

433.3K/sec

14/17FPL’07

0

50k

100k

150k

200k

250k

300k

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Experiment Results (continued)Experiment Results (continued) Average increase of invalidation traffic / Average increase of invalidation traffic /

secondsecond


Avera

ge in

cre

ase o

f in

valid

ati

on

tr

affi

c/s

ec

157.5K/sec

306.8K/sec

15/17FPL’07

0

10

20

30

40

50

60

70

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Experiment Results (continued)Experiment Results (continued) Average hit rate in the FPGA’s cacheAverage hit rate in the FPGA’s cache


Avera

ge h

it r

ate

(%

)

Hit rate = # cache-to-cache transfer

# data read (full cache line)

64.89%

16.9%

16/17FPL’07

Experiment Results (continued)Experiment Results (continued)

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB-20

0

20

40

60

80

100

120

140

160

180

200

exec

ution tim

e in

crea

se o

ver

bas

elin

e (s

ec)

cache size in FPGA

Average

Average execution time increaseAverage execution time increase– Baseline: benchmarks execution on a single P-III without Baseline: benchmarks execution on a single P-III without

FPGAFPGA data is always supplied from main memorydata is always supplied from main memory

191 seconds

Average execution time: 5635

seconds(93 min)

171 seconds

17/17FPL’07

Run-time BreakdownRun-time Breakdown

Estimate run-time of each coherence traffic Estimate run-time of each coherence traffic – with 256KB cache in FPGAwith 256KB cache in FPGA

Invalidation trafficCache-to-cache

transfer

Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles

Estimated run-times

Estimated time =Estimated time =avg. occurrenceavg. occurrence

secsecｘ avg. total execution timeavg. total execution time ｘ

clock periodclock periodcyclecycle

ｘ latency of each trafficlatency of each traffic

Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline

69 ~ 138 seconds 381 ~ 762 seconds69 ~ 138 seconds 381 ~ 762 seconds

Cache-to-cache transfer is responsible for at Cache-to-cache transfer is responsible for at least 33 (171-138) second increase !least 33 (171-138) second increase !

18/17FPL’07

ConclusionConclusion Proposed a novel method to measure the Proposed a novel method to measure the

intrinsic delay of coherence traffic and intrinsic delay of coherence traffic and evaluate its efficiencyevaluate its efficiency– Coherence traffic in P-III-based Intel server system Coherence traffic in P-III-based Intel server system

is not efficient as expectedis not efficient as expected The main reason is that, in MESI, main memory The main reason is that, in MESI, main memory

should be updated at the same time upon should be updated at the same time upon cache-to-cache-to-cache transfercache transfer

Opportunities for performance enhancementOpportunities for performance enhancement– For faster cache-to-cache transferFor faster cache-to-cache transfer

Cache line buffers in memory controllerCache line buffers in memory controller– As long as buffer space is available, memory controller As long as buffer space is available, memory controller

can take datacan take data

MOESI would help shorten the latencyMOESI would help shorten the latency – Main memory need not be updated upon cache-to-cache Main memory need not be updated upon cache-to-cache

transfertransfer

– For faster invalidation trafficFor faster invalidation traffic Advancing the snoop phase to an earlier stageAdvancing the snoop phase to an earlier stage

19/17FPL’07

Questions, Comments?Questions, Comments?

Thanks for your attention!

20/17FPL’07

Backup Slides

21/17FPL’07

MotivationMotivation

Traditionally, evaluations of coherence protocols Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with focused on reducing bus traffic incurred along with state transitions of coherence protocolsstate transitions of coherence protocols– Trace-based simulations were mostly used for the Trace-based simulations were mostly used for the

protocol evaluationsprotocol evaluations Software simulations are too slow to perform the Software simulations are too slow to perform the

broad range analysis of system behaviorsbroad range analysis of system behaviors– In addition, it is very difficult to do exact real-world In addition, it is very difficult to do exact real-world

modeling such as I/Osmodeling such as I/Os System-wide performance impact of coherence System-wide performance impact of coherence

traffic has not been explicitly investigated using real traffic has not been explicitly investigated using real systemssystems

This research provides a new method to evaluate This research provides a new method to evaluate and characterize coherence traffic efficiency of and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-snoop-based, invalidation protocols using an off-the-shelf system and an FPGAshelf system and an FPGA

22/17FPL’07

Motivation and ContributionMotivation and Contribution Evaluation of coherence traffic Evaluation of coherence traffic

efficiencyefficiency– MotivationMotivation

Memory wall becomes Memory wall becomes higherhigher

– Important to understand the Important to understand the impact of communication impact of communication among processorsamong processors

Traditionally, evaluation of Traditionally, evaluation of coherence protocols focused coherence protocols focused on protocols themselveson protocols themselves

– Software-based simulationSoftware-based simulation FPGA technologyFPGA technology

– Original Pentium fits into one Original Pentium fits into one Xilinx Virtex-4 LX200Xilinx Virtex-4 LX200

– Recent emulation effortRecent emulation effort RAMP consortium RAMP consortium

– ContributionContribution A novel method to measure A novel method to measure

the intrinsic delay of the intrinsic delay of coherence traffic and coherence traffic and evaluate its efficiency using evaluate its efficiency using emulation techniqueemulation technique

MemorIES (ASPLOS 2000)MemorIES (ASPLOS 2000)

BEE2 board BEE2 board

23/17FPL’07

Cache Coherence ProtocolsCache Coherence Protocols

Well-known technique for data consistency Well-known technique for data consistency among multiprocessor with cachesamong multiprocessor with caches

ClassificationClassification– Snoop-based protocolsSnoop-based protocols

Rely on broadcasting on shared busRely on broadcasting on shared bus– Based on shared memoryBased on shared memory

Symmetric access to main memorySymmetric access to main memory Limited scalabilityLimited scalability Used to build small-scale multiprocessor systemsUsed to build small-scale multiprocessor systems

– Very popular in servers and workstationsVery popular in servers and workstations

– Directory-based protocolsDirectory-based protocols Message-based communication via interconnection Message-based communication via interconnection

networknetwork– Based on distributed shared memory (DSM)Based on distributed shared memory (DSM)

Cache coherent non-uniform memory Access (ccNUMA)Cache coherent non-uniform memory Access (ccNUMA) ScalableScalable Used to build large-scale systemsUsed to build large-scale systems Actively studied in 1990sActively studied in 1990s

24/17FPL’07

Cache Coherence Protocols (continued)Cache Coherence Protocols (continued)

Snoop-based protocolsSnoop-based protocols– Invalidation-based protocolsInvalidation-based protocols

Invalidate shared copies when writingInvalidate shared copies when writing 1980s1980s

– Write-once, Synapse, Berkeley, and Illinois Write-once, Synapse, Berkeley, and Illinois Currently, adopt different combinations of Currently, adopt different combinations of

the states (M, O, E, S, and I)the states (M, O, E, S, and I)– MEI: PowerPC750, MIPS64 20KcMEI: PowerPC750, MIPS64 20Kc– MSI: Silicon Graphics 4D seriesMSI: Silicon Graphics 4D series– MESI: Pentium class, AMD K6, PowerPC601MESI: Pentium class, AMD K6, PowerPC601– MOESI: AMD64, UltraSparcMOESI: AMD64, UltraSparc

– Update-based protocolsUpdate-based protocols Update shared copies when writingUpdate shared copies when writing Dragon protocol and FireflyDragon protocol and Firefly

25/17FPL’07

Cache Coherence Protocols (continued)Cache Coherence Protocols (continued)

Directory-based protocolsDirectory-based protocols– Memory-based schemesMemory-based schemes

Keep Keep directory at the granularity of a cache directory at the granularity of a cache lineline in home node’s memory in home node’s memory

– One dirty bit, and one presence bit for each nodeOne dirty bit, and one presence bit for each node Storage overhead due to directoryStorage overhead due to directory ExamplesExamples

– Stanford DASH, Stanford FLASH, MIT Alewife, and SGI Stanford DASH, Stanford FLASH, MIT Alewife, and SGI OriginOrigin

– Cache-based schemesCache-based schemes Keep Keep only head pointer for each cache lineonly head pointer for each cache line in in

home node’ directoryhome node’ directory– Keep Keep forward and backward pointers in cachesforward and backward pointers in caches of each of each

nodenode Long latency due to serialization of messagesLong latency due to serialization of messages ExamplesExamples

– Sequent NUMA-Q, Convex Exemplar, and Data GeneralSequent NUMA-Q, Convex Exemplar, and Data General

26/17FPL’07

Emulation Initiatives for Protocol EvaluationEmulation Initiatives for Protocol Evaluation RPM (mid-to-late ’90s)RPM (mid-to-late ’90s)

– RRapid apid PPrototyping engine for rototyping engine for MMultiprocessor ultiprocessor from Univ. of Southern Californiafrom Univ. of Southern California

– ccNUMA Full system emulation ccNUMA Full system emulation A Sparc IU/FPU core is used as CPU in each A Sparc IU/FPU core is used as CPU in each

node, and the rest (L1, L2 etc) is implemented node, and the rest (L1, L2 etc) is implemented with 8 FPGAswith 8 FPGAs

Nodes are connected through Futurebus+Nodes are connected through Futurebus+

27/17FPL’07

FPGA Initiatives for EvaluationFPGA Initiatives for Evaluation

Other cache emulatorsOther cache emulators– RACFCS (1997)RACFCS (1997)

RReconfigurable econfigurable AAddress ddress CCollector and ollector and FFlying lying CCache ache SSimulator from imulator from Yonsei Yonsei Univ. in Korea Univ. in Korea

Plugged into Intel486 busPlugged into Intel486 bus– Passively collect Passively collect

– HACS (2002)HACS (2002) HHardware ardware AAccelerated ccelerated CCache ache SSimulator imulator

from Brigham Young Univ.from Brigham Young Univ. Plugged into FSB of Pentium-Pro-based Plugged into FSB of Pentium-Pro-based

systemsystem– ACE (2006)ACE (2006)

AActive ctive CCache ache EEmulator from Intel Corp.mulator from Intel Corp. Plugged into FSB of Pentium-III-based Plugged into FSB of Pentium-III-based

systemsystem

an fpga approach to quantifying coherence traffic efficiency on multiprocessor systems

Documents

fsbinitiating cache

cache transferpentium

cache transfers

impact of coherence

invalidation traffic

bus traffic

fpgasave evicted cache

emulation system