an fpga approach to quantifying coherence traffic efficiency on multiprocessor systems
DESCRIPTION
An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems. Taeweon Suh ┼ , Shih-Lien L. Lu ¥ , and Hsien-Hsin S. Lee § Platform Validation Engineering, Intel ┼ Microprocessor Technology Lab, Intel ¥ ECE, Georgia Institute of Technology § August 27, 2007. - PowerPoint PPT PresentationTRANSCRIPT
An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor
Systems
An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor
Systems
Taeweon Suh Taeweon Suh ┼┼, , Shih-Lien L. Lu Shih-Lien L. Lu ¥¥, and , and
Hsien-Hsin S. Lee Hsien-Hsin S. Lee §§
Platform Validation Engineering, Intel Platform Validation Engineering, Intel ┼┼
Microprocessor Technology Lab, Intel Microprocessor Technology Lab, Intel ¥¥
ECE, Georgia Institute of Technology ECE, Georgia Institute of Technology §§
August 27, 2007August 27, 2007
2/17FPL’07
Motivation and ContributionMotivation and Contribution
Evaluation of coherence traffic Evaluation of coherence traffic efficiencyefficiency– Why important?Why important?
Understand the impact of coherence traffic Understand the impact of coherence traffic on system performance on system performance
Reflect into communication architecture Reflect into communication architecture – Problems with traditional methodsProblems with traditional methods
Evaluation of protocols themselvesEvaluation of protocols themselves Software simulationsSoftware simulations Experiments on SMP machines: ambiguousExperiments on SMP machines: ambiguous
– SolutionSolution A novel method to measure the intrinsic A novel method to measure the intrinsic
delay of coherence traffic and evaluate its delay of coherence traffic and evaluate its efficiencyefficiency
3/17FPL’07
Cache Coherence ProtocolCache Coherence Protocol
ExampleExample– MESI Protocol: Snoop-based protocolMESI Protocol: Snoop-based protocol
ProtocolStates
ModifiedExclusiveSharedInvalid
Processor 0(MESI)
Memory
Processor 1(MESI)
1234
Example operation sequence
E 1234S 1234 S 1234
shared
M abcd
invalidate
I 1234 S abcdS abcd
P0: readP1: readP1: write (abcd)P0: read
I ----- I -----
cache-to-cache
4/17FPL’07
Previous Work 1Previous Work 1
MemorIES (2000)MemorIES (2000)– MemorMemory y IInstrumentation and nstrumentation and EEmulation mulation
SSystem from IBM T.J. Watsonystem from IBM T.J. Watson– L3 Cache and/or coherence protocol L3 Cache and/or coherence protocol
emulation emulation Plugged into 6xx bus of RS/6000 SMP machinePlugged into 6xx bus of RS/6000 SMP machine
– Passive emulatorPassive emulator
5/17FPL’07
Previous Work 2Previous Work 2
ACE (2006)ACE (2006)– AActivective C Cacheache E Emulationmulation– Active L3 Cache size emulation with timingActive L3 Cache size emulation with timing– Time dilationTime dilation
6/17FPL’07
Evaluation MethodologyEvaluation Methodology Goal Goal
– Measure Measure the intrinsic delay of coherence trafficthe intrinsic delay of coherence traffic and and evaluate its efficiencyevaluate its efficiency
Shortcomings in multiprocessor environmentShortcomings in multiprocessor environment– Nearly impossible to isolate the impact of coherence Nearly impossible to isolate the impact of coherence
traffic on system performancetraffic on system performance– Even worse, there are non-deterministic factorsEven worse, there are non-deterministic factors
Arbitration delayArbitration delay Stall in pipelined busStall in pipelined bus
“cache-to-cache transfer”shared bus
Processor Processor 00
(MESI)(MESI)
Memorycontroller
Main memory
Processor Processor 11
(MESI)(MESI)
Processor Processor 22
(MESI)(MESI)
Processor Processor 33
(MESI)(MESI)
7/17FPL’07
Evaluation Methodology (continued)Evaluation Methodology (continued) Our methodologyOur methodology
– Use an Intel server system equipped with two Pentium-IIIsUse an Intel server system equipped with two Pentium-IIIs– Replace one Pentium-III with an FPGAReplace one Pentium-III with an FPGA– Implement a cache in FPGAImplement a cache in FPGA– Save evicted cache lines into theSave evicted cache lines into the cachecache– Supply data using cache-to-cache transfer when Pentium-Supply data using cache-to-cache transfer when Pentium-
III requests it next timeIII requests it next time– Measure execution time of benchmarks and compare with Measure execution time of benchmarks and compare with
the baselinethe baseline
Front-side bus (FSB)
Pentium-III Pentium-III (MESI)(MESI)
Memorycontroller
2GB SDRAM
“cache-to-cache transfer”
Pentium-III Pentium-III (MESI)(MESI)
FPGAFPGA
D$D$
8/17FPL’07
Intel server systemIntel server system
Pentium-IIIPentium-III
FPGA boardFPGA board
Logic analyzerLogic analyzerHost PCHost PC
UARTUART
Evaluation EquipmentEvaluation Equipment
9/17FPL’07
Evaluation Equipment (continued)Evaluation Equipment (continued)
Xilinx Virtex-IIFPGA
FSB interface
Logic analyzer ports
LEDs
10/17FPL’07
Implementation Implementation
Simplified P6 FSB timing diagramSimplified P6 FSB timing diagram– Cache-to-cache transfer on the P6 FSBCache-to-cache transfer on the P6 FSB
ADS
addrA[35:3]#
HIT#
HITM#
TRDY#
DRDY#
DBSY#
data0D[63:0]# data2 data3data1
request1
request2
error1
error2
snoop
response
dataFSB pipeline stages
snoop-hit
memory controller is ready to accept data
new transaction
11/17FPL’07
Implementation (continued)Implementation (continued) Implemented modules in FPGAImplemented modules in FPGA
– State machinesState machines To keep track of FSB transactionsTo keep track of FSB transactions
– Taking evicted data from FSBTaking evicted data from FSB– Initiating cache-to-cache transferInitiating cache-to-cache transfer
– Direct-mapped cachesDirect-mapped caches Cache size in FPGA varies from 1KB to 256KBCache size in FPGA varies from 1KB to 256KB Note that Pentium-III has 256KB 4-way set associative L2 Note that Pentium-III has 256KB 4-way set associative L2
– Statistics moduleStatistics module
Xilinx Virtex-II FPGAXilinx Virtex-II FPGA
Front-side bus (FSB)
Direct-mappedDirect-mapped cachecacheTagTag DataData
Registers for Registers for statisticsstatistics
PC via UART
Logic Analyzer
State machineState machine
write-back
cache-to-cache
the rest
8
12/17FPL’07
Experiment Environment and MethodExperiment Environment and Method Operating systemOperating system
– Redhat Linux 2.4.20-8 Redhat Linux 2.4.20-8
Natively run SPEC2000 benchmarkNatively run SPEC2000 benchmark– Selection of benchmark does not affect the Selection of benchmark does not affect the
evaluation as long as reasonable # bus traffic is evaluation as long as reasonable # bus traffic is generatedgenerated
FPGA sends statistics information to PC via FPGA sends statistics information to PC via UARTUART– # cache-to-cache transfers on FSB per second# cache-to-cache transfers on FSB per second– # invalidation traffic on FSB per second# invalidation traffic on FSB per second
Read-for-ownership transactionsRead-for-ownership transactions– 0-byte memory read with invalidation (upon upgrade miss)0-byte memory read with invalidation (upon upgrade miss)– Full-line (4Full-line (48B) memory read with invalidation8B) memory read with invalidation
– # burst-read (4# burst-read (48B) transactions on FSB per second8B) transactions on FSB per second
More metricsMore metrics– Hit rate in the FPGA’s cache Hit rate in the FPGA’s cache – Execution time difference compared to baselineExecution time difference compared to baseline
13/17FPL’07
100k
200k
300k
400k
500k
600k
700k
800k 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB
Experiment Results Experiment Results
Average # cache-to-cache transfers / Average # cache-to-cache transfers / secondsecond
gzip vpr gcc mcf parser gap bzip2 twolf average
Avera
ge #
cach
e-t
o-c
ach
e
tran
sfe
rs/s
ec
804.2K/sec
433.3K/sec
14/17FPL’07
0
50k
100k
150k
200k
250k
300k
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB
Experiment Results (continued)Experiment Results (continued) Average increase of invalidation traffic / Average increase of invalidation traffic /
secondsecond
gzip vpr gcc mcf parser gap bzip2 twolf average
Avera
ge in
cre
ase o
f in
valid
ati
on
tr
affi
c/s
ec
157.5K/sec
306.8K/sec
15/17FPL’07
0
10
20
30
40
50
60
70
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB
Experiment Results (continued)Experiment Results (continued) Average hit rate in the FPGA’s cacheAverage hit rate in the FPGA’s cache
gzip vpr gcc mcf parser gap bzip2 twolf average
Avera
ge h
it r
ate
(%
)
Hit rate = # cache-to-cache transfer
# data read (full cache line)
64.89%
16.9%
16/17FPL’07
Experiment Results (continued)Experiment Results (continued)
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB-20
0
20
40
60
80
100
120
140
160
180
200
exec
ution tim
e in
crea
se o
ver
bas
elin
e (s
ec)
cache size in FPGA
Average
Average execution time increaseAverage execution time increase– Baseline: benchmarks execution on a single P-III without Baseline: benchmarks execution on a single P-III without
FPGAFPGA data is always supplied from main memorydata is always supplied from main memory
191 seconds
Average execution time: 5635
seconds(93 min)
171 seconds
17/17FPL’07
Run-time BreakdownRun-time Breakdown
Estimate run-time of each coherence traffic Estimate run-time of each coherence traffic – with 256KB cache in FPGAwith 256KB cache in FPGA
Invalidation trafficCache-to-cache
transfer
Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles
Estimated run-times
Estimated time =Estimated time =avg. occurrenceavg. occurrence
secsecx avg. total execution timeavg. total execution time x
clock periodclock periodcyclecycle
x latency of each trafficlatency of each traffic
Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline
69 ~ 138 seconds 381 ~ 762 seconds69 ~ 138 seconds 381 ~ 762 seconds
Cache-to-cache transfer is responsible for at Cache-to-cache transfer is responsible for at least 33 (171-138) second increase !least 33 (171-138) second increase !
18/17FPL’07
ConclusionConclusion Proposed a novel method to measure the Proposed a novel method to measure the
intrinsic delay of coherence traffic and intrinsic delay of coherence traffic and evaluate its efficiencyevaluate its efficiency– Coherence traffic in P-III-based Intel server system Coherence traffic in P-III-based Intel server system
is not efficient as expectedis not efficient as expected The main reason is that, in MESI, main memory The main reason is that, in MESI, main memory
should be updated at the same time upon should be updated at the same time upon cache-to-cache-to-cache transfercache transfer
Opportunities for performance enhancementOpportunities for performance enhancement– For faster cache-to-cache transferFor faster cache-to-cache transfer
Cache line buffers in memory controllerCache line buffers in memory controller– As long as buffer space is available, memory controller As long as buffer space is available, memory controller
can take datacan take data
MOESI would help shorten the latencyMOESI would help shorten the latency – Main memory need not be updated upon cache-to-cache Main memory need not be updated upon cache-to-cache
transfertransfer
– For faster invalidation trafficFor faster invalidation traffic Advancing the snoop phase to an earlier stageAdvancing the snoop phase to an earlier stage
19/17FPL’07
Questions, Comments?Questions, Comments?
Thanks for your attention!
20/17FPL’07
Backup Slides
21/17FPL’07
MotivationMotivation
Traditionally, evaluations of coherence protocols Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with focused on reducing bus traffic incurred along with state transitions of coherence protocolsstate transitions of coherence protocols– Trace-based simulations were mostly used for the Trace-based simulations were mostly used for the
protocol evaluationsprotocol evaluations Software simulations are too slow to perform the Software simulations are too slow to perform the
broad range analysis of system behaviorsbroad range analysis of system behaviors– In addition, it is very difficult to do exact real-world In addition, it is very difficult to do exact real-world
modeling such as I/Osmodeling such as I/Os System-wide performance impact of coherence System-wide performance impact of coherence
traffic has not been explicitly investigated using real traffic has not been explicitly investigated using real systemssystems
This research provides a new method to evaluate This research provides a new method to evaluate and characterize coherence traffic efficiency of and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-snoop-based, invalidation protocols using an off-the-shelf system and an FPGAshelf system and an FPGA
22/17FPL’07
Motivation and ContributionMotivation and Contribution Evaluation of coherence traffic Evaluation of coherence traffic
efficiencyefficiency– MotivationMotivation
Memory wall becomes Memory wall becomes higherhigher
– Important to understand the Important to understand the impact of communication impact of communication among processorsamong processors
Traditionally, evaluation of Traditionally, evaluation of coherence protocols focused coherence protocols focused on protocols themselveson protocols themselves
– Software-based simulationSoftware-based simulation FPGA technologyFPGA technology
– Original Pentium fits into one Original Pentium fits into one Xilinx Virtex-4 LX200Xilinx Virtex-4 LX200
– Recent emulation effortRecent emulation effort RAMP consortium RAMP consortium
– ContributionContribution A novel method to measure A novel method to measure
the intrinsic delay of the intrinsic delay of coherence traffic and coherence traffic and evaluate its efficiency using evaluate its efficiency using emulation techniqueemulation technique
MemorIES (ASPLOS 2000)MemorIES (ASPLOS 2000)
BEE2 board BEE2 board
23/17FPL’07
Cache Coherence ProtocolsCache Coherence Protocols
Well-known technique for data consistency Well-known technique for data consistency among multiprocessor with cachesamong multiprocessor with caches
ClassificationClassification– Snoop-based protocolsSnoop-based protocols
Rely on broadcasting on shared busRely on broadcasting on shared bus– Based on shared memoryBased on shared memory
Symmetric access to main memorySymmetric access to main memory Limited scalabilityLimited scalability Used to build small-scale multiprocessor systemsUsed to build small-scale multiprocessor systems
– Very popular in servers and workstationsVery popular in servers and workstations
– Directory-based protocolsDirectory-based protocols Message-based communication via interconnection Message-based communication via interconnection
networknetwork– Based on distributed shared memory (DSM)Based on distributed shared memory (DSM)
Cache coherent non-uniform memory Access (ccNUMA)Cache coherent non-uniform memory Access (ccNUMA) ScalableScalable Used to build large-scale systemsUsed to build large-scale systems Actively studied in 1990sActively studied in 1990s
24/17FPL’07
Cache Coherence Protocols (continued)Cache Coherence Protocols (continued)
Snoop-based protocolsSnoop-based protocols– Invalidation-based protocolsInvalidation-based protocols
Invalidate shared copies when writingInvalidate shared copies when writing 1980s1980s
– Write-once, Synapse, Berkeley, and Illinois Write-once, Synapse, Berkeley, and Illinois Currently, adopt different combinations of Currently, adopt different combinations of
the states (M, O, E, S, and I)the states (M, O, E, S, and I)– MEI: PowerPC750, MIPS64 20KcMEI: PowerPC750, MIPS64 20Kc– MSI: Silicon Graphics 4D seriesMSI: Silicon Graphics 4D series– MESI: Pentium class, AMD K6, PowerPC601MESI: Pentium class, AMD K6, PowerPC601– MOESI: AMD64, UltraSparcMOESI: AMD64, UltraSparc
– Update-based protocolsUpdate-based protocols Update shared copies when writingUpdate shared copies when writing Dragon protocol and FireflyDragon protocol and Firefly
25/17FPL’07
Cache Coherence Protocols (continued)Cache Coherence Protocols (continued)
Directory-based protocolsDirectory-based protocols– Memory-based schemesMemory-based schemes
Keep Keep directory at the granularity of a cache directory at the granularity of a cache lineline in home node’s memory in home node’s memory
– One dirty bit, and one presence bit for each nodeOne dirty bit, and one presence bit for each node Storage overhead due to directoryStorage overhead due to directory ExamplesExamples
– Stanford DASH, Stanford FLASH, MIT Alewife, and SGI Stanford DASH, Stanford FLASH, MIT Alewife, and SGI OriginOrigin
– Cache-based schemesCache-based schemes Keep Keep only head pointer for each cache lineonly head pointer for each cache line in in
home node’ directoryhome node’ directory– Keep Keep forward and backward pointers in cachesforward and backward pointers in caches of each of each
nodenode Long latency due to serialization of messagesLong latency due to serialization of messages ExamplesExamples
– Sequent NUMA-Q, Convex Exemplar, and Data GeneralSequent NUMA-Q, Convex Exemplar, and Data General
26/17FPL’07
Emulation Initiatives for Protocol EvaluationEmulation Initiatives for Protocol Evaluation RPM (mid-to-late ’90s)RPM (mid-to-late ’90s)
– RRapid apid PPrototyping engine for rototyping engine for MMultiprocessor ultiprocessor from Univ. of Southern Californiafrom Univ. of Southern California
– ccNUMA Full system emulation ccNUMA Full system emulation A Sparc IU/FPU core is used as CPU in each A Sparc IU/FPU core is used as CPU in each
node, and the rest (L1, L2 etc) is implemented node, and the rest (L1, L2 etc) is implemented with 8 FPGAswith 8 FPGAs
Nodes are connected through Futurebus+Nodes are connected through Futurebus+
27/17FPL’07
FPGA Initiatives for EvaluationFPGA Initiatives for Evaluation
Other cache emulatorsOther cache emulators– RACFCS (1997)RACFCS (1997)
RReconfigurable econfigurable AAddress ddress CCollector and ollector and FFlying lying CCache ache SSimulator from imulator from Yonsei Yonsei Univ. in Korea Univ. in Korea
Plugged into Intel486 busPlugged into Intel486 bus– Passively collect Passively collect
– HACS (2002)HACS (2002) HHardware ardware AAccelerated ccelerated CCache ache SSimulator imulator
from Brigham Young Univ.from Brigham Young Univ. Plugged into FSB of Pentium-Pro-based Plugged into FSB of Pentium-Pro-based
systemsystem– ACE (2006)ACE (2006)
AActive ctive CCache ache EEmulator from Intel Corp.mulator from Intel Corp. Plugged into FSB of Pentium-III-based Plugged into FSB of Pentium-III-based
systemsystem