sam: optimizing multithreaded cores for speculative ... · sam : optimizing multithreaded cores for...
TRANSCRIPT
SAM:OptimizingMultithreadedCoresforSpeculativeParallelismMALEENABEYDEERA, SUVINAY SUBRAMANIAN, MARKJEFFREY,JOEL EMER, DANIEL SANCHEZ
PACT 2017
ExecutiveSummaryAnalyzestheinterplaybetweenhardwaremultithreadingandspeculativeparallelism
(eg:ThreadLevelSpeculationandTransactionalMemory )
Conventionalmultithreadingcausesperformancepathologiesonspeculativeworkloads• Increaseinabortedwork• Inefficientuseofspeculationresources
Why?Allthreadsaretreatedequally
SpeculationAwareMultithreading(SAM)• Prioritizethreadsrunning tasksmorelikelytocommit
SAMmakesmultithreadingmoreuseful
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 2
ExecutiveSummaryAnalyzestheinterplaybetweenhardwaremultithreadingandspeculativeparallelism
(eg:ThreadLevelSpeculationandTransactionalMemory )
Conventionalmultithreadingcausesperformancepathologiesonspeculativeworkloads• Increaseinabortedwork• Inefficientuseofspeculationresources
Why?Allthreadsaretreatedequally
SpeculationAwareMultithreading(SAM)• Prioritizethreadsrunning tasksmorelikelytocommit
SAMmakesmultithreadingmoreuseful
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 2
Outline
BackgroundonspeculativeparallelismPitfallsofspeculativeparallelismwithconventionalmultithreadingSAMonin-ordercoresSAMonout-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 3
BackgroundonSpeculativeParallelism
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 4
ParallelizetaskswhenthedependencesarenotknowninadvanceHardwareexecutesalltasksinparallel,abortinguponconflictsWhichtasktoabort?Conflictresolutionpolicy
BackgroundonSpeculativeParallelism
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 4
ParallelizetaskswhenthedependencesarenotknowninadvanceHardwareexecutesalltasksinparallel,abortinguponconflictsWhichtasktoabort?Conflictresolutionpolicy
SpeculativeParallelism
Orderede.g.Thread-LevelSpeculation(TLS)
(Programorderdictatestheconflictresolutionorder)
Unorderede.g.HardwareTransactionalMemory
(Anyexecutionorderisvalid,buthigh-performanceconflictresolutionpoliciesdefineanorder)
BackgroundonSpeculativeParallelism
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 4
ParallelizetaskswhenthedependencesarenotknowninadvanceHardwareexecutesalltasksinparallel,abortinguponconflictsWhichtasktoabort?Conflictresolutionpolicy
Implicitorderamongalltasksinanyspeculativesystem
SpeculativeParallelism
Orderede.g.Thread-LevelSpeculation(TLS)
(Programorderdictatestheconflictresolutionorder)
Unorderede.g.HardwareTransactionalMemory
(Anyexecutionorderisvalid,buthigh-performanceconflictresolutionpoliciesdefineanorder)
BaselineSystem- Swarm[Jeffrey,MICRO’15]
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 5
void desTask(Timestamp ts , GateInput* input) {Gate* g = input ->gate ();bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) {
for (GateInput* i : g-> connectedInputs ()) {swarm::enqueue(desTask , ts+delay(g,i), i);
}}
}
BaselineSystem- Swarm[Jeffrey,MICRO’15]
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 5
void desTask(Timestamp ts , GateInput* input) {Gate* g = input ->gate ();bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) {
for (GateInput* i : g-> connectedInputs ()) {swarm::enqueue(desTask , ts+delay(g,i), i);
}}
} Taskscreatechildrentasks(functionptr,timestamp,args)
Timestampedtasks
BaselineSystem- Swarm[Jeffrey,MICRO’15]
Tasksappeartoexecuteintimestamporder
Unorderedexecutionviaequaltimestamps
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 5
void desTask(Timestamp ts , GateInput* input) {Gate* g = input ->gate ();bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) {
for (GateInput* i : g-> connectedInputs ()) {swarm::enqueue(desTask , ts+delay(g,i), i);
}}
} Taskscreatechildrentasks(functionptr,timestamp,args)
Timestampedtasks
SwarmMicroarchitecture
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 6
Equaltimestamps:globalorderviaVirtualTime(VT)
Timestamp Tiebreaker
Virtual Time
SwarmMicroarchitecture
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 6
Mem / IO
Mem
/ IO
Mem / IO
Mem
/ IO
16-tile, 64-core CMP Tile Organization
Core Core Core Core
L1I/D L1I/D L1I/D L1I/D
L2
L3 SliceRouter
Task Unit
Tile
Equaltimestamps:globalorderviaVirtualTime(VT)
Timestamp Tiebreaker
Virtual Time
SwarmMicroarchitecture
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 6
Mem / IO
Mem
/ IO
Mem / IO
Mem
/ IO
16-tile, 64-core CMP Tile Organization
Core Core Core Core
L1I/D L1I/D L1I/D L1I/D
L2
L3 SliceRouter
Task Unit
Tile
Equaltimestamps:globalorderviaVirtualTime(VT)
Tasksexecuteout-of-order,butcommitinVTorder
Timestamp Tiebreaker
Virtual Time
Commitqueue:stateoftaskswaitingtocommit
Outline
BackgroundonspeculativeparallelismPitfallsofspeculativeparallelismwithconventionalmultithreadingSAMonin-ordercoresSAMonout-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 7
PitfallsofSpeculation-ObliviousMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 8
Systemconfiguration:64-coreSMTsystemIn-ordercorewith2-wideissueSpeculation-oblivious round-robin order
PitfallsofSpeculation-ObliviousMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 8
Systemconfiguration:64-coreSMTsystemIn-ordercorewith2-wideissueSpeculation-oblivious round-robin order
PitfallsofSpeculation-ObliviousMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 8
Insights:1.Multithreadingcanbehighlybeneficial
Systemconfiguration:64-coreSMTsystemIn-ordercorewith2-wideissueSpeculation-oblivious round-robin order
Micro-opsissuedfromcommitted tasks
Noreadymicro-opstoissue
PitfallsofSpeculation-ObliviousMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 8
Insights:1.Multithreadingcanbehighlybeneficial
However,multithreadingcanalsoleadto:2.Increasedaborts
Systemconfiguration:64-coreSMTsystemIn-ordercorewith2-wideissueSpeculation-oblivious round-robin order
Micro-opsissuedfromcommitted tasks
Noreadymicro-opstoissue
Micro-opsissuedfromabortedtasks
PitfallsofSpeculation-ObliviousMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 8
Insights:1.Multithreadingcanbehighlybeneficial
However,multithreadingcanalsoleadto:2.Increasedaborts3.Inefficientuseofspeculationresources
Systemconfiguration:64-coreSMTsystemIn-ordercorewith2-wideissueSpeculation-oblivious round-robin order
Micro-opsissuedfromcommitted tasks
Noreadymicro-opstoissue
Micro-opsissuedfromabortedtasks
Resourcestalls
PitfallsofSpeculation-ObliviousMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 8
Insights:1.Multithreadingcanbehighlybeneficial
However,multithreadingcanalsoleadto:2.Increasedaborts3.Inefficientuseofspeculationresources
Unlikely-to-committaskshurtthethroughputoflikely-to-commitones
Systemconfiguration:64-coreSMTsystemIn-ordercorewith2-wideissueSpeculation-oblivious round-robin order
Micro-opsissuedfromcommitted tasks
Noreadymicro-opstoissue
Micro-opsissuedfromabortedtasks
Resourcestalls
Speculation-AwareMultithreading
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 9
Prioritizethreadsaccordingtotheirconflictresolutionpriorities
ReduceSpeculationResourceStalls(taskscommitearly)
ReduceAborts(focusresourcesontaskslikelytocommit)
Outline
BackgroundonspeculativeparallelismPitfallsofspeculativeparallelismwithconventionalmultithreadingSAMonin-ordercoresSAMonout-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 10
SAMonin-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 11
SMTIssue
Fetch Decode leslesRegisterFiles
Pipe 0Pipe 1
Int ALUFP ALU
Int ALUMem/DCache
Thread micro-op queues
SAMonin-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 11
SMTIssue
Fetch Decode leslesRegisterFiles
Pipe 0Pipe 1
Int ALUFP ALU
Int ALUMem/DCache
Thread micro-op queues
SAMonin-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 11
SMTIssue
Fetch Decode leslesRegisterFiles
Pipe 0Pipe 1
Int ALUFP ALU
Int ALUMem/DCache
Thread micro-op queues
Conflict resolutionpriority updates(Virtual Times)
Task Unit
SAMonin-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 11
SMTIssue
Fetch Decode leslesRegisterFiles
Pipe 0Pipe 1
Int ALUFP ALU
Int ALUMem/DCache
Thread micro-op queues
SAM issue priorities(higher is better)
SortMax
Ready
52:9
52:717:195:4
Virtual Times
3
24
1
IssueThreadID
Conflict resolutionpriority updates(Virtual Times)
Task Unit
ExperimentalMethodology
BaselineSystem• Swarm+Wait-N-GoTM [Jafrietal.ASPLOS’13]conflictresolutiontechniques• Cycle-accurate,event-driven,Pin-basedsimulator• Modelsystemsupto64cores• Cores:2wideissue,upto8threadspercore
Benchmarks• Ordered:Swarm[Jeffreyetal.MICRO’15,MICRO’16]– 8benchmarks• Unordered:STAMP[Minhetal. IISWC’08]– 8benchmarks
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 12
SAMmakesmultithreadingmoreeffective
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 13
1 Thread
Ordered Benchmarks Unordered Benchmarks
SAMmakesmultithreadingmoreeffective
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 13
8 Thread Round Robin1 Thread
Ordered Benchmarks Unordered Benchmarks
SAMmakesmultithreadingmoreeffective
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 13
8 Thread SAM8 Thread Round Robin1 Thread
Ordered Benchmarks Unordered Benchmarks
SAMmakesmultithreadingmoreeffective
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 13
8 Thread SAM8 Thread Round Robin1 Thread
Ordered Benchmarks Unordered Benchmarks
8threadedcoresoutperformsinglethreadedcoresby1.85X
WithSAM,thebenefitincreasesto2.33X
SAMmakesmultithreadingmoreeffective
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 13
8 Thread SAM8 Thread Round Robin1 Thread
Ordered Benchmarks Unordered Benchmarks
8threadedcoresoutperformsinglethreadedcoresby1.85X
WithSAM,thebenefitincreasesto2.33X
SAMmakesmultithreadingmoreeffective
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 13
8 Thread SAM8 Thread Round Robin1 Thread
Ordered Benchmarks Unordered Benchmarks
8threadedcoresoutperformsinglethreadedcoresby1.85X
WithSAM,thebenefitincreasesto2.33X
WhydoesSAMhelp?
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 14
SAMmatchesRRwhentherearenopathologies
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready Other
WhydoesSAMhelp?
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 14
SAMmatchesRRwhentherearenopathologies
SAMreduceswastedwork
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready Other
WhydoesSAMhelp?
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 14
SAMmatchesRRwhentherearenopathologies
SAMreduceswastedwork
SAMreducesresourcestalls
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready Other
Outline
BackgroundonspeculativeparallelismPitfallsofspeculativeparallelismwithconventionalmultithreadingSAMonin-ordercoresSAMonout-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 15
SAMonout-of-ordercoresUnlikein-ordercores,prioritiesaffectpipelineefficiency• Asinglethreadcanclogcoreresources• Increasedwrongpathexecution
Despitethese,prioritizingtasksisbetter
Needforaggressiveprioritizationaffectscoredesign• Shared,notpartitionedROBs
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 16
SMTIssue
Fetch Decode
Thread micro-op queues
Issue Buffer
PhysicalRegFile
Pipe 0 ReorderBuffer
Pipe 1
In-flight uops (for ICount)
3 9 4 2
SAM priorities
3 4 2 1
Conflict resolutionpriority updates(from task unit)
Conflict res. priorities
2 3 2 1
SAMtradeoffs without-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 17
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
Baselinepolicy- ICount(IC)
sssp – 8threads
SAMtradeoffs without-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 17
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
Baselinepolicy- ICount(IC)
SAMismorebeneficialwithdynamicallysharedROBsReducesaborts+resourcestalls
sssp – 8threads
SAMtradeoffs without-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 17
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
Baselinepolicy- ICount(IC)
SAMismorebeneficialwithdynamicallysharedROBsReducesaborts+resourcestalls
Butreducedpipelineefficiency
sssp – 8threads
SAMtradeoffs without-of-ordercores
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 17
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
Baselinepolicy- ICount(IC)
SAMismorebeneficialwithdynamicallysharedROBsReducesaborts+resourcestalls
ButreducedpipelineefficiencyIncreaseinwrong-pathissues+not-readystalls
sssp – 8threads
AdaptiveSAMpolicy
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 18
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
AdaptiveSAMpolicy
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 18
HardwarecounterstotrackcyclesMicro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
AdaptiveSAMpolicy
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 18
Aborted Resource NotreadyWrongpath
HardwarecounterstotrackcyclesMicro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
AdaptiveSAMpolicy
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 18
Aborted Resource NotreadyWrongpath
Hardwarecounterstotrackcycles
+
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
AdaptiveSAMpolicy
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 18
Aborted Resource NotreadyWrongpath
Hardwarecounterstotrackcycles
Cycleslosttotasklevelspeculation
+
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
AdaptiveSAMpolicy
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 18
Aborted Resource NotreadyWrongpath
Hardwarecounterstotrackcycles
Cycleslosttotasklevelspeculation
Cycleslosttopipelineinefficiencies
+ +
>
UseSAM UseICount
True False
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
SAMonOoO cores(allbenchmarks)
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 19
At8threads/core:• Multithreadingimprovesperformance
oversinglethreadedcoresby1.1x
Averageoverallbenchmarks
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
SAMonOoO cores(allbenchmarks)
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 19
At8threads/core:• Multithreadingimprovesperformance
oversinglethreadedcoresby1.1x• WithSAM,improvementrisesto1.5x
Averageoverallbenchmarks
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
SAMonOoO cores(allbenchmarks)
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 19
At8threads/core:• Multithreadingimprovesperformance
oversinglethreadedcoresby1.1x• WithSAM,improvementrisesto1.5x
Adaptivepolicyslightlyincreasesperformanceat2and4threads
Averageoverallbenchmarks
Micro-opsissued Unusedissueslots(reason)Committed Aborted Resource Notready OtherWrongpath
Conclusion
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 20
Conventionalmultithreadingcausesperformancepathologiesonspeculativeworkloads• Increaseinabortedwork• Inefficientuseofspeculationresources
SpeculationAwareMultithreading(SAM)Prioritizethreadsrunningtasksmorelikelytocommit
SAMmakesmultithreadingmoreuseful
Questions?
SAM:OPTIMIZINGMULTITHREADEDCORESFORSPECULATIVEPARALLELISM 21
Conventionalmultithreadingcausesperformancepathologiesonspeculativeworkloads• Increaseinabortedwork• Inefficientuseofspeculationresources
SpeculationAwareMultithreading(SAM)Prioritizethreadsrunningtasksmorelikelytocommit
SAMmakesmultithreadingmoreuseful