a low‐overhead asynchronous interconnection network for...
TRANSCRIPT
ALow‐OverheadAsynchronousALow‐OverheadAsynchronousInterconnectionNetworkforInterconnectionNetworkforGALSChipMultiprocessorsGALSChipMultiprocessorsMichaelN.HorakMichaelN.Horak,,UniversityofMarylandUniversityofMaryland
StevenM.NowickStevenM.Nowick,,ColumbiaUniversityColumbiaUniversity
MatthewMatthewCarlbergCarlberg,,UCBerkeleyUCBerkeley
UziUziVishkinVishkin,,UniversityofMarylandUniversityofMaryland
In ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-10)
ChallengesforDesigningNetworks‐on‐ChipChallengesforDesigningNetworks‐on‐Chip
•• PowerConsumptionPowerConsumption–– WillexceedfuturepowerbudgetsbyafactorofWillexceedfuturepowerbudgetsbyafactorof!"#!"#[1][1]
–– Globalclocks:consumelargefractionofoverallpowerGlobalclocks:consumelargefractionofoverallpower
•• PerformanceBottlenecksPerformanceBottlenecks–– LargenetworklatenciescauseperformancedegradationLargenetworklatenciescauseperformancedegradation
•• IncreasedDesignerResourcesIncreasedDesignerResources–– ManytechniquesareincompatiblewithcurrentCADtoolsManytechniquesareincompatiblewithcurrentCADtools
–– DifficultiesintegratingheterogeneousmodulesDifficultiesintegratingheterogeneousmodules•• ChipspartitionedintoChipspartitionedinto!"#$%&#'($%!%)*(+,!-%).!"#$%&#'($%!%)*(+,!-%).
[1]J.D.Owens,W.J.Dally,R.Ho,D.N.[1]J.D.Owens,W.J.Dally,R.Ho,D.N.JayasimhaJayasimha,S.W.,S.W.KecklerKeckler,andL.‐S.,andL.‐S.PehPeh..Researchchallengesforon‐chipinterconnectionnetworks.Researchchallengesforon‐chipinterconnectionnetworks.IEEEMicroIEEEMicro,27(5):96‐108,2007.,27(5):96‐108,2007.
PotentialAdvantagesofAsynchronousDesignPotentialAdvantagesofAsynchronousDesign
•• LowerPowerLowerPower–– Noclockpowerconsumed:Noclockpowerconsumed:withoutwithoutclockgatingclockgating–– IdlecomponentsinherentlyconsumelowpowerIdlecomponentsinherentlyconsumelowpower
•• GreaterFlexibility/ModularityGreaterFlexibility/Modularity–– NoclockdistributionNoclockdistribution–– EasierintegrationbetweenmultipletimingdomainsEasierintegrationbetweenmultipletimingdomains–– SupportsreusablecomponentsSupportsreusablecomponents
•• LowerSystemLatencyLowerSystemLatency–– End‐to‐endtrafficwithoutclocksynchronizationEnd‐to‐endtrafficwithoutclocksynchronization
•• MoreResilienttoOn‐ChipVariationsMoreResilienttoOn‐ChipVariations–– CorrectoperationdependsonlocalizedtimingconstraintsCorrectoperationdependsonlocalizedtimingconstraints
Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System
•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]
[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.
Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System
•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]
•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric
[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.
Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System
•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]
•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric
•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks
[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.
Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System
•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]
•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric
•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks
•• Mixed‐TimingInterfacesMixed‐TimingInterfaces–– ProviderobustcommunicationProviderobustcommunicationbetweenSyncandbetweenSyncandAsyncAsyncdomainsdomains
[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.
AdvancesinGALSNetworks‐on‐ChipAdvancesinGALSNetworks‐on‐Chip•• CommercialDesignsCommercialDesigns
–– SilistixSilistix,Inc.,Inc.(J.Bainbridge,S.(J.Bainbridge,S.FurberFurber.IEEEMicro‐02).IEEEMicro‐02)
•• CHAINCHAIN™™workstoolsuite:heterogeneousworkstoolsuite:heterogeneousSOCsSOCs
–– FulcrumMicrosystemsFulcrumMicrosystems(A.Lines.Micro‐04)(A.Lines.Micro‐04)
•• FocalPointFocalPointchips:chips:high‐performanceEthernetroutinghigh‐performanceEthernetrouting
•• RecentRecentWorkWork–– AsynchronousNetwork‐on‐Chip(AsynchronousNetwork‐on‐Chip(ANoCANoC))((BeigneBeigne,,ClermidyClermidy,,VivetVivetetal.Async‐05)etal.Async‐05)
•• Wormholepacket‐switchedWormholepacket‐switchedNoCNoCwithlow‐latencyservicewithlow‐latencyservice
–– MANGOMANGOClocklessClocklessNetwork‐on‐ChipNetwork‐on‐Chip(T.(T.BjerregaardBjerregaard.DATE‐05).DATE‐05)
•• Offersquality‐of‐service(Offersquality‐of‐service(QoSQoS)guarantees)guarantees
–– RasPRasPOn‐ChipNetworkOn‐ChipNetwork(S.Hollis,S.W.Moore.ICCD‐06)(S.Hollis,S.W.Moore.ICCD‐06)
•• Utilizeshigh‐speedpulse‐basedsignalingUtilizeshigh‐speedpulse‐basedsignaling
–– SpiNNakerSpiNNakerProjectProject(Khan,Lester,(Khan,Lester,PlanaPlana,,FurberFurberetal.IJCNN‐08)etal.IJCNN‐08)
•• Massively‐parallelneuralsimulationMassively‐parallelneuralsimulation
GALSGALSNOCsNOCs:TypicalCurrentTargets:TypicalCurrentTargets•• Low‐toModerate‐PerformanceEmbeddedSystemsLow‐toModerate‐PerformanceEmbeddedSystems
–– 200‐500MHz200‐500MHz–– HighsystemlatencyHighsystemlatency
•• ““Four‐PhaseReturn‐to‐ZeroFour‐PhaseReturn‐to‐Zero””ProtocolsProtocols–– Tworound‐trips/linkTworound‐trips/linkpertransactionpertransaction
•• ““Delay‐InsensitiveDataDelay‐InsensitiveData””Encoding(dual‐rail,1‐of‐4)Encoding(dual‐rail,1‐of‐4)–– Lowercodingefficiencythansingle‐railLowercodingefficiencythansingle‐rail
•• Complex‐FunctionalityRouterNodesComplex‐FunctionalityRouterNodes–– 5‐portrouterswithlayeredservices(5‐portrouterswithlayeredservices(QoSQoS,etc.),etc.)–– Highlatency/highareaHighlatency/higharea
•• CustomCircuitTechniques:CustomCircuitTechniques:–– Pulse‐basedsignaling,low‐swingPulse‐basedsignaling,low‐swingsignallingsignalling–– DynamicDynamiclogic,specializedcellslogic,specializedcells
OutlineOutline
•• IntroductionIntroduction
•• TargetGALSNetworkDesignTargetGALSNetworkDesign•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork
•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives
•• ExperimentalResultsExperimentalResults
•• ConclusionsConclusions
TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors–– Medium‐toHigh‐PerformanceMedium‐toHigh‐Performance
TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors
•• ““HeterochronousHeterochronous””TimingTiming[3][3]–– MostgeneralGALStimingmodelMostgeneralGALStimingmodel
–– SupportmultiplesynchronousdomainswithunrelatedclockingSupportmultiplesynchronousdomainswithunrelatedclocking
–– PromotesreuseofIntellectualProperty(IP)modulesPromotesreuseofIntellectualProperty(IP)modules
[3]D.Messerschmitt,[3]D.Messerschmitt,““SynchronizationinDigitalSystemDesignSynchronizationinDigitalSystemDesign””,,IEEEJournalonSelectedAreasinCommunications,October1990IEEEJournalonSelectedAreasinCommunications,October1990
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors
•• ““HeterochronousHeterochronous””TimingTiming
•• TransitionSignalingTransitionSignaling(Two‐Phase)(Two‐Phase)–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““four‐phasehandshakingfour‐phasehandshaking””
•• 2roundtriplinkcommunications2roundtriplinkcommunicationspertransactionpertransaction
–– BenefitsofTwo‐Phase:BenefitsofTwo‐Phase:•• 1roundtriplinkcommunication1roundtriplinkcommunicationpertransactionpertransaction
•• improvedthroughput,powerimprovedthroughput,power……..
–– ChallengeofTwo‐Phase:ChallengeofTwo‐Phase:designinglightweightimplementationsdesigninglightweightimplementations•• Mostexisting2‐phasedesignsuse:Mostexisting2‐phasedesignsuse:
–– complexslowregisters:doublelatch,double‐edge‐triggered,capture/passcomplexslowregisters:doublelatch,double‐edge‐triggered,capture/pass»» [Seitz/Su[Seitz/Su““MosaicMosaic””93,93,BrunvandBrunvand91,Sutherland89]91,Sutherland89]
–– customcircuitcomponentscustomcircuitcomponents
TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors
•• ““HeterochronousHeterochronous””TimingTiming
•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling
•• Single‐RailBundledDataSingle‐RailBundledData–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““delay‐insensitivedelay‐insensitive””linkencodingslinkencodings
•• providegreattiming‐robustness==>cost=providegreattiming‐robustness==>cost=poorcodingefficiencypoorcodingefficiency
•• examples:dual‐rail,1‐of‐4examples:dual‐rail,1‐of‐4
–– ““Single‐RailBundledDataSingle‐RailBundledData””benefits:benefits:•• re‐usesynchronousre‐usesynchronousdatapathsdatapaths::1wire/bit1wire/bit++addedadded““requestrequest””•• excellentcodingefficiencyexcellentcodingefficiency
–– Challenge:requiresmatcheddelayforChallenge:requiresmatcheddelayfor““requestrequest””signalsignal•• 1‐sidedtimingconstraint:1‐sidedtimingconstraint:““requestrequest””mustarriveafterdatastablemustarriveafterdatastable
TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors
•• ““HeterochronousHeterochronous””TimingTiming
•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling
•• Single‐RailBundledDataSingle‐RailBundledData
•• HighPerformanceHighPerformance–– LowSystem‐LevelLatencyLowSystem‐LevelLatency
•• minimizeend‐to‐enddelayunderlighttomoderatetrafficminimizeend‐to‐enddelayunderlighttomoderatetraffic
–– HighSustainedThroughputHighSustainedThroughput•• maximizesteady‐statethroughputunderheavytrafficmaximizesteady‐statethroughputunderheavytraffic
TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors
•• ““HeterochronousHeterochronous””TimingTiming
•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling
•• Single‐RailBundledDataSingle‐RailBundledData
•• HighPerformanceHighPerformance
•• StandardCellMethodologyStandardCellMethodology–– UseexistingstandardcelllibrariesUseexistingstandardcelllibraries
•• onlyexception:onlyexception:analogarbitercircuitanalogarbitercircuit
–– Challenge:Challenge:timinganalysisusingexistingtoolstiminganalysisusingexistingtools
TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors
•• ““HeterochronousHeterochronous””TimingTiming
•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling
•• Single‐RailBundledDataSingle‐RailBundledData
•• HighPerformanceHighPerformance
•• StandardCellMethodologyStandardCellMethodology
•• Fine‐GrainedFine‐GrainedNetworkTopologyNetworkTopology–– LightweightnetworknodesLightweightnetworknodes
•• low‐functionalitylow‐functionalitylow‐radixroutercomponentslow‐radixroutercomponents
•• avoids5‐portrouterwithNorth/South/East/West/Localportsavoids5‐portrouterwithNorth/South/East/West/Localports
TargetGALSNetworkDesignTargetGALSNetworkDesign
OutlineOutline
•• IntroductionIntroduction•• TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Background:XMTProcessor/Background:XMTProcessor/MoTMoTNetworkNetwork–– eXpliciteXplicitMulti‐Threading(XMT)ArchitectureMulti‐Threading(XMT)Architecture–– Mesh‐of‐Trees(Mesh‐of‐Trees(MoTMoT)NetworkTopology)NetworkTopology–– SynchronousRouterNodesSynchronousRouterNodes
•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions
XMTParallelArchitectureXMTParallelArchitecture
•• XMT=XMT=““eXpliciteXplicitMulti‐ThreadingMulti‐Threading””(1997‐present)[4](1997‐present)[4]–– LedbyProf.UziLedbyProf.UziVishkinVishkinatUniversityofMaryland,CollegeParkatUniversityofMaryland,CollegePark
•• BasedonParallelRandomAccessModel(PRAM)BasedonParallelRandomAccessModel(PRAM)–– LargestbodyofparallelalgorithmictheoryLargestbodyofparallelalgorithmictheory
•• EaseofProgrammabilityEaseofProgrammability–– XMT‐ClanguageXMT‐Clanguage+optimizingcompiler+optimizingcompiler–– Single‐ProgramMultiple‐Data(SPMD)programmingmethodologySingle‐ProgramMultiple‐Data(SPMD)programmingmethodology
•• DemonstratedtoProvideSignificantSpeedupsDemonstratedtoProvideSignificantSpeedups–– Performswellonirregularcomputations(BFS,ray‐tracing)Performswellonirregularcomputations(BFS,ray‐tracing)–– 100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]
[4]D.Naishlos,J.Nuzman,C.‐W.Tseng,andU.Vishkin.“Towardsafirstverticalprototypingofanextremelyfine‐grainedparallelprogrammingapproach”,SPAA2001
[5]P.[5]P.GuGuandU.andU.VishkinVishkin,,““Casestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipCasestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipmultiprocessormultiprocessor””,,JournalofEmbeddedComputing,April2006JournalofEmbeddedComputing,April2006
XMTParallelArchitectureXMTParallelArchitecture
•• ProcessingClustersProcessingClusters–– Groupofsimplepipelinedcores,Groupofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)
–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization
–– ““IOSIOS””=independence‐of‐order=independence‐of‐ordersemantics:semantics:noWAW/WAR/RAWnoWAW/WAR/RAWdatahazardsbetweenthreadsdatahazardsbetweenthreads
XMTParallelArchitectureXMTParallelArchitecture
•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)
–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittlewithlittleornosynchronizationornosynchronization
•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache
–– NocachecoherenceproblemNocachecoherenceproblem
XMTParallelArchitectureXMTParallelArchitecture
•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)
–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization
•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache
–– NocachecoherenceproblemNocachecoherenceproblem
•• NOCChallenge:NOCChallenge:highbandwidth/lowpowerrequirementshighbandwidth/lowpowerrequirements–– Manyconcurrentmemoryrequests(load/store)Manyconcurrentmemoryrequests(load/store)–– Shortpackets:1‐2flits/dynamically‐varyingtrafficShortpackets:1‐2flits/dynamically‐varyingtraffic–– LowlatencyrequiredforsystemperformanceLowlatencyrequiredforsystemperformance
ProposedXMTParallelArchitecture:ProposedXMTParallelArchitecture:withGALSInterconnectionNetworkwithGALSInterconnectionNetwork
GALSNetwork…
…
Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology
•• VariantofclassicVariantofclassicMoTMoT
•• Nfan‐outtreesNfan‐outtrees–– RoutingonlyRoutingonly
–– RootatsourceterminalsRootatsourceterminals
•• Nfan‐intreesNfan‐intrees–– ArbitrationonlyArbitrationonly
–– RootatdestinationterminalsRootatdestinationterminals
$%&'(%)('*+,*-('+.
Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology•• HighThroughputHighThroughput
–– Uniqueroutingpaths(source/sink)Uniqueroutingpaths(source/sink)–– AvoidsinterferencepenaltiesAvoidsinterferencepenalties
•• FixedPathLengthFixedPathLength–– LogarithmicdepthLogarithmicdepth
•• DistributedLow‐RadixRoutingDistributedLow‐RadixRouting–– LimitedfunctionalitynodesLimitedfunctionalitynodes–– WormholedeterministicroutingWormholedeterministicrouting
•• ShowntoPerformWellforShowntoPerformWellforCMPsCMPs–– Providesveryhighsustainedthroughput[6]Providesveryhighsustainedthroughput[6]–– Highsaturationthroughput:Highsaturationthroughput:~91%~91%
[6]A.O.Balkan,G.[6]A.O.Balkan,G.QuQu,U.,U.VishkinVishkin,,““Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐chipparallelismchipparallelism””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009
SynchronousRoutingPrimitiveSynchronousRoutingPrimitive•• Fan‐OutComponentFan‐OutComponent[7][7]
–– 1Input,2Outputs1Input,2Outputs
–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism
•• SignaltopreviousstagewhenSignaltopreviousstagewhennewdatacanbeacceptednewdatacanbeaccepted
•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””[[CarloniCarlonietal.,TCAD01]etal.,TCAD01]
–– 2‐RegisterFIFO:2‐RegisterFIFO:B0,B1B0,B1
–– Allows1flit/cycleinsteady‐stateAllows1flit/cycleinsteady‐state•• AcceptnewdataandforwardstoreddataconcurrentlyAcceptnewdataandforwardstoreddataconcurrently
–– Cost:Cost:1extraauxiliaryregister1extraauxiliaryregister((flipflop‐basedflipflop‐based))
[7][7]A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)
SynchronousArbitrationPrimitiveSynchronousArbitrationPrimitive
•• Fan‐InComponentFan‐InComponent[7][7]–– 2Inputs,1Output2Inputs,1Output
–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism
•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””–– 2‐Stage2‐StageFIFOsFIFOsateachinputportateachinputport–– Whenempty,latency=1cycleWhenempty,latency=1cycle
–– Whenstalled,latency=2+cyclesWhenstalled,latency=2+cycles•• Dependsonback‐pressureandsynchronousarbitrationDependsonback‐pressureandsynchronousarbitration
–– Cost:Cost:totalof4registerstotalof4registers(flip‐flopbased)(flip‐flopbased)
[7}[7}A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)
OutlineOutline
•• IntroductionIntroduction
•• TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork
•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives–– Routingprimitive(Fan‐out)Routingprimitive(Fan‐out)
–– Arbitrationprimitive(Fan‐in)Arbitrationprimitive(Fan‐in)–– Mixed‐timinginterfacesMixed‐timinginterfaces
•• ExperimentalResultsExperimentalResults
•• ConclusionsConclusions
NewRoutingPrimitiveNewRoutingPrimitive
Req0Ack0Data0
Req1Ack1Data1
ReqAck
B(oolean)
Data_In
Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)
NewRoutingPrimitiveNewRoutingPrimitive
Req0Ack0Data0
Req1Ack1Data1
ReqAck
B(oolean)
Data_In
Binary Routing SignalBinary Routing Signal
NewRoutingPrimitiveNewRoutingPrimitive
Req0Ack0Data0
Req1Ack1Data1
ReqAck
B(oolean)
Data_In
Data ChannelsData Channels
Latch Control 0
Toggle 0
LATCH
Req0
AckReq Ack0
Latch Control 1
Toggle 1
LATCH
Req1
AckReq Ack1
B
B
Data1
Data0
Data_In
,*-('+./0%'1'('23,*-('+./0%'1'('23
Req0Req1Ack
Latch Control 0
Toggle 0
LATCH
Req0
AckReq Ack0
Latch Control 1
Toggle 1
LATCH
Req1
AckReq Ack1
B
B
Data1
Data0
Data_In
LatchLatchControllersControllers
Req0Req1Ack
,*-('+./0%'1'('23,*-('+./0%'1'('23
Req Ack Req1 Ack1
Latch ControllerLatch Controller
Latch Control 0
Toggle 0
LATCH
Req0
AckReq Ack0
Latch Control 1
Toggle 1
LATCH
Req1
AckReq Ack1
B
B
Data1
Data0
Data_In
Normally OpaqueNormally OpaqueLatch RegistersLatch Registers
,*-('+./0%'1'('23,*-('+./0%'1'('23
Req0Req1Ack
Latch Control 0
Toggle 0
LATCH
Req0
AckReq Ack0
Latch Control 1
Toggle 1
LATCH
Req1
AckReq Ack1
B
B
Data1
Data0
Data_In
4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";
,*-('+./0%'1'('23,*-('+./0%'1'('23
Req0Req1Ack
<)(3+=><)(3+=>
Latch Control 0
Toggle 0
LATCH
Req0
AckReq Ack0
Latch Control 1
Toggle 1
LATCH
Req1
AckReq Ack1
B
B
Data1
Data0
Data_In
4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";
,*-('+./0%'1'('23,*-('+./0%'1'('23
Req0Req1Ack
?@%*-.@A-(?@%*-.@A-(
/0,!()'1$/0,!()'1$((.$-*'.$-*'
NewArbitrationPrimitiveNewArbitrationPrimitive
Req0Ack0Data0
Req_OutAck_InData_Out
Req1Ack1Data1
Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)
NewArbitrationPrimitiveNewArbitrationPrimitive
Req0Ack0Data0
Req_OutAck_InData_Out
Req1Ack1Data1
Data ChannelsData Channels
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(
4)()A)(@4)()A)(@
<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(
4)()A)(@4)()A)(@
Mutual ExclusionMutual ExclusionElement (Element (MutexMutex))
<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(
4)()A)(@4)()A)(@
Request ProtectionRequest ProtectionLatchesLatches(Normally Opaque)(Normally Opaque)
<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(
4)()A)(@4)()A)(@
Data + Request Latch RegisterData + Request Latch Register (only one bank of latches required)(only one bank of latches required)
<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(
4)()A)(@4)()A)(@
Acknowledgment Protection LatchesAcknowledgment Protection Latches(normally transparent)(normally transparent)
<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%
4)()A)(@4)()A)(@
F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J
$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
<)(3+=><)(3+=>
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%
4)()A)(@4)()A)(@
F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J
$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
<)(3+=><)(3+=>
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0Data1
Data_Out
Mux_Select
LATCH
B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%
4)()A)(@4)()A)(@
F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J
$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
?@%*-.@A-(?@%*-.@A-((bestcase,(bestcase,alternatingalternatinginputs)inputs)
/0,!()'1$/0,!()'1$((.$-*'.$-*'
WormholeRoutingCapabilityWormholeRoutingCapability
•• Goal:Goal:supporttransmissionofmulti‐flitpacketssupporttransmissionofmulti‐flitpackets–– example:XMTexample:XMT””storestorepacketspackets””=2flits=2flits(address+data)(address+data)
•• Solution:Solution:add1extraadd1extra““gluebitgluebit””toeachflittoeachflit–– Gluebit=1Gluebit=1notlastflitinpacketnotlastflitinpacket
–– Enhancedarbitrationprimitive:Enhancedarbitrationprimitive:biasbiasmutexmutexdecisiondecision•• ““winner‐take‐allwinner‐take‐all””strategy[strategy[Dally/TowlesDally/Towles]]
•• headerflittakesoverheaderflittakesovermutexmutex:glue=1:glue=1
•• lastflitreleaseslastflitreleasesmutexmutex:glue=:glue=00
Mutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
L8
L9
Req_Out
Ack_In
Data0Data1
Glue0
Glue1
Mux_Select
Data_OutLATCH
L+@)+=35/B8*C/D*+(%*8/E+'(L+@)+=35/B8*C/D*+(%*8/E+'(
4)()A)(@4)()A)(@
<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23
glue0 bit
glue1 bit
WormholeControl
LinearPipelinePrimitiveLinearPipelinePrimitive
Req_OutAck_InData_Out
ReqAckData
-- Can beCan be inserted for buffering:inserted for buffering: to improve system-level throughput to improve system-level throughput-- Basis for Basis for design of new fan-in/fan-out primitivesdesign of new fan-in/fan-out primitives
[8]M.SinghandS.M.Nowick.“MOUSETRAP:High‐SpeedTransition‐SignalingAsynchronousPipelines,”IEEETransactionsonVLSISystems,vol.15:11,pp.1256‐1269(Nov.2007)
LinearPipelinePrimitiveLinearPipelinePrimitive
Req_OutAck_InData_Out
ReqAckData
Handshaking Signals (Request and Acknowledgment)Handshaking Signals (Request and Acknowledgment)
LinearPipelinePrimitiveLinearPipelinePrimitive
Req_OutAck_InData_Out
ReqAckData
Data ChannelsData Channels
Mixed‐TimingInterfacesMixed‐TimingInterfaces
•• UseExistingSynchronizingUseExistingSynchronizingFIFOsFIFOs[9][9](withsmall(withsmallmodifications)modifications)
–– SupportsSupports““heterochronousheterochronous””timingdomainstimingdomains–– NomodificationtoexistingcomponentsNomodificationtoexistingcomponents
•• ModularDesignModularDesign–– ReusableReusablePutPutandandGetGetcomponents(eithercomponents(eitherAsyncAsyncorSync)orSync)–– EachFIFOisarrayofidenticalcellsEachFIFOisarrayofidenticalcells
•• SupportsLow‐PowerOperationSupportsLow‐PowerOperation–– CircularFIFO:datadoesnotmoveCircularFIFO:datadoesnotmove
[9]T.[9]T.ChelceaChelceaandS.Nowick,andS.Nowick,““RobustInterfacesforMixed‐TimingSystemsRobustInterfacesforMixed‐TimingSystems””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004
OutlineOutline
•• IntroductionIntroduction
•• TargetGALSNetworkDesignTargetGALSNetworkDesign
•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork
•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives
•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions
EvaluationMethodologyEvaluationMethodology
•• DirectComparisonwithSynchronousDirectComparisonwithSynchronousMoTMoTNetworkNetwork–– IdenticalTechnology:IdenticalTechnology:IBM90nmCMOSprocessIBM90nmCMOSprocess
–– IdenticalFunctionalityIdenticalFunctionality:Sameroutingandarbitrationprimitives:Sameroutingandarbitrationprimitives
–– IdenticalTopology:IdenticalTopology:8‐terminalnetworkswithsame8‐terminalnetworkswithsamefloorplanfloorplan
•• EvaluateatMultipleLevelsofIntegrationEvaluateatMultipleLevelsofIntegration–– IsolatedAsynchronousPrimitivesIsolatedAsynchronousPrimitives(post‐layout)(post‐layout)
–– 8‐TerminalAsynchronousNetwork8‐TerminalAsynchronousNetwork(pre‐layoutwithwireestimates,(pre‐layoutwithwireestimates,‐‐interconnectionoflaid‐outrouterprimitives)‐‐interconnectionoflaid‐outrouterprimitives)
–– 8‐TerminalGALSNetwork8‐TerminalGALSNetwork
–– XMTArchitectureCo‐SimulationonParallelKernelsXMTArchitectureCo‐SimulationonParallelKernels
ToolFlowToolFlow
•• ImplementedinIBM90nmtechnologyImplementedinIBM90nmtechnology–– PlacedandroutedwithCadenceSOCEncounterPlacedandroutedwithCadenceSOCEncounter
–– Simulatedasgate‐levelSimulatedasgate‐levelVerilogVerilogwithextracteddelayswithextracteddelays
•• StandardCellMethodologyStandardCellMethodology–– ARM90nmStandardCells(IBMCMOS9SF)ARM90nmStandardCells(IBMCMOS9SF)
•• Exception:MutualExclusionElementException:MutualExclusionElement–– DesignedusingtransistormodelsfromIBM90nmPDKDesignedusingtransistormodelsfromIBM90nmPDK
–– SimulatedinCadenceSimulatedinCadenceSpectreSpectre
–– MeasureddelaystocalibrateMeasureddelaystocalibrateVerilogVerilogbehavioralmodelbehavioralmodel
RoutingPrimitiveComparison:RoutingPrimitiveComparison:AreaandPowerAreaandPower
225.61.822.062.06988.6988.6Synchronous
0.60.560.370.37358.4358.4Asynchronous
((μμW)W)((μμW)W)((pJpJ))((μμmm22))
IdleIdlePowerPower
LeakageLeakagePowerPower
Energy/Energy/PacketPacketAreaArea
•• Area:Area:–– 64%lessarea:64%lessarea:resultoflightweightresultoflightweightdatastoragedatastorage
•• 2flip‐flopregisters+extraMUX/DEMUX(sync)2flip‐flopregisters+extraMUX/DEMUX(sync)vsvs..2latchregisters(2latchregisters(asyncasync))
•• MUX/DEMUXoverhead(sync)MUX/DEMUXoverhead(sync)
•• Energy/Packet(1flit):Energy/Packet(1flit):–– 82%lessenergyperpacket82%lessenergyperpacket–– Steady‐statemeasurementonrandomtrafficSteady‐statemeasurementonrandomtraffic
RoutingPrimitiveComparison:RoutingPrimitiveComparison:LatencyandThroughputLatencyandThroughput
1.931.931.931.931.931.93516516Synchronous
1.701.701.341.341.071.07546546Asynchronous
AlternatingAlternatingRandomRandomSingleSingle((psps))
Maximum Throughput Maximum Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type
•• Synchronous:Synchronous:UsingMaxClockRateUsingMaxClockRate(1.93GHz)(1.93GHz)
•• Latency:Latency:–– //546546psps((asyncasync)vs.516)vs.516psps(sync)(sync)
•• MaxThroughput(Giga‐flits/sec):MaxThroughput(Giga‐flits/sec):–– Single‐portedtraffic:Single‐portedtraffic:MMNMMNofsyncmax.ofsyncmax.(noconcurrency)(noconcurrency)
–– Randomtraffic:Randomtraffic:O"NO"Nofsyncmax.ofsyncmax.–– Alternatingtraffic:Alternatingtraffic:PPNPPNofsyncMax.ofsyncMax.(mostconcurrency)(mostconcurrency)
……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages
ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:AreaandPowerAreaandPower
388.64.133.533.532240.32240.3Synchronous
0.50.500.330.33349.3349.3Asynchronous
((μμW)W)((μμW)W)((pJpJ))((μμmm22))
IdleIdlePowerPower
LeakageLeakagePowerPower
Energy/Energy/PacketPacketAreaAreaComponent TypeComponent Type
•• Area:Area:–– 84%lessarea84%lessarea–– Duetolow‐overheaddatastorageDuetolow‐overheaddatastorage
•• 4flip‐flopregisters(sync)4flip‐flopregisters(sync)vs.vs.11latchregister(latchregister(asyncasync))
•• Energy/Packet(1flit):Energy/Packet(1flit):–– 91%lessenergyperpacket91%lessenergyperpacket–– Measuredsteady‐statepacketsarrivingatbothinputportsMeasuredsteady‐statepacketsarrivingatbothinputports
ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:LatencyandThroughputLatencyandThroughput
2.092.092.092.09474474Synchronous
2.042.041.081.08489489Asynchronous
Both PortsBoth PortsSingleSingle((psps))
Max. Throughput Max. Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type
•• Synchronous:UsingMaxClockRate(2.09GHz)Synchronous:UsingMaxClockRate(2.09GHz)
•• Latency:Latency:–– 489489psps((asyncasync))vs.474vs.474psps(sync)(sync)
•• Max.Throughput(Giga‐flits/sec):Max.Throughput(Giga‐flits/sec):–– SinglePortonly:SinglePortonly:M!NM!Nofsynchronousmax.ofsynchronousmax.–– TrafficatBothPorts:TrafficatBothPorts:QPNQPNofsynchronousmax.ofsynchronousmax.
……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages
8‐TerminalNetworkEvaluation8‐TerminalNetworkEvaluation
•• Head‐on‐HeadComparisonwithSyncNetworkHead‐on‐HeadComparisonwithSyncNetwork
•• ProjectedNetworkLayoutProjectedNetworkLayout–– Pre‐layoutPre‐layoutasyncasyncnetworknetwork
–– UsesUsespost‐layoutprimitivespost‐layoutprimitives,treatedashardIPmacros,with,treatedashardIPmacros,withassignedwiredelaysassignedwiredelays
–– ExtrapolatewiredelaysbasedonASICExtrapolatewiredelaysbasedonASICfloorplanfloorplanofSyncofSyncMoTMoT
•• ExperimentalSetupExperimentalSetup–– EvaluateperformanceunderEvaluateperformanceunderuniformlyrandominputtrafficuniformlyrandominputtraffic
–– 32‐bitflits32‐bitflits
Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout
•• BasedonBasedonFloorplanFloorplanofSynchronousofSynchronousMoTMoTTestASICTestASIC–– Designed/fabricatedatUMDinMarch2007Designed/fabricatedatUMDinMarch2007[10][10]
•• Networkdividedinto4partitions(P0,P1,P2,P3)Networkdividedinto4partitions(P0,P1,P2,P3)–– Fan‐InTreesexistentirelywithinonepartitionFan‐InTreesexistentirelywithinonepartition–– Fan‐OutTreesdistributedamongpartitionsFan‐OutTreesdistributedamongpartitions
•• AsynchronousProjectionMethodologyAsynchronousProjectionMethodology–– TreatasynchronousprimitivesareTreatasynchronousprimitivesarehardIPmacroshardIPmacros
•• allrouting,arbitrationprimitiveshavesametimingallrouting,arbitrationprimitiveshavesametiming
–– EvenlydistributeEvenlydistributegroupsofprimitivesgroupsofprimitives–– AssignAssigninter‐primitivewiredelaysinter‐primitivewiredelaysbasedonpositionbasedonposition
•• delaysonwiresassignedbasedontechnologyspecificationsdelaysonwiresassignedbasedontechnologyspecifications
[10]A.O.Balkan,M.N.[10]A.O.Balkan,M.N.HorakHorak,G.,G.QuQu,U.,U.VishkinVishkin..““Layout‐accuratedesignandimplementationofahigh‐Layout‐accuratedesignandimplementationofahigh‐throughputinterconnectionnetworkforsingle‐chipparallelprocessingthroughputinterconnectionnetworkforsingle‐chipparallelprocessing””,,HotInterconnects,August2007HotInterconnects,August2007
Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout
ExampleFan-Out Tree
P0P0
P1P1
P2P2
P3P3
CurrentCADToolFlows:SyncCurrentCADToolFlows:Syncvsvs..AsyncAsync
•• SynchronousSynthesis:SynchronousSynthesis:–– Automaticplace/routeoptimizationsAutomaticplace/routeoptimizations
–– Includescellresizing/repeaterinsertionIncludescellresizing/repeaterinsertion
•• AsynchronousSynthesis:AsynchronousSynthesis:–– Limitedoptimization:Limitedoptimization:hardmacros+regularmanualplacementhardmacros+regularmanualplacement–– Nocellresizing/repeaterinsertionNocellresizing/repeaterinsertion……muchpotentialforfutureperformanceimprovementmuchpotentialforfutureperformanceimprovement
•• CurrentlyDoNotDefineNecessaryTimingConstraintsCurrentlyDoNotDefineNecessaryTimingConstraints–– Noautomaticpath‐lengthmatchingNoautomaticpath‐lengthmatching–– NecessarytoenforcebundlingconstraintNecessarytoenforcebundlingconstraint
AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:400MHzSyncvs.400MHzSyncvs.AsyncAsync
Comparablethroughputfor entire
range of Sync
Sync hasat least 4.3x
higher latencyfor all Syncinput rates
Sync Max.Input Rate:102.4 Gbps
Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency
AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:800MHzSyncvs.800MHzSyncvs.AsyncAsync
Comparablethroughputfor entire
range of Sync
Sync has>1.7x
higher latencyfor input rates
up to 73%of Sync max.(150 Gbps)
Sync Max.Input Rate:204.8 Gbps
Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency
AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:1.36GHzSyncvs.1.36GHzSyncvs.AsyncAsync
Comparablethroughput
for ratesup to
55% of Sync max.(190 Gbps)
Lower latencyfor input
rates up to 43% of
Sync max.(150 Gbps)
Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency
Sync Max.Input Rate:348.2 Gbps
GALSNetworkPerformanceComparisonGALSNetworkPerformanceComparison
•• ExperimentalSetupExperimentalSetup–– CreateterminalstogeneratetrafficandrecordmeasurementsCreateterminalstogeneratetrafficandrecordmeasurements
–– TerminalsgenerateTerminalsgenerateuniformlyrandominputtrafficuniformlyrandominputtraffic
•• ResultsNormalizedtoClockRateResultsNormalizedtoClockRate–– ThroughputunitsThroughputunits(normalized)(normalized)::flitspercycleperportflitspercycleperport
–– LatencyunitsLatencyunits(normalized)(normalized)::#clockcycles#clockcycles
–– Syncnetworkresults:Syncnetworkresults:alwayssamealwayssamerelativetoclockcyclesrelativetoclockcycles
–– AsyncAsyncnetworkresults:networkresults:varyvarywithclockratewithclockrate
GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:400MHzGALSvs.Sync400MHzGALSvs.Sync
Comparablethroughput
for alltraffic rates
Sync has52% higher
latencyup to 80%input traffic
GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:600MHzGALSvs.Sync600MHzGALSvs.Sync
Comparablethroughputup to 65%input traffic
Lower latencyup to 60%input traffic
GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:800MHzGALSvs.Sync800MHzGALSvs.Sync
Comparablethroughputup to 52%input traffic
Lower latencyup to 29%
input traffic,comparable
latencyup to 40%input traffic
XMTParallelKernelSimulationsXMTParallelKernelSimulations
•• Goal:IntegratewithSynchronousXMTParallelArchitectureGoal:IntegratewithSynchronousXMTParallelArchitecture–– XMTXMTVerilogVerilogRTLdescriptionwithGALSnetworkRTLdescriptionwithGALSnetwork
•• XMTParallelKernelsXMTParallelKernels–– ArraySummation(add)ArraySummation(add)
•• Computesumof3millionelementsinarrayComputesumof3millionelementsinarray
–– MatrixMultiplication(MatrixMultiplication(mmulmmul))•• Computeproductoftwo64x64matricesComputeproductoftwo64x64matrices
–– Breadth‐FirstSearch(Breadth‐FirstSearch(bfsbfs))•• RunXMTBFSalgorithmwith100,000verticesand1millionedgesRunXMTBFSalgorithmwith100,000verticesand1millionedges
–– ArrayIncrement(ArrayIncrement(a_inca_inc))•• Incrementall32kelementsofanarrayIncrementall32kelementsofanarray
XMTParallelKernelSimulationsXMTParallelKernelSimulations
•• XMTProcessorConfigurationXMTProcessorConfiguration–– 8ProcessingClusters(168ProcessingClusters(16TCUsTCUseach)=each)=128128TCUTCU’’sstotaltotal
–– 8DistributedL1D‐CacheModules(64KBtotal)8DistributedL1D‐CacheModules(64KBtotal)
•• SimulateGALSXMTatDifferentClockFrequenciesSimulateGALSXMTatDifferentClockFrequencies–– 200,400,700MHz200,400,700MHz
•• CompareSpeedupsRelativetoSynchronousXMTCompareSpeedupsRelativetoSynchronousXMT–– Valuesgreaterthan1.0indicatebetterperformanceValuesgreaterthan1.0indicatebetterperformance
GALSXMTPerformanceComparisonGALSXMTPerformanceComparison
GALS XMThas similar
performance for200, 400 MHz
Only moderate degradationat 700 MHz(a_inc: 37%
decrease)
(Graph arranged in order of increasing network utilization)(Graph arranged in order of increasing network utilization)
ConclusionsConclusions•• NewGALSNetworkforChipMultiprocessorsNewGALSNetworkforChipMultiprocessors
–– Low‐overheadnetworkforLow‐overheadnetworkfor““heterochronousheterochronous””InterfacesInterfaces
•• DesignofTwoNewAsynchronousRouterCellsDesignofTwoNewAsynchronousRouterCells–– RoutingRoutingandandarbitrationarbitrationcircuitscircuits
•• OverviewofResultsOverviewofResults–– RouterPrimitivesRouterPrimitives
•• 64‐84%lessarea,82‐91%lessenergy/packet64‐84%lessarea,82‐91%lessenergy/packet•• Latency&throughput(forbalancedtraffic)=~2Latency&throughput(forbalancedtraffic)=~2Gflits/secGflits/sec
–– System‐LevelPerformanceSystem‐LevelPerformance•• AsyncAsyncnetworkcomparisonwith800MHzsyncnetwork:networkcomparisonwith800MHzsyncnetwork:
–– ComparablethroughputComparablethroughputacrossallinputtrafficacrossallinputtraffic
–– 1.7xlowerlatency1.7xlowerlatencyuptoupto73%maxinputtraffic73%maxinputtraffic
•• GALSnetworkcomparisonwith800MHzsyncnetwork:GALSnetworkcomparisonwith800MHzsyncnetwork:–– ComparablethroughputComparablethroughputuptoupto52%maxinputtraffic52%maxinputtraffic
–– LowerlatencyLowerlatencyuptoupto29%maxinputtraffic29%maxinputtraffic
FutureDirectionsFutureDirections•• ArchitecturalOptimizationArchitecturalOptimization
–– InsertlinearpipelinestagesonlongwirestoimprovethroughputInsertlinearpipelinestagesonlongwirestoimprovethroughput
•• CircuitOptimizationCircuitOptimization–– Improvedesignsofrouting/arbitrationprimitivesImprovedesignsofrouting/arbitrationprimitives–– Mixed‐timingFIFOoptimizationsMixed‐timingFIFOoptimizations
•• AsynchronousTopologyOptimizationAsynchronousTopologyOptimization–– AreaimprovementsAreaimprovementsusinghybridusinghybridMoT‐ButterflyMoT‐Butterfly[Balkanetal.,DAC‐08][Balkanetal.,DAC‐08]
•• IntegratewithSynchronousPhysicalCADToolFlowIntegratewithSynchronousPhysicalCADToolFlow–– GoalGoal=leverage=leverageexistingcommercialexistingcommercialtechniquestechniques
•• TimingconstraintspecificationandsynthesisofTimingconstraintspecificationandsynthesisofunclockedunclockedtimingpathstimingpaths
•• BuildonautomatedBuildonautomatedasyncasyncflowofflowof[[Quinton/Greenstreet/WiltonQuinton/Greenstreet/WiltonTVLSITVLSI‘‘08]08]
•• Optimizedplacement,routing,gateresizingandrepeaterinsertionOptimizedplacement,routing,gateresizingandrepeaterinsertion
•• TargetAlternativeParallelArchitectures/MemorySystemsTargetAlternativeParallelArchitectures/MemorySystems
TypesofMixed‐Timing(GALS)SystemsTypesofMixed‐Timing(GALS)Systems
•• PseudochronousPseudochronous–– SameFrequency,ConstantPhaseDifferenceSameFrequency,ConstantPhaseDifference
•• MesochronousMesochronous–– SameFrequency,UndefinedPhaseDifferenceSameFrequency,UndefinedPhaseDifference
•• PlesiochronousPlesiochronous–– NearlyexactFrequencyandPhaseDifferenceNearlyexactFrequencyandPhaseDifference
•• HeterochronousHeterochronous–– UndefinedFrequencyandPhaseDifferenceUndefinedFrequencyandPhaseDifference
MOUSETRAPAsynchronousPipelinesMOUSETRAPAsynchronousPipelines
•• FastCommunicationFastCommunication–– TransitionTransitionsignalingsignaling(2‐phase)handshaking(2‐phase)handshaking
•• Synchronous‐StyleChannelEncodingSynchronous‐StyleChannelEncoding–– Single‐railbundleddataprotocolSingle‐railbundleddataprotocol
•• LowLatencyLowLatency–– 1TransparentDLatchdelayforemptystage1TransparentDLatchdelayforemptystage
•• Minimal‐OverheadLatchControllerMinimal‐OverheadLatchController–– 1XNORGate1XNORGate
reqN
ackN-1
reqN+1
ackN
Data Latch
Latch Controller
doneN
Data in Data out
Stage NStage N-1 Stage N+1
En
MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)
Stagescommunicateusingtransition-signaling:
reqN
ackN-1
reqN+1
ackN
Data Latch
Latch Controller
doneN
Data in Data out
Stage NStage N-1 Stage N+1
En
MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)
Stagescommunicateusingtransition-signaling:
1 transition1 transitionper data item!per data item!
One Data Item
BasicMixed‐ClockFIFO(Sync‐Sync)BasicMixed‐ClockFIFO(Sync‐Sync)
cell cell cell cell cell
Get
Con
trolle
r
Empty Detector
Full DetectorPut
Controller
full
req_put
data_putCLK_put
CLK_getdata_getreq_get
valid_getempty
•• Sync‐SyncFIFOSync‐SyncFIFO:usesSynchronous:usesSynchronousPutPutandandGetGetModulesModules–– Sync‐Syncisoneof4mixed‐timingSync‐Syncisoneof4mixed‐timingFIFOsFIFOs
•• MixedMixedAsyncAsync+Sync+SyncFIFOFIFO’’ss:modularchanges:modularchanges–– Sync‐AsyncSync‐Async::usesSynchronoususesSynchronousPutPut(top)andAsynchronous(top)andAsynchronousGetGet–– Async‐SyncAsync‐Sync::usesSynchronoususesSynchronousGetGet(bottom)andAsynchronous(bottom)andAsynchronousPutPut