a low‐overhead asynchronous interconnection network for...

84
A Low‐Overhead Asynchronous A Low‐Overhead Asynchronous Interconnection Network for Interconnection Network for GALS Chip Multiprocessors GALS Chip Multiprocessors Michael N. Horak Michael N. Horak, University of Maryland University of Maryland Steven M. Nowick Steven M. Nowick, Columbia University Columbia University Matthew Matthew Carlberg Carlberg, UC Berkeley UC Berkeley Uzi Uzi Vishkin Vishkin, University of Maryland University of Maryland In ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-10)

Upload: doananh

Post on 30-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

ALow‐OverheadAsynchronousALow‐OverheadAsynchronousInterconnectionNetworkforInterconnectionNetworkforGALSChipMultiprocessorsGALSChipMultiprocessorsMichaelN.HorakMichaelN.Horak,,UniversityofMarylandUniversityofMaryland

StevenM.NowickStevenM.Nowick,,ColumbiaUniversityColumbiaUniversity

MatthewMatthewCarlbergCarlberg,,UCBerkeleyUCBerkeley

UziUziVishkinVishkin,,UniversityofMarylandUniversityofMaryland

In ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-10)

ChallengesforDesigningNetworks‐on‐ChipChallengesforDesigningNetworks‐on‐Chip

•• PowerConsumptionPowerConsumption–– WillexceedfuturepowerbudgetsbyafactorofWillexceedfuturepowerbudgetsbyafactorof!"#!"#[1][1]

–– Globalclocks:consumelargefractionofoverallpowerGlobalclocks:consumelargefractionofoverallpower

•• PerformanceBottlenecksPerformanceBottlenecks–– LargenetworklatenciescauseperformancedegradationLargenetworklatenciescauseperformancedegradation

•• IncreasedDesignerResourcesIncreasedDesignerResources–– ManytechniquesareincompatiblewithcurrentCADtoolsManytechniquesareincompatiblewithcurrentCADtools

–– DifficultiesintegratingheterogeneousmodulesDifficultiesintegratingheterogeneousmodules•• ChipspartitionedintoChipspartitionedinto!"#$%&#'($%!%)*(+,!-%).!"#$%&#'($%!%)*(+,!-%).

[1]J.D.Owens,W.J.Dally,R.Ho,D.N.[1]J.D.Owens,W.J.Dally,R.Ho,D.N.JayasimhaJayasimha,S.W.,S.W.KecklerKeckler,andL.‐S.,andL.‐S.PehPeh..Researchchallengesforon‐chipinterconnectionnetworks.Researchchallengesforon‐chipinterconnectionnetworks.IEEEMicroIEEEMicro,27(5):96‐108,2007.,27(5):96‐108,2007.

PotentialAdvantagesofAsynchronousDesignPotentialAdvantagesofAsynchronousDesign

•• LowerPowerLowerPower–– Noclockpowerconsumed:Noclockpowerconsumed:withoutwithoutclockgatingclockgating–– IdlecomponentsinherentlyconsumelowpowerIdlecomponentsinherentlyconsumelowpower

•• GreaterFlexibility/ModularityGreaterFlexibility/Modularity–– NoclockdistributionNoclockdistribution–– EasierintegrationbetweenmultipletimingdomainsEasierintegrationbetweenmultipletimingdomains–– SupportsreusablecomponentsSupportsreusablecomponents

•• LowerSystemLatencyLowerSystemLatency–– End‐to‐endtrafficwithoutclocksynchronizationEnd‐to‐endtrafficwithoutclocksynchronization

•• MoreResilienttoOn‐ChipVariationsMoreResilienttoOn‐ChipVariations–– CorrectoperationdependsonlocalizedtimingconstraintsCorrectoperationdependsonlocalizedtimingconstraints

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks

•• Mixed‐TimingInterfacesMixed‐TimingInterfaces–– ProviderobustcommunicationProviderobustcommunicationbetweenSyncandbetweenSyncandAsyncAsyncdomainsdomains

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

AdvancesinGALSNetworks‐on‐ChipAdvancesinGALSNetworks‐on‐Chip•• CommercialDesignsCommercialDesigns

–– SilistixSilistix,Inc.,Inc.(J.Bainbridge,S.(J.Bainbridge,S.FurberFurber.IEEEMicro‐02).IEEEMicro‐02)

•• CHAINCHAIN™™workstoolsuite:heterogeneousworkstoolsuite:heterogeneousSOCsSOCs

–– FulcrumMicrosystemsFulcrumMicrosystems(A.Lines.Micro‐04)(A.Lines.Micro‐04)

•• FocalPointFocalPointchips:chips:high‐performanceEthernetroutinghigh‐performanceEthernetrouting

•• RecentRecentWorkWork–– AsynchronousNetwork‐on‐Chip(AsynchronousNetwork‐on‐Chip(ANoCANoC))((BeigneBeigne,,ClermidyClermidy,,VivetVivetetal.Async‐05)etal.Async‐05)

•• Wormholepacket‐switchedWormholepacket‐switchedNoCNoCwithlow‐latencyservicewithlow‐latencyservice

–– MANGOMANGOClocklessClocklessNetwork‐on‐ChipNetwork‐on‐Chip(T.(T.BjerregaardBjerregaard.DATE‐05).DATE‐05)

•• Offersquality‐of‐service(Offersquality‐of‐service(QoSQoS)guarantees)guarantees

–– RasPRasPOn‐ChipNetworkOn‐ChipNetwork(S.Hollis,S.W.Moore.ICCD‐06)(S.Hollis,S.W.Moore.ICCD‐06)

•• Utilizeshigh‐speedpulse‐basedsignalingUtilizeshigh‐speedpulse‐basedsignaling

–– SpiNNakerSpiNNakerProjectProject(Khan,Lester,(Khan,Lester,PlanaPlana,,FurberFurberetal.IJCNN‐08)etal.IJCNN‐08)

•• Massively‐parallelneuralsimulationMassively‐parallelneuralsimulation

GALSGALSNOCsNOCs:TypicalCurrentTargets:TypicalCurrentTargets•• Low‐toModerate‐PerformanceEmbeddedSystemsLow‐toModerate‐PerformanceEmbeddedSystems

–– 200‐500MHz200‐500MHz–– HighsystemlatencyHighsystemlatency

•• ““Four‐PhaseReturn‐to‐ZeroFour‐PhaseReturn‐to‐Zero””ProtocolsProtocols–– Tworound‐trips/linkTworound‐trips/linkpertransactionpertransaction

•• ““Delay‐InsensitiveDataDelay‐InsensitiveData””Encoding(dual‐rail,1‐of‐4)Encoding(dual‐rail,1‐of‐4)–– Lowercodingefficiencythansingle‐railLowercodingefficiencythansingle‐rail

•• Complex‐FunctionalityRouterNodesComplex‐FunctionalityRouterNodes–– 5‐portrouterswithlayeredservices(5‐portrouterswithlayeredservices(QoSQoS,etc.),etc.)–– Highlatency/highareaHighlatency/higharea

•• CustomCircuitTechniques:CustomCircuitTechniques:–– Pulse‐basedsignaling,low‐swingPulse‐basedsignaling,low‐swingsignallingsignalling–– DynamicDynamiclogic,specializedcellslogic,specializedcells

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives

•• ExperimentalResultsExperimentalResults

•• ConclusionsConclusions

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors–– Medium‐toHigh‐PerformanceMedium‐toHigh‐Performance

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming[3][3]–– MostgeneralGALStimingmodelMostgeneralGALStimingmodel

–– SupportmultiplesynchronousdomainswithunrelatedclockingSupportmultiplesynchronousdomainswithunrelatedclocking

–– PromotesreuseofIntellectualProperty(IP)modulesPromotesreuseofIntellectualProperty(IP)modules

[3]D.Messerschmitt,[3]D.Messerschmitt,““SynchronizationinDigitalSystemDesignSynchronizationinDigitalSystemDesign””,,IEEEJournalonSelectedAreasinCommunications,October1990IEEEJournalonSelectedAreasinCommunications,October1990

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• TransitionSignalingTransitionSignaling(Two‐Phase)(Two‐Phase)–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““four‐phasehandshakingfour‐phasehandshaking””

•• 2roundtriplinkcommunications2roundtriplinkcommunicationspertransactionpertransaction

–– BenefitsofTwo‐Phase:BenefitsofTwo‐Phase:•• 1roundtriplinkcommunication1roundtriplinkcommunicationpertransactionpertransaction

•• improvedthroughput,powerimprovedthroughput,power……..

–– ChallengeofTwo‐Phase:ChallengeofTwo‐Phase:designinglightweightimplementationsdesigninglightweightimplementations•• Mostexisting2‐phasedesignsuse:Mostexisting2‐phasedesignsuse:

–– complexslowregisters:doublelatch,double‐edge‐triggered,capture/passcomplexslowregisters:doublelatch,double‐edge‐triggered,capture/pass»» [Seitz/Su[Seitz/Su““MosaicMosaic””93,93,BrunvandBrunvand91,Sutherland89]91,Sutherland89]

–– customcircuitcomponentscustomcircuitcomponents

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““delay‐insensitivedelay‐insensitive””linkencodingslinkencodings

•• providegreattiming‐robustness==>cost=providegreattiming‐robustness==>cost=poorcodingefficiencypoorcodingefficiency

•• examples:dual‐rail,1‐of‐4examples:dual‐rail,1‐of‐4

–– ““Single‐RailBundledDataSingle‐RailBundledData””benefits:benefits:•• re‐usesynchronousre‐usesynchronousdatapathsdatapaths::1wire/bit1wire/bit++addedadded““requestrequest””•• excellentcodingefficiencyexcellentcodingefficiency

–– Challenge:requiresmatcheddelayforChallenge:requiresmatcheddelayfor““requestrequest””signalsignal•• 1‐sidedtimingconstraint:1‐sidedtimingconstraint:““requestrequest””mustarriveafterdatastablemustarriveafterdatastable

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance–– LowSystem‐LevelLatencyLowSystem‐LevelLatency

•• minimizeend‐to‐enddelayunderlighttomoderatetrafficminimizeend‐to‐enddelayunderlighttomoderatetraffic

–– HighSustainedThroughputHighSustainedThroughput•• maximizesteady‐statethroughputunderheavytrafficmaximizesteady‐statethroughputunderheavytraffic

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance

•• StandardCellMethodologyStandardCellMethodology–– UseexistingstandardcelllibrariesUseexistingstandardcelllibraries

•• onlyexception:onlyexception:analogarbitercircuitanalogarbitercircuit

–– Challenge:Challenge:timinganalysisusingexistingtoolstiminganalysisusingexistingtools

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance

•• StandardCellMethodologyStandardCellMethodology

•• Fine‐GrainedFine‐GrainedNetworkTopologyNetworkTopology–– LightweightnetworknodesLightweightnetworknodes

•• low‐functionalitylow‐functionalitylow‐radixroutercomponentslow‐radixroutercomponents

•• avoids5‐portrouterwithNorth/South/East/West/Localportsavoids5‐portrouterwithNorth/South/East/West/Localports

TargetGALSNetworkDesignTargetGALSNetworkDesign

OutlineOutline

•• IntroductionIntroduction•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:XMTProcessor/Background:XMTProcessor/MoTMoTNetworkNetwork–– eXpliciteXplicitMulti‐Threading(XMT)ArchitectureMulti‐Threading(XMT)Architecture–– Mesh‐of‐Trees(Mesh‐of‐Trees(MoTMoT)NetworkTopology)NetworkTopology–– SynchronousRouterNodesSynchronousRouterNodes

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions

XMTParallelArchitectureXMTParallelArchitecture

•• XMT=XMT=““eXpliciteXplicitMulti‐ThreadingMulti‐Threading””(1997‐present)[4](1997‐present)[4]–– LedbyProf.UziLedbyProf.UziVishkinVishkinatUniversityofMaryland,CollegeParkatUniversityofMaryland,CollegePark

•• BasedonParallelRandomAccessModel(PRAM)BasedonParallelRandomAccessModel(PRAM)–– LargestbodyofparallelalgorithmictheoryLargestbodyofparallelalgorithmictheory

•• EaseofProgrammabilityEaseofProgrammability–– XMT‐ClanguageXMT‐Clanguage+optimizingcompiler+optimizingcompiler–– Single‐ProgramMultiple‐Data(SPMD)programmingmethodologySingle‐ProgramMultiple‐Data(SPMD)programmingmethodology

•• DemonstratedtoProvideSignificantSpeedupsDemonstratedtoProvideSignificantSpeedups–– Performswellonirregularcomputations(BFS,ray‐tracing)Performswellonirregularcomputations(BFS,ray‐tracing)–– 100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]

[4]D.Naishlos,J.Nuzman,C.‐W.Tseng,andU.Vishkin.“Towardsafirstverticalprototypingofanextremelyfine‐grainedparallelprogrammingapproach”,SPAA2001

[5]P.[5]P.GuGuandU.andU.VishkinVishkin,,““Casestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipCasestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipmultiprocessormultiprocessor””,,JournalofEmbeddedComputing,April2006JournalofEmbeddedComputing,April2006

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupofsimplepipelinedcores,Groupofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization

–– ““IOSIOS””=independence‐of‐order=independence‐of‐ordersemantics:semantics:noWAW/WAR/RAWnoWAW/WAR/RAWdatahazardsbetweenthreadsdatahazardsbetweenthreads

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittlewithlittleornosynchronizationornosynchronization

•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache

–– NocachecoherenceproblemNocachecoherenceproblem

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization

•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache

–– NocachecoherenceproblemNocachecoherenceproblem

•• NOCChallenge:NOCChallenge:highbandwidth/lowpowerrequirementshighbandwidth/lowpowerrequirements–– Manyconcurrentmemoryrequests(load/store)Manyconcurrentmemoryrequests(load/store)–– Shortpackets:1‐2flits/dynamically‐varyingtrafficShortpackets:1‐2flits/dynamically‐varyingtraffic–– LowlatencyrequiredforsystemperformanceLowlatencyrequiredforsystemperformance

ProposedXMTParallelArchitecture:ProposedXMTParallelArchitecture:withGALSInterconnectionNetworkwithGALSInterconnectionNetwork

GALSNetwork…

Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology

•• VariantofclassicVariantofclassicMoTMoT

•• Nfan‐outtreesNfan‐outtrees–– RoutingonlyRoutingonly

–– RootatsourceterminalsRootatsourceterminals

•• Nfan‐intreesNfan‐intrees–– ArbitrationonlyArbitrationonly

–– RootatdestinationterminalsRootatdestinationterminals

$%&'(%)('*+,*-('+.

Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology•• HighThroughputHighThroughput

–– Uniqueroutingpaths(source/sink)Uniqueroutingpaths(source/sink)–– AvoidsinterferencepenaltiesAvoidsinterferencepenalties

•• FixedPathLengthFixedPathLength–– LogarithmicdepthLogarithmicdepth

•• DistributedLow‐RadixRoutingDistributedLow‐RadixRouting–– LimitedfunctionalitynodesLimitedfunctionalitynodes–– WormholedeterministicroutingWormholedeterministicrouting

•• ShowntoPerformWellforShowntoPerformWellforCMPsCMPs–– Providesveryhighsustainedthroughput[6]Providesveryhighsustainedthroughput[6]–– Highsaturationthroughput:Highsaturationthroughput:~91%~91%

[6]A.O.Balkan,G.[6]A.O.Balkan,G.QuQu,U.,U.VishkinVishkin,,““Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐chipparallelismchipparallelism””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009

SynchronousRoutingPrimitiveSynchronousRoutingPrimitive•• Fan‐OutComponentFan‐OutComponent[7][7]

–– 1Input,2Outputs1Input,2Outputs

–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism

•• SignaltopreviousstagewhenSignaltopreviousstagewhennewdatacanbeacceptednewdatacanbeaccepted

•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””[[CarloniCarlonietal.,TCAD01]etal.,TCAD01]

–– 2‐RegisterFIFO:2‐RegisterFIFO:B0,B1B0,B1

–– Allows1flit/cycleinsteady‐stateAllows1flit/cycleinsteady‐state•• AcceptnewdataandforwardstoreddataconcurrentlyAcceptnewdataandforwardstoreddataconcurrently

–– Cost:Cost:1extraauxiliaryregister1extraauxiliaryregister((flipflop‐basedflipflop‐based))

[7][7]A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)

SynchronousArbitrationPrimitiveSynchronousArbitrationPrimitive

•• Fan‐InComponentFan‐InComponent[7][7]–– 2Inputs,1Output2Inputs,1Output

–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism

•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””–– 2‐Stage2‐StageFIFOsFIFOsateachinputportateachinputport–– Whenempty,latency=1cycleWhenempty,latency=1cycle

–– Whenstalled,latency=2+cyclesWhenstalled,latency=2+cycles•• Dependsonback‐pressureandsynchronousarbitrationDependsonback‐pressureandsynchronousarbitration

–– Cost:Cost:totalof4registerstotalof4registers(flip‐flopbased)(flip‐flopbased)

[7}[7}A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives–– Routingprimitive(Fan‐out)Routingprimitive(Fan‐out)

–– Arbitrationprimitive(Fan‐in)Arbitrationprimitive(Fan‐in)–– Mixed‐timinginterfacesMixed‐timinginterfaces

•• ExperimentalResultsExperimentalResults

•• ConclusionsConclusions

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Binary Routing SignalBinary Routing Signal

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Data ChannelsData Channels

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

LatchLatchControllersControllers

Req0Req1Ack

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req Ack Req1 Ack1

Latch ControllerLatch Controller

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

Normally OpaqueNormally OpaqueLatch RegistersLatch Registers

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

<)(3+=><)(3+=>

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

?@%*-.@A-(?@%*-.@A-(

/0,!()'1$/0,!()'1$((.$-*'.$-*'

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Data ChannelsData Channels

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Mutual ExclusionMutual ExclusionElement (Element (MutexMutex))

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Request ProtectionRequest ProtectionLatchesLatches(Normally Opaque)(Normally Opaque) ‏‏

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Data + Request Latch RegisterData + Request Latch Register (only one bank of latches required)(only one bank of latches required)

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Acknowledgment Protection LatchesAcknowledgment Protection Latches(normally transparent)(normally transparent)

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

<)(3+=><)(3+=>

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

<)(3+=><)(3+=>

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

?@%*-.@A-(?@%*-.@A-((bestcase,(bestcase,alternatingalternatinginputs)inputs)

/0,!()'1$/0,!()'1$((.$-*'.$-*'

WormholeRoutingCapabilityWormholeRoutingCapability

•• Goal:Goal:supporttransmissionofmulti‐flitpacketssupporttransmissionofmulti‐flitpackets–– example:XMTexample:XMT””storestorepacketspackets””=2flits=2flits(address+data)(address+data)

•• Solution:Solution:add1extraadd1extra““gluebitgluebit””toeachflittoeachflit–– Gluebit=1Gluebit=1notlastflitinpacketnotlastflitinpacket

–– Enhancedarbitrationprimitive:Enhancedarbitrationprimitive:biasbiasmutexmutexdecisiondecision•• ““winner‐take‐allwinner‐take‐all””strategy[strategy[Dally/TowlesDally/Towles]]

•• headerflittakesoverheaderflittakesovermutexmutex:glue=1:glue=1

•• lastflitreleaseslastflitreleasesmutexmutex:glue=:glue=00

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

L8

L9

Req_Out

Ack_In

Data0Data1

Glue0

Glue1

Mux_Select

Data_OutLATCH

L+@)+=35/B8*C/D*+(%*8/E+'(L+@)+=35/B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

glue0 bit

glue1 bit

WormholeControl

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

-- Can beCan be inserted for buffering:inserted for buffering: to improve system-level throughput to improve system-level throughput-- Basis for Basis for design of new fan-in/fan-out primitivesdesign of new fan-in/fan-out primitives

[8]M.SinghandS.M.Nowick.“MOUSETRAP:High‐SpeedTransition‐SignalingAsynchronousPipelines,”IEEETransactionsonVLSISystems,vol.15:11,pp.1256‐1269(Nov.2007)

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

Handshaking Signals (Request and Acknowledgment)Handshaking Signals (Request and Acknowledgment)

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

Data ChannelsData Channels

Mixed‐TimingInterfacesMixed‐TimingInterfaces

•• UseExistingSynchronizingUseExistingSynchronizingFIFOsFIFOs[9][9](withsmall(withsmallmodifications)modifications)

–– SupportsSupports““heterochronousheterochronous””timingdomainstimingdomains–– NomodificationtoexistingcomponentsNomodificationtoexistingcomponents

•• ModularDesignModularDesign–– ReusableReusablePutPutandandGetGetcomponents(eithercomponents(eitherAsyncAsyncorSync)orSync)–– EachFIFOisarrayofidenticalcellsEachFIFOisarrayofidenticalcells

•• SupportsLow‐PowerOperationSupportsLow‐PowerOperation–– CircularFIFO:datadoesnotmoveCircularFIFO:datadoesnotmove

[9]T.[9]T.ChelceaChelceaandS.Nowick,andS.Nowick,““RobustInterfacesforMixed‐TimingSystemsRobustInterfacesforMixed‐TimingSystems””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives

•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions

EvaluationMethodologyEvaluationMethodology

•• DirectComparisonwithSynchronousDirectComparisonwithSynchronousMoTMoTNetworkNetwork–– IdenticalTechnology:IdenticalTechnology:IBM90nmCMOSprocessIBM90nmCMOSprocess

–– IdenticalFunctionalityIdenticalFunctionality:Sameroutingandarbitrationprimitives:Sameroutingandarbitrationprimitives

–– IdenticalTopology:IdenticalTopology:8‐terminalnetworkswithsame8‐terminalnetworkswithsamefloorplanfloorplan

•• EvaluateatMultipleLevelsofIntegrationEvaluateatMultipleLevelsofIntegration–– IsolatedAsynchronousPrimitivesIsolatedAsynchronousPrimitives(post‐layout)(post‐layout)

–– 8‐TerminalAsynchronousNetwork8‐TerminalAsynchronousNetwork(pre‐layoutwithwireestimates,(pre‐layoutwithwireestimates,‐‐interconnectionoflaid‐outrouterprimitives)‐‐interconnectionoflaid‐outrouterprimitives)

–– 8‐TerminalGALSNetwork8‐TerminalGALSNetwork

–– XMTArchitectureCo‐SimulationonParallelKernelsXMTArchitectureCo‐SimulationonParallelKernels

ToolFlowToolFlow

•• ImplementedinIBM90nmtechnologyImplementedinIBM90nmtechnology–– PlacedandroutedwithCadenceSOCEncounterPlacedandroutedwithCadenceSOCEncounter

–– Simulatedasgate‐levelSimulatedasgate‐levelVerilogVerilogwithextracteddelayswithextracteddelays

•• StandardCellMethodologyStandardCellMethodology–– ARM90nmStandardCells(IBMCMOS9SF)ARM90nmStandardCells(IBMCMOS9SF)

•• Exception:MutualExclusionElementException:MutualExclusionElement–– DesignedusingtransistormodelsfromIBM90nmPDKDesignedusingtransistormodelsfromIBM90nmPDK

–– SimulatedinCadenceSimulatedinCadenceSpectreSpectre

–– MeasureddelaystocalibrateMeasureddelaystocalibrateVerilogVerilogbehavioralmodelbehavioralmodel

RoutingPrimitiveComparison:RoutingPrimitiveComparison:AreaandPowerAreaandPower

225.61.822.062.06988.6988.6Synchronous

0.60.560.370.37358.4358.4Asynchronous

((μμW)W)((μμW)W)((pJpJ))((μμmm22))

IdleIdlePowerPower

LeakageLeakagePowerPower

Energy/Energy/PacketPacketAreaArea

•• Area:Area:–– 64%lessarea:64%lessarea:resultoflightweightresultoflightweightdatastoragedatastorage

•• 2flip‐flopregisters+extraMUX/DEMUX(sync)2flip‐flopregisters+extraMUX/DEMUX(sync)vsvs..2latchregisters(2latchregisters(asyncasync))

•• MUX/DEMUXoverhead(sync)MUX/DEMUXoverhead(sync)

•• Energy/Packet(1flit):Energy/Packet(1flit):–– 82%lessenergyperpacket82%lessenergyperpacket–– Steady‐statemeasurementonrandomtrafficSteady‐statemeasurementonrandomtraffic

RoutingPrimitiveComparison:RoutingPrimitiveComparison:LatencyandThroughputLatencyandThroughput

1.931.931.931.931.931.93516516Synchronous

1.701.701.341.341.071.07546546Asynchronous

AlternatingAlternatingRandomRandomSingleSingle((psps))

Maximum Throughput Maximum Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type

•• Synchronous:Synchronous:UsingMaxClockRateUsingMaxClockRate(1.93GHz)(1.93GHz)

•• Latency:Latency:–– //546546psps((asyncasync)vs.516)vs.516psps(sync)(sync)

•• MaxThroughput(Giga‐flits/sec):MaxThroughput(Giga‐flits/sec):–– Single‐portedtraffic:Single‐portedtraffic:MMNMMNofsyncmax.ofsyncmax.(noconcurrency)(noconcurrency)

–– Randomtraffic:Randomtraffic:O"NO"Nofsyncmax.ofsyncmax.–– Alternatingtraffic:Alternatingtraffic:PPNPPNofsyncMax.ofsyncMax.(mostconcurrency)(mostconcurrency)

……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages

ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:AreaandPowerAreaandPower

388.64.133.533.532240.32240.3Synchronous

0.50.500.330.33349.3349.3Asynchronous

((μμW)W)((μμW)W)((pJpJ))((μμmm22))

IdleIdlePowerPower

LeakageLeakagePowerPower

Energy/Energy/PacketPacketAreaAreaComponent TypeComponent Type

•• Area:Area:–– 84%lessarea84%lessarea–– Duetolow‐overheaddatastorageDuetolow‐overheaddatastorage

•• 4flip‐flopregisters(sync)4flip‐flopregisters(sync)vs.vs.11latchregister(latchregister(asyncasync))

•• Energy/Packet(1flit):Energy/Packet(1flit):–– 91%lessenergyperpacket91%lessenergyperpacket–– Measuredsteady‐statepacketsarrivingatbothinputportsMeasuredsteady‐statepacketsarrivingatbothinputports

ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:LatencyandThroughputLatencyandThroughput

2.092.092.092.09474474Synchronous

2.042.041.081.08489489Asynchronous

Both PortsBoth PortsSingleSingle((psps))

Max. Throughput Max. Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type

•• Synchronous:UsingMaxClockRate(2.09GHz)Synchronous:UsingMaxClockRate(2.09GHz)

•• Latency:Latency:–– 489489psps((asyncasync))vs.474vs.474psps(sync)(sync)

•• Max.Throughput(Giga‐flits/sec):Max.Throughput(Giga‐flits/sec):–– SinglePortonly:SinglePortonly:M!NM!Nofsynchronousmax.ofsynchronousmax.–– TrafficatBothPorts:TrafficatBothPorts:QPNQPNofsynchronousmax.ofsynchronousmax.

……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages

8‐TerminalNetworkEvaluation8‐TerminalNetworkEvaluation

•• Head‐on‐HeadComparisonwithSyncNetworkHead‐on‐HeadComparisonwithSyncNetwork

•• ProjectedNetworkLayoutProjectedNetworkLayout–– Pre‐layoutPre‐layoutasyncasyncnetworknetwork

–– UsesUsespost‐layoutprimitivespost‐layoutprimitives,treatedashardIPmacros,with,treatedashardIPmacros,withassignedwiredelaysassignedwiredelays

–– ExtrapolatewiredelaysbasedonASICExtrapolatewiredelaysbasedonASICfloorplanfloorplanofSyncofSyncMoTMoT

•• ExperimentalSetupExperimentalSetup–– EvaluateperformanceunderEvaluateperformanceunderuniformlyrandominputtrafficuniformlyrandominputtraffic

–– 32‐bitflits32‐bitflits

Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout

•• BasedonBasedonFloorplanFloorplanofSynchronousofSynchronousMoTMoTTestASICTestASIC–– Designed/fabricatedatUMDinMarch2007Designed/fabricatedatUMDinMarch2007[10][10]

•• Networkdividedinto4partitions(P0,P1,P2,P3)Networkdividedinto4partitions(P0,P1,P2,P3)–– Fan‐InTreesexistentirelywithinonepartitionFan‐InTreesexistentirelywithinonepartition–– Fan‐OutTreesdistributedamongpartitionsFan‐OutTreesdistributedamongpartitions

•• AsynchronousProjectionMethodologyAsynchronousProjectionMethodology–– TreatasynchronousprimitivesareTreatasynchronousprimitivesarehardIPmacroshardIPmacros

•• allrouting,arbitrationprimitiveshavesametimingallrouting,arbitrationprimitiveshavesametiming

–– EvenlydistributeEvenlydistributegroupsofprimitivesgroupsofprimitives–– AssignAssigninter‐primitivewiredelaysinter‐primitivewiredelaysbasedonpositionbasedonposition

•• delaysonwiresassignedbasedontechnologyspecificationsdelaysonwiresassignedbasedontechnologyspecifications

[10]A.O.Balkan,M.N.[10]A.O.Balkan,M.N.HorakHorak,G.,G.QuQu,U.,U.VishkinVishkin..““Layout‐accuratedesignandimplementationofahigh‐Layout‐accuratedesignandimplementationofahigh‐throughputinterconnectionnetworkforsingle‐chipparallelprocessingthroughputinterconnectionnetworkforsingle‐chipparallelprocessing””,,HotInterconnects,August2007HotInterconnects,August2007

Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout

ExampleFan-Out Tree

P0P0

P1P1

P2P2

P3P3

CurrentCADToolFlows:SyncCurrentCADToolFlows:Syncvsvs..AsyncAsync

•• SynchronousSynthesis:SynchronousSynthesis:–– Automaticplace/routeoptimizationsAutomaticplace/routeoptimizations

–– Includescellresizing/repeaterinsertionIncludescellresizing/repeaterinsertion

•• AsynchronousSynthesis:AsynchronousSynthesis:–– Limitedoptimization:Limitedoptimization:hardmacros+regularmanualplacementhardmacros+regularmanualplacement–– Nocellresizing/repeaterinsertionNocellresizing/repeaterinsertion……muchpotentialforfutureperformanceimprovementmuchpotentialforfutureperformanceimprovement

•• CurrentlyDoNotDefineNecessaryTimingConstraintsCurrentlyDoNotDefineNecessaryTimingConstraints–– Noautomaticpath‐lengthmatchingNoautomaticpath‐lengthmatching–– NecessarytoenforcebundlingconstraintNecessarytoenforcebundlingconstraint

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:400MHzSyncvs.400MHzSyncvs.AsyncAsync

Comparablethroughputfor entire

range of Sync

Sync hasat least 4.3x

higher latencyfor all Syncinput rates

Sync Max.Input Rate:102.4 Gbps

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:800MHzSyncvs.800MHzSyncvs.AsyncAsync

Comparablethroughputfor entire

range of Sync

Sync has>1.7x

higher latencyfor input rates

up to 73%of Sync max.(150 Gbps)

Sync Max.Input Rate:204.8 Gbps

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:1.36GHzSyncvs.1.36GHzSyncvs.AsyncAsync

Comparablethroughput

for ratesup to

55% of Sync max.(190 Gbps)

Lower latencyfor input

rates up to 43% of

Sync max.(150 Gbps)

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

Sync Max.Input Rate:348.2 Gbps

GALSNetworkPerformanceComparisonGALSNetworkPerformanceComparison

•• ExperimentalSetupExperimentalSetup–– CreateterminalstogeneratetrafficandrecordmeasurementsCreateterminalstogeneratetrafficandrecordmeasurements

–– TerminalsgenerateTerminalsgenerateuniformlyrandominputtrafficuniformlyrandominputtraffic

•• ResultsNormalizedtoClockRateResultsNormalizedtoClockRate–– ThroughputunitsThroughputunits(normalized)(normalized)::flitspercycleperportflitspercycleperport

–– LatencyunitsLatencyunits(normalized)(normalized)::#clockcycles#clockcycles

–– Syncnetworkresults:Syncnetworkresults:alwayssamealwayssamerelativetoclockcyclesrelativetoclockcycles

–– AsyncAsyncnetworkresults:networkresults:varyvarywithclockratewithclockrate

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:400MHzGALSvs.Sync400MHzGALSvs.Sync

Comparablethroughput

for alltraffic rates

Sync has52% higher

latencyup to 80%input traffic

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:600MHzGALSvs.Sync600MHzGALSvs.Sync

Comparablethroughputup to 65%input traffic

Lower latencyup to 60%input traffic

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:800MHzGALSvs.Sync800MHzGALSvs.Sync

Comparablethroughputup to 52%input traffic

Lower latencyup to 29%

input traffic,comparable

latencyup to 40%input traffic

XMTParallelKernelSimulationsXMTParallelKernelSimulations

•• Goal:IntegratewithSynchronousXMTParallelArchitectureGoal:IntegratewithSynchronousXMTParallelArchitecture–– XMTXMTVerilogVerilogRTLdescriptionwithGALSnetworkRTLdescriptionwithGALSnetwork

•• XMTParallelKernelsXMTParallelKernels–– ArraySummation(add)ArraySummation(add)

•• Computesumof3millionelementsinarrayComputesumof3millionelementsinarray

–– MatrixMultiplication(MatrixMultiplication(mmulmmul))•• Computeproductoftwo64x64matricesComputeproductoftwo64x64matrices

–– Breadth‐FirstSearch(Breadth‐FirstSearch(bfsbfs))•• RunXMTBFSalgorithmwith100,000verticesand1millionedgesRunXMTBFSalgorithmwith100,000verticesand1millionedges

–– ArrayIncrement(ArrayIncrement(a_inca_inc))•• Incrementall32kelementsofanarrayIncrementall32kelementsofanarray

XMTParallelKernelSimulationsXMTParallelKernelSimulations

•• XMTProcessorConfigurationXMTProcessorConfiguration–– 8ProcessingClusters(168ProcessingClusters(16TCUsTCUseach)=each)=128128TCUTCU’’sstotaltotal

–– 8DistributedL1D‐CacheModules(64KBtotal)8DistributedL1D‐CacheModules(64KBtotal)

•• SimulateGALSXMTatDifferentClockFrequenciesSimulateGALSXMTatDifferentClockFrequencies–– 200,400,700MHz200,400,700MHz

•• CompareSpeedupsRelativetoSynchronousXMTCompareSpeedupsRelativetoSynchronousXMT–– Valuesgreaterthan1.0indicatebetterperformanceValuesgreaterthan1.0indicatebetterperformance

GALSXMTPerformanceComparisonGALSXMTPerformanceComparison

GALS XMThas similar

performance for200, 400 MHz

Only moderate degradationat 700 MHz(a_inc: 37%

decrease)

(Graph arranged in order of increasing network utilization)(Graph arranged in order of increasing network utilization)

ConclusionsConclusions•• NewGALSNetworkforChipMultiprocessorsNewGALSNetworkforChipMultiprocessors

–– Low‐overheadnetworkforLow‐overheadnetworkfor““heterochronousheterochronous””InterfacesInterfaces

•• DesignofTwoNewAsynchronousRouterCellsDesignofTwoNewAsynchronousRouterCells–– RoutingRoutingandandarbitrationarbitrationcircuitscircuits

•• OverviewofResultsOverviewofResults–– RouterPrimitivesRouterPrimitives

•• 64‐84%lessarea,82‐91%lessenergy/packet64‐84%lessarea,82‐91%lessenergy/packet•• Latency&throughput(forbalancedtraffic)=~2Latency&throughput(forbalancedtraffic)=~2Gflits/secGflits/sec

–– System‐LevelPerformanceSystem‐LevelPerformance•• AsyncAsyncnetworkcomparisonwith800MHzsyncnetwork:networkcomparisonwith800MHzsyncnetwork:

–– ComparablethroughputComparablethroughputacrossallinputtrafficacrossallinputtraffic

–– 1.7xlowerlatency1.7xlowerlatencyuptoupto73%maxinputtraffic73%maxinputtraffic

•• GALSnetworkcomparisonwith800MHzsyncnetwork:GALSnetworkcomparisonwith800MHzsyncnetwork:–– ComparablethroughputComparablethroughputuptoupto52%maxinputtraffic52%maxinputtraffic

–– LowerlatencyLowerlatencyuptoupto29%maxinputtraffic29%maxinputtraffic

FutureDirectionsFutureDirections•• ArchitecturalOptimizationArchitecturalOptimization

–– InsertlinearpipelinestagesonlongwirestoimprovethroughputInsertlinearpipelinestagesonlongwirestoimprovethroughput

•• CircuitOptimizationCircuitOptimization–– Improvedesignsofrouting/arbitrationprimitivesImprovedesignsofrouting/arbitrationprimitives–– Mixed‐timingFIFOoptimizationsMixed‐timingFIFOoptimizations

•• AsynchronousTopologyOptimizationAsynchronousTopologyOptimization–– AreaimprovementsAreaimprovementsusinghybridusinghybridMoT‐ButterflyMoT‐Butterfly[Balkanetal.,DAC‐08][Balkanetal.,DAC‐08]

•• IntegratewithSynchronousPhysicalCADToolFlowIntegratewithSynchronousPhysicalCADToolFlow–– GoalGoal=leverage=leverageexistingcommercialexistingcommercialtechniquestechniques

•• TimingconstraintspecificationandsynthesisofTimingconstraintspecificationandsynthesisofunclockedunclockedtimingpathstimingpaths

•• BuildonautomatedBuildonautomatedasyncasyncflowofflowof[[Quinton/Greenstreet/WiltonQuinton/Greenstreet/WiltonTVLSITVLSI‘‘08]08]

•• Optimizedplacement,routing,gateresizingandrepeaterinsertionOptimizedplacement,routing,gateresizingandrepeaterinsertion

•• TargetAlternativeParallelArchitectures/MemorySystemsTargetAlternativeParallelArchitectures/MemorySystems

BACKUPSLIDESBACKUPSLIDES

TypesofMixed‐Timing(GALS)SystemsTypesofMixed‐Timing(GALS)Systems

•• PseudochronousPseudochronous–– SameFrequency,ConstantPhaseDifferenceSameFrequency,ConstantPhaseDifference

•• MesochronousMesochronous–– SameFrequency,UndefinedPhaseDifferenceSameFrequency,UndefinedPhaseDifference

•• PlesiochronousPlesiochronous–– NearlyexactFrequencyandPhaseDifferenceNearlyexactFrequencyandPhaseDifference

•• HeterochronousHeterochronous–– UndefinedFrequencyandPhaseDifferenceUndefinedFrequencyandPhaseDifference

MOUSETRAPAsynchronousPipelinesMOUSETRAPAsynchronousPipelines

•• FastCommunicationFastCommunication–– TransitionTransitionsignalingsignaling(2‐phase)handshaking(2‐phase)handshaking ‏‏

•• Synchronous‐StyleChannelEncodingSynchronous‐StyleChannelEncoding–– Single‐railbundleddataprotocolSingle‐railbundleddataprotocol

•• LowLatencyLowLatency–– 1TransparentDLatchdelayforemptystage1TransparentDLatchdelayforemptystage

•• Minimal‐OverheadLatchControllerMinimal‐OverheadLatchController–– 1XNORGate1XNORGate

reqN

ackN-1

reqN+1

ackN

Data Latch

Latch Controller

doneN

Data in Data out

Stage NStage N-1 Stage N+1

En

MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)

Stagescommunicateusingtransition-signaling:

reqN

ackN-1

reqN+1

ackN

Data Latch

Latch Controller

doneN

Data in Data out

Stage NStage N-1 Stage N+1

En

MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)

Stagescommunicateusingtransition-signaling:

1 transition1 transitionper data item!per data item!

One Data Item

BasicMixed‐ClockFIFO(Sync‐Sync)BasicMixed‐ClockFIFO(Sync‐Sync)

cell cell cell cell cell

Get

Con

trolle

r

Empty Detector

Full DetectorPut

Controller

full

req_put

data_putCLK_put

CLK_getdata_getreq_get

valid_getempty

•• Sync‐SyncFIFOSync‐SyncFIFO:usesSynchronous:usesSynchronousPutPutandandGetGetModulesModules–– Sync‐Syncisoneof4mixed‐timingSync‐Syncisoneof4mixed‐timingFIFOsFIFOs

•• MixedMixedAsyncAsync+Sync+SyncFIFOFIFO’’ss:modularchanges:modularchanges–– Sync‐AsyncSync‐Async::usesSynchronoususesSynchronousPutPut(top)andAsynchronous(top)andAsynchronousGetGet–– Async‐SyncAsync‐Sync::usesSynchronoususesSynchronousGetGet(bottom)andAsynchronous(bottom)andAsynchronousPutPut