Transcript

ALow‐OverheadAsynchronousALow‐OverheadAsynchronousInterconnectionNetworkforInterconnectionNetworkforGALSChipMultiprocessorsGALSChipMultiprocessorsMichaelN.HorakMichaelN.Horak,,UniversityofMarylandUniversityofMaryland

StevenM.NowickStevenM.Nowick,,ColumbiaUniversityColumbiaUniversity

MatthewMatthewCarlbergCarlberg,,UCBerkeleyUCBerkeley

UziUziVishkinVishkin,,UniversityofMarylandUniversityofMaryland

In ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-10)

ChallengesforDesigningNetworks‐on‐ChipChallengesforDesigningNetworks‐on‐Chip

•• PowerConsumptionPowerConsumption–– WillexceedfuturepowerbudgetsbyafactorofWillexceedfuturepowerbudgetsbyafactorof!"#!"#[1][1]

–– Globalclocks:consumelargefractionofoverallpowerGlobalclocks:consumelargefractionofoverallpower

•• PerformanceBottlenecksPerformanceBottlenecks–– LargenetworklatenciescauseperformancedegradationLargenetworklatenciescauseperformancedegradation

•• IncreasedDesignerResourcesIncreasedDesignerResources–– ManytechniquesareincompatiblewithcurrentCADtoolsManytechniquesareincompatiblewithcurrentCADtools

–– DifficultiesintegratingheterogeneousmodulesDifficultiesintegratingheterogeneousmodules•• ChipspartitionedintoChipspartitionedinto!"#$%&#'($%!%)*(+,!-%).!"#$%&#'($%!%)*(+,!-%).

[1]J.D.Owens,W.J.Dally,R.Ho,D.N.[1]J.D.Owens,W.J.Dally,R.Ho,D.N.JayasimhaJayasimha,S.W.,S.W.KecklerKeckler,andL.‐S.,andL.‐S.PehPeh..Researchchallengesforon‐chipinterconnectionnetworks.Researchchallengesforon‐chipinterconnectionnetworks.IEEEMicroIEEEMicro,27(5):96‐108,2007.,27(5):96‐108,2007.

PotentialAdvantagesofAsynchronousDesignPotentialAdvantagesofAsynchronousDesign

•• LowerPowerLowerPower–– Noclockpowerconsumed:Noclockpowerconsumed:withoutwithoutclockgatingclockgating–– IdlecomponentsinherentlyconsumelowpowerIdlecomponentsinherentlyconsumelowpower

•• GreaterFlexibility/ModularityGreaterFlexibility/Modularity–– NoclockdistributionNoclockdistribution–– EasierintegrationbetweenmultipletimingdomainsEasierintegrationbetweenmultipletimingdomains–– SupportsreusablecomponentsSupportsreusablecomponents

•• LowerSystemLatencyLowerSystemLatency–– End‐to‐endtrafficwithoutclocksynchronizationEnd‐to‐endtrafficwithoutclocksynchronization

•• MoreResilienttoOn‐ChipVariationsMoreResilienttoOn‐ChipVariations–– CorrectoperationdependsonlocalizedtimingconstraintsCorrectoperationdependsonlocalizedtimingconstraints

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks

•• Mixed‐TimingInterfacesMixed‐TimingInterfaces–– ProviderobustcommunicationProviderobustcommunicationbetweenSyncandbetweenSyncandAsyncAsyncdomainsdomains

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

AdvancesinGALSNetworks‐on‐ChipAdvancesinGALSNetworks‐on‐Chip•• CommercialDesignsCommercialDesigns

–– SilistixSilistix,Inc.,Inc.(J.Bainbridge,S.(J.Bainbridge,S.FurberFurber.IEEEMicro‐02).IEEEMicro‐02)

•• CHAINCHAIN™™workstoolsuite:heterogeneousworkstoolsuite:heterogeneousSOCsSOCs

–– FulcrumMicrosystemsFulcrumMicrosystems(A.Lines.Micro‐04)(A.Lines.Micro‐04)

•• FocalPointFocalPointchips:chips:high‐performanceEthernetroutinghigh‐performanceEthernetrouting

•• RecentRecentWorkWork–– AsynchronousNetwork‐on‐Chip(AsynchronousNetwork‐on‐Chip(ANoCANoC))((BeigneBeigne,,ClermidyClermidy,,VivetVivetetal.Async‐05)etal.Async‐05)

•• Wormholepacket‐switchedWormholepacket‐switchedNoCNoCwithlow‐latencyservicewithlow‐latencyservice

–– MANGOMANGOClocklessClocklessNetwork‐on‐ChipNetwork‐on‐Chip(T.(T.BjerregaardBjerregaard.DATE‐05).DATE‐05)

•• Offersquality‐of‐service(Offersquality‐of‐service(QoSQoS)guarantees)guarantees

–– RasPRasPOn‐ChipNetworkOn‐ChipNetwork(S.Hollis,S.W.Moore.ICCD‐06)(S.Hollis,S.W.Moore.ICCD‐06)

•• Utilizeshigh‐speedpulse‐basedsignalingUtilizeshigh‐speedpulse‐basedsignaling

–– SpiNNakerSpiNNakerProjectProject(Khan,Lester,(Khan,Lester,PlanaPlana,,FurberFurberetal.IJCNN‐08)etal.IJCNN‐08)

•• Massively‐parallelneuralsimulationMassively‐parallelneuralsimulation

GALSGALSNOCsNOCs:TypicalCurrentTargets:TypicalCurrentTargets•• Low‐toModerate‐PerformanceEmbeddedSystemsLow‐toModerate‐PerformanceEmbeddedSystems

–– 200‐500MHz200‐500MHz–– HighsystemlatencyHighsystemlatency

•• ““Four‐PhaseReturn‐to‐ZeroFour‐PhaseReturn‐to‐Zero””ProtocolsProtocols–– Tworound‐trips/linkTworound‐trips/linkpertransactionpertransaction

•• ““Delay‐InsensitiveDataDelay‐InsensitiveData””Encoding(dual‐rail,1‐of‐4)Encoding(dual‐rail,1‐of‐4)–– Lowercodingefficiencythansingle‐railLowercodingefficiencythansingle‐rail

•• Complex‐FunctionalityRouterNodesComplex‐FunctionalityRouterNodes–– 5‐portrouterswithlayeredservices(5‐portrouterswithlayeredservices(QoSQoS,etc.),etc.)–– Highlatency/highareaHighlatency/higharea

•• CustomCircuitTechniques:CustomCircuitTechniques:–– Pulse‐basedsignaling,low‐swingPulse‐basedsignaling,low‐swingsignallingsignalling–– DynamicDynamiclogic,specializedcellslogic,specializedcells

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives

•• ExperimentalResultsExperimentalResults

•• ConclusionsConclusions

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors–– Medium‐toHigh‐PerformanceMedium‐toHigh‐Performance

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming[3][3]–– MostgeneralGALStimingmodelMostgeneralGALStimingmodel

–– SupportmultiplesynchronousdomainswithunrelatedclockingSupportmultiplesynchronousdomainswithunrelatedclocking

–– PromotesreuseofIntellectualProperty(IP)modulesPromotesreuseofIntellectualProperty(IP)modules

[3]D.Messerschmitt,[3]D.Messerschmitt,““SynchronizationinDigitalSystemDesignSynchronizationinDigitalSystemDesign””,,IEEEJournalonSelectedAreasinCommunications,October1990IEEEJournalonSelectedAreasinCommunications,October1990

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• TransitionSignalingTransitionSignaling(Two‐Phase)(Two‐Phase)–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““four‐phasehandshakingfour‐phasehandshaking””

•• 2roundtriplinkcommunications2roundtriplinkcommunicationspertransactionpertransaction

–– BenefitsofTwo‐Phase:BenefitsofTwo‐Phase:•• 1roundtriplinkcommunication1roundtriplinkcommunicationpertransactionpertransaction

•• improvedthroughput,powerimprovedthroughput,power……..

–– ChallengeofTwo‐Phase:ChallengeofTwo‐Phase:designinglightweightimplementationsdesigninglightweightimplementations•• Mostexisting2‐phasedesignsuse:Mostexisting2‐phasedesignsuse:

–– complexslowregisters:doublelatch,double‐edge‐triggered,capture/passcomplexslowregisters:doublelatch,double‐edge‐triggered,capture/pass»» [Seitz/Su[Seitz/Su““MosaicMosaic””93,93,BrunvandBrunvand91,Sutherland89]91,Sutherland89]

–– customcircuitcomponentscustomcircuitcomponents

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““delay‐insensitivedelay‐insensitive””linkencodingslinkencodings

•• providegreattiming‐robustness==>cost=providegreattiming‐robustness==>cost=poorcodingefficiencypoorcodingefficiency

•• examples:dual‐rail,1‐of‐4examples:dual‐rail,1‐of‐4

–– ““Single‐RailBundledDataSingle‐RailBundledData””benefits:benefits:•• re‐usesynchronousre‐usesynchronousdatapathsdatapaths::1wire/bit1wire/bit++addedadded““requestrequest””•• excellentcodingefficiencyexcellentcodingefficiency

–– Challenge:requiresmatcheddelayforChallenge:requiresmatcheddelayfor““requestrequest””signalsignal•• 1‐sidedtimingconstraint:1‐sidedtimingconstraint:““requestrequest””mustarriveafterdatastablemustarriveafterdatastable

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance–– LowSystem‐LevelLatencyLowSystem‐LevelLatency

•• minimizeend‐to‐enddelayunderlighttomoderatetrafficminimizeend‐to‐enddelayunderlighttomoderatetraffic

–– HighSustainedThroughputHighSustainedThroughput•• maximizesteady‐statethroughputunderheavytrafficmaximizesteady‐statethroughputunderheavytraffic

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance

•• StandardCellMethodologyStandardCellMethodology–– UseexistingstandardcelllibrariesUseexistingstandardcelllibraries

•• onlyexception:onlyexception:analogarbitercircuitanalogarbitercircuit

–– Challenge:Challenge:timinganalysisusingexistingtoolstiminganalysisusingexistingtools

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance

•• StandardCellMethodologyStandardCellMethodology

•• Fine‐GrainedFine‐GrainedNetworkTopologyNetworkTopology–– LightweightnetworknodesLightweightnetworknodes

•• low‐functionalitylow‐functionalitylow‐radixroutercomponentslow‐radixroutercomponents

•• avoids5‐portrouterwithNorth/South/East/West/Localportsavoids5‐portrouterwithNorth/South/East/West/Localports

TargetGALSNetworkDesignTargetGALSNetworkDesign

OutlineOutline

•• IntroductionIntroduction•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:XMTProcessor/Background:XMTProcessor/MoTMoTNetworkNetwork–– eXpliciteXplicitMulti‐Threading(XMT)ArchitectureMulti‐Threading(XMT)Architecture–– Mesh‐of‐Trees(Mesh‐of‐Trees(MoTMoT)NetworkTopology)NetworkTopology–– SynchronousRouterNodesSynchronousRouterNodes

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions

XMTParallelArchitectureXMTParallelArchitecture

•• XMT=XMT=““eXpliciteXplicitMulti‐ThreadingMulti‐Threading””(1997‐present)[4](1997‐present)[4]–– LedbyProf.UziLedbyProf.UziVishkinVishkinatUniversityofMaryland,CollegeParkatUniversityofMaryland,CollegePark

•• BasedonParallelRandomAccessModel(PRAM)BasedonParallelRandomAccessModel(PRAM)–– LargestbodyofparallelalgorithmictheoryLargestbodyofparallelalgorithmictheory

•• EaseofProgrammabilityEaseofProgrammability–– XMT‐ClanguageXMT‐Clanguage+optimizingcompiler+optimizingcompiler–– Single‐ProgramMultiple‐Data(SPMD)programmingmethodologySingle‐ProgramMultiple‐Data(SPMD)programmingmethodology

•• DemonstratedtoProvideSignificantSpeedupsDemonstratedtoProvideSignificantSpeedups–– Performswellonirregularcomputations(BFS,ray‐tracing)Performswellonirregularcomputations(BFS,ray‐tracing)–– 100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]

[4]D.Naishlos,J.Nuzman,C.‐W.Tseng,andU.Vishkin.“Towardsafirstverticalprototypingofanextremelyfine‐grainedparallelprogrammingapproach”,SPAA2001

[5]P.[5]P.GuGuandU.andU.VishkinVishkin,,““Casestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipCasestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipmultiprocessormultiprocessor””,,JournalofEmbeddedComputing,April2006JournalofEmbeddedComputing,April2006

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupofsimplepipelinedcores,Groupofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization

–– ““IOSIOS””=independence‐of‐order=independence‐of‐ordersemantics:semantics:noWAW/WAR/RAWnoWAW/WAR/RAWdatahazardsbetweenthreadsdatahazardsbetweenthreads

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittlewithlittleornosynchronizationornosynchronization

•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache

–– NocachecoherenceproblemNocachecoherenceproblem

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization

•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache

–– NocachecoherenceproblemNocachecoherenceproblem

•• NOCChallenge:NOCChallenge:highbandwidth/lowpowerrequirementshighbandwidth/lowpowerrequirements–– Manyconcurrentmemoryrequests(load/store)Manyconcurrentmemoryrequests(load/store)–– Shortpackets:1‐2flits/dynamically‐varyingtrafficShortpackets:1‐2flits/dynamically‐varyingtraffic–– LowlatencyrequiredforsystemperformanceLowlatencyrequiredforsystemperformance

ProposedXMTParallelArchitecture:ProposedXMTParallelArchitecture:withGALSInterconnectionNetworkwithGALSInterconnectionNetwork

GALSNetwork…

Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology

•• VariantofclassicVariantofclassicMoTMoT

•• Nfan‐outtreesNfan‐outtrees–– RoutingonlyRoutingonly

–– RootatsourceterminalsRootatsourceterminals

•• Nfan‐intreesNfan‐intrees–– ArbitrationonlyArbitrationonly

–– RootatdestinationterminalsRootatdestinationterminals

$%&'(%)('*+,*-('+.

Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology•• HighThroughputHighThroughput

–– Uniqueroutingpaths(source/sink)Uniqueroutingpaths(source/sink)–– AvoidsinterferencepenaltiesAvoidsinterferencepenalties

•• FixedPathLengthFixedPathLength–– LogarithmicdepthLogarithmicdepth

•• DistributedLow‐RadixRoutingDistributedLow‐RadixRouting–– LimitedfunctionalitynodesLimitedfunctionalitynodes–– WormholedeterministicroutingWormholedeterministicrouting

•• ShowntoPerformWellforShowntoPerformWellforCMPsCMPs–– Providesveryhighsustainedthroughput[6]Providesveryhighsustainedthroughput[6]–– Highsaturationthroughput:Highsaturationthroughput:~91%~91%

[6]A.O.Balkan,G.[6]A.O.Balkan,G.QuQu,U.,U.VishkinVishkin,,““Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐chipparallelismchipparallelism””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009

SynchronousRoutingPrimitiveSynchronousRoutingPrimitive•• Fan‐OutComponentFan‐OutComponent[7][7]

–– 1Input,2Outputs1Input,2Outputs

–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism

•• SignaltopreviousstagewhenSignaltopreviousstagewhennewdatacanbeacceptednewdatacanbeaccepted

•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””[[CarloniCarlonietal.,TCAD01]etal.,TCAD01]

–– 2‐RegisterFIFO:2‐RegisterFIFO:B0,B1B0,B1

–– Allows1flit/cycleinsteady‐stateAllows1flit/cycleinsteady‐state•• AcceptnewdataandforwardstoreddataconcurrentlyAcceptnewdataandforwardstoreddataconcurrently

–– Cost:Cost:1extraauxiliaryregister1extraauxiliaryregister((flipflop‐basedflipflop‐based))

[7][7]A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)

SynchronousArbitrationPrimitiveSynchronousArbitrationPrimitive

•• Fan‐InComponentFan‐InComponent[7][7]–– 2Inputs,1Output2Inputs,1Output

–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism

•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””–– 2‐Stage2‐StageFIFOsFIFOsateachinputportateachinputport–– Whenempty,latency=1cycleWhenempty,latency=1cycle

–– Whenstalled,latency=2+cyclesWhenstalled,latency=2+cycles•• Dependsonback‐pressureandsynchronousarbitrationDependsonback‐pressureandsynchronousarbitration

–– Cost:Cost:totalof4registerstotalof4registers(flip‐flopbased)(flip‐flopbased)

[7}[7}A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives–– Routingprimitive(Fan‐out)Routingprimitive(Fan‐out)

–– Arbitrationprimitive(Fan‐in)Arbitrationprimitive(Fan‐in)–– Mixed‐timinginterfacesMixed‐timinginterfaces

•• ExperimentalResultsExperimentalResults

•• ConclusionsConclusions

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Binary Routing SignalBinary Routing Signal

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Data ChannelsData Channels

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

LatchLatchControllersControllers

Req0Req1Ack

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req Ack Req1 Ack1

Latch ControllerLatch Controller

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

Normally OpaqueNormally OpaqueLatch RegistersLatch Registers

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

<)(3+=><)(3+=>

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

?@%*-.@A-(?@%*-.@A-(

/0,!()'1$/0,!()'1$((.$-*'.$-*'

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Data ChannelsData Channels

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Mutual ExclusionMutual ExclusionElement (Element (MutexMutex))

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Request ProtectionRequest ProtectionLatchesLatches(Normally Opaque)(Normally Opaque) ‏‏

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Data + Request Latch RegisterData + Request Latch Register (only one bank of latches required)(only one bank of latches required)

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Acknowledgment Protection LatchesAcknowledgment Protection Latches(normally transparent)(normally transparent)

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

<)(3+=><)(3+=>

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

<)(3+=><)(3+=>

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

?@%*-.@A-(?@%*-.@A-((bestcase,(bestcase,alternatingalternatinginputs)inputs)

/0,!()'1$/0,!()'1$((.$-*'.$-*'

WormholeRoutingCapabilityWormholeRoutingCapability

•• Goal:Goal:supporttransmissionofmulti‐flitpacketssupporttransmissionofmulti‐flitpackets–– example:XMTexample:XMT””storestorepacketspackets””=2flits=2flits(address+data)(address+data)

•• Solution:Solution:add1extraadd1extra““gluebitgluebit””toeachflittoeachflit–– Gluebit=1Gluebit=1notlastflitinpacketnotlastflitinpacket

–– Enhancedarbitrationprimitive:Enhancedarbitrationprimitive:biasbiasmutexmutexdecisiondecision•• ““winner‐take‐allwinner‐take‐all””strategy[strategy[Dally/TowlesDally/Towles]]

•• headerflittakesoverheaderflittakesovermutexmutex:glue=1:glue=1

•• lastflitreleaseslastflitreleasesmutexmutex:glue=:glue=00

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

L8

L9

Req_Out

Ack_In

Data0Data1

Glue0

Glue1

Mux_Select

Data_OutLATCH

L+@)+=35/B8*C/D*+(%*8/E+'(L+@)+=35/B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

glue0 bit

glue1 bit

WormholeControl

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

-- Can beCan be inserted for buffering:inserted for buffering: to improve system-level throughput to improve system-level throughput-- Basis for Basis for design of new fan-in/fan-out primitivesdesign of new fan-in/fan-out primitives

[8]M.SinghandS.M.Nowick.“MOUSETRAP:High‐SpeedTransition‐SignalingAsynchronousPipelines,”IEEETransactionsonVLSISystems,vol.15:11,pp.1256‐1269(Nov.2007)

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

Handshaking Signals (Request and Acknowledgment)Handshaking Signals (Request and Acknowledgment)

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

Data ChannelsData Channels

Mixed‐TimingInterfacesMixed‐TimingInterfaces

•• UseExistingSynchronizingUseExistingSynchronizingFIFOsFIFOs[9][9](withsmall(withsmallmodifications)modifications)

–– SupportsSupports““heterochronousheterochronous””timingdomainstimingdomains–– NomodificationtoexistingcomponentsNomodificationtoexistingcomponents

•• ModularDesignModularDesign–– ReusableReusablePutPutandandGetGetcomponents(eithercomponents(eitherAsyncAsyncorSync)orSync)–– EachFIFOisarrayofidenticalcellsEachFIFOisarrayofidenticalcells

•• SupportsLow‐PowerOperationSupportsLow‐PowerOperation–– CircularFIFO:datadoesnotmoveCircularFIFO:datadoesnotmove

[9]T.[9]T.ChelceaChelceaandS.Nowick,andS.Nowick,““RobustInterfacesforMixed‐TimingSystemsRobustInterfacesforMixed‐TimingSystems””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives

•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions

EvaluationMethodologyEvaluationMethodology

•• DirectComparisonwithSynchronousDirectComparisonwithSynchronousMoTMoTNetworkNetwork–– IdenticalTechnology:IdenticalTechnology:IBM90nmCMOSprocessIBM90nmCMOSprocess

–– IdenticalFunctionalityIdenticalFunctionality:Sameroutingandarbitrationprimitives:Sameroutingandarbitrationprimitives

–– IdenticalTopology:IdenticalTopology:8‐terminalnetworkswithsame8‐terminalnetworkswithsamefloorplanfloorplan

•• EvaluateatMultipleLevelsofIntegrationEvaluateatMultipleLevelsofIntegration–– IsolatedAsynchronousPrimitivesIsolatedAsynchronousPrimitives(post‐layout)(post‐layout)

–– 8‐TerminalAsynchronousNetwork8‐TerminalAsynchronousNetwork(pre‐layoutwithwireestimates,(pre‐layoutwithwireestimates,‐‐interconnectionoflaid‐outrouterprimitives)‐‐interconnectionoflaid‐outrouterprimitives)

–– 8‐TerminalGALSNetwork8‐TerminalGALSNetwork

–– XMTArchitectureCo‐SimulationonParallelKernelsXMTArchitectureCo‐SimulationonParallelKernels

ToolFlowToolFlow

•• ImplementedinIBM90nmtechnologyImplementedinIBM90nmtechnology–– PlacedandroutedwithCadenceSOCEncounterPlacedandroutedwithCadenceSOCEncounter

–– Simulatedasgate‐levelSimulatedasgate‐levelVerilogVerilogwithextracteddelayswithextracteddelays

•• StandardCellMethodologyStandardCellMethodology–– ARM90nmStandardCells(IBMCMOS9SF)ARM90nmStandardCells(IBMCMOS9SF)

•• Exception:MutualExclusionElementException:MutualExclusionElement–– DesignedusingtransistormodelsfromIBM90nmPDKDesignedusingtransistormodelsfromIBM90nmPDK

–– SimulatedinCadenceSimulatedinCadenceSpectreSpectre

–– MeasureddelaystocalibrateMeasureddelaystocalibrateVerilogVerilogbehavioralmodelbehavioralmodel

RoutingPrimitiveComparison:RoutingPrimitiveComparison:AreaandPowerAreaandPower

225.61.822.062.06988.6988.6Synchronous

0.60.560.370.37358.4358.4Asynchronous

((μμW)W)((μμW)W)((pJpJ))((μμmm22))

IdleIdlePowerPower

LeakageLeakagePowerPower

Energy/Energy/PacketPacketAreaArea

•• Area:Area:–– 64%lessarea:64%lessarea:resultoflightweightresultoflightweightdatastoragedatastorage

•• 2flip‐flopregisters+extraMUX/DEMUX(sync)2flip‐flopregisters+extraMUX/DEMUX(sync)vsvs..2latchregisters(2latchregisters(asyncasync))

•• MUX/DEMUXoverhead(sync)MUX/DEMUXoverhead(sync)

•• Energy/Packet(1flit):Energy/Packet(1flit):–– 82%lessenergyperpacket82%lessenergyperpacket–– Steady‐statemeasurementonrandomtrafficSteady‐statemeasurementonrandomtraffic

RoutingPrimitiveComparison:RoutingPrimitiveComparison:LatencyandThroughputLatencyandThroughput

1.931.931.931.931.931.93516516Synchronous

1.701.701.341.341.071.07546546Asynchronous

AlternatingAlternatingRandomRandomSingleSingle((psps))

Maximum Throughput Maximum Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type

•• Synchronous:Synchronous:UsingMaxClockRateUsingMaxClockRate(1.93GHz)(1.93GHz)

•• Latency:Latency:–– //546546psps((asyncasync)vs.516)vs.516psps(sync)(sync)

•• MaxThroughput(Giga‐flits/sec):MaxThroughput(Giga‐flits/sec):–– Single‐portedtraffic:Single‐portedtraffic:MMNMMNofsyncmax.ofsyncmax.(noconcurrency)(noconcurrency)

–– Randomtraffic:Randomtraffic:O"NO"Nofsyncmax.ofsyncmax.–– Alternatingtraffic:Alternatingtraffic:PPNPPNofsyncMax.ofsyncMax.(mostconcurrency)(mostconcurrency)

……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages

ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:AreaandPowerAreaandPower

388.64.133.533.532240.32240.3Synchronous

0.50.500.330.33349.3349.3Asynchronous

((μμW)W)((μμW)W)((pJpJ))((μμmm22))

IdleIdlePowerPower

LeakageLeakagePowerPower

Energy/Energy/PacketPacketAreaAreaComponent TypeComponent Type

•• Area:Area:–– 84%lessarea84%lessarea–– Duetolow‐overheaddatastorageDuetolow‐overheaddatastorage

•• 4flip‐flopregisters(sync)4flip‐flopregisters(sync)vs.vs.11latchregister(latchregister(asyncasync))

•• Energy/Packet(1flit):Energy/Packet(1flit):–– 91%lessenergyperpacket91%lessenergyperpacket–– Measuredsteady‐statepacketsarrivingatbothinputportsMeasuredsteady‐statepacketsarrivingatbothinputports

ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:LatencyandThroughputLatencyandThroughput

2.092.092.092.09474474Synchronous

2.042.041.081.08489489Asynchronous

Both PortsBoth PortsSingleSingle((psps))

Max. Throughput Max. Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type

•• Synchronous:UsingMaxClockRate(2.09GHz)Synchronous:UsingMaxClockRate(2.09GHz)

•• Latency:Latency:–– 489489psps((asyncasync))vs.474vs.474psps(sync)(sync)

•• Max.Throughput(Giga‐flits/sec):Max.Throughput(Giga‐flits/sec):–– SinglePortonly:SinglePortonly:M!NM!Nofsynchronousmax.ofsynchronousmax.–– TrafficatBothPorts:TrafficatBothPorts:QPNQPNofsynchronousmax.ofsynchronousmax.

……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages

8‐TerminalNetworkEvaluation8‐TerminalNetworkEvaluation

•• Head‐on‐HeadComparisonwithSyncNetworkHead‐on‐HeadComparisonwithSyncNetwork

•• ProjectedNetworkLayoutProjectedNetworkLayout–– Pre‐layoutPre‐layoutasyncasyncnetworknetwork

–– UsesUsespost‐layoutprimitivespost‐layoutprimitives,treatedashardIPmacros,with,treatedashardIPmacros,withassignedwiredelaysassignedwiredelays

–– ExtrapolatewiredelaysbasedonASICExtrapolatewiredelaysbasedonASICfloorplanfloorplanofSyncofSyncMoTMoT

•• ExperimentalSetupExperimentalSetup–– EvaluateperformanceunderEvaluateperformanceunderuniformlyrandominputtrafficuniformlyrandominputtraffic

–– 32‐bitflits32‐bitflits

Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout

•• BasedonBasedonFloorplanFloorplanofSynchronousofSynchronousMoTMoTTestASICTestASIC–– Designed/fabricatedatUMDinMarch2007Designed/fabricatedatUMDinMarch2007[10][10]

•• Networkdividedinto4partitions(P0,P1,P2,P3)Networkdividedinto4partitions(P0,P1,P2,P3)–– Fan‐InTreesexistentirelywithinonepartitionFan‐InTreesexistentirelywithinonepartition–– Fan‐OutTreesdistributedamongpartitionsFan‐OutTreesdistributedamongpartitions

•• AsynchronousProjectionMethodologyAsynchronousProjectionMethodology–– TreatasynchronousprimitivesareTreatasynchronousprimitivesarehardIPmacroshardIPmacros

•• allrouting,arbitrationprimitiveshavesametimingallrouting,arbitrationprimitiveshavesametiming

–– EvenlydistributeEvenlydistributegroupsofprimitivesgroupsofprimitives–– AssignAssigninter‐primitivewiredelaysinter‐primitivewiredelaysbasedonpositionbasedonposition

•• delaysonwiresassignedbasedontechnologyspecificationsdelaysonwiresassignedbasedontechnologyspecifications

[10]A.O.Balkan,M.N.[10]A.O.Balkan,M.N.HorakHorak,G.,G.QuQu,U.,U.VishkinVishkin..““Layout‐accuratedesignandimplementationofahigh‐Layout‐accuratedesignandimplementationofahigh‐throughputinterconnectionnetworkforsingle‐chipparallelprocessingthroughputinterconnectionnetworkforsingle‐chipparallelprocessing””,,HotInterconnects,August2007HotInterconnects,August2007

Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout

ExampleFan-Out Tree

P0P0

P1P1

P2P2

P3P3

CurrentCADToolFlows:SyncCurrentCADToolFlows:Syncvsvs..AsyncAsync

•• SynchronousSynthesis:SynchronousSynthesis:–– Automaticplace/routeoptimizationsAutomaticplace/routeoptimizations

–– Includescellresizing/repeaterinsertionIncludescellresizing/repeaterinsertion

•• AsynchronousSynthesis:AsynchronousSynthesis:–– Limitedoptimization:Limitedoptimization:hardmacros+regularmanualplacementhardmacros+regularmanualplacement–– Nocellresizing/repeaterinsertionNocellresizing/repeaterinsertion……muchpotentialforfutureperformanceimprovementmuchpotentialforfutureperformanceimprovement

•• CurrentlyDoNotDefineNecessaryTimingConstraintsCurrentlyDoNotDefineNecessaryTimingConstraints–– Noautomaticpath‐lengthmatchingNoautomaticpath‐lengthmatching–– NecessarytoenforcebundlingconstraintNecessarytoenforcebundlingconstraint

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:400MHzSyncvs.400MHzSyncvs.AsyncAsync

Comparablethroughputfor entire

range of Sync

Sync hasat least 4.3x

higher latencyfor all Syncinput rates

Sync Max.Input Rate:102.4 Gbps

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:800MHzSyncvs.800MHzSyncvs.AsyncAsync

Comparablethroughputfor entire

range of Sync

Sync has>1.7x

higher latencyfor input rates

up to 73%of Sync max.(150 Gbps)

Sync Max.Input Rate:204.8 Gbps

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:1.36GHzSyncvs.1.36GHzSyncvs.AsyncAsync

Comparablethroughput

for ratesup to

55% of Sync max.(190 Gbps)

Lower latencyfor input

rates up to 43% of

Sync max.(150 Gbps)

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

Sync Max.Input Rate:348.2 Gbps

GALSNetworkPerformanceComparisonGALSNetworkPerformanceComparison

•• ExperimentalSetupExperimentalSetup–– CreateterminalstogeneratetrafficandrecordmeasurementsCreateterminalstogeneratetrafficandrecordmeasurements

–– TerminalsgenerateTerminalsgenerateuniformlyrandominputtrafficuniformlyrandominputtraffic

•• ResultsNormalizedtoClockRateResultsNormalizedtoClockRate–– ThroughputunitsThroughputunits(normalized)(normalized)::flitspercycleperportflitspercycleperport

–– LatencyunitsLatencyunits(normalized)(normalized)::#clockcycles#clockcycles

–– Syncnetworkresults:Syncnetworkresults:alwayssamealwayssamerelativetoclockcyclesrelativetoclockcycles

–– AsyncAsyncnetworkresults:networkresults:varyvarywithclockratewithclockrate

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:400MHzGALSvs.Sync400MHzGALSvs.Sync

Comparablethroughput

for alltraffic rates

Sync has52% higher

latencyup to 80%input traffic

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:600MHzGALSvs.Sync600MHzGALSvs.Sync

Comparablethroughputup to 65%input traffic

Lower latencyup to 60%input traffic

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:800MHzGALSvs.Sync800MHzGALSvs.Sync

Comparablethroughputup to 52%input traffic

Lower latencyup to 29%

input traffic,comparable

latencyup to 40%input traffic

XMTParallelKernelSimulationsXMTParallelKernelSimulations

•• Goal:IntegratewithSynchronousXMTParallelArchitectureGoal:IntegratewithSynchronousXMTParallelArchitecture–– XMTXMTVerilogVerilogRTLdescriptionwithGALSnetworkRTLdescriptionwithGALSnetwork

•• XMTParallelKernelsXMTParallelKernels–– ArraySummation(add)ArraySummation(add)

•• Computesumof3millionelementsinarrayComputesumof3millionelementsinarray

–– MatrixMultiplication(MatrixMultiplication(mmulmmul))•• Computeproductoftwo64x64matricesComputeproductoftwo64x64matrices

–– Breadth‐FirstSearch(Breadth‐FirstSearch(bfsbfs))•• RunXMTBFSalgorithmwith100,000verticesand1millionedgesRunXMTBFSalgorithmwith100,000verticesand1millionedges

–– ArrayIncrement(ArrayIncrement(a_inca_inc))•• Incrementall32kelementsofanarrayIncrementall32kelementsofanarray

XMTParallelKernelSimulationsXMTParallelKernelSimulations

•• XMTProcessorConfigurationXMTProcessorConfiguration–– 8ProcessingClusters(168ProcessingClusters(16TCUsTCUseach)=each)=128128TCUTCU’’sstotaltotal

–– 8DistributedL1D‐CacheModules(64KBtotal)8DistributedL1D‐CacheModules(64KBtotal)

•• SimulateGALSXMTatDifferentClockFrequenciesSimulateGALSXMTatDifferentClockFrequencies–– 200,400,700MHz200,400,700MHz

•• CompareSpeedupsRelativetoSynchronousXMTCompareSpeedupsRelativetoSynchronousXMT–– Valuesgreaterthan1.0indicatebetterperformanceValuesgreaterthan1.0indicatebetterperformance

GALSXMTPerformanceComparisonGALSXMTPerformanceComparison

GALS XMThas similar

performance for200, 400 MHz

Only moderate degradationat 700 MHz(a_inc: 37%

decrease)

(Graph arranged in order of increasing network utilization)(Graph arranged in order of increasing network utilization)

ConclusionsConclusions•• NewGALSNetworkforChipMultiprocessorsNewGALSNetworkforChipMultiprocessors

–– Low‐overheadnetworkforLow‐overheadnetworkfor““heterochronousheterochronous””InterfacesInterfaces

•• DesignofTwoNewAsynchronousRouterCellsDesignofTwoNewAsynchronousRouterCells–– RoutingRoutingandandarbitrationarbitrationcircuitscircuits

•• OverviewofResultsOverviewofResults–– RouterPrimitivesRouterPrimitives

•• 64‐84%lessarea,82‐91%lessenergy/packet64‐84%lessarea,82‐91%lessenergy/packet•• Latency&throughput(forbalancedtraffic)=~2Latency&throughput(forbalancedtraffic)=~2Gflits/secGflits/sec

–– System‐LevelPerformanceSystem‐LevelPerformance•• AsyncAsyncnetworkcomparisonwith800MHzsyncnetwork:networkcomparisonwith800MHzsyncnetwork:

–– ComparablethroughputComparablethroughputacrossallinputtrafficacrossallinputtraffic

–– 1.7xlowerlatency1.7xlowerlatencyuptoupto73%maxinputtraffic73%maxinputtraffic

•• GALSnetworkcomparisonwith800MHzsyncnetwork:GALSnetworkcomparisonwith800MHzsyncnetwork:–– ComparablethroughputComparablethroughputuptoupto52%maxinputtraffic52%maxinputtraffic

–– LowerlatencyLowerlatencyuptoupto29%maxinputtraffic29%maxinputtraffic

FutureDirectionsFutureDirections•• ArchitecturalOptimizationArchitecturalOptimization

–– InsertlinearpipelinestagesonlongwirestoimprovethroughputInsertlinearpipelinestagesonlongwirestoimprovethroughput

•• CircuitOptimizationCircuitOptimization–– Improvedesignsofrouting/arbitrationprimitivesImprovedesignsofrouting/arbitrationprimitives–– Mixed‐timingFIFOoptimizationsMixed‐timingFIFOoptimizations

•• AsynchronousTopologyOptimizationAsynchronousTopologyOptimization–– AreaimprovementsAreaimprovementsusinghybridusinghybridMoT‐ButterflyMoT‐Butterfly[Balkanetal.,DAC‐08][Balkanetal.,DAC‐08]

•• IntegratewithSynchronousPhysicalCADToolFlowIntegratewithSynchronousPhysicalCADToolFlow–– GoalGoal=leverage=leverageexistingcommercialexistingcommercialtechniquestechniques

•• TimingconstraintspecificationandsynthesisofTimingconstraintspecificationandsynthesisofunclockedunclockedtimingpathstimingpaths

•• BuildonautomatedBuildonautomatedasyncasyncflowofflowof[[Quinton/Greenstreet/WiltonQuinton/Greenstreet/WiltonTVLSITVLSI‘‘08]08]

•• Optimizedplacement,routing,gateresizingandrepeaterinsertionOptimizedplacement,routing,gateresizingandrepeaterinsertion

•• TargetAlternativeParallelArchitectures/MemorySystemsTargetAlternativeParallelArchitectures/MemorySystems

BACKUPSLIDESBACKUPSLIDES

TypesofMixed‐Timing(GALS)SystemsTypesofMixed‐Timing(GALS)Systems

•• PseudochronousPseudochronous–– SameFrequency,ConstantPhaseDifferenceSameFrequency,ConstantPhaseDifference

•• MesochronousMesochronous–– SameFrequency,UndefinedPhaseDifferenceSameFrequency,UndefinedPhaseDifference

•• PlesiochronousPlesiochronous–– NearlyexactFrequencyandPhaseDifferenceNearlyexactFrequencyandPhaseDifference

•• HeterochronousHeterochronous–– UndefinedFrequencyandPhaseDifferenceUndefinedFrequencyandPhaseDifference

MOUSETRAPAsynchronousPipelinesMOUSETRAPAsynchronousPipelines

•• FastCommunicationFastCommunication–– TransitionTransitionsignalingsignaling(2‐phase)handshaking(2‐phase)handshaking ‏‏

•• Synchronous‐StyleChannelEncodingSynchronous‐StyleChannelEncoding–– Single‐railbundleddataprotocolSingle‐railbundleddataprotocol

•• LowLatencyLowLatency–– 1TransparentDLatchdelayforemptystage1TransparentDLatchdelayforemptystage

•• Minimal‐OverheadLatchControllerMinimal‐OverheadLatchController–– 1XNORGate1XNORGate

reqN

ackN-1

reqN+1

ackN

Data Latch

Latch Controller

doneN

Data in Data out

Stage NStage N-1 Stage N+1

En

MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)

Stagescommunicateusingtransition-signaling:

reqN

ackN-1

reqN+1

ackN

Data Latch

Latch Controller

doneN

Data in Data out

Stage NStage N-1 Stage N+1

En

MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)

Stagescommunicateusingtransition-signaling:

1 transition1 transitionper data item!per data item!

One Data Item

BasicMixed‐ClockFIFO(Sync‐Sync)BasicMixed‐ClockFIFO(Sync‐Sync)

cell cell cell cell cell

Get

Con

trolle

r

Empty Detector

Full DetectorPut

Controller

full

req_put

data_putCLK_put

CLK_getdata_getreq_get

valid_getempty

•• Sync‐SyncFIFOSync‐SyncFIFO:usesSynchronous:usesSynchronousPutPutandandGetGetModulesModules–– Sync‐Syncisoneof4mixed‐timingSync‐Syncisoneof4mixed‐timingFIFOsFIFOs

•• MixedMixedAsyncAsync+Sync+SyncFIFOFIFO’’ss:modularchanges:modularchanges–– Sync‐AsyncSync‐Async::usesSynchronoususesSynchronousPutPut(top)andAsynchronous(top)andAsynchronousGetGet–– Async‐SyncAsync‐Sync::usesSynchronoususesSynchronousGetGet(bottom)andAsynchronous(bottom)andAsynchronousPutPut


Top Related