nvidia kepler gk110 architecture whitepaper

24
Whitepaper NVIDIA’s Next Generation CUDA TM Compute Architecture: Kepler TM GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0

Upload: hisahin

Post on 18-Dec-2015

248 views

Category:

Documents


12 download

DESCRIPTION

NVIDIA Kepler GK110

TRANSCRIPT

  • Whitepaper

    NVIDIAs Next Generation CUDATM Compute Architecture:

    Kepler TM GK110

    The Fastest, Most Efficient HPC Architecture Ever Built

    V1.0

  • Table of Contents KeplerGK110TheNextGenerationGPUComputingArchitecture...........................................................3KeplerGK110ExtremePerformance,ExtremeEfficiency..........................................................................4

    Dynamic Parallelism......................................................................................................................5 Hyper-Q...........................................................................................................................................5 Grid Management Unit..................................................................................................................5 NVIDIA GPUDirect.....................................................................................................................5

    AnOverviewoftheGK110KeplerArchitecture...........................................................................................6PerformanceperWatt..............................................................................................................................7StreamingMultiprocessor(SMX)Architecture.........................................................................................8SMXProcessingCoreArchitecture.......................................................................................................9QuadWarpScheduler...........................................................................................................................9NewISAEncoding:255RegistersperThread.....................................................................................11ShuffleInstruction...............................................................................................................................11AtomicOperations..............................................................................................................................11TextureImprovements.......................................................................................................................12

    KeplerMemorySubsystemL1,L2,ECC................................................................................................1364KBConfigurableSharedMemoryandL1Cache............................................................................1348KBReadOnlyDataCache...............................................................................................................13ImprovedL2Cache.............................................................................................................................14MemoryProtectionSupport...............................................................................................................14

    DynamicParallelism................................................................................................................................14HyperQ...................................................................................................................................................17GridManagementUnitEfficientlyKeepingtheGPUUtilized...............................................................19NVIDIAGPUDirect................................................................................................................................20

    Conclusion...................................................................................................................................................21Appendix A - QuickRefresheronCUDA...................................................................................................22

    CUDAHardwareExecution.................................................................................................................23

  • Kepler GK110 The Next Generation GPU Computing Architecture Asthedemandforhighperformanceparallelcomputingincreasesacrossmanyareasofscience,medicine,engineering,andfinance,NVIDIAcontinuestoinnovateandmeetthatdemandwithextraordinarilypowerfulGPUcomputingarchitectures.NVIDIAsexistingFermiGPUshavealreadyredefinedandacceleratedHighPerformanceComputing(HPC)capabilitiesinareassuchasseismicprocessing,biochemistrysimulations,weatherandclimatemodeling,signalprocessing,computationalfinance,computeraidedengineering,computationalfluiddynamics,anddataanalysis.NVIDIAsnewKeplerGK110GPUraisestheparallelcomputingbarconsiderablyandwillhelpsolvetheworldsmostdifficultcomputingproblems.ByofferingmuchhigherprocessingpowerthanthepriorGPUgenerationandbyprovidingnewmethodstooptimizeandincreaseparallelworkloadexecutionontheGPU,KeplerGK110simplifiescreationofparallelprogramsandwillfurtherrevolutionizehighperformancecomputing.

  • Kepler GK110 - Extreme Performance, Extreme Efficiency Comprising7.1billiontransistors,KeplerGK110isnotonlythefastest,butalsothemostarchitecturallycomplexmicroprocessoreverbuilt.Addingmanynewinnovativefeaturesfocusedoncomputeperformance,GK110wasdesignedtobeaparallelprocessingpowerhouseforTeslaandtheHPCmarket.KeplerGK110willprovideover1TFlopofdoubleprecisionthroughputwithgreaterthan80%DGEMMefficiencyversus6065%onthepriorFermiarchitecture.Inadditiontogreatlyimprovedperformance,theKeplerarchitectureoffersahugeleapforwardinpowerefficiency,deliveringupto3xtheperformanceperwattofFermi.

    KeplerGK110DiePhoto

  • ThefollowingnewfeaturesinKeplerGK110enableincreasedGPUutilization,simplifyparallelprogramdesign,andaidinthedeploymentofGPUsacrossthespectrumofcomputeenvironmentsrangingfrompersonalworkstationstosupercomputers:

    Dynamic ParallelismaddsthecapabilityfortheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.Byprovidingtheflexibilitytoadapttotheamountandformofparallelismthroughthecourseofaprogram'sexecution,programmerscanexposemorevariedkindsofparallelworkandmakethemostefficientusetheGPUasacomputationevolves.Thiscapabilityallowslessstructured,morecomplextaskstoruneasilyandeffectively,enablinglargerportionsofanapplicationtorunentirelyontheGPU.Inaddition,programsareeasiertocreate,andtheCPUisfreedforothertasks.

    Hyper-QHyperQenablesmultipleCPUcorestolaunchworkonasingleGPUsimultaneously,therebydramaticallyincreasingGPUutilizationandsignificantlyreducingCPUidletimes.HyperQincreasesthetotalnumberofconnections(workqueues)betweenthehostandtheGK110GPUbyallowing32simultaneous,hardwaremanagedconnections(comparedtothesingleconnectionavailablewithFermi).HyperQisaflexiblesolutionthatallowsseparateconnectionsfrommultipleCUDAstreams,frommultipleMessagePassingInterface(MPI)processes,orevenfrommultiplethreadswithinaprocess.Applicationsthatpreviouslyencounteredfalseserializationacrosstasks,therebylimitingachievedGPUutilization,canseeuptodramaticperformanceincreasewithoutchanginganyexistingcode.

    Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexiblegridmanagementanddispatchcontrolsystem.ThenewGK110GridManagementUnit(GMU)managesandprioritizesgridstobeexecutedontheGPU.TheGMUcanpausethedispatchofnewgridsandqueuependingandsuspendedgridsuntiltheyarereadytoexecute,providingtheflexibilitytoenablepowerfulruntimes,suchasDynamicParallelism.TheGMUensuresbothCPUandGPUgeneratedworkloadsareproperlymanagedanddispatched.

    NVIDIA GPUDirectNVIDIAGPUDirectisacapabilitythatenablesGPUswithinasinglecomputer,orGPUsindifferentserverslocatedacrossanetwork,todirectlyexchangedatawithoutneedingtogotoCPU/systemmemory.TheRDMAfeatureinGPUDirectallowsthirdpartydevicessuchasSSDs,NICs,andIBadapterstodirectlyaccessmemoryonmultipleGPUswithinthesamesystem,significantlydecreasingthelatencyofMPIsendandreceivemessagesto/fromGPUmemory.ItalsoreducesdemandsonsystemmemorybandwidthandfreestheGPUDMAenginesforusebyotherCUDAtasks.KeplerGK110alsosupportsotherGPUDirectfeaturesincludingPeertoPeerandGPUDirectforVideo.

  • An Overview of the GK110 Kepler Architecture KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperformingparallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcomputehorsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerandgeneratingmuchlessheatoutput.AfullKeplerGK110implementationincludes15SMXunitsandsix64bitmemorycontrollers.DifferentproductswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14SMXs.Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:

    ThenewSMXprocessorarchitecture Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat

    eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/Oimplementation.

    Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities

    KeplerGK110Fullchipblockdiagram

  • KeplerGK110supportsthenewCUDAComputeCapability3.5.(ForabriefoverviewofCUDAseeAppendixAQuickRefresheronCUDA).ThefollowingtablecomparesparametersofdifferentComputeCapabilitiesforFermiandKeplerGPUarchitectures:

    FERMIGF100

    FERMIGF104

    KEPLERGK104

    KEPLERGK110

    ComputeCapability 2.0 2.1 3.0 3.5Threads/Warp 32 32 32 32MaxWarps/Multiprocessor 48 48 64 64MaxThreads/Multiprocessor 1536 1536 2048 2048MaxThreadBlocks/Multiprocessor 8 8 16 1632bitRegisters/Multiprocessor 32768 32768 65536 65536MaxRegisters/Thread 63 63 63 255MaxThreads/ThreadBlock 1024 1024 1024 1024SharedMemorySizeConfigurations(bytes) 16K 16K 16K 16K

    48K 48K 32K 32K 48K 48KMaxXGridDimension 2^161 2^161 2^321 2^321HyperQ No No No YesDynamicParallelism No No No Yes

    ComputeCapabilityofFermiandKeplerGPUs

    PerformanceperWattAprincipaldesigngoalfortheKeplerarchitecturewasimprovingpowerefficiency.WhendesigningKepler,NVIDIAengineersappliedeverythinglearnedfromFermitobetteroptimizetheKeplerarchitectureforhighlyefficientoperation.TSMCs28nmmanufacturingprocessplaysanimportantroleinloweringpowerconsumption,butmanyGPUarchitecturemodificationswererequiredtofurtherreducepowerconsumptionwhilemaintaininggreatperformance.EveryhardwareunitinKeplerwasdesignedandscrubbedtoprovideoutstandingperformanceperwatt.Thebestexampleofgreatperf/wattisseeninthedesignofKeplerGK110snewStreamingMultiprocessor(SMX),whichissimilarinmanyrespectstotheSMXunitrecentlyintroducedinKeplerGK104,butincludessubstantiallymoredoubleprecisionunitsforcomputealgorithms.

  • StreamingMultiprocessor(SMX)ArchitectureKeplerGK110snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemostpowerfulmultiprocessorwevebuilt,butalsothemostprogrammableandpowerefficient.

    SMX:192singleprecisionCUDAcores,64doubleprecisionunits,32specialfunctionunits(SFU),and32load/storeunits(LD/ST).

  • SMXProcessingCoreArchitectureEachoftheKeplerGK110SMXunitsfeature192singleprecisionCUDAcores,andeachcorehasfullypipelinedfloatingpointandintegerarithmeticlogicunits.KeplerretainsthefullIEEE7542008compliantsingleanddoubleprecisionarithmeticintroducedinFermi,includingthefusedmultiplyadd(FMA)operation.OneofthedesigngoalsfortheKeplerGK110SMXwastosignificantlyincreasetheGPUsdelivereddoubleprecisionperformance,sincedoubleprecisionarithmeticisattheheartofmanyHPCapplications.KeplerGK110sSMXalsoretainsthespecialfunctionunits(SFUs)forfastapproximatetranscendentaloperationsasinpreviousgenerationGPUs,providing8xthenumberofSFUsoftheFermiGF110SM.SimilartoGK104SMXunits,thecoreswithinthenewGK110SMXunitsusetheprimaryGPUclockratherthanthe2xshaderclock.Recallthe2xshaderclockwasintroducedintheG80TeslaarchitectureGPUandusedinallsubsequentTeslaandFermiarchitectureGPUs.Runningexecutionunitsatahigherclockrateallowsachiptoachieveagiventargetthroughputwithfewercopiesoftheexecutionunits,whichisessentiallyanareaoptimization,buttheclockinglogicforthefastercoresismorepowerhungry.ForKepler,ourprioritywasperformanceperwatt.Whilewemademanyoptimizationsthatbenefittedbothareaandpower,wechosetooptimizeforpowerevenattheexpenseofsomeaddedareacost,withalargernumberofprocessingcoresrunningatthelower,lesspowerhungryGPUclock.QuadWarpSchedulerTheSMXschedulesthreadsingroupsof32parallelthreadscalledwarps.EachSMXfeaturesfourwarpschedulersandeightinstructiondispatchunits,allowingfourwarpstobeissuedandexecutedconcurrently.Keplersquadwarpschedulerselectsfourwarps,andtwoindependentinstructionsperwarpcanbedispatchedeachcycle.UnlikeFermi,whichdidnotpermitdoubleprecisioninstructionstobepairedwithotherinstructions,KeplerGK110allowsdoubleprecisioninstructionstobepairedwithotherinstructions.

  • EachKeplerSMXcontains4WarpSchedulers,eachwithdualInstructionDispatchUnits.AsingleWarpSchedulerUnitisshownabove.

    WealsolookedforopportunitiestooptimizethepowerintheSMXwarpschedulerlogic.Forexample,bothKeplerandFermischedulerscontainsimilarhardwareunitstohandletheschedulingfunction,including:

    a) Registerscoreboardingforlonglatencyoperations(textureandload)b) Interwarpschedulingdecisions(e.g.,pickthebestwarptogonextamongeligiblecandidates)c) Threadblocklevelscheduling(e.g.,theGigaThreadengine)

    However,Fermisscheduleralsocontainsacomplexhardwarestagetopreventdatahazardsinthemathdatapathitself.Amultiportregisterscoreboardkeepstrackofanyregistersthatarenotyetreadywithvaliddata,andadependencycheckerblockanalyzesregisterusageacrossamultitudeoffullydecodedwarpinstructionsagainstthescoreboard,todeterminewhichareeligibletoissue.ForKepler,werecognizedthatthisinformationisdeterministic(themathpipelinelatenciesarenotvariable),andthereforeitispossibleforthecompilertodetermineupfrontwheninstructionswillbereadytoissue,andprovidethisinformationintheinstructionitself.Thisallowedustoreplaceseveralcomplexandpowerexpensiveblockswithasimplehardwareblockthatextractsthepredeterminedlatencyinformationandusesittomaskoutwarpsfromeligibilityattheinterwarpschedulerstage.

  • NewISAEThenumbthreadacFermimacompellinchromodyto5.3xdumemory.ShuffleInTofurthewithinawstoreandthreadswpermutatthread.Ubutterfly

    Shuffleofcarriedoublock,sincFFT,whic

    Thisexamp

    AtomicOAtomicmcorrectlyadd,min,operationusedforpserializet

    Encoding:255berofregistecesstouptoyseesubstanngexamplecaynamics)calcuetotheabili

    nstructionrimproveperwarptoshareloadoperati

    withinawarpion.Shufflessefulshufflesystylepermuffersaperformutinasinglescedataexchahrequiresda

    leshowssomeo

    perationsemoryoperaperformreadmax,andcom

    nsareperformparallelsortinhreadexecut

    5Registersprsthatcanbe255registersntialspeedupanbeseeninulationsusingitytouseman

    rformance,Kedata.Previoonstopassthcanreadvaluupportsarbitsubsetsincluutationsamomanceadvanstep.Shuffleangedatthewtasharingwi

    ofthevariation

    ationsareimpdmodifywritmpareandswmedwithouting,reductiontion.

    erThreadeaccessedbys.CodesthatsasaresultotheQUDAlibgCUDA.QUDnymoreregis

    eplerimplemusly,sharinghedatathrouuesfromothetraryindexeddingnextthrngthethreadtageovershaalsocanreduwarplevelnethinawarp,a

    spossibleusing

    portantinparteoperationswapareatominterruptionboperations,a

    yathreadhasexhibithighroftheincreasbraryforperfDAfp64basedstersperthre

    mentsanewSdatabetweenughsharedmerthreadsintreferencesread(offsetudsinawarp,aredmemoryucetheamoueverneedstoa6%perform

    gthenewShuffl

    rallelprogramonsharedda

    micinthesensbyotherthreandbuildingd

    sbeenquadrregisterpresssedavailableforminglatticdalgorithmseadandexper

    Shuffleinstrucnthreadswitemory.Withthewarpinjui.e.anythrepordownbyarealsoavaily,inthatastontofshared

    obeplacedinmancegainca

    einstructionin

    mming,allowiatastructuressethatthereeads.Atomicmdatastructure

    upledinGK1sureorspillinperthreadreeQCD(quantseeperformariencingfewe

    ction,whichathinawarpretheShuffleinustaboutanyadreadsfromyafixedamolableasCUDAoreandloadmemoryneesharedmemnbeseenjus

    Kepler.

    ingconcurrens.Atomicopeead,modify,amemoryoperesinparallel

    10,allowingengbehaviorinegistercounttumanceincreaseerspillstoloc

    allowsthreadequiredseparnstruction,yimaginablemanyotherunt)andXORAintrinsics.operationisdedperthrea

    mory.InthecastbyusingSh

    ntthreadstoerationssuchandwriterationsarewwithoutlocks

    eachn.A

    esupcal

    dsrate

    R

    adaseofhuffle.

    as

    widelysthat

  • ThroughputofglobalmemoryatomicoperationsonKeplerGK110issubstantiallyimprovedcomparedtotheFermigeneration.Atomicoperationthroughputtoacommonglobalmemoryaddressisimprovedby9xtooneoperationperclock.Atomicoperationthroughputtoindependentglobaladdressesisalsosignificantlyaccelerated,andlogictohandleaddressconflictshasbeenmademoreefficient.Atomicoperationscanoftenbeprocessedatratessimilartogloballoadoperations.Thisspeedincreasemakesatomicsfastenoughtousefrequentlywithinkernelinnerloops,eliminatingtheseparatereductionpassesthatwerepreviouslyrequiredbysomealgorithmstoconsolidateresults.KeplerGK110alsoexpandsthenativesupportfor64bitatomicoperationsinglobalmemory.InadditiontoatomicAdd,atomicCAS,andatomicExch(whichwerealsosupportedbyFermiandKeplerGK104),GK110supportsthefollowing:

    atomicMin atomicMax atomicAnd atomicOr atomicXor

    Otheratomicoperationswhicharenotsupportednatively(forexample64bitfloatingpointatomics)maybeemulatedusingthecompareandswap(CAS)instruction.TextureImprovementsTheGPUsdedicatedhardwareTextureunitsareavaluableresourceforcomputeprogramswithaneedtosampleorfilterimagedata.ThetexturethroughputinKeplerissignificantlyincreasedcomparedtoFermieachSMXunitcontains16texturefilteringunits,a4xincreasevstheFermiGF110SM.Inaddition,Keplerchangesthewaytexturestateismanaged.IntheFermigeneration,fortheGPUtoreferenceatexture,ithadtobeassignedaslotinafixedsizebindingtablepriortogridlaunch.Thenumberofslotsinthattableultimatelylimitshowmanyuniquetexturesaprogramcanreadfromatruntime.Ultimately,aprogramwaslimitedtoaccessingonly128simultaneoustexturesinFermi.WithbindlesstexturesinKepler,theadditionalstepofusingslotsisntnecessary:texturestateisnowsavedasanobjectinmemoryandthehardwarefetchesthesestateobjectsondemand,makingbindingtablesobsolete.Thiseffectivelyeliminatesanylimitsonthenumberofuniquetexturesthatcanbereferencedbyacomputeprogram.Instead,programscanmaptexturesatanytimeandpasstexturehandlesaroundastheywouldanyotherpointer.

  • KeplerMemorySubsystemL1,L2,ECCKeplersmemoryhierarchyisorganizedsimilarlytoFermi.TheKeplerarchitecturesupportsaunifiedmemoryrequestpathforloadsandstores,withanL1cacheperSMXmultiprocessor.KeplerGK110alsoenablescompilerdirecteduseofanadditionalnewcacheforreadonlydata,asdescribedbelow.

    64KBConfigurableSharedMemoryandL1CacheIntheKeplerGK110architecture,asinthepreviousgenerationFermiarchitecture,eachSMXhas64KBofonchipmemorythatcanbeconfiguredas48KBofSharedmemorywith16KBofL1cache,oras16KBofsharedmemorywith48KBofL1cache.KeplernowallowsforadditionalflexibilityinconfiguringtheallocationofsharedmemoryandL1cachebypermittinga32KB/32KBsplitbetweensharedmemoryandL1cache.TosupporttheincreasedthroughputofeachSMXunit,thesharedmemorybandwidthfor64bandlargerloadoperationsisalsodoubledcomparedtotheFermiSM,to256Bpercoreclock.48KBReadOnlyDataCacheInadditiontotheL1cache,Keplerintroducesa48KBcachefordatathatisknowntobereadonlyforthedurationofthefunction.IntheFermigeneration,thiscachewasaccessibleonlybytheTextureunit.Expertprogrammersoftenfounditadvantageoustoloaddatathroughthispathexplicitlybymappingtheirdataastextures,butthisapproachhadmanylimitations.

  • InKepler,inadditiontosignificantlyincreasingthecapacityofthiscachealongwiththetexturehorsepowerincrease,wedecidedtomakethecachedirectlyaccessibletotheSMforgeneralloadoperations.UseofthereadonlypathisbeneficialbecauseittakesbothloadandworkingsetfootprintoffoftheShared/L1cachepath.Inaddition,theReadOnlyDataCacheshighertagbandwidthsupportsfullspeedunalignedmemoryaccesspatternsamongotherscenarios.UseofthispathismanagedautomaticallybythecompileraccesstoanyvariableordatastructurethatisknowntobeconstantthroughprogrammeruseoftheC99standardconst__restrictkeywordwillbetaggedbythecompilertobeloadedthroughtheReadOnlyDataCache.ImprovedL2CacheTheKeplerGK110GPUfeatures1536KBofdedicatedL2cachememory,doubletheamountofL2availableintheFermiarchitecture.TheL2cacheistheprimarypointofdataunificationbetweentheSMXunits,servicingallload,store,andtexturerequestsandprovidingefficient,highspeeddatasharingacrosstheGPU.TheL2cacheonKepleroffersupto2xofthebandwidthperclockavailableinFermi.Algorithmsforwhichdataaddressesarenotknownbeforehand,suchasphysicssolvers,raytracing,andsparsematrixmultiplicationespeciallybenefitfromthecachehierarchy.FilterandconvolutionkernelsthatrequiremultipleSMstoreadthesamedataalsobenefit.MemoryProtectionSupportLikeFermi,Keplersregisterfiles,sharedmemories,L1cache,L2cacheandDRAMmemoryareprotectedbyaSingleErrorCorrectDoubleErrorDetect(SECDED)ECCcode.Inaddition,theReadOnlyDataCachesupportssingleerrorcorrectionthroughaparitycheck;intheeventofaparityerror,thecacheunitautomaticallyinvalidatesthefailedline,forcingareadofthecorrectdatafromL2.ECCcheckbitfetchesfromDRAMnecessarilyconsumesomeamountofDRAMbandwidth,whichresultsinaperformancedifferencebetweenECCenabledandECCdisabledoperation,especiallyonmemorybandwidthsensitiveapplications.KeplerGK110implementsseveraloptimizationstoECCcheckbitfetchhandlingbasedonFermiexperience.Asaresult,theECConvsoffperformancedeltahasbeenreducedbyanaverageof66%,asmeasuredacrossourinternalcomputeapplicationtestsuite.

    DynamicParallelismInahybridCPUGPUsystem,enablingalargeramountofparallelcodeinanapplicationtorunefficientlyandentirelywithintheGPUimprovesscalabilityandperformanceasGPUsincreaseinperf/watt.Toacceleratetheseadditionalparallelportionsoftheapplication,GPUsmustsupportmorevariedtypesofparallelworkloads.DynamicParallelismisanewfeatureintroducedwithKeplerGK110thatallowstheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.

  • Fermiwasverygoodatprocessinglargeparalleldatastructureswhenthescaleandparametersoftheproblemwereknownatkernellaunchtime.AllworkwaslaunchedfromthehostCPU,wouldruntocompletion,andreturnaresultbacktotheCPU.Theresultwouldthenbeusedaspartofthefinalsolution,orwouldbeanalyzedbytheCPUwhichwouldthensendadditionalrequestsbacktotheGPUforadditionalprocessing.InKeplerGK110anykernelcanlaunchanotherkernel,andcancreatethenecessarystreams,eventsandmanagethedependenciesneededtoprocessadditionalworkwithouttheneedforhostCPUinteraction.Thisarchitecturalinnovationmakesiteasierfordeveloperstocreateandoptimizerecursiveanddatadependentexecutionpatterns,andallowsmoreofaprogramtoberundirectlyonGPU.ThesystemCPUcanthenbefreedupforadditionaltasks,orthesystemcouldbeconfiguredwithalesspowerfulCPUtocarryoutthesameworkload.

    DynamicParallelismallowsmoreparallelcodeinanapplicationtobelauncheddirectlybytheGPUontoitself(rightsideofimage)ratherthanrequiringCPUintervention(leftsideofimage).

    DynamicParallelismallowsmorevarietiesofparallelalgorithmstobeimplementedontheGPU,includingnestedloopswithdifferingamountsofparallelism,parallelteamsofserialcontroltaskthreads,orsimpleserialcontrolcodeoffloadedtotheGPUinordertopromotedatalocalitywiththeparallelportionoftheapplication.Becauseakernelhastheabilitytolaunchadditionalworkloadsbasedonintermediate,onGPUresults,programmerscannowintelligentlyloadbalanceworktofocusthebulkoftheirresourcesontheareasoftheproblemthateitherrequirethemostprocessingpoweroraremostrelevanttothesolution.

  • Oneexamplewouldbedynamicallysettingupagridforanumericalsimulationtypicallygridcellsarefocusedinregionsofgreatestchange,requiringanexpensivepreprocessingpassthroughthedata.Alternatively,auniformlycoarsegridcouldbeusedtopreventwastedGPUresources,orauniformlyfinegridcouldbeusedtoensureallthefeaturesarecaptured,buttheseoptionsriskmissingsimulationfeaturesoroverspendingcomputeresourcesonregionsoflessinterest.WithDynamicParallelism,thegridresolutioncanbedetermineddynamicallyatruntimeinadatadependentmanner.Startingwithacoarsegrid,thesimulationcanzoominonareasofinterestwhileavoidingunnecessarycalculationinareaswithlittlechange.ThoughthiscouldbeaccomplishedusingasequenceofCPUlaunchedkernels,itwouldbefarsimplertoallowtheGPUtorefinethegriditselfbyanalyzingthedataandlaunchingadditionalworkaspartofasinglesimulationkernel,eliminatinginterruptionoftheCPUanddatatransfersbetweentheCPUandGPU.

    ImageattributionCharlesReid

    Theaboveexampleillustratesthebenefitsofusingadynamicallysizedgridinanumericalsimulation.Tomeetpeakprecisionrequirements,afixedresolutionsimulationmustrunatanexcessivelyfineresolutionacrosstheentiresimulationdomain,whereasamultiresolutiongridappliesthecorrectsimulationresolutiontoeachareabasedonlocalvariation.

  • HyperQOneofthofworkfrlaunchesfhardwarewithinonthiscouldcomplexit

    KeplerGKnumberotheGPUbconnectioCUDAstrewithinaplimitingG

    HyperQpe

    EachCUDoptimizedconcurrendependen

    Qechallengesrommultiplefromseparateworkqueueestreamtocdbealleviatedtyincreases,tK110improveofconnectionbyallowing32onavailableweams,frommprocess.AppliPUutilization

    ermitsmoresim

    DAstreamismd,andoperatntlywithoutnncies.

    inthepasthastreams.Theestreams,bu.Thisallowedcompletebefodtosomeextthiscanbecoesonthisfuncs(workqueu2simultaneowithFermi).HmultipleMessicationsthatn,canseeup

    ultaneousconn

    managedwithionsinonestneedingtosp

    asbeenkeepeFermiarchitutultimatelydforfalseintroreadditionatentthroughmemoreandctionalitywithes)betweenus,hardwareHyperQisaflagePassingIpreviouslyentoa32xperf

    ectionsbetwee

    hinitsownhatreamwillnoecificallytailo

    ingtheGPUsecturesuppothestreamswrastreamdealkernelsinatheuseofabdmoredifficuhthenewHythehostand

    emanagedcoexiblesolutionterface(MPncounteredfaformanceincr

    nCPUandGPU

    ardwareworklongerblockorthelaunch

    suppliedwithorted16waywereallmultpendencies,rseparatestrebreadthfirstulttomanageyperQfeaturetheCUDAWonnections(coonthatallowsPI)processes,alseserializatreasewithout

    .

    kqueue,interotherstreamordertoelim

    hanoptimallyconcurrencytiplexedintotrequiringdepeamcouldbelaunchordereefficiently.e.HyperQin

    WorkDistributomparedtotsconnectionsorevenfromionacrosstastchangingan

    rstreamdepems,enablingsminatepossib

    yscheduledloofkernelthesamependentkerneexecuted.W,asprogram

    ncreasesthettor(CWD)logthesinglesfrommultipmmultiplethrsks,therebynyexistingco

    endenciesarestreamstoexblefalse

    oad

    elsWhile

    totalgicin

    plereads

    de.

    execute

  • HyperQofferssignificantbenefitsforuseinMPIbasedparallelcomputersystems.LegacyMPIbasedalgorithmswereoftencreatedtorunonmulticoreCPUsystems,withtheamountofworkassignedtoeachMPIprocessscaledaccordingly.ThiscanleadtoasingleMPIprocesshavinginsufficientworktofullyoccupytheGPU.WhileithasalwaysbeenpossibleformultipleMPIprocessestoshareaGPU,theseprocessescouldbecomebottleneckedbyfalsedependencies.HyperQremovesthosefalsedependencies,dramaticallyincreasingtheefficiencyofGPUsharingacrossMPIprocesses.

    HyperQworkingwithCUDAStreams:IntheFermimodelshownontheleft,only(C,P)&(R,X)canrunconcurrentlyduetointrastreamdependenciescausedbythesinglehardwareworkqueue.TheKeplerHyperQmodelallowsallstreamstorunconcurrentlyusingseparateworkqueues.

  • GridManagementUnitEfficientlyKeepingtheGPUUtilizedNewfeaturesinKeplerGK110,suchastheabilityforCUDAkernelstolaunchworkdirectlyontheGPUwithDynamicParallelism,requiredthattheCPUtoGPUworkflowinKeplerofferincreasedfunctionalityovertheFermidesign.OnFermi,agridofthreadblockswouldbelaunchedbytheCPUandwouldalwaysruntocompletion,creatingasimpleunidirectionalflowofworkfromthehosttotheSMsviatheCUDAWorkDistributor(CWD)unit.KeplerGK110wasdesignedtoimprovetheCPUtoGPUworkflowbyallowingtheGPUtoefficientlymanagebothCPUandCUDAcreatedworkloads.WediscussedtheabilityoftheKeplerGK110GPUtoallowkernelstolaunchworkdirectlyontheGPU,anditsimportanttounderstandthechangesmadeintheKeplerGK110architecturetofacilitatethesenewfunctions.InKepler,agridcanbelaunchedfromtheCPUjustaswasthecasewithFermi,howevernewgridscanalsobecreatedprogrammaticallybyCUDAwithintheKeplerSMXunit.TomanagebothCUDAcreatedandhostoriginatedgrids,anewGridManagementUnit(GMU)wasintroducedinKeplerGK110.ThiscontrolunitmanagesandprioritizesgridsthatarepassedintotheCWDtobesenttotheSMXunitsforexecution.TheCWDinKeplerholdsgridsthatarereadytodispatch,anditisabletodispatch32activegrids,whichisdoublethecapacityoftheFermiCWD.TheKeplerCWDcommunicateswiththeGMUviaabidirectionallinkthatallowstheGMUtopausethedispatchofnewgridsandtoholdpendingandsuspendedgridsuntilneeded.TheGMUalsohasadirectconnectiontotheKeplerSMXunitstopermitgridsthatlaunchadditionalworkontheGPUviaDynamicParallelismtosendthenewworkbacktoGMUtobeprioritizedanddispatched.Ifthekernelthatdispatchedtheadditionalworkloadpauses,theGMUwillholditinactiveuntilthedependentworkhascompleted.

  • TheredesignedKeplerHOSTtoGPUworkflowshowsthenewGridManagementUnit,whichallowsittomanagetheactivelydispatchinggrids,pausedispatch,andholdpendingandsuspendedgrids.

    NVIDIAGPUDirectWhenworkingwithalargeamountofdata,increasingthedatathroughputandreducinglatencyisvitaltoincreasingcomputeperformance.KeplerGK110supportstheRDMAfeatureinNVIDIAGPUDirect,whichisdesignedtoimproveperformancebyallowingdirectaccesstoGPUmemorybythirdpartydevicessuchasIBadapters,NICs,andSSDs.WhenusingCUDA5.0,GPUDirectprovidesthefollowingimportantfeatures:

    Directmemoryaccess(DMA)betweenNICandGPUwithouttheneedforCPUsidedatabuffering.

    SignificantlyimprovedMPISend/MPIRecvefficiencybetweenGPUandothernodesinanetwork. EliminatesCPUbandwidthandlatencybottlenecks Workswithvarietyof3rdpartynetwork,capture,andstoragedevices

  • Applicationslikereversetimemigration(usedinseismicimagingforoil&gasexploration)distributethelargeimagingdataacrossseveralGPUs.HundredsofGPUsmustcollaboratetocrunchthedata,oftencommunicatingintermediateresults.GPUDirectenablesmuchhigheraggregatebandwidthforthisGPUtoGPUcommunicationscenariowithinaserverandacrossserverswiththeP2PandRDMAfeatures.KeplerGK110alsosupportsotherGPUDirectfeaturessuchasPeertoPeerandGPUDirectforVideo.

    GPUDirectRDMAallowsdirectaccesstoGPUmemoryfrom3rdpartydevicessuchasnetworkadapters,whichtranslatesintodirecttransfersbetweenGPUsacrossnodesaswell.

    ConclusionWiththelaunchofFermiin2010,NVIDIAusheredinanewerainthehighperformancecomputing(HPC)industrybasedonahybridcomputingmodelwhereCPUsandGPUsworktogethertosolvecomputationallyintensiveworkloads.Now,withthenewKeplerGK110GPU,NVIDIAagainraisesthebarfortheHPCindustry.KeplerGK110wasdesignedfromthegrounduptomaximizecomputationalperformanceandthroughputcomputingwithoutstandingpowerefficiency.ThearchitecturehasmanynewinnovationssuchasSMX,DynamicParallelism,andHyperQthatmakehybridcomputingdramaticallyfaster,easiertoprogram,andapplicabletoabroadersetofapplications.KeplerGK110GPUswillbeusedinnumeroussystemsrangingfromworkstationstosupercomputerstoaddressthemostdauntingchallengesinHPC.

  • AppenCUDAisawrittenwkernelsthintothreaexecutesgrid,aproAthreadbbarriersyarrayofthglobalmemodel,eaautomaticcommuniresultsin

    ndix A - combination

    withC,C++,Fohatexecuteacadblocksandaninstanceoogramcounteblockisasetnchronizationhreadblocksemory,andsyachthreadhacarrayvariabcation,datasGlobalMemo

    Quick Rnhardware/soortran,andotcrossmanypgridsofthreofthekernel.er,registers,pofconcurrennandsharedthatexecuteynchronizebeasaperthreables.Eachthresharing,androryspaceafte

    Refresheoftwareplatftherlanguagearallelthreadadblocks,asEachthreadperthreadprtlyexecutingmemory.Atthesameker

    etweendependprivatemeeadblockhasresultsharingerkernelwid

    er on CUformthatenaes.ACUDAprds.TheprograshowninFigalsohasthrearivatememorgthreadsthathreadblockhrnel,readinpndentkernelmoryspaceusaperblockginparallelaleglobalsync

    UDA ablesNVIDIArograminvokammerorcomgure1.Eachthadandblockry,inputs,andtcancooperahasablockIDputsfromglobcalls.IntheCusedforregissharedmemolgorithms.Grchronization.

    GPUstoexecesparallelfumpilerorganhreadwithinIDswithinitsdoutputresuateamongtheDwithinitsgrbalmemory,CUDAparallelsterspills,funoryspaceuseridsofthread

    cuteprogramnctionscalledizesthesethrathreadblocsthreadblockults.emselvesthrorid.Agridisawriteresultslprogramminnctioncalls,anedforinterthblocksshare

    sdreadsckkand

    oughntongndChread

  • Figure1:CUDAHierarchyofthreads,blocks,andgrids,withcorrespondingperthreadprivate,perblockshared,andperapplicationglobalmemoryspaces.

    CUDAHardwareExecutionCUDAshierarchyofthreadsmapstoahierarchyofprocessorsontheGPU;aGPUexecutesoneormorekernelgrids;astreamingmultiprocessor(SMonFermi/SMXonKepler)executesoneormorethreadblocks;andCUDAcoresandotherexecutionunitsintheSMXexecutethreadinstructions.TheSMXexecutesthreadsingroupsof32threadscalledwarps.Whileprogrammerscangenerallyignorewarpexecutionforfunctionalcorrectnessandfocusonprogrammingindividualscalarthreads,theycangreatlyimproveperformancebyhavingthreadsinawarpexecutethesamecodepathandaccessmemorywithnearbyaddresses.

  • NoticeALLINFORMATIONPROVIDEDINTHISWHITEPAPER,INCLUDINGCOMMENTARY,OPINION,NVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHERDOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEINGPROVIDEDASIS.NVIDIAMAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOMATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORAPARTICULARPURPOSE.Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesnoresponsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorotherrightsofthirdpartiesthatmayresultfromitsuse.NolicenseisgrantedbyimplicationorotherwiseunderanypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationaresubjecttochangewithoutnotice.Thispublicationsupersedesandreplacesallinformationpreviouslysupplied.NVIDIACorporationproductsarenotauthorizedforuseascriticalcomponentsinlifesupportdevicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.TrademarksNVIDIA,theNVIDIAlogo,CUDA,FERMI,KEPLERandGeForcearetrademarksorregisteredtrademarksofNVIDIACorporationintheUnitedStatesandothercountries.Othercompanyandproductnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.Copyright2012NVIDIACorporation.Allrightsreserved.