nvidia kepler gk110 architecture whitepaper
DESCRIPTION
NVIDIA Kepler GK110TRANSCRIPT
-
Whitepaper
NVIDIAs Next Generation CUDATM Compute Architecture:
Kepler TM GK110
The Fastest, Most Efficient HPC Architecture Ever Built
V1.0
-
Table of Contents KeplerGK110TheNextGenerationGPUComputingArchitecture...........................................................3KeplerGK110ExtremePerformance,ExtremeEfficiency..........................................................................4
Dynamic Parallelism......................................................................................................................5 Hyper-Q...........................................................................................................................................5 Grid Management Unit..................................................................................................................5 NVIDIA GPUDirect.....................................................................................................................5
AnOverviewoftheGK110KeplerArchitecture...........................................................................................6PerformanceperWatt..............................................................................................................................7StreamingMultiprocessor(SMX)Architecture.........................................................................................8SMXProcessingCoreArchitecture.......................................................................................................9QuadWarpScheduler...........................................................................................................................9NewISAEncoding:255RegistersperThread.....................................................................................11ShuffleInstruction...............................................................................................................................11AtomicOperations..............................................................................................................................11TextureImprovements.......................................................................................................................12
KeplerMemorySubsystemL1,L2,ECC................................................................................................1364KBConfigurableSharedMemoryandL1Cache............................................................................1348KBReadOnlyDataCache...............................................................................................................13ImprovedL2Cache.............................................................................................................................14MemoryProtectionSupport...............................................................................................................14
DynamicParallelism................................................................................................................................14HyperQ...................................................................................................................................................17GridManagementUnitEfficientlyKeepingtheGPUUtilized...............................................................19NVIDIAGPUDirect................................................................................................................................20
Conclusion...................................................................................................................................................21Appendix A - QuickRefresheronCUDA...................................................................................................22
CUDAHardwareExecution.................................................................................................................23
-
Kepler GK110 The Next Generation GPU Computing Architecture Asthedemandforhighperformanceparallelcomputingincreasesacrossmanyareasofscience,medicine,engineering,andfinance,NVIDIAcontinuestoinnovateandmeetthatdemandwithextraordinarilypowerfulGPUcomputingarchitectures.NVIDIAsexistingFermiGPUshavealreadyredefinedandacceleratedHighPerformanceComputing(HPC)capabilitiesinareassuchasseismicprocessing,biochemistrysimulations,weatherandclimatemodeling,signalprocessing,computationalfinance,computeraidedengineering,computationalfluiddynamics,anddataanalysis.NVIDIAsnewKeplerGK110GPUraisestheparallelcomputingbarconsiderablyandwillhelpsolvetheworldsmostdifficultcomputingproblems.ByofferingmuchhigherprocessingpowerthanthepriorGPUgenerationandbyprovidingnewmethodstooptimizeandincreaseparallelworkloadexecutionontheGPU,KeplerGK110simplifiescreationofparallelprogramsandwillfurtherrevolutionizehighperformancecomputing.
-
Kepler GK110 - Extreme Performance, Extreme Efficiency Comprising7.1billiontransistors,KeplerGK110isnotonlythefastest,butalsothemostarchitecturallycomplexmicroprocessoreverbuilt.Addingmanynewinnovativefeaturesfocusedoncomputeperformance,GK110wasdesignedtobeaparallelprocessingpowerhouseforTeslaandtheHPCmarket.KeplerGK110willprovideover1TFlopofdoubleprecisionthroughputwithgreaterthan80%DGEMMefficiencyversus6065%onthepriorFermiarchitecture.Inadditiontogreatlyimprovedperformance,theKeplerarchitectureoffersahugeleapforwardinpowerefficiency,deliveringupto3xtheperformanceperwattofFermi.
KeplerGK110DiePhoto
-
ThefollowingnewfeaturesinKeplerGK110enableincreasedGPUutilization,simplifyparallelprogramdesign,andaidinthedeploymentofGPUsacrossthespectrumofcomputeenvironmentsrangingfrompersonalworkstationstosupercomputers:
Dynamic ParallelismaddsthecapabilityfortheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.Byprovidingtheflexibilitytoadapttotheamountandformofparallelismthroughthecourseofaprogram'sexecution,programmerscanexposemorevariedkindsofparallelworkandmakethemostefficientusetheGPUasacomputationevolves.Thiscapabilityallowslessstructured,morecomplextaskstoruneasilyandeffectively,enablinglargerportionsofanapplicationtorunentirelyontheGPU.Inaddition,programsareeasiertocreate,andtheCPUisfreedforothertasks.
Hyper-QHyperQenablesmultipleCPUcorestolaunchworkonasingleGPUsimultaneously,therebydramaticallyincreasingGPUutilizationandsignificantlyreducingCPUidletimes.HyperQincreasesthetotalnumberofconnections(workqueues)betweenthehostandtheGK110GPUbyallowing32simultaneous,hardwaremanagedconnections(comparedtothesingleconnectionavailablewithFermi).HyperQisaflexiblesolutionthatallowsseparateconnectionsfrommultipleCUDAstreams,frommultipleMessagePassingInterface(MPI)processes,orevenfrommultiplethreadswithinaprocess.Applicationsthatpreviouslyencounteredfalseserializationacrosstasks,therebylimitingachievedGPUutilization,canseeuptodramaticperformanceincreasewithoutchanginganyexistingcode.
Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexiblegridmanagementanddispatchcontrolsystem.ThenewGK110GridManagementUnit(GMU)managesandprioritizesgridstobeexecutedontheGPU.TheGMUcanpausethedispatchofnewgridsandqueuependingandsuspendedgridsuntiltheyarereadytoexecute,providingtheflexibilitytoenablepowerfulruntimes,suchasDynamicParallelism.TheGMUensuresbothCPUandGPUgeneratedworkloadsareproperlymanagedanddispatched.
NVIDIA GPUDirectNVIDIAGPUDirectisacapabilitythatenablesGPUswithinasinglecomputer,orGPUsindifferentserverslocatedacrossanetwork,todirectlyexchangedatawithoutneedingtogotoCPU/systemmemory.TheRDMAfeatureinGPUDirectallowsthirdpartydevicessuchasSSDs,NICs,andIBadapterstodirectlyaccessmemoryonmultipleGPUswithinthesamesystem,significantlydecreasingthelatencyofMPIsendandreceivemessagesto/fromGPUmemory.ItalsoreducesdemandsonsystemmemorybandwidthandfreestheGPUDMAenginesforusebyotherCUDAtasks.KeplerGK110alsosupportsotherGPUDirectfeaturesincludingPeertoPeerandGPUDirectforVideo.
-
An Overview of the GK110 Kepler Architecture KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperformingparallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcomputehorsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerandgeneratingmuchlessheatoutput.AfullKeplerGK110implementationincludes15SMXunitsandsix64bitmemorycontrollers.DifferentproductswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14SMXs.Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:
ThenewSMXprocessorarchitecture Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat
eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/Oimplementation.
Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities
KeplerGK110Fullchipblockdiagram
-
KeplerGK110supportsthenewCUDAComputeCapability3.5.(ForabriefoverviewofCUDAseeAppendixAQuickRefresheronCUDA).ThefollowingtablecomparesparametersofdifferentComputeCapabilitiesforFermiandKeplerGPUarchitectures:
FERMIGF100
FERMIGF104
KEPLERGK104
KEPLERGK110
ComputeCapability 2.0 2.1 3.0 3.5Threads/Warp 32 32 32 32MaxWarps/Multiprocessor 48 48 64 64MaxThreads/Multiprocessor 1536 1536 2048 2048MaxThreadBlocks/Multiprocessor 8 8 16 1632bitRegisters/Multiprocessor 32768 32768 65536 65536MaxRegisters/Thread 63 63 63 255MaxThreads/ThreadBlock 1024 1024 1024 1024SharedMemorySizeConfigurations(bytes) 16K 16K 16K 16K
48K 48K 32K 32K 48K 48KMaxXGridDimension 2^161 2^161 2^321 2^321HyperQ No No No YesDynamicParallelism No No No Yes
ComputeCapabilityofFermiandKeplerGPUs
PerformanceperWattAprincipaldesigngoalfortheKeplerarchitecturewasimprovingpowerefficiency.WhendesigningKepler,NVIDIAengineersappliedeverythinglearnedfromFermitobetteroptimizetheKeplerarchitectureforhighlyefficientoperation.TSMCs28nmmanufacturingprocessplaysanimportantroleinloweringpowerconsumption,butmanyGPUarchitecturemodificationswererequiredtofurtherreducepowerconsumptionwhilemaintaininggreatperformance.EveryhardwareunitinKeplerwasdesignedandscrubbedtoprovideoutstandingperformanceperwatt.Thebestexampleofgreatperf/wattisseeninthedesignofKeplerGK110snewStreamingMultiprocessor(SMX),whichissimilarinmanyrespectstotheSMXunitrecentlyintroducedinKeplerGK104,butincludessubstantiallymoredoubleprecisionunitsforcomputealgorithms.
-
StreamingMultiprocessor(SMX)ArchitectureKeplerGK110snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemostpowerfulmultiprocessorwevebuilt,butalsothemostprogrammableandpowerefficient.
SMX:192singleprecisionCUDAcores,64doubleprecisionunits,32specialfunctionunits(SFU),and32load/storeunits(LD/ST).
-
SMXProcessingCoreArchitectureEachoftheKeplerGK110SMXunitsfeature192singleprecisionCUDAcores,andeachcorehasfullypipelinedfloatingpointandintegerarithmeticlogicunits.KeplerretainsthefullIEEE7542008compliantsingleanddoubleprecisionarithmeticintroducedinFermi,includingthefusedmultiplyadd(FMA)operation.OneofthedesigngoalsfortheKeplerGK110SMXwastosignificantlyincreasetheGPUsdelivereddoubleprecisionperformance,sincedoubleprecisionarithmeticisattheheartofmanyHPCapplications.KeplerGK110sSMXalsoretainsthespecialfunctionunits(SFUs)forfastapproximatetranscendentaloperationsasinpreviousgenerationGPUs,providing8xthenumberofSFUsoftheFermiGF110SM.SimilartoGK104SMXunits,thecoreswithinthenewGK110SMXunitsusetheprimaryGPUclockratherthanthe2xshaderclock.Recallthe2xshaderclockwasintroducedintheG80TeslaarchitectureGPUandusedinallsubsequentTeslaandFermiarchitectureGPUs.Runningexecutionunitsatahigherclockrateallowsachiptoachieveagiventargetthroughputwithfewercopiesoftheexecutionunits,whichisessentiallyanareaoptimization,buttheclockinglogicforthefastercoresismorepowerhungry.ForKepler,ourprioritywasperformanceperwatt.Whilewemademanyoptimizationsthatbenefittedbothareaandpower,wechosetooptimizeforpowerevenattheexpenseofsomeaddedareacost,withalargernumberofprocessingcoresrunningatthelower,lesspowerhungryGPUclock.QuadWarpSchedulerTheSMXschedulesthreadsingroupsof32parallelthreadscalledwarps.EachSMXfeaturesfourwarpschedulersandeightinstructiondispatchunits,allowingfourwarpstobeissuedandexecutedconcurrently.Keplersquadwarpschedulerselectsfourwarps,andtwoindependentinstructionsperwarpcanbedispatchedeachcycle.UnlikeFermi,whichdidnotpermitdoubleprecisioninstructionstobepairedwithotherinstructions,KeplerGK110allowsdoubleprecisioninstructionstobepairedwithotherinstructions.
-
EachKeplerSMXcontains4WarpSchedulers,eachwithdualInstructionDispatchUnits.AsingleWarpSchedulerUnitisshownabove.
WealsolookedforopportunitiestooptimizethepowerintheSMXwarpschedulerlogic.Forexample,bothKeplerandFermischedulerscontainsimilarhardwareunitstohandletheschedulingfunction,including:
a) Registerscoreboardingforlonglatencyoperations(textureandload)b) Interwarpschedulingdecisions(e.g.,pickthebestwarptogonextamongeligiblecandidates)c) Threadblocklevelscheduling(e.g.,theGigaThreadengine)
However,Fermisscheduleralsocontainsacomplexhardwarestagetopreventdatahazardsinthemathdatapathitself.Amultiportregisterscoreboardkeepstrackofanyregistersthatarenotyetreadywithvaliddata,andadependencycheckerblockanalyzesregisterusageacrossamultitudeoffullydecodedwarpinstructionsagainstthescoreboard,todeterminewhichareeligibletoissue.ForKepler,werecognizedthatthisinformationisdeterministic(themathpipelinelatenciesarenotvariable),andthereforeitispossibleforthecompilertodetermineupfrontwheninstructionswillbereadytoissue,andprovidethisinformationintheinstructionitself.Thisallowedustoreplaceseveralcomplexandpowerexpensiveblockswithasimplehardwareblockthatextractsthepredeterminedlatencyinformationandusesittomaskoutwarpsfromeligibilityattheinterwarpschedulerstage.
-
NewISAEThenumbthreadacFermimacompellinchromodyto5.3xdumemory.ShuffleInTofurthewithinawstoreandthreadswpermutatthread.Ubutterfly
Shuffleofcarriedoublock,sincFFT,whic
Thisexamp
AtomicOAtomicmcorrectlyadd,min,operationusedforpserializet
Encoding:255berofregistecesstouptoyseesubstanngexamplecaynamics)calcuetotheabili
nstructionrimproveperwarptoshareloadoperati
withinawarpion.Shufflessefulshufflesystylepermuffersaperformutinasinglescedataexchahrequiresda
leshowssomeo
perationsemoryoperaperformreadmax,andcom
nsareperformparallelsortinhreadexecut
5Registersprsthatcanbe255registersntialspeedupanbeseeninulationsusingitytouseman
rformance,Kedata.Previoonstopassthcanreadvaluupportsarbitsubsetsincluutationsamomanceadvanstep.Shuffleangedatthewtasharingwi
ofthevariation
ationsareimpdmodifywritmpareandswmedwithouting,reductiontion.
erThreadeaccessedbys.CodesthatsasaresultotheQUDAlibgCUDA.QUDnymoreregis
eplerimplemusly,sharinghedatathrouuesfromothetraryindexeddingnextthrngthethreadtageovershaalsocanreduwarplevelnethinawarp,a
spossibleusing
portantinparteoperationswapareatominterruptionboperations,a
yathreadhasexhibithighroftheincreasbraryforperfDAfp64basedstersperthre
mentsanewSdatabetweenughsharedmerthreadsintreferencesread(offsetudsinawarp,aredmemoryucetheamoueverneedstoa6%perform
gthenewShuffl
rallelprogramonsharedda
micinthesensbyotherthreandbuildingd
sbeenquadrregisterpresssedavailableforminglatticdalgorithmseadandexper
Shuffleinstrucnthreadswitemory.Withthewarpinjui.e.anythrepordownbyarealsoavaily,inthatastontofshared
obeplacedinmancegainca
einstructionin
mming,allowiatastructuressethatthereeads.Atomicmdatastructure
upledinGK1sureorspillinperthreadreeQCD(quantseeperformariencingfewe
ction,whichathinawarpretheShuffleinustaboutanyadreadsfromyafixedamolableasCUDAoreandloadmemoryneesharedmemnbeseenjus
Kepler.
ingconcurrens.Atomicopeead,modify,amemoryoperesinparallel
10,allowingengbehaviorinegistercounttumanceincreaseerspillstoloc
allowsthreadequiredseparnstruction,yimaginablemanyotherunt)andXORAintrinsics.operationisdedperthrea
mory.InthecastbyusingSh
ntthreadstoerationssuchandwriterationsarewwithoutlocks
eachn.A
esupcal
dsrate
R
adaseofhuffle.
as
widelysthat
-
ThroughputofglobalmemoryatomicoperationsonKeplerGK110issubstantiallyimprovedcomparedtotheFermigeneration.Atomicoperationthroughputtoacommonglobalmemoryaddressisimprovedby9xtooneoperationperclock.Atomicoperationthroughputtoindependentglobaladdressesisalsosignificantlyaccelerated,andlogictohandleaddressconflictshasbeenmademoreefficient.Atomicoperationscanoftenbeprocessedatratessimilartogloballoadoperations.Thisspeedincreasemakesatomicsfastenoughtousefrequentlywithinkernelinnerloops,eliminatingtheseparatereductionpassesthatwerepreviouslyrequiredbysomealgorithmstoconsolidateresults.KeplerGK110alsoexpandsthenativesupportfor64bitatomicoperationsinglobalmemory.InadditiontoatomicAdd,atomicCAS,andatomicExch(whichwerealsosupportedbyFermiandKeplerGK104),GK110supportsthefollowing:
atomicMin atomicMax atomicAnd atomicOr atomicXor
Otheratomicoperationswhicharenotsupportednatively(forexample64bitfloatingpointatomics)maybeemulatedusingthecompareandswap(CAS)instruction.TextureImprovementsTheGPUsdedicatedhardwareTextureunitsareavaluableresourceforcomputeprogramswithaneedtosampleorfilterimagedata.ThetexturethroughputinKeplerissignificantlyincreasedcomparedtoFermieachSMXunitcontains16texturefilteringunits,a4xincreasevstheFermiGF110SM.Inaddition,Keplerchangesthewaytexturestateismanaged.IntheFermigeneration,fortheGPUtoreferenceatexture,ithadtobeassignedaslotinafixedsizebindingtablepriortogridlaunch.Thenumberofslotsinthattableultimatelylimitshowmanyuniquetexturesaprogramcanreadfromatruntime.Ultimately,aprogramwaslimitedtoaccessingonly128simultaneoustexturesinFermi.WithbindlesstexturesinKepler,theadditionalstepofusingslotsisntnecessary:texturestateisnowsavedasanobjectinmemoryandthehardwarefetchesthesestateobjectsondemand,makingbindingtablesobsolete.Thiseffectivelyeliminatesanylimitsonthenumberofuniquetexturesthatcanbereferencedbyacomputeprogram.Instead,programscanmaptexturesatanytimeandpasstexturehandlesaroundastheywouldanyotherpointer.
-
KeplerMemorySubsystemL1,L2,ECCKeplersmemoryhierarchyisorganizedsimilarlytoFermi.TheKeplerarchitecturesupportsaunifiedmemoryrequestpathforloadsandstores,withanL1cacheperSMXmultiprocessor.KeplerGK110alsoenablescompilerdirecteduseofanadditionalnewcacheforreadonlydata,asdescribedbelow.
64KBConfigurableSharedMemoryandL1CacheIntheKeplerGK110architecture,asinthepreviousgenerationFermiarchitecture,eachSMXhas64KBofonchipmemorythatcanbeconfiguredas48KBofSharedmemorywith16KBofL1cache,oras16KBofsharedmemorywith48KBofL1cache.KeplernowallowsforadditionalflexibilityinconfiguringtheallocationofsharedmemoryandL1cachebypermittinga32KB/32KBsplitbetweensharedmemoryandL1cache.TosupporttheincreasedthroughputofeachSMXunit,thesharedmemorybandwidthfor64bandlargerloadoperationsisalsodoubledcomparedtotheFermiSM,to256Bpercoreclock.48KBReadOnlyDataCacheInadditiontotheL1cache,Keplerintroducesa48KBcachefordatathatisknowntobereadonlyforthedurationofthefunction.IntheFermigeneration,thiscachewasaccessibleonlybytheTextureunit.Expertprogrammersoftenfounditadvantageoustoloaddatathroughthispathexplicitlybymappingtheirdataastextures,butthisapproachhadmanylimitations.
-
InKepler,inadditiontosignificantlyincreasingthecapacityofthiscachealongwiththetexturehorsepowerincrease,wedecidedtomakethecachedirectlyaccessibletotheSMforgeneralloadoperations.UseofthereadonlypathisbeneficialbecauseittakesbothloadandworkingsetfootprintoffoftheShared/L1cachepath.Inaddition,theReadOnlyDataCacheshighertagbandwidthsupportsfullspeedunalignedmemoryaccesspatternsamongotherscenarios.UseofthispathismanagedautomaticallybythecompileraccesstoanyvariableordatastructurethatisknowntobeconstantthroughprogrammeruseoftheC99standardconst__restrictkeywordwillbetaggedbythecompilertobeloadedthroughtheReadOnlyDataCache.ImprovedL2CacheTheKeplerGK110GPUfeatures1536KBofdedicatedL2cachememory,doubletheamountofL2availableintheFermiarchitecture.TheL2cacheistheprimarypointofdataunificationbetweentheSMXunits,servicingallload,store,andtexturerequestsandprovidingefficient,highspeeddatasharingacrosstheGPU.TheL2cacheonKepleroffersupto2xofthebandwidthperclockavailableinFermi.Algorithmsforwhichdataaddressesarenotknownbeforehand,suchasphysicssolvers,raytracing,andsparsematrixmultiplicationespeciallybenefitfromthecachehierarchy.FilterandconvolutionkernelsthatrequiremultipleSMstoreadthesamedataalsobenefit.MemoryProtectionSupportLikeFermi,Keplersregisterfiles,sharedmemories,L1cache,L2cacheandDRAMmemoryareprotectedbyaSingleErrorCorrectDoubleErrorDetect(SECDED)ECCcode.Inaddition,theReadOnlyDataCachesupportssingleerrorcorrectionthroughaparitycheck;intheeventofaparityerror,thecacheunitautomaticallyinvalidatesthefailedline,forcingareadofthecorrectdatafromL2.ECCcheckbitfetchesfromDRAMnecessarilyconsumesomeamountofDRAMbandwidth,whichresultsinaperformancedifferencebetweenECCenabledandECCdisabledoperation,especiallyonmemorybandwidthsensitiveapplications.KeplerGK110implementsseveraloptimizationstoECCcheckbitfetchhandlingbasedonFermiexperience.Asaresult,theECConvsoffperformancedeltahasbeenreducedbyanaverageof66%,asmeasuredacrossourinternalcomputeapplicationtestsuite.
DynamicParallelismInahybridCPUGPUsystem,enablingalargeramountofparallelcodeinanapplicationtorunefficientlyandentirelywithintheGPUimprovesscalabilityandperformanceasGPUsincreaseinperf/watt.Toacceleratetheseadditionalparallelportionsoftheapplication,GPUsmustsupportmorevariedtypesofparallelworkloads.DynamicParallelismisanewfeatureintroducedwithKeplerGK110thatallowstheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.
-
Fermiwasverygoodatprocessinglargeparalleldatastructureswhenthescaleandparametersoftheproblemwereknownatkernellaunchtime.AllworkwaslaunchedfromthehostCPU,wouldruntocompletion,andreturnaresultbacktotheCPU.Theresultwouldthenbeusedaspartofthefinalsolution,orwouldbeanalyzedbytheCPUwhichwouldthensendadditionalrequestsbacktotheGPUforadditionalprocessing.InKeplerGK110anykernelcanlaunchanotherkernel,andcancreatethenecessarystreams,eventsandmanagethedependenciesneededtoprocessadditionalworkwithouttheneedforhostCPUinteraction.Thisarchitecturalinnovationmakesiteasierfordeveloperstocreateandoptimizerecursiveanddatadependentexecutionpatterns,andallowsmoreofaprogramtoberundirectlyonGPU.ThesystemCPUcanthenbefreedupforadditionaltasks,orthesystemcouldbeconfiguredwithalesspowerfulCPUtocarryoutthesameworkload.
DynamicParallelismallowsmoreparallelcodeinanapplicationtobelauncheddirectlybytheGPUontoitself(rightsideofimage)ratherthanrequiringCPUintervention(leftsideofimage).
DynamicParallelismallowsmorevarietiesofparallelalgorithmstobeimplementedontheGPU,includingnestedloopswithdifferingamountsofparallelism,parallelteamsofserialcontroltaskthreads,orsimpleserialcontrolcodeoffloadedtotheGPUinordertopromotedatalocalitywiththeparallelportionoftheapplication.Becauseakernelhastheabilitytolaunchadditionalworkloadsbasedonintermediate,onGPUresults,programmerscannowintelligentlyloadbalanceworktofocusthebulkoftheirresourcesontheareasoftheproblemthateitherrequirethemostprocessingpoweroraremostrelevanttothesolution.
-
Oneexamplewouldbedynamicallysettingupagridforanumericalsimulationtypicallygridcellsarefocusedinregionsofgreatestchange,requiringanexpensivepreprocessingpassthroughthedata.Alternatively,auniformlycoarsegridcouldbeusedtopreventwastedGPUresources,orauniformlyfinegridcouldbeusedtoensureallthefeaturesarecaptured,buttheseoptionsriskmissingsimulationfeaturesoroverspendingcomputeresourcesonregionsoflessinterest.WithDynamicParallelism,thegridresolutioncanbedetermineddynamicallyatruntimeinadatadependentmanner.Startingwithacoarsegrid,thesimulationcanzoominonareasofinterestwhileavoidingunnecessarycalculationinareaswithlittlechange.ThoughthiscouldbeaccomplishedusingasequenceofCPUlaunchedkernels,itwouldbefarsimplertoallowtheGPUtorefinethegriditselfbyanalyzingthedataandlaunchingadditionalworkaspartofasinglesimulationkernel,eliminatinginterruptionoftheCPUanddatatransfersbetweentheCPUandGPU.
ImageattributionCharlesReid
Theaboveexampleillustratesthebenefitsofusingadynamicallysizedgridinanumericalsimulation.Tomeetpeakprecisionrequirements,afixedresolutionsimulationmustrunatanexcessivelyfineresolutionacrosstheentiresimulationdomain,whereasamultiresolutiongridappliesthecorrectsimulationresolutiontoeachareabasedonlocalvariation.
-
HyperQOneofthofworkfrlaunchesfhardwarewithinonthiscouldcomplexit
KeplerGKnumberotheGPUbconnectioCUDAstrewithinaplimitingG
HyperQpe
EachCUDoptimizedconcurrendependen
Qechallengesrommultiplefromseparateworkqueueestreamtocdbealleviatedtyincreases,tK110improveofconnectionbyallowing32onavailableweams,frommprocess.AppliPUutilization
ermitsmoresim
DAstreamismd,andoperatntlywithoutnncies.
inthepasthastreams.Theestreams,bu.Thisallowedcompletebefodtosomeextthiscanbecoesonthisfuncs(workqueu2simultaneowithFermi).HmultipleMessicationsthatn,canseeup
ultaneousconn
managedwithionsinonestneedingtosp
asbeenkeepeFermiarchitutultimatelydforfalseintroreadditionatentthroughmemoreandctionalitywithes)betweenus,hardwareHyperQisaflagePassingIpreviouslyentoa32xperf
ectionsbetwee
hinitsownhatreamwillnoecificallytailo
ingtheGPUsecturesuppothestreamswrastreamdealkernelsinatheuseofabdmoredifficuhthenewHythehostand
emanagedcoexiblesolutionterface(MPncounteredfaformanceincr
nCPUandGPU
ardwareworklongerblockorthelaunch
suppliedwithorted16waywereallmultpendencies,rseparatestrebreadthfirstulttomanageyperQfeaturetheCUDAWonnections(coonthatallowsPI)processes,alseserializatreasewithout
.
kqueue,interotherstreamordertoelim
hanoptimallyconcurrencytiplexedintotrequiringdepeamcouldbelaunchordereefficiently.e.HyperQin
WorkDistributomparedtotsconnectionsorevenfromionacrosstastchangingan
rstreamdepems,enablingsminatepossib
yscheduledloofkernelthesamependentkerneexecuted.W,asprogram
ncreasesthettor(CWD)logthesinglesfrommultipmmultiplethrsks,therebynyexistingco
endenciesarestreamstoexblefalse
oad
elsWhile
totalgicin
plereads
de.
execute
-
HyperQofferssignificantbenefitsforuseinMPIbasedparallelcomputersystems.LegacyMPIbasedalgorithmswereoftencreatedtorunonmulticoreCPUsystems,withtheamountofworkassignedtoeachMPIprocessscaledaccordingly.ThiscanleadtoasingleMPIprocesshavinginsufficientworktofullyoccupytheGPU.WhileithasalwaysbeenpossibleformultipleMPIprocessestoshareaGPU,theseprocessescouldbecomebottleneckedbyfalsedependencies.HyperQremovesthosefalsedependencies,dramaticallyincreasingtheefficiencyofGPUsharingacrossMPIprocesses.
HyperQworkingwithCUDAStreams:IntheFermimodelshownontheleft,only(C,P)&(R,X)canrunconcurrentlyduetointrastreamdependenciescausedbythesinglehardwareworkqueue.TheKeplerHyperQmodelallowsallstreamstorunconcurrentlyusingseparateworkqueues.
-
GridManagementUnitEfficientlyKeepingtheGPUUtilizedNewfeaturesinKeplerGK110,suchastheabilityforCUDAkernelstolaunchworkdirectlyontheGPUwithDynamicParallelism,requiredthattheCPUtoGPUworkflowinKeplerofferincreasedfunctionalityovertheFermidesign.OnFermi,agridofthreadblockswouldbelaunchedbytheCPUandwouldalwaysruntocompletion,creatingasimpleunidirectionalflowofworkfromthehosttotheSMsviatheCUDAWorkDistributor(CWD)unit.KeplerGK110wasdesignedtoimprovetheCPUtoGPUworkflowbyallowingtheGPUtoefficientlymanagebothCPUandCUDAcreatedworkloads.WediscussedtheabilityoftheKeplerGK110GPUtoallowkernelstolaunchworkdirectlyontheGPU,anditsimportanttounderstandthechangesmadeintheKeplerGK110architecturetofacilitatethesenewfunctions.InKepler,agridcanbelaunchedfromtheCPUjustaswasthecasewithFermi,howevernewgridscanalsobecreatedprogrammaticallybyCUDAwithintheKeplerSMXunit.TomanagebothCUDAcreatedandhostoriginatedgrids,anewGridManagementUnit(GMU)wasintroducedinKeplerGK110.ThiscontrolunitmanagesandprioritizesgridsthatarepassedintotheCWDtobesenttotheSMXunitsforexecution.TheCWDinKeplerholdsgridsthatarereadytodispatch,anditisabletodispatch32activegrids,whichisdoublethecapacityoftheFermiCWD.TheKeplerCWDcommunicateswiththeGMUviaabidirectionallinkthatallowstheGMUtopausethedispatchofnewgridsandtoholdpendingandsuspendedgridsuntilneeded.TheGMUalsohasadirectconnectiontotheKeplerSMXunitstopermitgridsthatlaunchadditionalworkontheGPUviaDynamicParallelismtosendthenewworkbacktoGMUtobeprioritizedanddispatched.Ifthekernelthatdispatchedtheadditionalworkloadpauses,theGMUwillholditinactiveuntilthedependentworkhascompleted.
-
TheredesignedKeplerHOSTtoGPUworkflowshowsthenewGridManagementUnit,whichallowsittomanagetheactivelydispatchinggrids,pausedispatch,andholdpendingandsuspendedgrids.
NVIDIAGPUDirectWhenworkingwithalargeamountofdata,increasingthedatathroughputandreducinglatencyisvitaltoincreasingcomputeperformance.KeplerGK110supportstheRDMAfeatureinNVIDIAGPUDirect,whichisdesignedtoimproveperformancebyallowingdirectaccesstoGPUmemorybythirdpartydevicessuchasIBadapters,NICs,andSSDs.WhenusingCUDA5.0,GPUDirectprovidesthefollowingimportantfeatures:
Directmemoryaccess(DMA)betweenNICandGPUwithouttheneedforCPUsidedatabuffering.
SignificantlyimprovedMPISend/MPIRecvefficiencybetweenGPUandothernodesinanetwork. EliminatesCPUbandwidthandlatencybottlenecks Workswithvarietyof3rdpartynetwork,capture,andstoragedevices
-
Applicationslikereversetimemigration(usedinseismicimagingforoil&gasexploration)distributethelargeimagingdataacrossseveralGPUs.HundredsofGPUsmustcollaboratetocrunchthedata,oftencommunicatingintermediateresults.GPUDirectenablesmuchhigheraggregatebandwidthforthisGPUtoGPUcommunicationscenariowithinaserverandacrossserverswiththeP2PandRDMAfeatures.KeplerGK110alsosupportsotherGPUDirectfeaturessuchasPeertoPeerandGPUDirectforVideo.
GPUDirectRDMAallowsdirectaccesstoGPUmemoryfrom3rdpartydevicessuchasnetworkadapters,whichtranslatesintodirecttransfersbetweenGPUsacrossnodesaswell.
ConclusionWiththelaunchofFermiin2010,NVIDIAusheredinanewerainthehighperformancecomputing(HPC)industrybasedonahybridcomputingmodelwhereCPUsandGPUsworktogethertosolvecomputationallyintensiveworkloads.Now,withthenewKeplerGK110GPU,NVIDIAagainraisesthebarfortheHPCindustry.KeplerGK110wasdesignedfromthegrounduptomaximizecomputationalperformanceandthroughputcomputingwithoutstandingpowerefficiency.ThearchitecturehasmanynewinnovationssuchasSMX,DynamicParallelism,andHyperQthatmakehybridcomputingdramaticallyfaster,easiertoprogram,andapplicabletoabroadersetofapplications.KeplerGK110GPUswillbeusedinnumeroussystemsrangingfromworkstationstosupercomputerstoaddressthemostdauntingchallengesinHPC.
-
AppenCUDAisawrittenwkernelsthintothreaexecutesgrid,aproAthreadbbarriersyarrayofthglobalmemodel,eaautomaticcommuniresultsin
ndix A - combination
withC,C++,Fohatexecuteacadblocksandaninstanceoogramcounteblockisasetnchronizationhreadblocksemory,andsyachthreadhacarrayvariabcation,datasGlobalMemo
Quick Rnhardware/soortran,andotcrossmanypgridsofthreofthekernel.er,registers,pofconcurrennandsharedthatexecuteynchronizebeasaperthreables.Eachthresharing,androryspaceafte
Refresheoftwareplatftherlanguagearallelthreadadblocks,asEachthreadperthreadprtlyexecutingmemory.Atthesameker
etweendependprivatemeeadblockhasresultsharingerkernelwid
er on CUformthatenaes.ACUDAprds.TheprograshowninFigalsohasthrearivatememorgthreadsthathreadblockhrnel,readinpndentkernelmoryspaceusaperblockginparallelaleglobalsync
UDA ablesNVIDIArograminvokammerorcomgure1.Eachthadandblockry,inputs,andtcancooperahasablockIDputsfromglobcalls.IntheCusedforregissharedmemolgorithms.Grchronization.
GPUstoexecesparallelfumpilerorganhreadwithinIDswithinitsdoutputresuateamongtheDwithinitsgrbalmemory,CUDAparallelsterspills,funoryspaceuseridsofthread
cuteprogramnctionscalledizesthesethrathreadblocsthreadblockults.emselvesthrorid.Agridisawriteresultslprogramminnctioncalls,anedforinterthblocksshare
sdreadsckkand
oughntongndChread
-
Figure1:CUDAHierarchyofthreads,blocks,andgrids,withcorrespondingperthreadprivate,perblockshared,andperapplicationglobalmemoryspaces.
CUDAHardwareExecutionCUDAshierarchyofthreadsmapstoahierarchyofprocessorsontheGPU;aGPUexecutesoneormorekernelgrids;astreamingmultiprocessor(SMonFermi/SMXonKepler)executesoneormorethreadblocks;andCUDAcoresandotherexecutionunitsintheSMXexecutethreadinstructions.TheSMXexecutesthreadsingroupsof32threadscalledwarps.Whileprogrammerscangenerallyignorewarpexecutionforfunctionalcorrectnessandfocusonprogrammingindividualscalarthreads,theycangreatlyimproveperformancebyhavingthreadsinawarpexecutethesamecodepathandaccessmemorywithnearbyaddresses.
-
NoticeALLINFORMATIONPROVIDEDINTHISWHITEPAPER,INCLUDINGCOMMENTARY,OPINION,NVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHERDOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEINGPROVIDEDASIS.NVIDIAMAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOMATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORAPARTICULARPURPOSE.Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesnoresponsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorotherrightsofthirdpartiesthatmayresultfromitsuse.NolicenseisgrantedbyimplicationorotherwiseunderanypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationaresubjecttochangewithoutnotice.Thispublicationsupersedesandreplacesallinformationpreviouslysupplied.NVIDIACorporationproductsarenotauthorizedforuseascriticalcomponentsinlifesupportdevicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.TrademarksNVIDIA,theNVIDIAlogo,CUDA,FERMI,KEPLERandGeForcearetrademarksorregisteredtrademarksofNVIDIACorporationintheUnitedStatesandothercountries.Othercompanyandproductnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.Copyright2012NVIDIACorporation.Allrightsreserved.