firesim: fast, cycle-accurate datacenter simulation in the ... · single-host-node simulation...
TRANSCRIPT
FireSim:Fast,Cycle-AccurateDatacenterSimulationinthePublicCloud
SagarKarandikar,HowardMao,DonggyuKim,DavidBiancolin,AlonAmid,DayeolLee,KyleKovacs,BorivojeNikolic,RandyKatz,JonathanBachrach,KrsteAsanović
https://fires.im@firesimproject
Whysimulatedatacenters?Next-gendatacenterswon’tbebuiltonlyfromcommoditycomponents:
Deepermemory/storage
hierarchiese.g.3DXPoint
TheendofMoore’sLaw
CustomSiliconintheCloud
Fastnetworkse.g.SiliconPhotonics
NewDCorganizations
e.g.disaggregation
[1]
2
…andcustomHWischangingfasterthanever
FPGAs: AgileHWDesignforASICs:
[2]
3
So,whatdoesoursimulatorneedtodo?• ModelHWatscale:• CPUsdowntomicroarchitecture• Fastnetworks,switches• Novelaccelerators
• RunrealSW:• RealOS,networkingstack(Linux)• Realframeworks/applications(notmicrobenchmarks)
• Makeitusable:• Runonacommodityplatform• Wanttoencouragecollaborationbetweensystems,architecture
4
(2)Buildasoftwaresimulator
EvaluatingComputerSystems
(1)Buildthehardware
(3)Buildahardware-acceleratedsimulator5
(1)Buildthehardware:“Tapeout”
• ArchitectsbuildhardwarebywritingRTL(Verilog,VHDL,etc.)• Considered“hard”
• Validate/unittestinsoftwareRTLsimulation• PushthroughASICtools(licensesveryexpensive+signNDAsfortools/process)• Iteratefor“QualityofResults”(QoR)akahittingyourtargetfrequency,area,passingdesignrulesoftheprocess• Sendafinaldesigntothefab,givethemmillionsof$,waitweeks/monthsfortheresult• Getchipsback,networktogetherintoaDC
6
(2)Write/Useasoftwaresimulator
• Lotsofoptionsoutthere• EasytoprototypenewHW:writeC++code• Alsoeasytomodelsomethingthatyoucan’treallybuild• Veryslowtorun(atbest100sofKIPSforasingle-node)• Eitherrunmicrobenchmarksorusesampling/skip-forward
• Onceyoucomeupwithagooddesign,thenwriteRTL• CannetworktogethermanyinstantiationsofaSWsimulatortosimulateadatacenter
7
(3)BuildaHW-acceleratedsimulator:DIABLO
• Needtohand-writeRTLmodels(evenharderthan“tapeout-ready”RTL)• Tiedtoacustomhost-platform• DIABLO[4]:• Simulated3072servers,96ToRsat~2.7MHz• BootedLinux,ranappslikememcached• $100k+hostplatform,custombuilt• Abstractprocessor,switchmodels
• Can’ttakethisRTLandtapeitout
DIABLOPrototype
8
EvaluatingSystemsforEvaluatingComputerSystems
Metric SWSimulator HW-accelSimulator Build therealHW
Cost $ $$$ $$$$$
Time-to-first-cycle Fastest(recompile C++) Medium (FPGAtools) Slowest(CADtools+fab)
RuntimeSlowdown 100,000x 1,000x 1x
Cantapeout? X X/✔ ✔
Run realSW X(tooslow) ✔ ✔
CommodityPlatform ✔ X X
9
Howdoweimprove?
Usefulhardwaretrends:
OpenISA Open,Silicon-ProvenSoC
Implementations
High-ProductivityHardwareDesign
Language
FPGAsintheCloud
10
So,whatarewebuilding?
• Adatacentersimulatorthat…• Modelsservers,accelerators,switches,anddatacenterinterconnect• UsesacommodityhostplatformwithFPGAs(EC2F1)• Letsusersworkwith:
• RTL(Chisel/Verilog)forcustomizingserverblades,buildingaccelerators,etc.• Softwaremodels(C++)forswitches
• Automateshigh-performancesimulatednetworktransport+theprocessofusingFPGAs,mappingsimulationtothehost,andbuildingasimulatoroutofRTLandsoftwaremodels• Runsrealsoftwarestacksatreasonablespeed(Linux+apps)
11
TileLink2On-ChipInterconnect
FireSimTargetDesign(Servers+Network)
• Serverblades,eachwith:• [email protected]• 16KiBI$,16KiBD$,256KiBL2• 16GBDRAM• 200GbpsEthernetNIC• OptionalAccelerators
• High-performancenetwork:• ParameterizableBW/linklatency
• e.g.200Gbps,2μs• Easytoaddyourownlink-layer
• WeprovideEthernet• Switchmodelswithconfigurable#ofports
• Configurabletopology
L2
OtherDevices
NIC
DRAM
RocketCore
L1I L1D
Accel
RocketCore
L1I L1D
AccelRocketCore
L1I L1D
Accel
RocketCore
L1I L1D
Accel
FireSimServerBladeBlockDiagramToDCNetw
ork
12
MappingasimulationtoEC2F1
• ServerSimulation• Lotsofinherentparallelism• WehavetheRTL:Xformit(FAME-1)• PutitontheFPGAs
• Networksimulation• Littleparallelisminswitchmodels(e.g.athreadperport)• Needtocoordinateallofourdistributedserversimulations• SouseCPUs+hostnetwork
ServerSimulation(s)
ServerSimulation(s)
ServerSimulation(s)
ServerSimulation(s)
ServerSimulation(s)
ServerSimulation(s)
ServerSimulation(s)
ServerSimulation(s)
SwitchMod
el
CPU
HostPCIe
f1.16xlarge
HostEtherne
t(EC2Network)
13
FAME-1TransformingRTL
• GivenRTL,wewanttoautomaticallytransformadesignintodecoupledcycle-accuratesimulatorRTLthatwecanrunontheFPGA• SeeMIDAS/Strober[5,6]fromCARRV/ISCA’16
14
TileLink2Interconnect
L2
OtherDevices
NIC
DRAM
RocketCore
L1I L1D
Accel
RocketCore
L1I L1D
Accel
RocketCore
L1I L1D
Accel
RocketCore
L1I L1D
Accel
FireSimServerBlade
ToDCNetwork
NICIn
NICOut
OtherDev.In
OtherDev.Out
FAME-1TransformedServerBlade
TileLink2Interconnect
L2
OtherDevices
NIC
RocketCore
L1I
L1D
Accel
RocketCore
L1I
L1D
Accel
RocketCore
L1I
L1D
Accel
RocketCore
L1I
L1D
Accel
DDR3
Mod
el
Single-host-nodesimulationmetrics
• Wepackfourquad-coreserversimulationsperFPGA• =32serversimulationsperf1.16xlarge• =128simulatedcoresperf1.16xlarge• Onesimulationmanagementthreadper-FPGA
• 32-port,200Gbpsper-portToRswitchmodel• Onethread-per-port(16xlarge has64vCPUs)
• Runsat~5MHz->~400millioninsts/sec• $13.20/hron-demand,~$2.60/hrspot
ToR
Node1 Node32
…
…
200Gbps2uslinks
TargetNetworkTopologyonone16xlarge
15
4-coreRC 4-coreRC
4-coreRC 4-coreRC
HostMemChannels
HostMemChannels
HostPCIe
SingleFPG
A
“Supernode”(oneFPGA)
Reproducingtaillatencyeffectsfromrealsystems
• LeverichandKozyrakisshoweffectsofthread-imbalanceinmemcached in[3]• Wecanobservethiseffectinsimulation
16
Scalingtoa1024NodeRISC-VDatacenter
• 1024serverblades(4096cores),32ToRswitches• 32f1.16xlarges
• 1Root+4Aggregationswitches• 5m4.16xlarges
• Runsat3.4MHz(13billioninsts/s)• Samplememcached run:
17
SummingUp
• WecanprototypeadatacenterbuiltonanyRISC-Vcore• YoubringtheSoCs,accelerators
• Simulationisautomaticallybuiltanddeployed• ssh intothesimulatedsystem,justlikearealcluster
18
RISC-VSoCs Accelerators Network
Topology
FireSim
Automaticallydeployed,high-performance,distributedsimulation
Thanks!
Talktousabout:• Automatingbuilding/distributingsimulationsacrossEC2instances• OoOCores(e.g.BOOM)/integratingyourhigh-performancecore• “Functional”networksimulation• e.g.AutomaticallyrunallofSPECInt06-refonRocketChip@150MHz,inthecloud,in<1day• Checkoutourdemo/blogpostontheAWSComputeBlog
• TCOofsimulationinthecloud• RISC-Vinthecloud• Scalingfurther
https://fires.im@firesimproject
19
References
[1]PeterX.Gao,AkshayNarayan,SagarKarandikar,JoaoCarreira,SangjinHan,RachitAgarwal,SylviaRatnasamy,andScottShenker.2016.Networkrequirementsforresourcedisaggregation.OSDI'16[2]Y.Lee etal.,"AnAgileApproachtoBuildingRISC-VMicroprocessors,"in IEEEMicro,vol.36,no.2,pp.8-20,Mar.-Apr.2016.[3]JacobLeverichandChristosKozyrakis.Reconcilinghighserverutilizationandsub-millisecondquality-of-service.EuroSys'14[4]ZhangxiTan,ZhenghaoQian,XiChen,KrsteAsanovic,andDavidPatterson.DIABLO:AWarehouse-ScaleComputerNetworkSimulatorusingFPGAs.ASPLOS'15[5]EvaluationofRISC-VRTLDesignswithFPGASimulation.DonggyuKim,ChristopherCelio,DavidBiancolin,JonathanBachrachandKrsteAsanovic.CARRV'17.[6]DonggyuKim,AdamIzraelevitz,ChristopherCelio,HokeunKim,BrianZimmer,YunsupLee,JonathanBachrach,andKrsteAsanović.Strober:fastandaccuratesample-basedenergysimulationforarbitraryRTL.ISCA'16
20