firesim: fast, cycle-accurate datacenter simulation in the ... · single-host-node simulation...

20
FireSim: Fast, Cycle-Accurate Datacenter Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović https://fires.im @firesimproject

Upload: others

Post on 19-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

FireSim:Fast,Cycle-AccurateDatacenterSimulationinthePublicCloud

SagarKarandikar,HowardMao,DonggyuKim,DavidBiancolin,AlonAmid,DayeolLee,KyleKovacs,BorivojeNikolic,RandyKatz,JonathanBachrach,KrsteAsanović

https://fires.im@firesimproject

Page 2: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

Whysimulatedatacenters?Next-gendatacenterswon’tbebuiltonlyfromcommoditycomponents:

Deepermemory/storage

hierarchiese.g.3DXPoint

TheendofMoore’sLaw

CustomSiliconintheCloud

Fastnetworkse.g.SiliconPhotonics

NewDCorganizations

e.g.disaggregation

[1]

2

Page 3: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

…andcustomHWischangingfasterthanever

FPGAs: AgileHWDesignforASICs:

[2]

3

Page 4: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

So,whatdoesoursimulatorneedtodo?• ModelHWatscale:• CPUsdowntomicroarchitecture• Fastnetworks,switches• Novelaccelerators

• RunrealSW:• RealOS,networkingstack(Linux)• Realframeworks/applications(notmicrobenchmarks)

• Makeitusable:• Runonacommodityplatform• Wanttoencouragecollaborationbetweensystems,architecture

4

Page 5: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

(2)Buildasoftwaresimulator

EvaluatingComputerSystems

(1)Buildthehardware

(3)Buildahardware-acceleratedsimulator5

Page 6: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

(1)Buildthehardware:“Tapeout”

• ArchitectsbuildhardwarebywritingRTL(Verilog,VHDL,etc.)• Considered“hard”

• Validate/unittestinsoftwareRTLsimulation• PushthroughASICtools(licensesveryexpensive+signNDAsfortools/process)• Iteratefor“QualityofResults”(QoR)akahittingyourtargetfrequency,area,passingdesignrulesoftheprocess• Sendafinaldesigntothefab,givethemmillionsof$,waitweeks/monthsfortheresult• Getchipsback,networktogetherintoaDC

6

Page 7: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

(2)Write/Useasoftwaresimulator

• Lotsofoptionsoutthere• EasytoprototypenewHW:writeC++code• Alsoeasytomodelsomethingthatyoucan’treallybuild• Veryslowtorun(atbest100sofKIPSforasingle-node)• Eitherrunmicrobenchmarksorusesampling/skip-forward

• Onceyoucomeupwithagooddesign,thenwriteRTL• CannetworktogethermanyinstantiationsofaSWsimulatortosimulateadatacenter

7

Page 8: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

(3)BuildaHW-acceleratedsimulator:DIABLO

• Needtohand-writeRTLmodels(evenharderthan“tapeout-ready”RTL)• Tiedtoacustomhost-platform• DIABLO[4]:• Simulated3072servers,96ToRsat~2.7MHz• BootedLinux,ranappslikememcached• $100k+hostplatform,custombuilt• Abstractprocessor,switchmodels

• Can’ttakethisRTLandtapeitout

DIABLOPrototype

8

Page 9: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

EvaluatingSystemsforEvaluatingComputerSystems

Metric SWSimulator HW-accelSimulator Build therealHW

Cost $ $$$ $$$$$

Time-to-first-cycle Fastest(recompile C++) Medium (FPGAtools) Slowest(CADtools+fab)

RuntimeSlowdown 100,000x 1,000x 1x

Cantapeout? X X/✔ ✔

Run realSW X(tooslow) ✔ ✔

CommodityPlatform ✔ X X

9

Page 10: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

Howdoweimprove?

Usefulhardwaretrends:

OpenISA Open,Silicon-ProvenSoC

Implementations

High-ProductivityHardwareDesign

Language

FPGAsintheCloud

10

Page 11: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

So,whatarewebuilding?

• Adatacentersimulatorthat…• Modelsservers,accelerators,switches,anddatacenterinterconnect• UsesacommodityhostplatformwithFPGAs(EC2F1)• Letsusersworkwith:

• RTL(Chisel/Verilog)forcustomizingserverblades,buildingaccelerators,etc.• Softwaremodels(C++)forswitches

• Automateshigh-performancesimulatednetworktransport+theprocessofusingFPGAs,mappingsimulationtothehost,andbuildingasimulatoroutofRTLandsoftwaremodels• Runsrealsoftwarestacksatreasonablespeed(Linux+apps)

11

Page 12: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

TileLink2On-ChipInterconnect

FireSimTargetDesign(Servers+Network)

• Serverblades,eachwith:• [email protected]• 16KiBI$,16KiBD$,256KiBL2• 16GBDRAM• 200GbpsEthernetNIC• OptionalAccelerators

• High-performancenetwork:• ParameterizableBW/linklatency

• e.g.200Gbps,2μs• Easytoaddyourownlink-layer

• WeprovideEthernet• Switchmodelswithconfigurable#ofports

• Configurabletopology

L2

OtherDevices

NIC

DRAM

RocketCore

L1I L1D

Accel

RocketCore

L1I L1D

AccelRocketCore

L1I L1D

Accel

RocketCore

L1I L1D

Accel

FireSimServerBladeBlockDiagramToDCNetw

ork

12

Page 13: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

MappingasimulationtoEC2F1

• ServerSimulation• Lotsofinherentparallelism• WehavetheRTL:Xformit(FAME-1)• PutitontheFPGAs

• Networksimulation• Littleparallelisminswitchmodels(e.g.athreadperport)• Needtocoordinateallofourdistributedserversimulations• SouseCPUs+hostnetwork

ServerSimulation(s)

ServerSimulation(s)

ServerSimulation(s)

ServerSimulation(s)

ServerSimulation(s)

ServerSimulation(s)

ServerSimulation(s)

ServerSimulation(s)

SwitchMod

el

CPU

HostPCIe

f1.16xlarge

HostEtherne

t(EC2Network)

13

Page 14: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

FAME-1TransformingRTL

• GivenRTL,wewanttoautomaticallytransformadesignintodecoupledcycle-accuratesimulatorRTLthatwecanrunontheFPGA• SeeMIDAS/Strober[5,6]fromCARRV/ISCA’16

14

TileLink2Interconnect

L2

OtherDevices

NIC

DRAM

RocketCore

L1I L1D

Accel

RocketCore

L1I L1D

Accel

RocketCore

L1I L1D

Accel

RocketCore

L1I L1D

Accel

FireSimServerBlade

ToDCNetwork

NICIn

NICOut

OtherDev.In

OtherDev.Out

FAME-1TransformedServerBlade

TileLink2Interconnect

L2

OtherDevices

NIC

RocketCore

L1I

L1D

Accel

RocketCore

L1I

L1D

Accel

RocketCore

L1I

L1D

Accel

RocketCore

L1I

L1D

Accel

DDR3

Mod

el

Page 15: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

Single-host-nodesimulationmetrics

• Wepackfourquad-coreserversimulationsperFPGA• =32serversimulationsperf1.16xlarge• =128simulatedcoresperf1.16xlarge• Onesimulationmanagementthreadper-FPGA

• 32-port,200Gbpsper-portToRswitchmodel• Onethread-per-port(16xlarge has64vCPUs)

• Runsat~5MHz->~400millioninsts/sec• $13.20/hron-demand,~$2.60/hrspot

ToR

Node1 Node32

200Gbps2uslinks

TargetNetworkTopologyonone16xlarge

15

4-coreRC 4-coreRC

4-coreRC 4-coreRC

HostMemChannels

HostMemChannels

HostPCIe

SingleFPG

A

“Supernode”(oneFPGA)

Page 16: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

Reproducingtaillatencyeffectsfromrealsystems

• LeverichandKozyrakisshoweffectsofthread-imbalanceinmemcached in[3]• Wecanobservethiseffectinsimulation

16

Page 17: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

Scalingtoa1024NodeRISC-VDatacenter

• 1024serverblades(4096cores),32ToRswitches• 32f1.16xlarges

• 1Root+4Aggregationswitches• 5m4.16xlarges

• Runsat3.4MHz(13billioninsts/s)• Samplememcached run:

17

Page 18: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

SummingUp

• WecanprototypeadatacenterbuiltonanyRISC-Vcore• YoubringtheSoCs,accelerators

• Simulationisautomaticallybuiltanddeployed• ssh intothesimulatedsystem,justlikearealcluster

18

RISC-VSoCs Accelerators Network

Topology

FireSim

Automaticallydeployed,high-performance,distributedsimulation

Page 19: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

Thanks!

Talktousabout:• Automatingbuilding/distributingsimulationsacrossEC2instances• OoOCores(e.g.BOOM)/integratingyourhigh-performancecore• “Functional”networksimulation• e.g.AutomaticallyrunallofSPECInt06-refonRocketChip@150MHz,inthecloud,in<1day• Checkoutourdemo/blogpostontheAWSComputeBlog

• TCOofsimulationinthecloud• RISC-Vinthecloud• Scalingfurther

[email protected]

https://fires.im@firesimproject

19

Page 20: FireSim: Fast, Cycle-Accurate Datacenter Simulation in the ... · Single-host-node simulation metrics •We pack four quad -core server simulations per FPGA •= 32 server simulations

References

[1]PeterX.Gao,AkshayNarayan,SagarKarandikar,JoaoCarreira,SangjinHan,RachitAgarwal,SylviaRatnasamy,andScottShenker.2016.Networkrequirementsforresourcedisaggregation.OSDI'16[2]Y.Lee etal.,"AnAgileApproachtoBuildingRISC-VMicroprocessors,"in IEEEMicro,vol.36,no.2,pp.8-20,Mar.-Apr.2016.[3]JacobLeverichandChristosKozyrakis.Reconcilinghighserverutilizationandsub-millisecondquality-of-service.EuroSys'14[4]ZhangxiTan,ZhenghaoQian,XiChen,KrsteAsanovic,andDavidPatterson.DIABLO:AWarehouse-ScaleComputerNetworkSimulatorusingFPGAs.ASPLOS'15[5]EvaluationofRISC-VRTLDesignswithFPGASimulation.DonggyuKim,ChristopherCelio,DavidBiancolin,JonathanBachrachandKrsteAsanovic.CARRV'17.[6]DonggyuKim,AdamIzraelevitz,ChristopherCelio,HokeunKim,BrianZimmer,YunsupLee,JonathanBachrach,andKrsteAsanović.Strober:fastandaccuratesample-basedenergysimulationforarbitraryRTL.ISCA'16

20