ramp-white / fast-mp
TRANSCRIPT
![Page 1: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/1.jpg)
RAMP-White / FAST-MP
Hari Angepat and Derek ChiouElectrical and Computer Engineering
University of Texas at Austin
Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale
![Page 2: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/2.jpg)
RAMP-White Overview
Use existing FPGA processor implementations to build scalable, flexible, coherent shared memory platforms that run standard operating systems
Standard ISA/OS enables more complex applications such as software emulators (QEMU) when desired.
![Page 3: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/3.jpg)
RAMP White Architecture
Classic shared memory machine design
Processor
Intersection Unit
Intersection Unit
Router Router
Processor
IO/MEM IO/MEM
![Page 4: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/4.jpg)
RAMP White Architecture
RAMP-White
Processor
Proc shim
IO DeviceIO
DeviceIO
DeviceDRAM
Peripheral bus
Proc shim
Intersection Unit
NIUIntersection
UnitNIURing
RouterRing Router
DRAM
Peripheral busPeriph shim Periph shim
Processor
![Page 5: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/5.jpg)
RAMP White Architecture
RAMP-White
Processor
Proc shim
IO DeviceIO
DeviceIO
DeviceDRAM
Peripheral bus
Proc shim
Intersection Unit
NIUIntersection
UnitNIURing
RouterRing Router
DRAM
Peripheral busPeriph shim Periph shim
Processor
Model CMP/SMP targets• Coherent shared memory platform• Single image OS
RAMP scalability (1K cores) via spatial and temporal replication
![Page 6: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/6.jpg)
RAMP White Architecture
RAMP-White
Processor
Proc shim
IO DeviceIO
DeviceIO
DeviceDRAM
Peripheral bus
Proc shim
Intersection Unit
NIUIntersection
UnitNIURing
RouterRing Router
DRAM
Peripheral busPeriph shim Periph shim
Processor
Ability to use commodity cores:• SparcV8: Leon3 soft‐core• PowerPC: PPC405 hard‐core• Configurable coherence protocol, enginesc
![Page 7: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/7.jpg)
RAMP White Architecture
RAMP-White
Processor
Proc shim
IO DeviceIO
DeviceIO
DeviceDRAM
Peripheral bus
Proc shim
Intersection Unit
NIUIntersection
UnitNIURing
RouterRing Router
DRAM
Peripheral busPeriph shim Periph shim
Processor
Configurable modules:• NIC, network, coherence engine, intersection unit
Modules connected by Connectors:• Point‐to‐point FIFOs that can model target time if required
Shim adapters
![Page 8: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/8.jpg)
RAMP-White Status
Working:• Multi processor Leon
• Soft‐fp kernel and userspace as initramfs
• Standard pthread Splash benchmarks
Still debugging:• Multichip crossing with scalable interrupt components
• Integration with parametrizable FAST cache model See me during retreat if interested in Alpha release
![Page 9: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/9.jpg)
Prototype (See at Demo) Hardware
• Sparc V8 32bit soft‐core processor (Leon3)
• 50 Mhz core clock, soft‐FP, 16KB Icache, Dcache bypassed
• GRLIB Components {serial, ethernet, ddr, jtag}
Software• Linux SMP 2.6.21 for Leon3
• Pthread‐based Splash2 benchmarks
• RAM disk rootfs with simple userspace apps
Platform• BEE2 control FPGA with JTAG based programming
• Ethernet for kernel loading/debugging
RAMP-White
![Page 10: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/10.jpg)
FAST-MP
![Page 11: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/11.jpg)
FAST-MP: High Level Goal
Multi‐resolution coherent shared memory target emulation• Predict performance/power for wide range of micro‐architectures at accuracies ranging from cycle accurate to functional‐only
• Capable of running real ISAs aided by binary translation (x86, Sparc, PowerPC, etc), operating systems (unmodified Windows, Linux), compilers, applications (SQLServer, Apache, etc)
• Extensible/flexible (new instructions, different micro architectures)
![Page 12: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/12.jpg)
Performance Modeling on RAMP-White RAMP‐White host predicts RAMP‐White target performance perfectly• Predicting performance of arbitrary micro‐architectures requires additional support
FAST (FPGA Accelerated Simulation Techniques) uses a timing model to predict performance of arbitrary micro‐architecture• Special purpose structure designed to predict time
• Very small (complex model in a fraction of an FPGA)
• Uses same functional model for any micro‐architecture
White as a scalable functional model for FAST‐MP
![Page 13: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/13.jpg)
FAST (FPGA Accelerated Simulation)
Speculative FM with checkpoint/rollback of FM when FM/TM paths diverge• Ex) branch mispredict/resolve
![Page 14: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/14.jpg)
FAST-MP Approach
Multicore functional model executes as it wishes• Functional instruction stream generated (per core) and sent to timing model
• Rollback when functional model execution differs from timing model• Branch mispredictions, address speculation, etc.
Possible for functional model to access memory in different order than target
![Page 15: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/15.jpg)
FAST-MP Memory Reordering
All memory references tagged with a version number
FM passes a version number in trace to TM• essentially a precondition on the validity of the given trace
If TM version != FM version• Freeze timing models (to avoid corrupting TM)
• Rollback functional models to restore correct memory/architectural state
• Use TM directed order to re‐execute
![Page 16: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/16.jpg)
White
Processor
Intersection Unit
Intersection Unit
Router Router
Processor
IO/MEM IO/MEM
![Page 17: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/17.jpg)
White + Timing Model
PowerPC/Sparc ISA with arbitrary timing model
Processor
Intersection Unit
Intersection Unit
Router Router
Processor
IO/MEM IO/MEM
Net modelTiming Model Timing Model
![Page 18: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/18.jpg)
White + VM + Timing Model Sparc ISA with QEMU to emulate any ISA
Requires trace/rollback:• Hardware• Software (QEMU) ‐ can also be hardware accelerated
SMP OS
QEMU x86 VCPU
QEMU x86 VCPU
Processor
Intersection Unit
Intersection Unit
Router Router
Processor
IO/MEM IO/MEM
X86 Timing Model
X86 Timing ModelNet modelTiming Model Timing Model
![Page 19: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/19.jpg)
Probability of Reordered Memory Ops
Functionally‐driven speculation in a MP costly if timing ordered memory references conflict• Preliminary study with on X86 applications studying atomic operations
• Use Pin dynamic instrumentation tool to monitor every atomic operation running a multi‐threaded app
• Analyze inter‐atomic distance for existing shared memory workloads (Splash2, Parsec)
![Page 20: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/20.jpg)
Interprocessor Atomic Reuse Distance
0%
5%
10%
15%
20%
25%
30%
35%
0 2500 7500 10000 20000 30000 40000 50000 60000 70000 80000 90000
Percen
t Atomic Ope
ration
s
Interprocessor Reuse Distance (Cycles)
FFT
LU
Ocean
Radix
BlackScholes
BodyTrack
FaceSim
Ferret
FluidAnimate
FreqMine
Swaptions
![Page 21: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/21.jpg)
Task Size Scaling on Intel CMPs
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
0 1000 2000 3000 4000 5000 6000 7000 8000
Speedu
p Normailzed
to Serial Im
plem
entation
Task Size (cycles)
Xeon5140‐1Thread
XeonX3230‐1Thread
Xeon5140‐2Threads
XeonX3230‐2Threads
Xeon5140‐4Threads
XeonX3230‐4Threads
![Page 22: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/22.jpg)
FAST-MP Can Be Less Than Accurate!
Nearly accurate• Functional model backpressured by timing model
• Don’t want to overflow buffers
• Each functional core roughly at correct instruction relative to other cores
• Do not rollback to reorder memory operations• Still correct, just locks taken in different order
• Eliminate rollback overheads, probably quite accurate• Model RAMP‐White on FAST‐MP to check accuracy
Functional + cache• Run with just cache simulators
Etc.
![Page 23: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/23.jpg)
QEMU on White-Leon3
QEMU 0.9.1 with patches• Some issues remaining with Dyngen for V8 ISA with Leon3 cross compiler
For initial Linux Boot:• X86 instructions: 1
• QEMU uOPs: ~3.1
• Sparc instructions: ~22.5• High overheads involved in address computation, segmentation checks, software tlb, etc
Can modify/replace Leon3 to improve efficiency• MicroOP‐based processor
![Page 24: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/24.jpg)
Conclusions
Initial RAMP‐White Alpha design functional
FAST‐MP• Provide various ISAs• Cycle‐accuracy to purely functional• Developing power models
FAST‐MP will run on top of RAMP‐White as well as standard multicore system
![Page 25: RAMP-White / FAST-MP](https://reader034.vdocuments.site/reader034/viewer/2022042311/625b5dd9f0c33574641dcfcb/html5/thumbnails/25.jpg)
Questions…