flashabacus: a self-governing flash-based accelerator for...
TRANSCRIPT
![Page 1: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/1.jpg)
FlashAbacus:A Self-Governing Flash-Based
Accelerator for Low-Power Systems
Jie Zhang and Myoungsoo JungComputer Architecture and Memory Systems Lab
![Page 2: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/2.jpg)
Executable Summary
Traditional heterogeneous compute system• Long data path between accelerator and storage;• Accelerators cost high power;
Intel 750 SSD DRAM CPU Xeon Phi
22W 7W 91W 300W
Abacus
10W6W
NAND Flash
low‐power
No data movement
Major ResultsPerformance: 127% better than traditional heterogeneous system.Energy: reduce 78% of energy compared to traditional approach.
FlashAbacus
![Page 3: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/3.jpg)
Example: Top-500 HPC trendsSystem
s using
cop
rocessor/accelerators
18%Accelerator is a promising solution, but it also faces several challenges
![Page 4: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/4.jpg)
power consumption
Challenge1: power consumption
The power consumption renders it difficult from being accepted in low-power system.
300W
180W
20W
![Page 5: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/5.jpg)
Challenge2: data movement overhead
32% storage
23% movement
45% computation
![Page 6: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/6.jpg)
Challenge2: data movement overhead
17% storage
64% movement
19% computation
![Page 7: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/7.jpg)
Challenge2: data movementDiscrete Hardware:
i) Storage to device memory
ii) Device to host‐side DRAM
DRAMDRAMDRAMMain CPU
IO Controller
DRAM
EMPs
MemoryNorthbridgeCache
Storage MediaSSD
Accelerator
SSD
IO Controller
NorthbridgeDRAMDRAMDRAMDRAM
iii) Host‐side DRAM to user process
DRAMDRAMDRAMDRAM Northbridge
Main CPU
vi) User process to accelerator DRAM
Main CPU
Northbridge Memory
![Page 8: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/8.jpg)
Challenge2: data movementDiscrete Software Stack:
User Space
Kernel Space
Device Space
Data‐intensive Application
Acc. RuntimeI/O Runtime
File SystemAcc. DriverHBA Driver
AcceleratorSSDStorage S/W Stack Acc. S/W Stack
Firmware
HBA Driver
Firmware
SSD
HBA DriverFile System
I/O Runtime
Data‐intensive Application
Acc. Runtime
Acc. DriverAcc. Driver
Accelerator
![Page 9: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/9.jpg)
Challenge3: accelerator utilization
Low-power compute system is sensitive to the serial program codes.
![Page 10: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/10.jpg)
Challenge3: accelerator utilization
79% 76%
![Page 11: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/11.jpg)
FlashAbacusOur solution--FlashAbacus:i. Reduce power consumption;ii. Eliminating redundant data copy and long data path;iii.Improve core utilization;
power consumption
300W180W
20WFlashAbacus
FlashAbacus
![Page 12: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/12.jpg)
A glance of hardwareMany‐core Host
Memory
NorthBridge
Core Core CoreCore Core Core
IO Controller
SSD
EMPsCache
Memory flashflash
corecoreProcessor
Flash
Accelerator
Storage
Heterogeneous Platform
Our PlatformAccelerator
![Page 13: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/13.jpg)
Tier‐1 Network
Tier‐2 NetworkNetwork
Inside Accelerator
Flash backbone
FPGA
Ctrle
r FlashFlashFlash
FPGA
Ctrle
r FlashFlashFlash
LWP0 LWP1 LWPn
PCIeControllerN
orth
Bridge Scratch
padShared Mem(DDR3L)PSC
Flash‐based Storage
GPDSPCores
PeripheralComponents
Flashvisor
Storengine
Kernelexe.
![Page 14: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/14.jpg)
Programming model
kernelGen
loop(optional)
kernelExe
dataSave (optional)
HostAccelerator
kernelOffload
fopen()malloc()
loop
Acc‐Malloc()
fread()Acc‐Memcpy()Acc‐kernel()
fwrite()Acc‐Memcpy()
free()
fclose()
I/O Runtime
Acc‐Free()
Epilogue
Prologue
Body
Acc. Runtimefopen()malloc()
loop
Acc‐Malloc()
fread()Acc‐Memcpy()Acc‐kernel()
fwrite()Acc‐Memcpy()
free()
fclose()
I/O Runtime
Acc‐Free()
Epilogue
Prologue
Body
Acc. Runtime
Traditional Programming Model
Traditional Programming Model
FlashAbacusProgramming Model
parallelserialserial
![Page 15: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/15.jpg)
Software Development
• Fuse flash in a multi-core system• Parallel kernel execution
![Page 16: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/16.jpg)
Fuse flash in a multi-core system
Data access model
LWP
L2$ b
a
DRAM
c
Flash
?Storage access w/o OS?Storage management?
![Page 17: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/17.jpg)
Flash VirtualizationFlashvisor: No OS/FS • Directly expose flash address space to LWPs.• Map flash address space to internal DRAM.
Manage storage access • Maintain a simplified page mapping table.• Translate from LBA to PPN.
Protection & access control• Maintain a range lock for parallel data access.
Storengine: manage flash background tasks such as garbage collection and log dumping.
![Page 18: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/18.jpg)
Flash VirtualizationRead
KernelMessage1
Flashvisor
2 Lock inquiry
Range lock
Page table lookup3Scratchpad
4 I/O
FPGA
FlashFlashFlash
5 DMA
LPDDR36 Read
Ch# Page group#
Page Table
Inde
x
pkg#
Logical Address
Address Translation
Physical Address
StartPage
StartPage
StartPage
StartPage
Search
Startpage
StartPage
StartPage
RB tree
![Page 19: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/19.jpg)
Flash VirtualizationWrite
FPGAFlashFlashFlash
Kernel 1 Write LPDDR3Message2
Flashvisor
Lock inquiry3
Range lock
I/O5
4 Reclaim blockStoregine
Garbagecollection
Page table snapshot
DMA5Page table update6Scratchpad
![Page 20: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/20.jpg)
Software Development
• Fuse flash in a multi-core system• Parallel kernel execution
![Page 21: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/21.jpg)
Parallel Kernel Execution
FlashAbacus
Host/User
App1()App2()App3()
Accelerator
Flashv
isor
Parallel Execution
Kernel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2
StorageLPDDR3 FP
GAFP
GAFP
GA
Address management
Parallel execution model:Master thread
Conventional
Require OS thread managementHost-accelerator communication
No hostintervention
![Page 22: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/22.jpg)
Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.
Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.
k1k0App0
App2 k2 k3
T0Arrive Time
LWP0
LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k0 k1
k2 k3
k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY
LWP0
LWP1LWP2LWP3
k0
k1
T0 T1 T2 T3 T4 T5 T6 T7
k2k3 SAVED
k0 LATENCYk1 LATk2 LATENCYk3 LATENCY
SAVED
![Page 23: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/23.jpg)
Fine-granule SchedulingPartition kernel into microblocks:
An example of FDTD-2D
_fict_[0]ey[0][j] = FOR j = 0..3
ENDFORFOR i = 0..3 FOR j = 0..3
ey[i][j] = ENDFORFOR j = 1..3
ENDFORENDFORFOR i = 0..3 FOR j = 0..3
ENDFORENDFOR
screen
Kernel
Microblock 0
Microblock 1
Microblock 2
![Page 24: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/24.jpg)
Fine-granule SchedulingIntra-kernel Out-of-order scheduling (IntraO3):• Schedule microblocks from all kernels across LWPs.• Pros: maximize core utilization• Cons: make sure running microblocks have no dependency
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a1
2 k0ak012
k0ak0b
k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY
SAVEDSAVED
SAVED
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
![Page 25: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/25.jpg)
Experiment SetupSystem configuration:
Host Xeon 2620‐v3
LWPs 8 @ 1GHz
SSD access latency Read Lat.=25us, Write Lat.=800us
Workloads Polybench benchmark suits
Accelerator Configuration:• SIMD: use OpenMP and has discrete storage and accelerator;• InterSt: FlashAbacus with static inter‐kernel scheduling;• InterDy: FlashAbacus with dynamic inter‐kernel scheduling;• InterO3: FlashAbacus with out‐of‐order intra‐kernel scheduling.
![Page 26: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/26.jpg)
EvaluationTime series analysis
IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.
IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.
IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.
IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.
Storage Access
Storage Access
ComputeCompute
![Page 27: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/27.jpg)
EvaluationEnergy
FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.
![Page 28: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/28.jpg)
Thank you
![Page 29: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/29.jpg)
Backup
![Page 30: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/30.jpg)
Performance Evolution in Computing
Single‐Core Era
Constrained by: Power Complexity
Multi‐Core Era
Constrained by: Power Scalability
HeterogeneousSystem Era
Enabled by: Data parallelism High‐performance
acceleratorIntel Xeon‐phi
GPGPU
![Page 31: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/31.jpg)
Challenge2: data movement
Storage access accounts for a large ratio of total execution time.
![Page 32: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/32.jpg)
Parallel Kernel Executionmanage the kernel scheduling to maximize execution throughput of all LWPs.
App()
Host/User
Kernel 0
Kernel 1
Kernel n
Parallel Execution
Kernel 2
Flashv
isor
FPGA
FPGA
FPGA
StorageLPDDR3
Address management
Single application Multiple applications
Host/User
App1()App2()App3()
Accelerator
Flashv
isor
Parallel Execution
Kernel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2
StorageLPDDR3 FP
GAFP
GAFP
GA
Address management
Parallel execution model:
![Page 33: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/33.jpg)
Tier‐1 Network
Tier‐2 Network
Inside Accelerator
Flash backbone
FPGA
Ctrle
r FlashFlashFlash
FPGA
Ctrle
r FlashFlashFlash
LWP0 LWP1 LWPn
PCIeControllerN
orth
Bridge Scratch
padShared Mem(DDR3L)PSC
LWP0
![Page 34: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/34.jpg)
Programming model
HOSTINT
1 1PCIe
Flashvisor
download2 DRAM
sleep3 PSCinvoke54 3
LWPload6
5
Kernel offloadKernel scheduleKernel execution
![Page 35: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/35.jpg)
Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.• Pros: equivalent, no starvation• Cons: low core utilization
Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.• Pros: good performance when kernels are sufficient• Cons: poor performance when kernels are few
![Page 36: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/36.jpg)
Kernel Scheduling StrategiesInter-kernel scheduling (static):
k1k0App0
App2 k2 k3
T0Arrive Time
LWP0
LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k0 k1
k2 k3
k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY
Inter-kernel scheduling (dynamic):LWP0
LWP1LWP2LWP3
k0
k1
T0 T1 T2 T3 T4 T5 T6 T7
k2k3 SAVED
k0 LATENCYk1 LATk2 LATENCYk3 LATENCY
SAVED
![Page 37: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/37.jpg)
Kernel Scheduling StrategiesSolution: partition kernel into microblocks:
_fict_[0]ey[0][j] = FOR j = 0..3
ENDFORFOR i = 0..3 FOR j = 0..3
ey[i][j] = ENDFORFOR j = 1..3
ENDFORENDFORFOR i = 0..3 FOR j = 0..3
ENDFORENDFOR
Microblock 0
Microblock 1
Microblock 2
screen
An example of FDTD-2D
![Page 38: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/38.jpg)
Kernel Scheduling StrategiesIntra-kernel scheduling (in-order):
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a k012
k0a k012
k0ak0b
k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY
SAVEDSAVED
![Page 39: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/39.jpg)
Kernel Scheduling StrategiesIntra-kernel scheduling (out-of-order):
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a1
2 k0ak012
k0ak0b
k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY
SAVEDSAVED
SAVED
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
![Page 40: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/40.jpg)
Fine-granule SchedulingIntra-kernel In-order scheduling (IntraIo):• Execute kernels in serial and schedule microblocks across all
LWPs.• Pros: reduce the complexity of microblock scheduling• Cons: cannot maximize core utilization
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a k012
k0a k012
k0ak0b
k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY
SAVEDSAVED
![Page 41: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/41.jpg)
EvaluationThroughput
InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.
InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.
InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.
InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.
![Page 42: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power](https://reader035.vdocuments.site/reader035/viewer/2022070108/60296d98d7752c178737e7f1/html5/thumbnails/42.jpg)
EvaluationEnergy
InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.
InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.
InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.
FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.