cloud - nuts and bolts of computational storage · ˃hadoop job across multiple data nodes with...
TRANSCRIPT
© Copyright 2018 Xilinx
Presented By
Gopi Jandhyala
Deboleena Minz Sakalley
Seong Kim
Sonal Santan
October 2nd, 2018
The Nuts & Bolts of
Computational Storage Platform
© Copyright 2018 Xilinx
Agenda
˃ Today’s ChallengesProblem Statement
Solution Proposal
Solution Illustration
˃ Computational Storage PlatformInfrastructure
Developer Tools
Applications
˃ Solution Proof PointsPostgres DB Acceleration
Compression
˃ Summary
>> 2
© Copyright 2018 Xilinx
Today’s Challenges
>> 3
˃ Exponential data growth driven by
unstructured data, Eg. video
˃ The new brick wall: Performance
bottlenecks & power implications of
moving data back and forth to compute
Source: Patrick Cheesman
The Cambrian Explosion… of Data
˃ Computing hitting one brick
wall after another (the end of Moore’s law,
Dennard Scaling, Amdahl’s law)
˃ Inevitable evolution towards
Heterogeneous Computing
˃ Accelerators critical to scaling
performance cost/power efficientlySource: David Patterson, ACM 2018
© Copyright 2018 Xilinx>> 5
Moving Data to Compute ….
CPU0 CPU1
SS
D0
SS
D1
SS
D2
SS
Dn
CPU0 CPU1
SS
D0
SS
D1
SS
D2
SS
Dn
11 1
CPU0 CPU1
SS
D0
SS
D1
SS
D2
SS
Dn
11 1
2 2 2 2
Storage
AdapterStorage
Adapter
The New Brick Wall
Storage
AdapterStorage
Adapter
Storage
AdapterStorage
Adapter
1-10 TB / server 10-100 TB / server 100-1000 TB / server
Power
Non data-intensive acceleration
Data-intensive acceleration performance
Performance PowerPerformance PowerPerformance
Non data-intensive acceleration
Data-intensive acceleration performance
Non data-intensive acceleration
Data-intensive acceleration performance
© Copyright 2018 Xilinx
Moving Compute to Data ….
>> 5
CPU0 CPU1
FPGA
Storage
Accel
IO
DD
R
DD
R
CPU0 CPU1
FPGA
Storage
Accel
IO
DD
R
DD
R
CPU1CPU0 CPU1
FPGA
Storage
Accel
IO
DD
R
DD
R
The solution
˃ “Offload” storage centric
workloads
˃ “Split personality” – offload or
storage
˃ Accelerate storage services
“Inline” Encryption, compression, hashing
˃ Tighter integration
˃ Compute near StorageInline, offload and more
Search, Bigdata
© Copyright 2018 Xilinx>> 7
˃ Search for an image across 100TB
˃ Sequential scan of 100TB drive data into CPU
and ML based image classification accelerator
I/O and processing bottlenecks
High power consumption
˃ For same search embedded ML based
image classification accelerator scans drive
data locally and responds with match/not
match
Eliminate I/O and processing bottlenecks
Localized data scanning optimizes system power
Finding The Needle In the Haystack
Solution Illustration
CPU0 CPU1
SS
D0
SS
D1
SS
D2
SS
Dn
24 x 4TB
NVMe SSDs
24 x 4TB
NVMe SSDs with
Adaptive Acceleration
Storage
AdapterStorage
Adapter
Storage
AdapterStorage
Adapter
CPU0 CPU1
SS
D0
SS
D1
SS
D2
SS
Dn
Compute To Data
© Copyright 2018 Xilinx>> 8
˃ Count the total twitter feeds for a key word in
the last 24 hours
˃ Hadoop job across multiple data nodes with
each node CPU sequentially scanning the
saved data
I/O and processing bottlenecks
High power consumption
˃ Push the counting work to each drive within data
node thus enabling higher number of jobs each data
node can undertake
˃ Drive responds with the individual count from each
job that the CPU can simply aggregate
Eliminate I/O and processing bottlenecks
Localized data scanning optimizes system power
Finding The Needle In the Haystack
Solution Illustration
Compute To Data
© Copyright 2018 Xilinx
Agenda
˃ Today’s ChallengesProblem Statement
Solution Proposal
Solution Illustration
˃ Computational Storage PlatformInfrastructure
Developer Tools
Applications
˃ Solution Proof PointsPostgres DB Acceleration
Compression
˃ Summary
>> 8
© Copyright 2018 Xilinx
Xilinx Computation Storage Platform
>> 9
Adaptable Platform
High Speed
Interconnect
Device
Memory
ARM
ProcessorsSystem
Mgmt
Profiler &
Debugger
SDAccel
Compiler
HW & SW
EmulationOpenCL, C,
C++, Verilog
Database
Services
Storage
Services
Rich Media
Services
Infrastructure
Developer
Applications
Spark, Postgres,
Maria DB, HadoopCompression, Dedup,
Erasure Codes
Image Search, Video
Transcoding
Partial
Reconfig
© Copyright 2018 Xilinx
Platform Architecture
˃ Device Memory
Exposed to CPU as PCIe BAR
Used by SDAccel APIs to create buffers directly in the device (instead of Host DDR)
Can be used as P2P buffer target for direct data transfer from the storage
Shared by local ARM cores and the HW accelerators
Infrastructure
Device Memory
System
Mgmt
ARM
Cores
Kernels
Storage
Host
Debug
FPGA Accelerator
>> 11
© Copyright 2018 Xilinx
Platform Architecture
˃ ARM Cores
Quad Core ARM Cortex A53 up to 1.3GHz
Enable SW-Assist libraries for HW accelerators (Kernels)‒ Allows for SW/HW partitioning of kernels
Help run scheduling activities across multiple kernel instances
Can be used for authentication of end user accelerator binaries
Enable board management controls
Infrastructure
Device Memory
System
Mgmt
ARM
Cores
Kernels
Storage
Host
Debug
FPGA Accelerator
>> 12
© Copyright 2018 Xilinx
Platform Architecture
˃ System Management
Device sensors to monitor current, temperature, voltages etc.
System monitoring alerts to Host
˃ Debug Hooks
Enables performance monitoring
Provides insight on the interconnect/kernel efficiency
Stall monitoring
ILA/Chipscope for signal level debug
Infrastructure
Device Memory
System
Mgmt
ARM
Cores
Kernels
Storage
Host
Debug
FPGA Accelerator
>> 13
© Copyright 2018 Xilinx
Platform Architecture
˃ Kernels
Can be written in Verilog/VHDL, C, C++ or OpenCL
Dynamically programmed via PR (Partial reconfiguration) during run time
Multiple instances of kernels can be programmed for higher throughput‒ Auto-stitched by SDAccel compiler/linker
Use device memory for input and output buffers
Infrastructure
Device Memory
System
Mgmt
ARM
Cores
Kernels
Storage
Host
Debug
FPGA Accelerator
>> 14
© Copyright 2018 Xilinx
Platform Architecture
˃ High-Speed Interconnect
High Speed fabric connects all critical components‒ PCIe interface to the Host and Storage
‒ High speed internal AXI bus to memory and kernels
BW greater than the storage medium
˃ Storage
Tight integration with the platform
Direct data transfer to/from the FPGA buffers to enable faster and efficient kernel access‒ Bypass host DDR!
Infrastructure
Device Memory
System
Mgmt
ARM
Cores
Kernels
Storage
Host
Debug
FPGA Accelerator
>> 15
© Copyright 2018 Xilinx
Computational Storage Platform
DD
R C
trlr
SSD
FPGA
Memory Platform “Shell”
ARM CPUs
Accelerator “Role”
Shell, Role & Partial Reconfiguration
Fuzzy Search
Kernel
Fuzzy Search
Kernel
Exact Search
Kernel
Exact Search
Kernel
Search
App
Plugin
Levenstein
Distance
Kernel
Levenstein
Distance
Kernel
ML Inf.
Kernel
ML Inf.
Kernel
ML Inf.
Kernel
ML Inf.
Kernel
ML Inf.
Kernel
ML Inf.
Kernel
ML Inf
App
Plugin
Infrastructure
>> 16
© Copyright 2018 Xilinx
Tools and Infrastructure
>> 16
This Photo by
Unknown
Author is
licensed under
CC BY-NC-ND
Developer
˃ Compiler
Powerful SDAccel compiler
Many options to choose from – OpenCL, C++, C and Verilog/VHDL
˃ Partial Reconfiguration
Compiler generates image for the Role in the Platform
˃ Open Source Xilinx RunTime
Feature full runtime stack with user space and Linux kernel drivers
Multi-threading and Multi-process safe
˃ Platform Design
Platform details including Shell captured as DSA
Compiler
C C++ OpenCL
RTL
DSA
© Copyright 2018 Xilinx
Developer Productivity Tools
>> 17
This Photo by
Unknown Author is
licensed under CC
BY-NC-ND
Developer
˃ HW & SW Emulation
Emulation flows for faster debug
SW emulation‒ Debug kernel functionality
‒ Sequential execution
HW emulation‒ Synthesized kernel emulated
‒ Parallel/event based execution
˃ Profiler & Debug
Performance, Stall and Execution profiling/monitoring
ILA/Chipscope level debug also supported
˃ Mgmt Tools
Query board status
Perform board admin operations
Debugging Profiling
Libraries
Platform
Runtime
© Copyright 2018 Xilinx
Agenda
˃ Today’s ChallengesProblem Statement
Solution Proposal
Solution Illustration
˃ Computational Storage PlatformInfrastructure
Developer Tools
Applications
˃ Solution Proof PointsPostgres DB Acceleration
Compression
˃ Summary
>> 18
© Copyright 2018 XilinxPage 19
0
10
20
30
40
50
60
70
80
90
1GB 5GB 10GB
Tim
e in
Sec
With memory thrashing
X86 Compute Near Storage
0
5
10
15
20
25
1GB 5GB 10GB
Tim
e in
Sec
Without memory thrashing
X86 Compute Near Storage
PostgreSQL Query6
Solution Proofpoint #1
˃ PostgreSQL
PostgreSQL Query6 accelerated by SDAccel stack on computational storage platform
Minor modifications of PostgreSQL code allowed the data movement from storage device directly to the accelerator, thus eliminating host DDR copies altogether
˃ Results
Measurements show 3-5x on idle machines and > 10x on machines dealing with memory intensive apps
© Copyright 2018 Xilinx
PostgreSQL Query6 – x86 DDR BW Comparison
Solution Proofpoint #1
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
X86
DD
R B
/W. M
B/s
ec
Time – 1Unit = 0.1 secs
DDR B/W comparison – x86 vs. Compute Offload vs Computational Storage; 1GB dataset
X86 Computational Storage
Enabling Complete Compute Offload and Minimize Memory Usage
>> 21
© Copyright 2018 Xilinx
Compression
>> 21
Solution Proofpoint #2
˃ Comparison among FPGA acceleration, ASIC (Intel
QAT), CPU
˃ In addition >8x lower latency than CPU, e.g. 1MB
file
FPGA Acceleration: 1.99ms
ZLIB-1: 16.66ms
FPGA – High Comp*
P2P With External Switch
S
S
D
S
S
DFPGA
X86 Root-Complex
x86
PCIe Switch (Optional)
S
S
D
FPGA – High Thruput*
* Partner Data on Xilinx FPGA
© Copyright 2018 Xilinx
Compression – GZIP x86 DDR BW Comparison
Solution Proofpoint #2
0
100
200
300
400
500
600
Mem
ory
(M
B/s
)
Time
Read
Write
Total
Compute Offload Data Transfer
Computational
Storage Data Transfer
>>23
© Copyright 2018 Xilinx
Agenda
˃ Today’s ChallengesProblem Statement
Solution Proposal
Solution Illustration
˃ Computational Storage PlatformInfrastructure
Developer Tools
Applications
˃ Solution Proof PointsPostgres DB Acceleration
Compression
˃ Summary
>> 23
© Copyright 2018 Xilinx
Summary
˃ Exponential data growth driving the computational storage opportunity to offload
compute functions closer to memory and storage
˃ FPGA enabled adaptable storage will enable differentiation and unlock efficiency for
storage workloads
˃ Building on the success of Xilinx compute acceleration platform, Xilinx Computational
Storage Platform provides ease of application portability and tremendous returns for
workloads that are have storage affinity
>> 25
© Copyright 2018 Xilinx