cloud - nuts and bolts of computational storage · ˃hadoop job across multiple data nodes with...

25
© Copyright 2018 Xilinx Presented By Gopi Jandhyala Deboleena Minz Sakalley Seong Kim Sonal Santan October 2 nd , 2018 The Nuts & Bolts of Computational Storage Platform

Upload: others

Post on 19-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Presented By

Gopi Jandhyala

Deboleena Minz Sakalley

Seong Kim

Sonal Santan

October 2nd, 2018

The Nuts & Bolts of

Computational Storage Platform

Page 2: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Agenda

˃ Today’s ChallengesProblem Statement

Solution Proposal

Solution Illustration

˃ Computational Storage PlatformInfrastructure

Developer Tools

Applications

˃ Solution Proof PointsPostgres DB Acceleration

Compression

˃ Summary

>> 2

Page 3: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Today’s Challenges

>> 3

˃ Exponential data growth driven by

unstructured data, Eg. video

˃ The new brick wall: Performance

bottlenecks & power implications of

moving data back and forth to compute

Source: Patrick Cheesman

The Cambrian Explosion… of Data

˃ Computing hitting one brick

wall after another (the end of Moore’s law,

Dennard Scaling, Amdahl’s law)

˃ Inevitable evolution towards

Heterogeneous Computing

˃ Accelerators critical to scaling

performance cost/power efficientlySource: David Patterson, ACM 2018

Page 4: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx>> 5

Moving Data to Compute ….

CPU0 CPU1

SS

D0

SS

D1

SS

D2

SS

Dn

CPU0 CPU1

SS

D0

SS

D1

SS

D2

SS

Dn

11 1

CPU0 CPU1

SS

D0

SS

D1

SS

D2

SS

Dn

11 1

2 2 2 2

Storage

AdapterStorage

Adapter

The New Brick Wall

Storage

AdapterStorage

Adapter

Storage

AdapterStorage

Adapter

1-10 TB / server 10-100 TB / server 100-1000 TB / server

Power

Non data-intensive acceleration

Data-intensive acceleration performance

Performance PowerPerformance PowerPerformance

Non data-intensive acceleration

Data-intensive acceleration performance

Non data-intensive acceleration

Data-intensive acceleration performance

Page 5: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Moving Compute to Data ….

>> 5

CPU0 CPU1

FPGA

Storage

Accel

IO

DD

R

DD

R

CPU0 CPU1

FPGA

Storage

Accel

IO

DD

R

DD

R

CPU1CPU0 CPU1

FPGA

Storage

Accel

IO

DD

R

DD

R

The solution

˃ “Offload” storage centric

workloads

˃ “Split personality” – offload or

storage

˃ Accelerate storage services

“Inline” Encryption, compression, hashing

˃ Tighter integration

˃ Compute near StorageInline, offload and more

Search, Bigdata

Page 6: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx>> 7

˃ Search for an image across 100TB

˃ Sequential scan of 100TB drive data into CPU

and ML based image classification accelerator

I/O and processing bottlenecks

High power consumption

˃ For same search embedded ML based

image classification accelerator scans drive

data locally and responds with match/not

match

Eliminate I/O and processing bottlenecks

Localized data scanning optimizes system power

Finding The Needle In the Haystack

Solution Illustration

CPU0 CPU1

SS

D0

SS

D1

SS

D2

SS

Dn

24 x 4TB

NVMe SSDs

24 x 4TB

NVMe SSDs with

Adaptive Acceleration

Storage

AdapterStorage

Adapter

Storage

AdapterStorage

Adapter

CPU0 CPU1

SS

D0

SS

D1

SS

D2

SS

Dn

Compute To Data

Page 7: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx>> 8

˃ Count the total twitter feeds for a key word in

the last 24 hours

˃ Hadoop job across multiple data nodes with

each node CPU sequentially scanning the

saved data

I/O and processing bottlenecks

High power consumption

˃ Push the counting work to each drive within data

node thus enabling higher number of jobs each data

node can undertake

˃ Drive responds with the individual count from each

job that the CPU can simply aggregate

Eliminate I/O and processing bottlenecks

Localized data scanning optimizes system power

Finding The Needle In the Haystack

Solution Illustration

Compute To Data

Page 8: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Agenda

˃ Today’s ChallengesProblem Statement

Solution Proposal

Solution Illustration

˃ Computational Storage PlatformInfrastructure

Developer Tools

Applications

˃ Solution Proof PointsPostgres DB Acceleration

Compression

˃ Summary

>> 8

Page 9: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Xilinx Computation Storage Platform

>> 9

Adaptable Platform

High Speed

Interconnect

Device

Memory

ARM

ProcessorsSystem

Mgmt

Profiler &

Debugger

SDAccel

Compiler

HW & SW

EmulationOpenCL, C,

C++, Verilog

Database

Services

Storage

Services

Rich Media

Services

Infrastructure

Developer

Applications

Spark, Postgres,

Maria DB, HadoopCompression, Dedup,

Erasure Codes

Image Search, Video

Transcoding

Partial

Reconfig

Page 10: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Platform Architecture

˃ Device Memory

Exposed to CPU as PCIe BAR

Used by SDAccel APIs to create buffers directly in the device (instead of Host DDR)

Can be used as P2P buffer target for direct data transfer from the storage

Shared by local ARM cores and the HW accelerators

Infrastructure

Device Memory

System

Mgmt

ARM

Cores

Kernels

Storage

Host

Debug

FPGA Accelerator

>> 11

Page 11: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Platform Architecture

˃ ARM Cores

Quad Core ARM Cortex A53 up to 1.3GHz

Enable SW-Assist libraries for HW accelerators (Kernels)‒ Allows for SW/HW partitioning of kernels

Help run scheduling activities across multiple kernel instances

Can be used for authentication of end user accelerator binaries

Enable board management controls

Infrastructure

Device Memory

System

Mgmt

ARM

Cores

Kernels

Storage

Host

Debug

FPGA Accelerator

>> 12

Page 12: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Platform Architecture

˃ System Management

Device sensors to monitor current, temperature, voltages etc.

System monitoring alerts to Host

˃ Debug Hooks

Enables performance monitoring

Provides insight on the interconnect/kernel efficiency

Stall monitoring

ILA/Chipscope for signal level debug

Infrastructure

Device Memory

System

Mgmt

ARM

Cores

Kernels

Storage

Host

Debug

FPGA Accelerator

>> 13

Page 13: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Platform Architecture

˃ Kernels

Can be written in Verilog/VHDL, C, C++ or OpenCL

Dynamically programmed via PR (Partial reconfiguration) during run time

Multiple instances of kernels can be programmed for higher throughput‒ Auto-stitched by SDAccel compiler/linker

Use device memory for input and output buffers

Infrastructure

Device Memory

System

Mgmt

ARM

Cores

Kernels

Storage

Host

Debug

FPGA Accelerator

>> 14

Page 14: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Platform Architecture

˃ High-Speed Interconnect

High Speed fabric connects all critical components‒ PCIe interface to the Host and Storage

‒ High speed internal AXI bus to memory and kernels

BW greater than the storage medium

˃ Storage

Tight integration with the platform

Direct data transfer to/from the FPGA buffers to enable faster and efficient kernel access‒ Bypass host DDR!

Infrastructure

Device Memory

System

Mgmt

ARM

Cores

Kernels

Storage

Host

Debug

FPGA Accelerator

>> 15

Page 15: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Computational Storage Platform

DD

R C

trlr

SSD

FPGA

Memory Platform “Shell”

ARM CPUs

Accelerator “Role”

Shell, Role & Partial Reconfiguration

Fuzzy Search

Kernel

Fuzzy Search

Kernel

Exact Search

Kernel

Exact Search

Kernel

Search

App

Plugin

Levenstein

Distance

Kernel

Levenstein

Distance

Kernel

ML Inf.

Kernel

ML Inf.

Kernel

ML Inf.

Kernel

ML Inf.

Kernel

ML Inf.

Kernel

ML Inf.

Kernel

ML Inf

App

Plugin

Infrastructure

>> 16

Page 16: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Tools and Infrastructure

>> 16

This Photo by

Unknown

Author is

licensed under

CC BY-NC-ND

Developer

˃ Compiler

Powerful SDAccel compiler

Many options to choose from – OpenCL, C++, C and Verilog/VHDL

˃ Partial Reconfiguration

Compiler generates image for the Role in the Platform

˃ Open Source Xilinx RunTime

Feature full runtime stack with user space and Linux kernel drivers

Multi-threading and Multi-process safe

˃ Platform Design

Platform details including Shell captured as DSA

Compiler

C C++ OpenCL

RTL

DSA

Page 17: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Developer Productivity Tools

>> 17

This Photo by

Unknown Author is

licensed under CC

BY-NC-ND

Developer

˃ HW & SW Emulation

Emulation flows for faster debug

SW emulation‒ Debug kernel functionality

‒ Sequential execution

HW emulation‒ Synthesized kernel emulated

‒ Parallel/event based execution

˃ Profiler & Debug

Performance, Stall and Execution profiling/monitoring

ILA/Chipscope level debug also supported

˃ Mgmt Tools

Query board status

Perform board admin operations

Debugging Profiling

Libraries

Platform

Runtime

Page 18: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Agenda

˃ Today’s ChallengesProblem Statement

Solution Proposal

Solution Illustration

˃ Computational Storage PlatformInfrastructure

Developer Tools

Applications

˃ Solution Proof PointsPostgres DB Acceleration

Compression

˃ Summary

>> 18

Page 19: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 XilinxPage 19

0

10

20

30

40

50

60

70

80

90

1GB 5GB 10GB

Tim

e in

Sec

With memory thrashing

X86 Compute Near Storage

0

5

10

15

20

25

1GB 5GB 10GB

Tim

e in

Sec

Without memory thrashing

X86 Compute Near Storage

PostgreSQL Query6

Solution Proofpoint #1

˃ PostgreSQL

PostgreSQL Query6 accelerated by SDAccel stack on computational storage platform

Minor modifications of PostgreSQL code allowed the data movement from storage device directly to the accelerator, thus eliminating host DDR copies altogether

˃ Results

Measurements show 3-5x on idle machines and > 10x on machines dealing with memory intensive apps

Page 20: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

PostgreSQL Query6 – x86 DDR BW Comparison

Solution Proofpoint #1

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

X86

DD

R B

/W. M

B/s

ec

Time – 1Unit = 0.1 secs

DDR B/W comparison – x86 vs. Compute Offload vs Computational Storage; 1GB dataset

X86 Computational Storage

Enabling Complete Compute Offload and Minimize Memory Usage

>> 21

Page 21: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Compression

>> 21

Solution Proofpoint #2

˃ Comparison among FPGA acceleration, ASIC (Intel

QAT), CPU

˃ In addition >8x lower latency than CPU, e.g. 1MB

file

FPGA Acceleration: 1.99ms

ZLIB-1: 16.66ms

FPGA – High Comp*

P2P With External Switch

S

S

D

S

S

DFPGA

X86 Root-Complex

x86

PCIe Switch (Optional)

S

S

D

FPGA – High Thruput*

* Partner Data on Xilinx FPGA

Page 22: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Compression – GZIP x86 DDR BW Comparison

Solution Proofpoint #2

0

100

200

300

400

500

600

Mem

ory

(M

B/s

)

Time

Read

Write

Total

Compute Offload Data Transfer

Computational

Storage Data Transfer

>>23

Page 23: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Agenda

˃ Today’s ChallengesProblem Statement

Solution Proposal

Solution Illustration

˃ Computational Storage PlatformInfrastructure

Developer Tools

Applications

˃ Solution Proof PointsPostgres DB Acceleration

Compression

˃ Summary

>> 23

Page 24: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx

Summary

˃ Exponential data growth driving the computational storage opportunity to offload

compute functions closer to memory and storage

˃ FPGA enabled adaptable storage will enable differentiation and unlock efficiency for

storage workloads

˃ Building on the success of Xilinx compute acceleration platform, Xilinx Computational

Storage Platform provides ease of application portability and tremendous returns for

workloads that are have storage affinity

>> 25

Page 25: Cloud - Nuts and Bolts of Computational Storage · ˃Hadoop job across multiple data nodes with each node CPU sequentially scanning the saved data I/O and processing bottlenecks High

© Copyright 2018 Xilinx