cognitive ssd: a deep learning engine for in …...cognitive ssd: a deep learning engine for...

Cognitive SSD: A Deep Learning Engine for

In-Storage Data Retrieval

Shengwen Liang1,2, Ying Wang1,2, Youyou Lu3, Zhe Yang3

Huawei Li1,2, Xiaowei Li1,2

1State Key Laboratory of Computer Architecture,

Institute of Computing Technology, Chinese Academy of Sciences, Beijing2University of Chinese Academy of Sciences

3Tsinghua University

1/28

Outline – Cognitive SSD

2/28

Inaccuracy Power inefficiency

Deep learning

Graph search

AccuracyPower efficiency

Cognitive SSD

Near data processing

*

Unstructured dataretrieval system

RequestResults

Database

CPU

DR

AM NAND

flash

DRAM

Outline

• Background and Motivation

• Cognitive SSD System

• DLG-x Accelerator

• Evaluation

• Conclusion

3/28

Text file and documents

Audio Video Images

Unstructured Data

These data cannot be directly recognized by the computer

10%20%30%50%70%80%

4/28

• Unstructured data occupies to 80% of storage capacity in data

centers [1].

• Intensive retrieval/analysis requests.

• Fast and energy-efficient data retrieval solution.

Problem- Software

• Performance bottleneck migrates from

hardware (SSD(75-50us[2])) to software

(IO Stack (60.8us[3])).

2-5ms

HDD

10us

OPTANE

SSD

5x5/28

Application

Kernel

VFS/File System

Block IO Layer

SCSI Layer

Device Driver

HDD/SSD Hardware75-50us[2]

NAND FLASHSSD

Problem-Hardware

• Massive data movement incurs energy and latency overhead in the

conventional memory hierarchy.

HDD IO Interface(SATA,PCIe)

DRAM Cache LevelCache

SATAPCIe-4lane

L1L2L3

Compute Unit

6/28

Showcase

Content Based Unstructured Data Retrieval System

Data Preprocessing

Feature Mapping

Feature Matching

RankingRetrieval

Request

7/28

Feature representation Database indexing

Bad

retrieval

accuracy

010010101Convolution

Layer

PoolingLayer

Fully

Connected Layer

Hash Layer

Deep Hashing[4]

Graph Search[5]

Better feature representationFast and high accurate

retrieval performance

Solutions

Deep Learning Hashing Graph Search DLG - accuracy+ =

8/28

010010

Simplify the

software stack

The internal bandwidth of SSD can be 16x higher than the external SSD bandwidth[6]

Near-data processing

Shorten data path

User-visible

Software abstractionScalability

Different applications

Outline



• Overview

• High-level library

• Firmware and hardware


• Evaluation

• Conclusion9/28

Cognitive SSD System--Overview

OS

Caffe Pytorch Users Application

DLG-x Compiler User Library

Device Driver

User-level space

DLG Library

10/28

PCIe interface

I/O Interface

NAND Flash Controller

Cognitive SSDHardware

Instruction Parameter

NAND Flash Array

Cognitive SSD Runtime

DLG-x

Accelerator

Configuration

command

I/O

command

Task

commandNVMe

protocolDLG_config DLG_hashing SSD_read

Basic Firmware

Cognitive SSD System—High Level Library

Caffe

Pytorch

Users Application

DLG-x

Compiler

11/28

• How to update deep learning model and graph parameter?

User Library

Task Plane Data Plane

Send

task requestData

transmission

Train

deep learning model

Generate instruction

and

construct graph parameter

DLG-x configuration

• How to dispatch request?

Device driver

DLG_analysis(User definition)

User-defined API

Scalability

Challenge: Model and parameter configurable & scalability?

Cognitive SSD System - Firmware and hardware

12/28

I/O Interface

DLG-x Accelerator

Deep Learning Unit

Graph Search Engine

NAND Flash Controller

NAND Flash Array

DLG configurator

Instruction Parameter

DLG task scheduler I/O scheduler

Send

task request

Data

transmissionDLG-x configuration

DLG-x

instruction

Deep learning &Graph parameter

Outline




• Deep learning unit

• Graph search unit

• Evaluation

• Conclusion13/28

DLG-x Accelerator – Deep learning unit

DRAMCache data & store various controller metadata [7]

Instruction

Queue

Weight

Buffer-0

Weight

Buffer-1

Neural Processing Engine

Graph Search Engine

InOut

Buffer-0

InOut

Buffer-1

Convolution Pooling Activation

Vertex

Detector

Vertex

Arbitrator

Address

Generator

NAND Flash

NAND Flash

NAND Flash

Controller

Controller

Controller

Directly access NAND flash14/28

CONV - convolution layer

FC - fully connected layer

0

2

4

6

8

10

Ch

an

nels

CONV-1 CONV-2 CONV-3 CONV-4

CONV-5 FC-6 FC-7 FC-8

Challenge: How to supply data to accelerator without DRAM?

DLG-x Accelerator-Data Layout

Channel-0 Channel-1 Channel-2

FMC FMC FMC

Channel-0 Channel-1 Channel-2

FMC FMC FMC

Maximize the

parallelism of NAND

flash

Blo

ck

Pa

ge

Pa

ge

Page Register

Cache Register

Flash Controller

Read page cache command[8]

Blo

ck

Pa

ge

Pa

ge

Page Register

Cache Register

Flash Controller

Read command

Improve throughput by 33%

15/28

Parameters

Challenge: How to fully utilize the internal bandwidth of flash?

DLG-x Accelerator-Graph search unit

NAND Flash

NAND Flash

Controller

Controller

Weight

Buffer

Weight

Buffer

Graph Search Engine

Vertex

Detector

Vertex

Arbitrator

Address

Generator

InOut

Buffer

InOut

Buffer

Online stage: graph search

Neural

Processing

Engine

16/28

Offline stage: K-NN graph construction

NeighborsVertexK-NN graph

Data entry

DLG-x Accelerator-Data Layout

Read Amplification Low bandwidth utilizationID (32) + Hash Code (48-512) = 80-544bits << 16KB

25 neighbors = 25*(80-544) bits = (250-1700) Bytes << 16KB

Neighbors of Vertex 0

Neighbors of neighbors of vertex 0

NAND Flash

1 Block 1 Block

Page 0

Page 1

17/28

V 3 V 2

V 0 V 2

V 0

V 3

V 3 V 4V 2

V 2 V 5

V 3 V 4

V 1

V 2

V 1 V 0V 5

Read Vertex 0

Read Vertex 3

Read Vertex 2Read Page 0

Read Vertex 0

Read Vertex 3

Read Vertex 2

Read Page 0

Read Page 1

Read Page 0

V 3 V 2

V 2 V 5

V 0

V 1

V 3 V 4V 2

V 0 V 2

V 3 V 2

V 3

V 4

V 1 V 0V 5

Page 0

Page 1

Reduce flash

access

Data

layout

Page

granularity

Read

Challenge: How to avoid bandwidth waste?

Cognitive System – Case study

18/28

Caffe, Pytorch

Deep

learning unit

Graph

search unit

DLG-compiler

NAND

flash

Task plane Task schedulerDLG-x

accelerator

Deploy deep

learning model

and graph

parameter

User image

request

Data plane I/O schedulerInject image

data

Outline




• Evaluation

• Conclusion

19/28

Evaluation Setup

CPU DRAM SSD GPU FPGA

B-CPU 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD - -

B-GPU 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD NVIDIA

GTX

1080Ti

-

B-FPGA 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD - ZC706 Board

B-DLG-x 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD - ZC706 Board

Cognitive

SSD + CPU

2*Xeon E5-2630 32GB 3* 1TB PCIe SSD - OpenSSD

Cognitive

SSD

ARM Dual Cortex

A9

2GB 1TB NAND flash - OpenSSD

Hardware

Software

Ubuntu 14.04, Caffe[9], Crow web

framework[10].

WorkloadContent-Based Image Retrieval System (CBIR)

NAND FLASH

Flash Controller

DLG-x

PCIe

Ethernet

20/28

1. Zynq FPGA Chip – DLG-x and flash

controller

1. Dual Cortex A9 -- Firmware

2. 1GB DRAM

3. 8-channels NAND flash

4. Ethernet5. PCIe Gen 2 (maximum lane = 8)

OpenSSD

Evaluation-DLG algorithm

Dataset Total Train/Validate Labels

CIFAR-10 60000 50000/10000 10

Caltech256 29780 26790/2990 256

SUN397 108754 98049/10705 397

ImageNet 1331167 1281167/50000 1000

(a) CIFAR-10 (b) Caltech-256

(c) SUN397 (d) ImageNet

AlexNet(48) ITQ LSH VGG-16(48) ITQ LSH

ResNet-18(48) ITQ LSHResNet-50(64) ResNet-50(48)ITQ LSH A+ITQ AC+ITQ

200 400 800 100030000.0

0.5

1.0

T

Pre

cis

ion

10 30 50 100 200 5000.0

0.3

0.6

T

Pre

cis

ion

50 100 200 500 10000.0

0.3

0.6

T

Pre

cis

ion

200 400 600 800 10000.0

0.3

0.6

T

Pre

cis

ion

hash codelength

21/28

Dataset

• DLG solution performs better retrieval accuracy regardless of the

choice of T value when compared to the conventional hash solutions.

• DLG solution shows the robustness of the DLG solution when

deployed on a real-world system.

Evaluation-DLG-x

Model - Latency(ms) Power(w)

Hash

AlexNet

DLG-x 38 9.1

CPU 114 186

GPU 1.83 164

Hash

ResNet-18

DLG-x 94 9.4

CPU 121 185

GPU 7.13 112

Performance of deep hashing on DLG-x

200 400 600 800 1000100

102

104

106

# Top of retrieval

Late

ncy(u

s)

Brute-Force-Sort

Graph Search-CPU

DLG-x

200 400 600 800 10000

50

100

150

# of Top retrieval

Sp

ee

du

p

Brute-Force-Sort/DLG-xGraph Search-CPU/DLG-x

111.5 90.22

37.12 36.73

200 400 600 800 1000100

102

104

106

# of Top retrieval

Late

ncy(u

s)

200 400 600 800 1000100101102103104

# of Top retrieval

Sp

ee

du

p

5334.4498.5

12.53.4

(a) Latency on CIFAR-10 (b) Speedup on CIFAR-10

(c) Latency on ImageNet (d) Speedup on ImageNet

Performance of graph search on DLG-x

22/28

• Faster than CPU solution • Outperform than brute

force sort method

• More power-efficiency

than GPU solution• Up to 37.12 x and 12.5 x

speedup over CPU solution

Evaluation-Cognitive SSD System

Performance of Cognitive SSD system on ImageNet

Web Demo

23/28

• Compared to B-CPU, Cognitive SSD system reduces latency

by 69.9% on average.

• Cognitive SSD achieves higher power-efficiency than B-GPU

system by 2.44 x.

Evaluation-Cognitive SSD Cluster

Cognitive SSD as a server

24/28

14.08% PCIe PCIe PCIe

Internet

Conventional cluster system (CMC)

Internet

Host free cluster system (HFC)ARM Core

Evaluation-Cognitive SSD Cluster

• The power-efficiency of the HFC system is better than other baselines when users

requests are low.

• HFC system will perform better power-efficiency if the Cortex-A9 processor is replaced

by the latest Cortex-A series processor.

25/28

Conclusion

• Cognitive SSD provides a more power-efficient solution for unstructured data retrieval.

• The DLG-x accelerator integrates deep learning and graph search into one chip and directly accesses data from NAND flash without crossing multiple memory hierarchies.

• FPGA-based prototype evaluations show that Cognitive SSD outperforms other solutions on power-efficiency.

26/28

Cognitive SSD: A Deep Learning Engine for

In-Storage Data Retrieval

Shengwen Liang1,2, Ying Wang1,2, Youyou Lu3, Zhe Yang3

Huawei Li1,2, Xiaowei Li1,2

1State Key Laboratory of Computer Architecture,

Institute of Computing Technology, Chinese Academy of Sciences, Beijing2University of Chinese Academy of Sciences

3Tsinghua University

Q&A

27/28

If you have any questions, please contact us

Email: [email protected]

Cognitive System - Scalability

28/28

• Cognitive SSD system also supports other applications and not be

limited by the image data retrieval!

• The task plane provides the user-defined API (DLG_analysis) interface to enable users to deploy other applications without bigger modification.

Video Retrieval

Video Recognition

Query

Retrieval Results

DLG_analysis

request

Recognition Results

Reference[1] The biggest data challenges that you might not even know you have, May 2016. https: //www.ibm.com/blogs/watson/2016/05/ biggest-data-challenges-might-not-even-know/.

[2] “MLC 128Gb to 512Gb Async/Sync NAND,” p. 239, 2017.

[3] Y. Son, N. Y. Song, H. Han, H. Eom, and H. Y. Yeom, “A User-Level File System for Fast Storage Devices,” in 2014 International Conference on Cloud and Autonomic Computing, 2014,

pp. 258–264.

[4] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep Supervised Hashing for Fast Image Retrieval,” 2016, pp. 2064–2072.

[5] C. Fu, C. Wang, and D. Cai, “Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph,” ArXiv170700143 Cs, Jul. 2017.

[6] E. Doller, A. Akel, J. Wang, K. Curewitz, and S. Eilert, “DataCenter 2020: Near-memory acceleration for data-oriented applications,” in 2014 Symposium on VLSI Circuits Digest of

Technical Papers, 2014, pp. 1–4.

[7] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery,” ArXiv171111427 Cs, Nov. 2017.

[8] R. Micheloni, L. Crippa, and A. Marelli, Eds., Inside NAND Flash memories. Heidelberg ; New York: Springer, 2010.

[9] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” ArXiv14085093 Cs, Jun. 2014.

[10] J. Ha, crow: Crow is very fast and easy to use C++ micro web framework (inspired by Python Flask). 2018.

[11] http://www.thessdreview.com/ces-2019/intel-teases-h10-ssd-intel-optane-memory-with-qlc-3d-nand-in-single-m-2-module-ces-2019-udate/

This picture is modified from the web [11] and just for display, not actual Cognitive SSD system. The actual Cognitive SSD system is shown in page 20. *

http://www.thessdreview.com/ces-2019/intel-teases-h10-ssd-intel-optane-memory-with-qlc-3d-nand-in-single-m-2-module-ces-2019-udate/

cognitive ssd: a deep learning engine for in …...cognitive ssd: a deep learning engine for...

Documents