cognitive ssd: a deep learning engine for in …...cognitive ssd: a deep learning engine for...
TRANSCRIPT
Cognitive SSD: A Deep Learning Engine for
In-Storage Data Retrieval
Shengwen Liang1,2, Ying Wang1,2, Youyou Lu3, Zhe Yang3
Huawei Li1,2, Xiaowei Li1,2
1State Key Laboratory of Computer Architecture,
Institute of Computing Technology, Chinese Academy of Sciences, Beijing2University of Chinese Academy of Sciences
3Tsinghua University
1/28
Outline – Cognitive SSD
2/28
Inaccuracy Power inefficiency
Deep learning
Graph search
AccuracyPower efficiency
Cognitive SSD
Near data processing
*
Unstructured dataretrieval system
RequestResults
Database
CPU
DR
AM NAND
flash
DRAM
Outline
• Background and Motivation
• Cognitive SSD System
• DLG-x Accelerator
• Evaluation
• Conclusion
3/28
Text file and documents
Audio Video Images
Unstructured Data
These data cannot be directly recognized by the computer
10%20%30%50%70%80%
4/28
• Unstructured data occupies to 80% of storage capacity in data
centers [1].
• Intensive retrieval/analysis requests.
• Fast and energy-efficient data retrieval solution.
Problem- Software
• Performance bottleneck migrates from
hardware (SSD(75-50us[2])) to software
(IO Stack (60.8us[3])).
2-5ms
HDD
10us
OPTANE
SSD
5x5/28
Application
Kernel
VFS/File System
Block IO Layer
SCSI Layer
Device Driver
HDD/SSD Hardware75-50us[2]
NAND FLASHSSD
Problem-Hardware
• Massive data movement incurs energy and latency overhead in the
conventional memory hierarchy.
HDD IO Interface(SATA,PCIe)
DRAM Cache LevelCache
SATAPCIe-4lane
L1L2L3
Compute Unit
6/28
Showcase
Content Based Unstructured Data Retrieval System
Data Preprocessing
Feature Mapping
Feature Matching
RankingRetrieval
Request
7/28
Feature representation Database indexing
Bad
retrieval
accuracy
010010101Convolution
Layer
PoolingLayer
Fully
Connected Layer
Hash Layer
Deep Hashing[4]
Graph Search[5]
Better feature representationFast and high accurate
retrieval performance
Solutions
Deep Learning Hashing Graph Search DLG - accuracy+ =
8/28
010010
Simplify the
software stack
The internal bandwidth of SSD can be 16x higher than the external SSD bandwidth[6]
Near-data processing
Shorten data path
User-visible
Software abstractionScalability
Different applications
Outline
• Background and Motivation
• Cognitive SSD System
• Overview
• High-level library
• Firmware and hardware
• DLG-x Accelerator
• Evaluation
• Conclusion9/28
Cognitive SSD System--Overview
OS
Caffe Pytorch Users Application
DLG-x Compiler User Library
Device Driver
User-level space
DLG Library
10/28
PCIe interface
I/O Interface
NAND Flash Controller
Cognitive SSDHardware
Instruction Parameter
NAND Flash Array
Cognitive SSD Runtime
DLG-x
Accelerator
Configuration
command
I/O
command
Task
commandNVMe
protocolDLG_config DLG_hashing SSD_read
Basic Firmware
Cognitive SSD System—High Level Library
Caffe
Pytorch
Users Application
DLG-x
Compiler
11/28
• How to update deep learning model and graph parameter?
User Library
Task Plane Data Plane
Send
task requestData
transmission
Train
deep learning model
Generate instruction
and
construct graph parameter
DLG-x configuration
• How to dispatch request?
Device driver
DLG_analysis(User definition)
User-defined API
Scalability
Challenge: Model and parameter configurable & scalability?
Cognitive SSD System - Firmware and hardware
12/28
I/O Interface
DLG-x Accelerator
Deep Learning Unit
Graph Search Engine
NAND Flash Controller
NAND Flash Array
DLG configurator
Instruction Parameter
DLG task scheduler I/O scheduler
Send
task request
Data
transmissionDLG-x configuration
DLG-x
instruction
Deep learning &Graph parameter
Outline
• Background and Motivation
• Cognitive SSD System
• DLG-x Accelerator
• Deep learning unit
• Graph search unit
• Evaluation
• Conclusion13/28
DLG-x Accelerator – Deep learning unit
DRAMCache data & store various controller metadata [7]
Instruction
Queue
Weight
Buffer-0
Weight
Buffer-1
Neural Processing Engine
Graph Search Engine
InOut
Buffer-0
InOut
Buffer-1
Convolution Pooling Activation
Vertex
Detector
Vertex
Arbitrator
Address
Generator
NAND Flash
NAND Flash
NAND Flash
Controller
Controller
Controller
Directly access NAND flash14/28
CONV - convolution layer
FC - fully connected layer
0
2
4
6
8
10
Ch
an
nels
CONV-1 CONV-2 CONV-3 CONV-4
CONV-5 FC-6 FC-7 FC-8
Challenge: How to supply data to accelerator without DRAM?
DLG-x Accelerator-Data Layout
Channel-0 Channel-1 Channel-2
FMC FMC FMC
Channel-0 Channel-1 Channel-2
FMC FMC FMC
Maximize the
parallelism of NAND
flash
Blo
ck
Pa
ge
Pa
ge
Page Register
Cache Register
Flash Controller
Read page cache command[8]
Blo
ck
Pa
ge
Pa
ge
Page Register
Cache Register
Flash Controller
Read command
Improve throughput by 33%
15/28
Parameters
Challenge: How to fully utilize the internal bandwidth of flash?
DLG-x Accelerator-Graph search unit
NAND Flash
NAND Flash
Controller
Controller
Weight
Buffer
Weight
Buffer
Graph Search Engine
Vertex
Detector
Vertex
Arbitrator
Address
Generator
InOut
Buffer
InOut
Buffer
Online stage: graph search
Neural
Processing
Engine
16/28
Offline stage: K-NN graph construction
NeighborsVertexK-NN graph
Data entry
DLG-x Accelerator-Data Layout
Read Amplification Low bandwidth utilizationID (32) + Hash Code (48-512) = 80-544bits << 16KB
25 neighbors = 25*(80-544) bits = (250-1700) Bytes << 16KB
Neighbors of Vertex 0
Neighbors of neighbors of vertex 0
NAND Flash
1 Block 1 Block
Page 0
Page 1
17/28
V 3 V 2
V 0 V 2
V 0
V 3
V 3 V 4V 2
V 2 V 5
V 3 V 4
V 1
V 2
V 1 V 0V 5
Read Vertex 0
Read Vertex 3
Read Vertex 2Read Page 0
Read Vertex 0
Read Vertex 3
Read Vertex 2
Read Page 0
Read Page 1
Read Page 0
V 3 V 2
V 2 V 5
V 0
V 1
V 3 V 4V 2
V 0 V 2
V 3 V 2
V 3
V 4
V 1 V 0V 5
Page 0
Page 1
Reduce flash
access
Data
layout
Page
granularity
Read
Challenge: How to avoid bandwidth waste?
Cognitive System – Case study
18/28
Caffe, Pytorch
Deep
learning unit
Graph
search unit
DLG-compiler
NAND
flash
Task plane Task schedulerDLG-x
accelerator
Deploy deep
learning model
and graph
parameter
User image
request
Data plane I/O schedulerInject image
data
Outline
• Background and Motivation
• Cognitive SSD System
• DLG-x Accelerator
• Evaluation
• Conclusion
19/28
Evaluation Setup
CPU DRAM SSD GPU FPGA
B-CPU 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD - -
B-GPU 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD NVIDIA
GTX
1080Ti
-
B-FPGA 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD - ZC706 Board
B-DLG-x 2*Xeon E5-2630 32GB 4* 1TB PCIe SSD - ZC706 Board
Cognitive
SSD + CPU
2*Xeon E5-2630 32GB 3* 1TB PCIe SSD - OpenSSD
Cognitive
SSD
ARM Dual Cortex
A9
2GB 1TB NAND flash - OpenSSD
Hardware
Software
Ubuntu 14.04, Caffe[9], Crow web
framework[10].
WorkloadContent-Based Image Retrieval System (CBIR)
NAND FLASH
Flash Controller
DLG-x
PCIe
Ethernet
20/28
1. Zynq FPGA Chip – DLG-x and flash
controller
1. Dual Cortex A9 -- Firmware
2. 1GB DRAM
3. 8-channels NAND flash
4. Ethernet5. PCIe Gen 2 (maximum lane = 8)
OpenSSD
Evaluation-DLG algorithm
Dataset Total Train/Validate Labels
CIFAR-10 60000 50000/10000 10
Caltech256 29780 26790/2990 256
SUN397 108754 98049/10705 397
ImageNet 1331167 1281167/50000 1000
(a) CIFAR-10 (b) Caltech-256
(c) SUN397 (d) ImageNet
AlexNet(48) ITQ LSH VGG-16(48) ITQ LSH
ResNet-18(48) ITQ LSHResNet-50(64) ResNet-50(48)ITQ LSH A+ITQ AC+ITQ
200 400 800 100030000.0
0.5
1.0
T
Pre
cis
ion
10 30 50 100 200 5000.0
0.3
0.6
T
Pre
cis
ion
50 100 200 500 10000.0
0.3
0.6
T
Pre
cis
ion
200 400 600 800 10000.0
0.3
0.6
T
Pre
cis
ion
hash codelength
21/28
Dataset
• DLG solution performs better retrieval accuracy regardless of the
choice of T value when compared to the conventional hash solutions.
• DLG solution shows the robustness of the DLG solution when
deployed on a real-world system.
Evaluation-DLG-x
Model - Latency(ms) Power(w)
Hash
AlexNet
DLG-x 38 9.1
CPU 114 186
GPU 1.83 164
Hash
ResNet-18
DLG-x 94 9.4
CPU 121 185
GPU 7.13 112
Performance of deep hashing on DLG-x
200 400 600 800 1000100
102
104
106
# Top of retrieval
Late
ncy(u
s)
Brute-Force-Sort
Graph Search-CPU
DLG-x
200 400 600 800 10000
50
100
150
# of Top retrieval
Sp
ee
du
p
Brute-Force-Sort/DLG-xGraph Search-CPU/DLG-x
111.5 90.22
37.12 36.73
200 400 600 800 1000100
102
104
106
# of Top retrieval
Late
ncy(u
s)
200 400 600 800 1000100101102103104
# of Top retrieval
Sp
ee
du
p
5334.4498.5
12.53.4
(a) Latency on CIFAR-10 (b) Speedup on CIFAR-10
(c) Latency on ImageNet (d) Speedup on ImageNet
Performance of graph search on DLG-x
22/28
• Faster than CPU solution • Outperform than brute
force sort method
• More power-efficiency
than GPU solution• Up to 37.12 x and 12.5 x
speedup over CPU solution
Evaluation-Cognitive SSD System
Performance of Cognitive SSD system on ImageNet
Web Demo
23/28
• Compared to B-CPU, Cognitive SSD system reduces latency
by 69.9% on average.
• Cognitive SSD achieves higher power-efficiency than B-GPU
system by 2.44 x.
Evaluation-Cognitive SSD Cluster
Cognitive SSD as a server
24/28
14.08% PCIe PCIe PCIe
Internet
Conventional cluster system (CMC)
Internet
Host free cluster system (HFC)ARM Core
Evaluation-Cognitive SSD Cluster
• The power-efficiency of the HFC system is better than other baselines when users
requests are low.
• HFC system will perform better power-efficiency if the Cortex-A9 processor is replaced
by the latest Cortex-A series processor.
25/28
Conclusion
• Cognitive SSD provides a more power-efficient solution for unstructured data retrieval.
• The DLG-x accelerator integrates deep learning and graph search into one chip and directly accesses data from NAND flash without crossing multiple memory hierarchies.
• FPGA-based prototype evaluations show that Cognitive SSD outperforms other solutions on power-efficiency.
26/28
Cognitive SSD: A Deep Learning Engine for
In-Storage Data Retrieval
Shengwen Liang1,2, Ying Wang1,2, Youyou Lu3, Zhe Yang3
Huawei Li1,2, Xiaowei Li1,2
1State Key Laboratory of Computer Architecture,
Institute of Computing Technology, Chinese Academy of Sciences, Beijing2University of Chinese Academy of Sciences
3Tsinghua University
Q&A
27/28
If you have any questions, please contact us
Email: [email protected]
Cognitive System - Scalability
28/28
• Cognitive SSD system also supports other applications and not be
limited by the image data retrieval!
• The task plane provides the user-defined API (DLG_analysis) interface to enable users to deploy other applications without bigger modification.
Video Retrieval
Video Recognition
Query
Retrieval Results
DLG_analysis
request
Recognition Results
Reference[1] The biggest data challenges that you might not even know you have, May 2016. https: //www.ibm.com/blogs/watson/2016/05/ biggest-data-challenges-might-not-even-know/.
[2] “MLC 128Gb to 512Gb Async/Sync NAND,” p. 239, 2017.
[3] Y. Son, N. Y. Song, H. Han, H. Eom, and H. Y. Yeom, “A User-Level File System for Fast Storage Devices,” in 2014 International Conference on Cloud and Autonomic Computing, 2014,
pp. 258–264.
[4] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep Supervised Hashing for Fast Image Retrieval,” 2016, pp. 2064–2072.
[5] C. Fu, C. Wang, and D. Cai, “Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph,” ArXiv170700143 Cs, Jul. 2017.
[6] E. Doller, A. Akel, J. Wang, K. Curewitz, and S. Eilert, “DataCenter 2020: Near-memory acceleration for data-oriented applications,” in 2014 Symposium on VLSI Circuits Digest of
Technical Papers, 2014, pp. 1–4.
[7] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery,” ArXiv171111427 Cs, Nov. 2017.
[8] R. Micheloni, L. Crippa, and A. Marelli, Eds., Inside NAND Flash memories. Heidelberg ; New York: Springer, 2010.
[9] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” ArXiv14085093 Cs, Jun. 2014.
[10] J. Ha, crow: Crow is very fast and easy to use C++ micro web framework (inspired by Python Flask). 2018.
[11] http://www.thessdreview.com/ces-2019/intel-teases-h10-ssd-intel-optane-memory-with-qlc-3d-nand-in-single-m-2-module-ces-2019-udate/
This picture is modified from the web [11] and just for display, not actual Cognitive SSD system. The actual Cognitive SSD system is shown in page 20. *