general architectures for dnn - electronic systemsheco/courses/ia-5lil0/lecture10... ·...
TRANSCRIPT
![Page 1: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/1.jpg)
Electrical Engineering – Electronic Systems group
Kanishkan Vadivel, Henk Corporaal, Pekka Jääskeläinen
General Architectures for DNN
![Page 2: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/2.jpg)
Recap• Inference and Learning Principles• Improving Network Efficiency – Focuses on reducing number of MACs and weights
• Loop Transformations – Software tricks to effectively use the Memory hierarchy
2
S E (8) M (23)fp32
S E(5) M(10)fp16
Quantization
![Page 3: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/3.jpg)
Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
3
![Page 4: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/4.jpg)
Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
4
![Page 5: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/5.jpg)
Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
5
Today
![Page 6: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/6.jpg)
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
6
Today
![Page 7: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/7.jpg)
Introduction• Machine learning plays a major role in today’s world
7
![Page 8: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/8.jpg)
Introduction• Machine learning plays a major role in today’s world
8
![Page 9: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/9.jpg)
Introduction• Machine learning plays a major role in today’s world
9
Made for “AI”
![Page 10: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/10.jpg)
Compute Intensity of DNN• Compute intensity is roughly proportional to accuracy of the DNN
10Source: Scaling for edge inference, Nature Electronics, 2018.
![Page 11: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/11.jpg)
Energy Efficiency Requirements • Ranges from Cloud to Edge device (low power embedded applications)
• Different energy budget and compute capabilities
11
Edge Node Embedded Device Cloud Server HPC Cloud
mW W kW MW
Compute capability
![Page 12: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/12.jpg)
Hardware Platform for DNN
12
Workload1. Inference2. Training
3. Meta Learning
Compute Platforms1. High-performance computing
2. Embedded systems
*Markovic, EE292 Class, Stanford, 2013
![Page 13: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/13.jpg)
Hardware Platform for DNN
13
Workload1. Inference2. Training
3. Meta Learning
Compute Platforms1. High-performance computing
2. Embedded systems
*Markovic, EE292 Class, Stanford, 2013
![Page 14: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/14.jpg)
Hardware Platform for DNN
14
Workload1. Inference2. Training
3. Meta Learning
Compute Platforms1. High-performance computing
2. Embedded systems
+?
Flexibility
Ene
rgy
Effi
cien
cy ASICPerformance/AreaDSP
CPU
FPGAGPU
ASIP
~1000x*
*Markovic, EE292 Class, Stanford, 2013
![Page 15: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/15.jpg)
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
15
Today
![Page 16: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/16.jpg)
Deep Convolutional Neural Networks
16Source: ICIP Tutorial, 2019
Contributes more that 90% of overall computation, dominating runtime and energy consumption
![Page 17: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/17.jpg)
Convolution Layer
17
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
=
![Page 18: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/18.jpg)
Convolution Layer
18
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
=
![Page 19: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/19.jpg)
Convolution Layer
19
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
=
![Page 20: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/20.jpg)
Convolution Layer
20
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
Number of MACs = R x S
=
![Page 21: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/21.jpg)
Convolution Layer
21
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
Number of MACs = R x S
2
=
![Page 22: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/22.jpg)
Convolution Layer
22
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
Number of MACs = R x S
=
x (H – R + 1) x (W - S + 1)
![Page 23: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/23.jpg)
Convolution Layer
23
Input fmap
H
W
C
2
weights
R
S
C
X
Output fmap
F
E
1
=
Many input fmap
Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C
![Page 24: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/24.jpg)
Convolution Layer
24
Input fmap
H
W
C
X
Output fmap
F
E
M
=
Many input & output fmap
M
weights
R
S
C
Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M
![Page 25: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/25.jpg)
Convolution Layer – For N batch
25
X
Output fmap
F
E
M
=
Many input & output fmap
M
weights
R
S
C
Input fmaps
RS
CN N
Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M x N
![Page 26: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/26.jpg)
Fully Connected Layer
26Source: ICIP Tutorial, 2019
Contributes more that 90% of overall computation, dominating runtime and energy consumption
![Page 27: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/27.jpg)
Fully Connected Layer
27
X
Output fmap
F =1
E =
1
M
=
1. Height and width of output fmap are 1 (E = F = 1)2. Filters as large as input fmaps (R= H, S=W)
M
weights
R
S
C
Input fmaps
R =
H
S = W
C
N N
![Page 28: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/28.jpg)
Fully Connected Layer
28
X
Output fmap
M
=
Matrix-Multiplication
weights
M
CRS = CHW
Input fmaps
CH
W
N
N
![Page 29: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/29.jpg)
Compute Intensity of Popular CNNs
29
![Page 30: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/30.jpg)
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
30
Today
![Page 31: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/31.jpg)
Overview of Microprocessor Designs
31Source: Time Moore, Liming Xiu 2019.
![Page 32: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/32.jpg)
Intrinsic Compute Capability
32Source: XETAL-II, 2010
![Page 33: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/33.jpg)
Intrinsic Compute Capability
33Source: XETAL-II, 2010
![Page 34: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/34.jpg)
Intrinsic Compute Capability
34Source: XETAL-II, 2010
TPU
CPU
~200x for DL
Difference in Performance/Watt
![Page 35: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/35.jpg)
Source of Inefficiency
35
How to improve?• Results does not include DRAM
power• More than 50% of energy is spent
on Cache and Control logic
• Reduce control overhead
• Improve Cache hierarchy• Multi-Core/Cluster concepts
provides an additional performance gain
Source: Computing’s Energy Problem, ISSCC 2014
Compute
Cac
he &
Con
trol
![Page 36: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/36.jpg)
Reduce Control Overhead: SIMD Extensions
36
![Page 37: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/37.jpg)
Reduce Control Overhead: SIMD Extensions
37
![Page 38: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/38.jpg)
Reduce Control Overhead: SIMD Extensions
• Intel
• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]
• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-
processor), NEON → 128bitm SVE → 128 to 2048bit
• Qualcomm – 4x1024b → 4096bits
38
![Page 39: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/39.jpg)
Reduce Control Overhead: SIMD Extensions
• Intel
• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]
• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-
processor), NEON → 128bitm SVE → 128 to 2048bit
• Qualcomm – 4x1024b → 4096bits
39DNN specific extensions: Reduced Precision and Instruction-set extension
![Page 40: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/40.jpg)
Example: Intel – Cascade Lake 2019(VNNI – Vector Neural Net Instructions)
40
x28-cores
![Page 41: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/41.jpg)
SIMD Extensions for DNNs
41
![Page 42: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/42.jpg)
SIMD Extensions for DNNs
42
(VNNI)
• Mixed-precision mode – INT8 x INT8 + INT32• VNNI – FMA on single cycle compared to 3-cycles with normal SIMD
instructions• Some architectures support “2x2 dot-product” as well
![Page 43: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/43.jpg)
Brain Floating Point
43
• bfloat16: Same dynamic range as IEEE-FP32, but with less accuracy• Example use-case: Google TPU, Cooper Lake Xeon processors• Another option - “posit” floating-point (adaptable fp format)
![Page 44: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/44.jpg)
How about BNN?
44
y = popcount (W XNOR X)
![Page 45: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/45.jpg)
Software Stack for DNN
45
?
![Page 46: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/46.jpg)
Software Stack for DNN
46
?
![Page 47: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/47.jpg)
Software Stack for DNN
• Parallelizing Compiler
• Inline assembly
• Intrinsics
47
?
![Page 48: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/48.jpg)
Software Stack for DNN
• Parallelizing Compiler
• Inline assembly
• Intrinsics• Optimized libraries
MKL-DNN, clDNN, BLAS, Arm NN, Arm CIMSIS-NN, and many more..
48
?
![Page 49: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/49.jpg)
Example: Arm-NN Library
49
TensorFlow
armNN
![Page 50: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/50.jpg)
Comparison among CNN Libraries on CPU
50Source: Evaluating the energy efficiency of Deep CNN, Da Li 2016. *Convnet on Xenon E5 (16-core)
• Caffe backends – Atlas, OpenBLAS, MKL, openMP, and CaffeConTroll
• Performance depends on the quality of the library optimizations for the target
![Page 51: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/51.jpg)
Distributed Learning and Inference
51Source: Large Scale Distributed Deep Networks, Jeffrey Dean, Google 2019
![Page 52: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/52.jpg)
Distributed DL - Approach
52
![Page 53: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/53.jpg)
Distributed DL – Performance
53
• Models with more parameters benefit more from the use of additional machines
![Page 54: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/54.jpg)
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
54
Today
![Page 55: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/55.jpg)
Domain Specific Processors (VLIW-SIMD)• Processors optimized for specific
application domain (e.g: Vision, signal processing, etc)
• Example: Qualcomm Hexagon, Movidious (Intel), Ceva, and many more..
• Support for DNN
• Instruction-set extensions
• DNN Accelerator in the execution pipeline
55
![Page 56: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/56.jpg)
Programming Model
56
![Page 57: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/57.jpg)
Hexagon DSP (Qualcomm)
57
![Page 58: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/58.jpg)
Hexagon over Quad CPU
58Source: Hotchips
![Page 59: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/59.jpg)
Hexagon – Power breakdown
59
• Less overhead on control logic and memory compared to CPUs
![Page 60: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/60.jpg)
Another Example: Ceva DSP
60Source: Anandtech
![Page 61: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/61.jpg)
Final Example: Movidious v2
61
Source: Hotchips’14
![Page 62: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/62.jpg)
Final Example: Movidious v2
62
Source: Hotchips’14
![Page 63: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/63.jpg)
Final Example: Intel Neural Compute Stick
63
![Page 64: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/64.jpg)
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
64
Today
![Page 65: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/65.jpg)
Graphics Processing Units (GPUs)• SIMD vs GPU
• GPUs uses threads instead of vectors
• GPUs have the “shared memory” spaces
65
SIMD GPU
![Page 66: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/66.jpg)
How Threads are Scheduled?
6666
![Page 67: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/67.jpg)
Example: NVIDIA Fermi - 2009
67
![Page 68: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/68.jpg)
Example: NVIDIA Fermi - 2009
68
![Page 69: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/69.jpg)
Example: NVIDIA Fermi - 2009
69
• Streaming Multiprocessors (SM)
• 32 – cuda cores/SM
• ALU – 32/64-bit
• FP – SP/DP (with FMA)
• SFU – Sin, cosine, sqrt, etc• Clock – 1.5 GHz (estimated)• Peak Performance – 1.5 TFLOPS
![Page 70: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/70.jpg)
Example: NVIDIA Volta - 2017
70
• Streaming Multiprocessors (SM)
• 80 - SMs
• 64 – INT32/FP32 core/SM
– 32 – FP64 cuda core/SM
• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops
• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125
• 1.53GHz * 80 * 1024
![Page 71: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/71.jpg)
Example: NVIDIA Volta - 2017
71
• Streaming Multiprocessors (SM)
• 80 - SMs
• 64 – INT32/FP32 core/SM
– 32 – FP64 cuda core/SM
• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops
• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125
• 1.53GHz * 80 * 1024
![Page 72: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/72.jpg)
Example: NVIDIA Volta - 2017
72
• Streaming Multiprocessors (SM)
• 80 - SMs
• 64 – INT32/FP32 core/SM
– 32 – FP64 cuda core/SM
• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops
• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125
• 1.53GHz * 80 * 1024
![Page 73: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/73.jpg)
Internals of Tensor Core• Modes of operation – Volta
• FP16 – A, B, C are FP16
• Mixed-precision – A and B are FP16, C is FP32• Turing GPUs – Supports 1, 2, 4, and 8 bit data-types (int4, int8 on Tensor
cores)
73
![Page 74: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/74.jpg)
Scheduling Example for 16x16x16 Gemm
74Source: Anandtech
![Page 75: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/75.jpg)
Scheduling Example for 16x16x16 Gemm
75Source: Anandtech
![Page 76: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/76.jpg)
Scheduling Example for 16x16x16 Gemm
76Source: Anandtech
![Page 77: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/77.jpg)
Scheduling Example for 16x16x16 Gemm
77Source: Anandtech
![Page 78: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/78.jpg)
Scheduling Example for 16x16x16 Gemm
78Source: Anandtech
![Page 79: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/79.jpg)
How to use Tensor cores• CuBLAS, CuDNN, etc• Library takes care of
Tiling and storage hierarchy
• Opcode: HMMA (Matrix Multiply Accumulate)
79
![Page 80: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/80.jpg)
GPU Performance
80
• We still need CPU for some extent
![Page 81: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/81.jpg)
GPU vs CPU Performance
• CPU - 16-core Intel Xeon E5 -2650 v2 @ 2.6GHz
• Benchmark: AlexNet• Lower batch size leads
to under utilization on all devices
• K20 has less memory than Titan X
81
![Page 82: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/82.jpg)
Concluding Remarks• Compute and data requirement of DNN is quite high and a major part of
the computation is from Matrix Multiplications (i.e. MAC ops)• Common DNN specific extensions in Generic architecture is,
1. Instruction-set extensions – Generally SIMD support at reduced precision
2. DNN accelerator on datapath (Co-processor, Tensor core, etc)• The effective performance of the platform depends on the hardware
capability and software support (programming model and library used to realize the network)
• The energy efficiency is still a limitation of generic platforms for DNN
82
![Page 83: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/83.jpg)
Reference• Evaluating the Energy Efficiency of Deep Convolutional Neural Network on
CPUs and GPUs, Da Li and Xinbo Chen, 2016
83
![Page 84: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/84.jpg)
Backup
84
![Page 85: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/85.jpg)
Mutithreading Categories
85
Tim
e (P
roce
ssor
Cyc
le)
super-scalar Fine-Grained Coarse-Grained MultiprocessingSimultaneous Mutithreading
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
![Page 86: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/86.jpg)
Example: IBM Power4 (Superscalar)
86
![Page 87: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/87.jpg)
Example: IBM Power5• Supports 2 threads
87
2 fetch (PC),2 initial decodes
2 commits (architected register sets)
![Page 88: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/88.jpg)
Power5 Thread Performance• Relative priority of each thread is
controllable in hardware• For balanced operation, both threads
run slower than if they “owned” the machine
88
![Page 89: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed2e74d6d79440c0a443c58/html5/thumbnails/89.jpg)
Any guess on largest chip so far?
89
Source : Cerebras, Hotchip 2019