ece 5775 high-level digital design automation fall 2018 · outline motivation for specialized...
TRANSCRIPT
![Page 1: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/1.jpg)
Specialized Computing
ECE 5775High-Level Digital Design Automation
Fall 2018
![Page 2: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/2.jpg)
▸ All students enrolled in CMS & Piazza– Email instructor otherwise
▸ Lab 1 to be released next Monday
1
Announcements
![Page 3: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/3.jpg)
Outline
▸ Motivation for specialized computing– Key driving forces from applications and technology– Main sources of inefficiency in general-purpose
computing
▸ FPGA introduction– Basic building blocks– Classical homogeneous FPGA architectures– Modern heterogeneous FPGA architectures
2
![Page 4: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/4.jpg)
3
30 Years of Hot Chips Symposium (www.hotchips.org)
8/28/2018 StarryHeavensAbove
https://mp.weixin.qq.com/s/BPtDYSXE7buKEWfDYAPKbw 1/13
8/28/2018 StarryHeavensAbove
https://mp.weixin.qq.com/s/BPtDYSXE7buKEWfDYAPKbw 1/13
8/28/2018 StarryHeavensAbove
https://mp.weixin.qq.com/s/BPtDYSXE7buKEWfDYAPKbw 2/13
8/28/2018 StarryHeavensAbove
https://mp.weixin.qq.com/s/BPtDYSXE7buKEWfDYAPKbw 2/13
Hot Chips 30 (2018)Hot Chips 20 (2008)
Hot Chips 10 (1998)Hot Chips 1 (1989)
![Page 5: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/5.jpg)
▸ The resurgence of AI is revolutionizing many crucial aspects of the world
– Self-driving vehicles – Image recognition and scene modeling – Game AI (e.g. AlphaGo)
▸ A key method is the deep neural networks (DNNs)4
Application Trend: The Great AI Awakening
![Page 6: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/6.jpg)
▸ DNNs require enormous amount of compute – For example, ResNet50 (70 layers) performs 7.7 billion operations
required to classify one image
5
Modern DNNs are Computationally “Expensive”
X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for Edge Inference of Neural Networks. Nature Electronics, vol 1, Apr 2018.
© 2018 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2018 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
PERSPECTIVE NATURE ELECTRONICS
the interconnect. Neurocube44 stacked the hybrid memory cube die on a single-instruction multiple-data processor, while TETRIS45 combined a hybrid memory cube with a spatial architecture. Unlike general DNN accelerators, near-data processing achieves optimal efficiency by using more area for computing. In order to achieve higher efficiency, some works have even moved the DRAM on chip. DaDianNao23 adopted embedded DRAM for high-density on-chip memory, which achieves a 150-fold reduction in energy at the cost of larger chip size. There are also some works that moved computing units to sensors, thereby further reducing the cost of memory access. ShiDiannao26 put vision processing in the sensor with no DRAM, yielding a 63-fold improvement in energy effi-ciency. RedEye46 even omitted analogue-to-digital conversion and performed DNN computation in the analogue domain at the sensor.
Non-von Neumann architectures have also been explored to reduce computation and memory consumption. One such approach adopts non-volatile resistive memories as programmable resis-tive elements. Because computation is performed in the analogue domain, it can be extremely fast with ReRAM arrays47. The approach also brings high density and high energy efficiency as computation and memory are packed in the same chip area, thereby involving minimal data movement. ISAAC48 adopted multicycle approach to perform high-precision calculations with limited memory using
25.1 million memristors. PRIME49 employed a large memristor array for multi-level computation. Jain et al.50 and Wang et al.51 proposed the use of spin-transfer torque magnetic RAM for DNN computation.
Recently, representative array-level demonstrations have been reported. These include IBM’s 500 × 661 phase change mem-ory array for handwritten-digit recognition using the Modified National Institute of Standards and Technology (MNIST) data-base52, Tsinghua’s 128 × 8 analogue resistive RAM array for face rec-ognition53, UCSB’s 12 × 12 crossbar array for pattern recognition54, and UCSB’s floating-gate array for MNIST image recognition55. Non-von Neumann architectures with memristors have several drawbacks: a large analogue-to-digital/digital-to-analogue conver-sion overhead, limited size of the memristor array, and energy and time overheads for memristor writing. It was recently shown that the analogue-to-digital conversion overhead can be eliminated by training the networks in the analogue domain54, and memristor writing can also be mitigated56. Although non-von Neumann archi-tectures with non-volatile resistive memories have considerable potential in both performance and energy efficiency, a number of requirements are yet to be met: special materials and device engi-neering to support the requirements of synaptic devices, increased array size, DNN mapping and EDA tools, and large-scale prototype
Num
ber o
f ope
ratio
ns (×
109 )
10
20
5
1
50
20% 10% 4%Top-five error
AlexNet(2012)
OverFeat(2013)
GoogLeNet(2014)
VGG-16
VGG-19(2014)
Inception-v1
Inception-v2
Inception-v3
Inception-v4(2016)
MobileNetShuffleNet
Xception
ResNeXt-101 DPN-131
PolyNet
NASNet-A(N=7)
ResNet110
ResNet152(2015)
ResNet50
NASNet-A(N=5)
NASNet-A(N=4)
NASNet-A(N=7)
2011
Cambricon
DaDianNao
Diannao
NeuFlow
ShiDianNao
Arria II EP2AGZ350Strati IV EP4SGX230
Stratix 10 GX 2800Arria V GX660
Year
20
100
200
50
10
500
5
1,000
Per
form
ance
den
sity
(gig
aops
s–1
mm
–2)
PD of GPUs PD of ASICs PD of FPGAs
Moore’s law trend forperformance according to ref. 35
Moore’s law end
GTX 690 GTX Titan
GTX Titan X
GTX 1080
P100 V100
Eyeriss
TPU
EIE
ParkMoves
Myriad2
Arria V5AGZE7
20172012 2013 2014 2015 2016
65 nm 45 nm 40 nm 28 nm 16 nm 12 nm
DNNs in academia withoutstructural optimization
DNNs in academia withstructural optimization
GTX Titan Z
a
b
GTX 1080 Ti
Fig. 3 | Gap between required number of operations and performance density. a, Number of operations versus top-five error rate for leading DNN designs from ImageNet classification competition. b, Performance density (PD) of leading GPU, ASIC and FPGA platforms. To catch up with the required number of operations, simply increasing the chip area is not feasible. Only the leading DNNs are labelled with a year. The y axis is in log scale. Data taken from refs 1–31,119,120.
NATURE ELECTRONICS | VOL 1 | APRIL 2018 | 216–222 | www.nature.com/natureelectronics218
Compute Density
![Page 7: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/7.jpg)
6
On Crash Course with the End of “Cheap” Technology Scaling
End of Growth of Single Program Speed?
22
End of the
Line?2X /
20 yrs(3%/yr)
RISC2X / 1.5 yrs
(52%/yr)
CISC2X / 3.5 yrs
(22%/yr)
End of DennardScaling
⇒Multicore2X / 3.5
yrs(23%/yr)
Am-dahl’sLaw⇒
2X / 6 yrs(12%/yr)
Based on SPECintCPU. Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018
▸ End of Dennard scaling: power becomes the key constraint
– Amdahl’s Law and dark silicon prevent “easy” multicore scaling
![Page 8: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/8.jpg)
▸ Classical scaling– Frequency increases at
constant power profiles• Performance improves “for free"!
▸ Leakage limited scaling– Vth virtually stopped scaling
due to exponentially increasing leakage power
– VDD scaling nearly stopped as well to maintain performance
▸ Dark silicon – Power constraints limit how
much of the chip can be activated at any one time (not 100% anymore)
7
End of Dennard Scaling and Dark SiliconClassical Dennard scalingDevice (transistor) # S2
Capacitance / device 1/SVoltage (Vdd) 1/SFrequency STotal power 1
Leakage limited scalingDevice (transistor) # S2
Capacitance / device 1/SVoltage (Vdd) ~1Frequency ~1Total power S
![Page 9: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/9.jpg)
8
Tradeoff between Compute Efficiency and Flexibility
Why is general-purpose CPU less energy efficient?
EFFICIENCYFLEXIBILITY
Control
Unit (CU)
Registers
Arithmetic Logic
Unit (ALU)
CPUs GPUs ASICsFPGAs
![Page 10: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/10.jpg)
9
Understanding Energy Inefficiency of General-Purpose Processors (GPPs)
L1-I$
BranchPredictor
FetchDecode Rename
RAT
Free list
Reservation-station
Schedule
Int RF
FP RF
RegisterRead/write
D-cache
LSQ + TLB
ALU
FPU
ROB
Commit
Typical Superscalar OoO Pipeline
Parameter Value
Fetch/issue/retire width 4
# Integer ALUs 3
# FP ALUs 2
# ROB entries 96
# Reservation station entries 64
L1 I-cache 32 KB, 8-way set associative
L1 D-cache 32 KB, 8-way set associative
L2 cache 6 MB, 8-way set associative
[source: Jason Cong, ISLPED’14 keynote]
![Page 11: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/11.jpg)
10
Energy Breakdown of Pipeline Components
Fetch unit9%
Decode6%
Rename12% Register
files3%
Scheduler11%
Int ALU14%
Mul/div4%
FPU8%
Memory10%
Misc23%
L1-I$
BranchPredictor
FetchDecode Rename
RAT
Free list
Reservation-station
Schedule
Int RF
FP RF
RegisterRead/write
D-cache
LSQ + TLB
ALU
FPU
ROB
Commit
Control
![Page 12: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/12.jpg)
11
Removing “Non-Computing” Portions
Fetch unit9%
Decode6%
Rename12% Register
files3%
Scheduler11%
Int ALU14%
Mul/div4%
FPU8%
Memory10%
Misc23%
▸ “Computing” portion: 10% (memory) + 26% (compute) = 36%
![Page 13: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/13.jpg)
Energy Comparison of Processor ALUs and Dedicated Units
▸ Why are processor units so expensive?
– ALU can perform multiple operations
• Add/sub/bitwise XOR/OR/AND
– 64-bit ALU– Dynamic/domino logic
used to run at high frequency
• Higher power dissipation
Operation Processor ALU
45 nm TSMC standard cell
32-bit add 0.122 nJ@2 GHz
0.002 nJ @1 GHz
32-bitmultiply
0.120 nJ@2 GHz
0.007 nJ @1 GHz
Single-precisionFP operation
0.150 nJ @ 2GHz
0.008 nJ @500 MHz
12
![Page 14: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/14.jpg)
13
Energy Breakdown with Standard-Cell ASICs
Fetch unit9%
Decode6%
Rename12% Register files
3%
Scheduler11%
Int ALU0.2%
Mul/div0.2%
FPU0.4%
ALU/FPU savings25.0%
Memory10%
Misc23%
▸ “Computing” portion: 10% (memory) + ~1% (compute) = 11%Still, only 10X gain is attainable?
![Page 15: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/15.jpg)
14
Additional Energy Savings from Specialization
▸ Customized data types– Exploit data range information to reduce bitwidth/precision and
simply arithmetic operations
▸ Specialized memory architecture– Exploit regular memory access patterns to minimize energy per
memory read/write
▸ Specialized communication architecture– Exploit data movement patterns to optimize the
structure/topology of on-chip interconnection network
These techniques combined can lead to another 10-100X energy efficiency improvement over GPPs
![Page 16: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/16.jpg)
15
Tradeoff between Compute Efficiency and Flexibility
EFFICIENCYFLEXIBILITY
Control
Unit (CU)
Registers
Arithmetic Logic
Unit (ALU)
CPUs GPUs ASICsFPGAs
What makes FPGA an interesting compute substrate?
![Page 17: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/17.jpg)
What is an FPGA?
▸ FPGA: Field-Programmable Gate Array– An integrated circuit designed to be configured by a customer or
a designer after manufacturing (wikipedia)
▸ Components in an FPGA Chip– Programmable logic blocks– Programmable interconnects– Programmable I/Os
16
![Page 18: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/18.jpg)
▸ SRAM-based implementation is popular– Non-standard technology means older technology generation
17
Three Important Pieces
Pass transistor (controlled by an SRAM bit)
Multiplexer (controlled by SRAM bits)
Lookup table (LUT, formed by SRAM bits)
LUT
![Page 19: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/19.jpg)
▸ Any function of k variables can be implemented with a 2k:1 multiplexer
18
Multiplexer as a Universal Gate
01010101
00110011
00001111
CoutSCinBA01101001
01010101
00110011
00001111
CoutSCinBA
01101001
00010111
01010101
00110011
00001111
CoutSCinBA01101001
00010111
01010101
00110011
00001111
CoutSumCinBA
01234567
S2
8:1 MUX
S1 S0
Cout
![Page 20: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/20.jpg)
▸ How many distinct 3-input 1-output Boolean functions exist?
▸ What about K inputs?
19
How Many Functions?
![Page 21: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/21.jpg)
20
Look-Up Table (LUT)
0/10/10/10/10/10/10/10/1
MU
X… Y
x2
A 3-input LUT
§ A k-input LUT (k-LUT) can be configured to implement any k-input 1-output combinational logic
– 2k SRAM bits– Delay is independent of logic
function
x1 x0
![Page 22: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/22.jpg)
▸ Implement a 2:1 MUX using a network of 2-input LUTs. Use the minimum number of LUTs
21
Exercise: Implementing Logic with LUTs
2-LUTY
MU
X
I0
I1
SBuilding block:
2-input LUT
![Page 23: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/23.jpg)
▸ A 6-input LUT can be two 5-input LUTs with common inputs (used in Xilinx 7-series FPGAs)
– Any function of six variables or two independent functions of five variables
22
Example: Implementing 6-LUT with 5-LUTs
5-LUT
A5A4A3A2A1
5-LUT
A5A4A3A2A1
A6
A5A4A3A2A1
O6
O5
6-LUT
![Page 24: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/24.jpg)
23
A Logic Element
LUT
▸ A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed▸ The LUT and FF combined form a logic element
![Page 25: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/25.jpg)
24
A Logic Block
▸ A logic block clusters multiple logic elements
▸ Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices
– Two independent carry chains per CLB for implementing adders
– Each slice contains four LUTs
CIN CIN
COUT COUT
SLICE
SLICE
Crossbar Sw
itch
![Page 26: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/26.jpg)
Traditional Homogeneous FPGA Architecture
25
Logic block
Switchblock
Routing track
![Page 27: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/27.jpg)
Modern Heterogeneous Field-Programmable System-on-Chip
[Figure credit: embeddedrelated.com]
▸ Island-style configurable mesh routing▸ Lots of dedicated components
– Memories/multipliers, I/Os, processors– Specialization leads to higher performance and lower power
26
![Page 28: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/28.jpg)
▸ Built-in components for fast arithmetic operation optimized for DSP applications – Essentially a multiply-accumulate core with many
other features
– Fixed logic and connections, functionality may be configured using control signals at run time
– Much faster than LUT-based implementation (ASIC vs. LUT)
27
Dedicated DSP Blocks
![Page 29: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/29.jpg)
28
Example: Xilinx DSP48E Slice
§25x18 signed multiplier§48-bit add/subtract/accumulate§48-bit logic operations§SIMD operations (12/24 bit)§Pipeline registers for high speed
[source: Xilinx Inc.]
![Page 30: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/30.jpg)
▸ Example: Xilinx 18K/36K block RAMs – 32k x 1 to 512 x 72 in one
36K block– Simple dual-port and true
dual-port configurations– Built-in FIFO logic– 64-bit error correction
coding per 36K block
[source: Xilinx Inc.] 29
DIADIPAADDRAWEAENA
CLKA
DIBDIPB
WEBADDRB
ENB
DOA
CLKB
DOPA
DOPBDOB
18K/36K block RAM
Dedicated Block RAMs (BRAMs)
![Page 31: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/31.jpg)
Embedded FPGA System-on-Chip
Xilinx Zynq All Programmable System-on-Chip[Source: Xilinx Inc.]
Dual ARM Cortex-A9 + NEON SIMD extension @600MHz~1GHz
Up to 350K logic cells2MB Block RAM 900 DSP48s
30
![Page 32: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/32.jpg)
31
FPGA as an Accelerator for Cloud Computing
Bloc
k RA
M
Bloc
k RA
M
~2 Million Logic Blocks
~5000DSP Blocks
~300MbBlock RAM
Xilinx UltraScale+ VU9P [Figure source: David Pellerin, AWS]
Amazon Machine
Image (AMI)
EC2 F1 Instance
CPU Application
on F1
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
FPGA Link
PCIeDDR
Controllers
Launch Instanceand Load AFI
Amazon EC2 F1 instances
![Page 33: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/33.jpg)
32
FPGA Deployment in Datacenter
[source: Microsoft, BrainWave, HotChips’2017]
▸ FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services
– e.g., project BrainWave claimed ~40Teraflops on large recurrent neural networks using Intel Stratix 10 FPGAs
![Page 34: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/34.jpg)
▸ Massive amount of fine-grained parallelism– Highly parallel and/or deeply pipelined to achieve maximum
parallelism– Distributed data/control dispatch
▸ Silicon configurable to fit algorithm– Compute the exact algorithm at the desired level of numerical
accuracy• Bit-level sizing and sub-cycle chaining
– Customized memory hierarchy
▸ Performance/watt advantage– Low power consumption compared to CPU and GPGPUs
• Low clock speed• Specialized architecture blocks
33
Summary: FPGA as a Programmable Accelerator
![Page 35: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/35.jpg)
▸ Fixed-point data types▸ Basics of algorithm analysis
34
Next Class
![Page 36: ECE 5775 High-Level Digital Design Automation Fall 2018 · Outline Motivation for specialized computing – Key driving forces from applications and technology – Main sources of](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec8982e80f1eb73380b716c/html5/thumbnails/36.jpg)
▸ These slides contain/adapt materials developed by– Prof. Jason Cong (UCLA)
35
Acknowledgements