fpgas - computer sciencecr4bd/6354/f2016/slides/lec22-slides... · fpgaprograms:rtl e.g.:verilog...
TRANSCRIPT
![Page 1: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/1.jpg)
FPGAs
1
![Page 2: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/2.jpg)
To read more…
This day’s papers:Brown and Rose, ”Architecture of FPGAs and CPLDs: A Tutorial”. (noreview required)Putnam et al, ”A Reconfigurable Fabric for Accelerating Large-ScaleDatacenter Services”
1
![Page 3: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/3.jpg)
reconfigurable hardware‘normal’ processor reconfig. HWstream of instructions set of wiringsfetch 1+ instruction/cycle milliseconds+ to reconfigurelots of control logic lots of routingfixed, fast functional units flexible, slower functional units
2
![Page 4: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/4.jpg)
the accelerator concept
second processor specialized for particularcomputation
examples:
GPUs — vector computations
FPGAs — ???
custom chips — ??? (next week)
3
![Page 5: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/5.jpg)
FPGA structure
Brown and Rose, Figure 2 4
![Page 6: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/6.jpg)
FPGA programs: RTL
e.g.: Verilog
determines wiring between gates, registers, memories
everything happens in parallel every cycle
manually specify what’s in registers, etc.
same languages used to design real processors
5
![Page 7: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/7.jpg)
RTL example
module counter(clock,reset,value);input clock;input reset;output value;
reg [32:0] count;
always @ (posedge reset or posedge clock)if (reset)begin
count <= 0;end
elsebegin
count <= count + 1'b1;end
assign value = count;endmodule
6
![Page 8: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/8.jpg)
A note about HW programming
not intuitiveattempts at easier interfaces:
“schematic capture” — draw circuit diagramcommon, doesn’t seem great at scale
higher-level tools, e.g., Chisel (Berkeley researchproject)
compile to RTL; used at scale
automatic translation of C-like language (C to gates)Very mixed reputation — very hard compilers problemBut see Aladdin paper
7
![Page 9: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/9.jpg)
FPGA design pipeline
Brown and Rose, Figure 7 8
![Page 10: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/10.jpg)
FPGA: place and route
RTL compiles to “gate list”
needs to turn into what components in the FPGA toconnect
not straightforward; hours+ to compute if FPGAnearly full
effects performance — longer wires/more switches
9
![Page 11: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/11.jpg)
Programmable switches: example
Example switch: transistor + SRAM cell(SRAM cell ≈ 1-bit register)
SRAM cell continously outputs stored value
can be written by seperate circuit (not shown)
Brown and Rose, Figure 5 10
![Page 12: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/12.jpg)
Programmable switches: example
Brown and Rose, Figure 5 11
![Page 13: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/13.jpg)
FPGA routing example
12
![Page 14: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/14.jpg)
FPGA logic block example (1)
13
![Page 15: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/15.jpg)
FPGA logic block example (2)
14
![Page 16: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/16.jpg)
FPGA configuration
what to do for every switch
just loading values into memory that controls switch
15
![Page 17: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/17.jpg)
FPGA efficiency
most transistors perform routing, not computation
much longer signal paths than in CPUsslower clock rates for same task
development tool usefulness/quality is not great
16
![Page 18: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/18.jpg)
FPGA: more complex logic
many FPGAs include specialized fixed functionalityRAMadders, multipliersfloating point unitscommon DSP computationsfull embedded-class CPU cores…
could implement these using fully programmablelogic
but slower/bigger
17
![Page 19: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/19.jpg)
review comments
what are FPGAs good for anyways?
versus/combined with GPUs/CPUs?
other large-scale deployments?
programmability?
18
![Page 20: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/20.jpg)
Catapult challenges
datacenter logisticscost (only 10% more???)power density (cooling, power distribution)physical space
programs across multiple FPGAsneeds fast FPGA-to-FPGA communicationcentralized allocation
failure handling
19
![Page 21: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/21.jpg)
The Shell
23% of FPGA (configurable) area:
20
![Page 22: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/22.jpg)
CPU to FPGA transfers
10 µs for 16 KB — approx 15 GB/s
(about maximum PCIe 3.0 transfer rate)
21
![Page 23: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/23.jpg)
Catapult roles
hand-coded Verilog (RTL language)
hand partitioned across FPGAs?
precise duplication of existing software
22
![Page 24: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/24.jpg)
Search engine architecture
searchquery cache
top-levelaggregator
(TLA)
MLA
MLA
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
ranking service
query
query
docum
ents
documents
documents, queryrankings
23
![Page 25: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/25.jpg)
Search engine architecture
searchquery cache
top-levelaggregator
(TLA)
MLA
MLA
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
ranking service
query
query
docum
ents
documents
documents, queryrankings
23
![Page 26: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/26.jpg)
Search engine architecture
searchquery cache
top-levelaggregator
(TLA)
MLA
MLA
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
ranking service
query
query
docum
ents
documents
documents, queryrankings
23
![Page 27: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/27.jpg)
Search engine architecture
searchquery cache
top-levelaggregator
(TLA)
MLA
MLA
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
ranking service
query
query
docum
ents
documents
documents, query
rankings
23
![Page 28: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/28.jpg)
Search engine architecture
searchquery cache
top-levelaggregator
(TLA)
MLA
MLA
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
index shard
ranking service
query
query
docum
ents
documents
documents, query
rankings
23
![Page 29: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/29.jpg)
Overall Motivation
24
![Page 30: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/30.jpg)
FPGA operation
recieve: document, some features via shared memoryoutput: scoreeach FPGA runs a macropipeline stage — 8 µs(1600 clock cycles)
25
![Page 31: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/31.jpg)
Queue Manager
“model reload”
can only store one model at a time — takes 250 µsto load from external RAM
on FPGA memories: approx. 40MB capacity(distributed)
trick: proess queries for same model together
26
![Page 32: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/32.jpg)
Feature Extraction FSMs
parallel finite-state machines
essentially regexes compiled to gates?
fully pipelined
27
![Page 33: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/33.jpg)
Feature Expressions
speialized mathematical expressions
custom multithreaded processor
model determines what the expressions are
mostly integer — small FPGA area — but some FP
split across multiple FPGAs
threads priority-scheduled
28
![Page 34: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/34.jpg)
“Complex” logic area
29
![Page 35: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/35.jpg)
What are FPGAs good for?
bit-twiddling (lots of simple CPU instrs.)?
inherently parallel programs?perhaps even if different operations — hard for GPUs
low-latency I/O interface and processing?
prototyping CPUs, GPUs
30
![Page 36: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/36.jpg)
What are FPGAs bad at?
floating point, other ‘big’ arithmetic operationspurpose-built, denser ALUs just win
caching lots of data?… but sometimes dedicated SRAM blocks
being easy to program wellprogramming FPGAs ≈ processor design!
31
![Page 37: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/37.jpg)
FPGAs versus GPUs
both good at doing massively parallel computations
FPGAs better at exploiting multiple instructionparallelism?
FPGAs can be lower latency for simple operations
FPGAs much worse at floatingpoint/non-small-integer calculations?
32
![Page 38: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/38.jpg)
Interlude: Homework 3
33
![Page 39: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/39.jpg)
Homework 3 supplied kernel
what does the supplied kernel do?0 1 2 … 255 256 257 258 … 511 512 …
255 511 …
34
![Page 40: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/40.jpg)
Exam topics
Memory hierarchy — caches, TLBsPipelining, instruction scheduling, VLIWMultiple issue/out-of-order:
register renaming and reservation stationsreorder buffers and branch predictionhardware multithreading
Multicore shared memory:cache coherency protocols/networksrelaxed memory models and sequential consistencysynchronization: spin locks, transaction memory, etc.
Vector machines, GPUs, other accelerators35
![Page 41: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/41.jpg)
Next time: Custom ASICs
higher dev cost/higher efficiency
two papers:one on: automating design of custom ASIC accelerators(Aladdin)another: a case study using that (Minerva)
all these things probably apply to FPGA stuff
36
![Page 42: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/42.jpg)
Preview: Minerva
Deep Neural Networks — machine learning models
accelerating evaluating DNNs (making predictionsfrom a pre-trained model)
mathematical tradeoffs (remove “unimportant”things from model)
architectural tradeoffs
37
![Page 43: FPGAs - Computer Sciencecr4bd/6354/F2016/slides/lec22-slides... · FPGAprograms:RTL e.g.:Verilog determineswiringbetweengates,registers,memories everythinghappensinparalleleverycycle](https://reader034.vdocuments.site/reader034/viewer/2022051601/5add87867f8b9a4a268da93a/html5/thumbnails/43.jpg)
Previre: Aladdin
Tool (used by Minerva) for quickly evaluatingaccelerator designs
Produces fast estimates
Complements existing high-level synthesis (“C togates”-like) tools
38