chapter 7 hardware accelerators 金仲達教授 清華大學資訊工程學系 (slides are taken...
Post on 20-Dec-2015
235 views
TRANSCRIPT
![Page 1: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/1.jpg)
Chapter 7
Hardware Accelerators
金仲達教授清華大學資訊工程學系
(Slides are taken from the textbook slides)
![Page 2: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/2.jpg)
Operating Systems-2
Overview CPUs and accelerators Accelerated system design
performance analysis scheduling and allocation
Design example: video accelerator
![Page 3: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/3.jpg)
Operating Systems-3
Accelerated systems Use additional computational unit
dedicated to some functions? Hardwired logic. Extra CPU.
Hardware/software co-design: joint design of hardware and software architectures.
![Page 4: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/4.jpg)
Operating Systems-4
Accelerated system architecture
CPU
accelerator
memory
I/O
request
dataresultdata
![Page 5: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/5.jpg)
Operating Systems-5
Accelerator vs. co-processor A co-processor connects to the internals of
the CPU and executes instructions. Instructions are dispatched by the CPU.
An accelerator appears as a device on the bus. Accelerator is controlled by registers, just like
I/O devices CPU and accelerator may also communicate
via shared memory, using synchronization mechanisms
Designed to perform a specific function
![Page 6: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/6.jpg)
Operating Systems-6
Accelerator implementations Application-specific integrated circuit. Field-programmable gate array (FPGA). Standard component.
Example: graphics processor.
![Page 7: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/7.jpg)
Operating Systems-7
System design tasks Similar to design a heterogeneous
multiprocessor architecture Processing element (PE): CPU, accelerator, etc.
Program the system.
![Page 8: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/8.jpg)
Operating Systems-8
cost
performance
Why accelerators? Better cost/performance.
Custom logic may be able to perform operation faster than a CPU of equivalent cost.
CPU cost is a non-linear function of performance.=> better split application on multiple cheaper PEs
![Page 9: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/9.jpg)
Operating Systems-9
Why accelerators? cont’d. Better real-time performance.
Put time-critical functions on less-loaded processing elements.
Remember RMS utilization---extra CPU cycles must be reserved to meet deadlines.
cost
performance
deadlinedeadline w.RMS overhead
![Page 10: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/10.jpg)
Operating Systems-10
Why accelerators? cont’d. Good for processing I/O in real-time. May consume less energy. May be better at streaming data. May not be able to do all the work on even
the largest single CPU.
![Page 11: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/11.jpg)
Operating Systems-11
Overview CPUs and accelerators Accelerated system design
performance analysis scheduling and allocation
Design example: video accelerator
![Page 12: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/12.jpg)
Operating Systems-12
Accelerated system design First, determine that the system really
needs to be accelerated. How much faster is the accelerator on the core
function? How much data transfer overhead?
Design the accelerator itself. Design CPU interface to accelerator.
![Page 13: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/13.jpg)
Operating Systems-13
Performance analysis Critical parameter is speedup: how much
faster is the system with the accelerator? Must take into account:
Accelerator execution time. Data transfer time. Synchronization with the master CPU.
![Page 14: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/14.jpg)
Operating Systems-14
Accelerator execution time Total accelerator execution time:
taccel = tin + tx + tout
tin and tout must reflect the time for bus transactions
Data input
Acceleratedcomputation
Data output
![Page 15: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/15.jpg)
Operating Systems-15
Data input/output times Bus transactions include:
flushing register/cache values to main memory;
time required for CPU to set up transaction; overhead of data transfers by bus packets,
handshaking, etc.
![Page 16: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/16.jpg)
Operating Systems-16
Accelerator speedup Assume loop is executed n times. Compare accelerated system to non-accelerat
ed system: S = n(tCPU - taccel) = n[tCPU - (tin + tx + tout)]
Execution time on CPU
![Page 17: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/17.jpg)
Operating Systems-17
Single- vs. multi-threaded One critical factor is available parallelism:
single-threaded/blocking: CPU waits for accelerator;
multithreaded/non-blocking: CPU continues to execute along with accelerator.
To multithread, CPU must have useful work to do. But software must also support multithreading.
![Page 18: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/18.jpg)
Operating Systems-18
Two modes of operations Single-threaded: Multi-threaded:
P2
P1
A1
P3
P4
P2
P1
A1
P3
P4CPU
Accelerator
CPU
Accelerator
![Page 19: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/19.jpg)
Operating Systems-19
Execution time analysis Single-threaded:
Count execution time of all component processes
Multi-threaded: Find longest path through execution.
Sources of parallelism: Overlap I/O and accelerator computation.
Perform operations in batches, read in second batch of data while computing on first batch.
Find other work to do on the CPU. May reschedule operations to move work after
accelerator initiation.
![Page 20: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/20.jpg)
Operating Systems-20
Overview CPUs and accelerators Accelerated system design
performance analysis scheduling and allocation
Design example: video accelerator
![Page 21: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/21.jpg)
Operating Systems-21
Accelerator/CPU interface Accelerator registers provide control
registers for CPU. Data registers can be used for small data
objects. Accelerator may include special-purpose
read/write logic. Especially valuable for large data transfers.
![Page 22: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/22.jpg)
Operating Systems-22
Caching problems Main memory provides the primary data
transfer mechanism to the accelerator. Programs must ensure that caching does
not invalidate main memory data. CPU reads location S. Accelerator writes location S. CPU writes location S. BAD
![Page 23: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/23.jpg)
Operating Systems-23
Synchronization As with cache, main memory writes to
shared memory may cause invalidation: CPU reads S. Accelerator writes S. CPU reads S.
![Page 24: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/24.jpg)
Operating Systems-24
Partitioning Divide functional specification into units.
Map units onto PEs. Units may become processes.
Determine proper level of parallelism:
f3(f1(),f2())
f1() f2()
f3()
vs.
![Page 25: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/25.jpg)
Operating Systems-25
Partitioning methodology Divide CDFG into pieces, shuffle functions
between pieces. Hierarchically decompose CDFG to identify
possible partitions.
![Page 26: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/26.jpg)
Operating Systems-26
Partitioning example
Block 1
Block 2
Block 3
cond 1
cond 2
P1
P2
P3P4
P5
![Page 27: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/27.jpg)
Operating Systems-27
Scheduling and allocation Must:
schedule operations in time; allocate computations to processing elements.
Scheduling and allocation interact, but separating them helps. Alternatively allocate, then schedule.
![Page 28: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/28.jpg)
Operating Systems-28
Example: scheduling and allocation
P1 P2
P3
d1 d2
Task graph Hardware platform
M1 M2
![Page 29: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/29.jpg)
Operating Systems-29
Example process execution times
M1 M2
P1 5 5
P2 5 6
P3 - 5
![Page 30: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/30.jpg)
Operating Systems-30
Example communication model Assume communication within PE is free. Cost of communication from P1 to P3 is d1
=2; cost of P2->P3 communication is d2 = 4.
![Page 31: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/31.jpg)
Operating Systems-31
First design Allocate P1, P2 -> M1; P3 -> M2.
time
M1
M2
network
5 10 15 20
P1 P2
d2
P3
Time = 19
![Page 32: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/32.jpg)
Operating Systems-32
Second design Allocate P1 -> M1; P2, P3 -> M2:
M1
M2
network
5 10 15 20
P1
P2
d2
P3
Time = 18
![Page 33: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/33.jpg)
Operating Systems-33
System integration and debugging Try to debug the CPU/accelerator interface
separately from the accelerator core. Build scaffolding to test the accelerator. Hardware/software co-simulation can be
useful.
![Page 34: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/34.jpg)
Operating Systems-34
Overview CPUs and accelerators Accelerated system design
performance analysis scheduling and allocation
Design example: video accelerator
![Page 35: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/35.jpg)
Operating Systems-35
Concept Build accelerator for block motion
estimation, one step in video compression. Perform two-dimensional correlation:
Frame 1
f2f2f2f2f2f2f2f2f2f2
![Page 36: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/36.jpg)
Operating Systems-36
Block motion estimation MPEG divides frame into 16 x 16 macroblocks f
or motion estimation. Search for best match within a search range. Measure similarity with sum-of-absolute-differ
ences (SAD): | M(i,j) - S(i-ox, j-oy) |
![Page 37: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/37.jpg)
Operating Systems-37
Best match Best match produces motion vector for
motion block:
![Page 38: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/38.jpg)
Operating Systems-38
Full search algorithmbestx = 0; besty = 0;bestsad = MAXSAD;for (ox = - SEARCHSIZE; ox < SEARCHSIZE; ox++) {
for (oy = -SEARCHSIZE; oy < SEARCHSIZE; oy++) {int result = 0;for (i=0; i<MBSIZE; i++) {
for (j=0; j<MBSIZE; j++) {result += iabs(mb[i][j] - search[i-ox+XCEN
TER][j-oy-YCENTER]);}
}if (result <= bestsad) { bestsad = result; bestx = ox; besty =
oy; }}
}
![Page 39: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/39.jpg)
Operating Systems-39
Computational requirements Let MBSIZE = 16, SEARCHSIZE = 8. Search area is 8 + 8 + 1 in each dimension. Must perform:
nops = (16 x 16) x (17 x 17) = 73984 ops CIF format has 352 x 288 pixels -> 22 x 18 macr
oblocks.
![Page 40: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/40.jpg)
Operating Systems-40
Accelerator requirementsname block motion estimator
purpose block motion est. in PC
inputs macroblocks, search areas
outputs motion vectors
functions compute motion vectors withfull search
performance as fast as possible
manufacturing cost hundreds of dollars
power from PC power supply
physical size/weight PCI card
![Page 41: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/41.jpg)
Operating Systems-41
Accelerator data types, basic classes
Motion-vector
x, y : pos
Macroblock
pixels[] : pixelval
Search-area
pixels[] : pixelval
PC
memory[]
Motion-estimator
compute-mv()
![Page 42: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/42.jpg)
Operating Systems-42
Sequence diagram
:PC :Motion-estimator
compute-mv()
memory[]
memory[]
memory[]
Search area
macroblocks
![Page 43: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/43.jpg)
Operating Systems-43
Architectural considerations Requires large amount of memory:
macroblock has 256 pixels; search area has 1,089 pixels.
May need external memory (especially if buffering multiple macroblocks/search areas).
![Page 44: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/44.jpg)
Operating Systems-44
Motion estimator organization
Add
ress
gene
rato
r
sear
ch a
rea
mac
robl
ock
netw
ork
ctrlne
twor
k
PE 0
comparator
PE 1
PE 15
Motionvector
...
![Page 45: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/45.jpg)
Operating Systems-45
Pixel schedulesPE 0 PE 1 PE 2
|M(0,0)-S(0,0)|
|M(0,1)-S(0,1)| |M(0,0)-S(0,1)|
|M(0,2)-S(0,2)| |M(0,1)-S(0,2)| |M(0,0)-S(0,2)|
M(0,0)
S(0,2)
![Page 46: Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)](https://reader033.vdocuments.site/reader033/viewer/2022061618/56649d4a5503460f94a271f6/html5/thumbnails/46.jpg)
Operating Systems-46
System testing Testing requires a large amount of data. Use simple patterns with obvious answers
for initial tests. Extract sample data from JPEG pictures for
more realistic tests.