slides 115 of hello
TRANSCRIPT
-
8/11/2019 Slides 115 of hello
1/18
University of MichiganElectrical Engineering and Computer Science
1
Application-Specific Processing on a
General Purpose Core via Transparent
Instruction Set Customization
Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Scott Mahlke, Krisztin Flautner*
Advanced Computer Architecture Lab, University of Michigan*ARM Ltd.
-
8/11/2019 Slides 115 of hello
2/18
University of MichiganElectrical Engineering and Computer Science
2
A Case for Customization
General purpose processors handles manyapplications fairly well, but
Each application has different requirements
Need for efficient execution
Impressive design wins through customization
Performance, power, area Up to 3.5x speedup [Hot Chips 16]
-
8/11/2019 Slides 115 of hello
3/18
University of MichiganElectrical Engineering and Computer Science
3
Computationally demanding parts of applicationsrun on special hardware
New instructions use the special hardware
Instruction Set Customization
CUSTOM
XOR
MPY LD
XOR
SHR
XOR
MOV
MPY
LD
SHR
AND
-
8/11/2019 Slides 115 of hello
4/18
University of MichiganElectrical Engineering and Computer Science
4
Traditional vs. Transparent Customization
High Non-RecurringEngineering costs (NRE)
Universal accelerator No ISA change
CPU
CPU
Compute Accelerator
(CCA)
CPU
CPU
CPU
CPU
Traditional Transparent
-
8/11/2019 Slides 115 of hello
5/18
University of MichiganElectrical Engineering and Computer Science
5
Design of a Compute Accelerator
Goal: support importantcomputation subgraphs
Array of function units
Exploits subgraph
parallelism
Allows natural data
propagation
FU FU FU
FU FU FU
IN 1
IN 2
Fetch
Issue
ALUALU
CCA
WB
-
8/11/2019 Slides 115 of hello
6/18
University of MichiganElectrical Engineering and Computer Science
6
Or
AndMov
Or
And
Or
AndMov
Or
And Mov
Or
AndMov
Or
And Mov
Mov
Mov
1
11
1
1 1
1
1
CCA Shape
164.gzip
-
8/11/2019 Slides 115 of hello
7/18
University of MichiganElectrical Engineering and Computer Science7
AndXor
Xor
Xor Add
Mov
Mov
1
22
2
2 2
2
2
CCA Shape
Blowfish
-
8/11/2019 Slides 115 of hello
8/18
University of MichiganElectrical Engineering and Computer Science8
Dynamic % of subgraphs using FU
CCA Utilization
1 2 3 4 5 6 7
1 100 59.0 22.9 13.1 6.5 4.2 0.3
2 91.1 50.6 9.9 4.1 0.6 0.2 0.03 57.4 17.8 6.3 2.9 0.1 0.0 0.0
4 18.5 8.3 1.6 0.1 0.0 0.0 0.0
5 8.7 2.1 0.1 0.0 0.0 0.0 0.0
6 2.1 1.2 0.1 0.0 0.0 0.0 0.0
7 1.2 0.1 0.1 0.0 0.0 0.0 0.0
8 0.1 0.1 0.0 0.0 0.0 0.0 0.0
-
8/11/2019 Slides 115 of hello
9/18
University of MichiganElectrical Engineering and Computer Science9
CCA Operations
Dynamic opcodes inimportant subgraphs
Excluded mpy/div,
load/store, branch
Two main categories
logicals, adds
Subgraphs rarely have
more than 3 dependentadds
Opcode %
Add 28.7
And 12.5
Move 11.7
Sext 10.4Lshift 9.8
Or 8.7
Xor 5.1
Sub 4.8
Rshift 2.4
Compare 0.4
-
8/11/2019 Slides 115 of hello
10/18
University of MichiganElectrical Engineering and Computer Science10
Proposed CCA Design
4 inputs/2 outputs Two FU types
Arith/logic
Logic
Crossbar between rows
Captures > 99% of
important subgraphs
I1 I2I1 I3 I4
O1 O2
-
8/11/2019 Slides 115 of hello
11/18
University of MichiganElectrical Engineering and Computer Science11
Synthesis of CCA
Synopsys design tools, 130nm library
Depth Configuration Control (bits) Delay (ns) Cell area
(mm2)
Subgraphs
Supported
7 6A-4L-4A-3L-2A-2L-1L
245 5.62 0.48 99.3%
6 6A-4L-4A-3L-2A-1L
229 4.56 0.45 95.1%
5 6A-4L-4A-2L-1L 197 3.50 0.40 87.6%
4 6A-4L-3A-2L 172 3.19 0.38 81.8%
-
8/11/2019 Slides 115 of hello
12/18
University of MichiganElectrical Engineering and Computer Science12
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA changeRecompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA changeRecompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA changeRecompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA changeRecompile necessary
ASIPs
ISA change
High NRE
Static Dynamic
CCA Utilization
Realization
Selection
Static
Dynamic
-
8/11/2019 Slides 115 of hello
13/18
University of MichiganElectrical Engineering and Computer Science13
ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR
Dynamic SelectionDynamic Realization
Detect and replace subgraphs in fill unit of trace cache
I-Cache
Trace
Cache
Retire
..
.
Execute
..
.
Decode
Trace
Construction
Subgraph
Selection and
Insertion
LSR r2, r2, #4
LD r3
CUSTOM
SHR
ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR
ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR
ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR
ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR
-
8/11/2019 Slides 115 of hello
14/18
University of MichiganElectrical Engineering and Computer Science14
Simulation
SimpleScalarARM instruction set 4-wide Execution, 1 compute accelerator
128 RUU entries
32k inst. trace cache, 256 inst. Traces 5000 cycle selection/insert latency
L1 I-cache : 32k, 2 way, 2 cycle hit
L1 D-cache : 32k, 4 way, 2 cycle hit
-
8/11/2019 Slides 115 of hello
15/18
University of MichiganElectrical Engineering and Computer Science15
Varying CCA Latency
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
Speedup
6
4
21
SPECintMediaBench Encryption
Lat
-
8/11/2019 Slides 115 of hello
16/18
University of MichiganElectrical Engineering and Computer Science16
Static SelectionDynamic Realization
Compiler selects subgraphs offline Communicated to the hardware at load time
Control bits stored in a table and inserted at decode
ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8SHR
LSR r2, r2, #4
LD r3
CCA_Start #2
ADD r4, r1, #1
XOR r5, r4, r2
ADD r6, r5, r3XOR r7, r6, r8
CCA_End
SHR
I-Cache
Control
Table
Retire
.
.
.
Execute
.
.
.
Decode
-
8/11/2019 Slides 115 of hello
17/18
University of MichiganElectrical Engineering and Computer Science17
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
Speed
up
Dynamic Selection Static Selection
Dynamic vs. Static SelectionSPECint MediaBench Encryption
-
8/11/2019 Slides 115 of hello
18/18
University of MichiganElectrical Engineering and Computer Science18
Summary
Transparent instruction set customization Benefits of customization without changing ISA
Presented design of a compute accelerator
Handle majority of important computationsubgraphs in many benchmarks
Developed ways to utilize the accelerator
Table-based static selectiondynamic realization Trace cache based dynamic selectiondynamic
realization