a hardware-software co-design approach with separated verification/synthesis between computation and...
TRANSCRIPT
A hardware-software co-design approach with separated verification/synthesis between computation and communication
Masahiro Fujita
VLSI Design and Education CenterThe University of Tokyo
2
State-of-the-art SoC design System on a chip (SoC) C-based design description down to implementatio
n
IP core1DSP
Bus IF
Interconnect
CPU
HW1DSP
Interconnect
Mem1
Mem2HW2
IP core2Mem1
Bus IF
IP core3HW1
Bus IF
IP core4HW2
Bus IF
IP core5Mem2
Bus IF
IP core6CPU
Bus IF
IF
void main() { a = read(); b = read(); c = func(a, b); write(c);}
IP library
3
Design reuse is extremely important in SoC designs
IP (Intellectual Property) core reuse Existing designs have been verified Interface may/may not match
Bus
CPUMemory
ControllerMPEG
AnalogI/F
PowerSource
Memory
MemoryController1
MemoryController2
Bus1
Bus2CPU2
CPU1
IP libraryEx: MPEG video systemSelect IP with
required functionality
4
Need protocol transducers…
CPU MPEG RAM
CustomHWDMAC
CPU(IP)
RAM(IP)
DMAC(IP)
MPEG RAM
CustomHWDMAC
RAM(IP)
DMAC(IP)
CPU(IP)
Trans-ducer
Interconnect(Bus)
Different on-chip bus protocolsProtocol A Protocol B
MPEG RAM
CustomHWDMAC
RAM(IP)
DMAC(IP)
CPU(IP)
Trans-ducer
Solution
Functionality is satisfied, but its interface does not match
Communication on the interconnect is based on different protocols
Insert “Protocol Transducer” for conversions
Protocol transducer should be automatically generated
5
Basic ways of thinking and our proposal
Like to come with a methodology for large and complicated system designs Design reuse is a key
Separation of concerns is essential Computation and communication (control and datapath) must b
e clearly separated in some ways What we propose
New way to design communication protocols (Special mechanisms for rectification after manufacturing)
…Multiple of these
SDRAM
CPU
H/WDSP
Bus
Mem
MemH/W
SoC
PCB
Analog
Mechanical
6
Propose design methods for communication interface design with clear separation between computation and communication How the separation helps design efficiency
Today’s topic
IP core1DSP
Bus IF
Interconnect
CPU
HW1DSP
Interconnect
Mem1
Mem2HW2IP core2Mem1
Bus IF
IP core3HW1
Bus IF
IP core4HW2
Bus IF
IP core5Mem2
Bus IF
IP core6CPU
Bus IF
IF
Interface/communication
ComputationInterface/communication
i i
Res.Req.
7
Propose various rectification methods for computation and communication Designs can be debugged after manufacturing Propose different mechanisms for comp. and
comm.
Our on-going relating research
IP core
Bus protocol IF(in-field programmable)
With programmable elements
Original circuit
Programmableelements
LUT
LUT
LUT
8
Outline Motivation Background: State-of-the-art design methodology
C-based design Proposed method and its application to interface designs f
or computing elements Key technology for IP reuse Separation of concerns: computation and communicati
on (control and datapath) Application to dynamically reconfigure computing (if time al
lowed)
CPU MPEG RAM
CustomHWDMAC
CPU(IP)
RAM(IP)
DMAC(IP)
Bus
Protocol A Protocol B
MPEG RAM
CustomHWDMAC
RAM(IP)
DMAC(IP)
CPU(IP)
Trans-ducer
Protocol A1
9
For improving design productivities
Start the design in higher abstraction levels C language based HW descriptions is
100~10000 more compact that gate level descriptions
The number of lines that on designer can describe per day is limited
Reuse of existing designs So called IP reuse in LSI designs Key is to separate computation and
communication
Interface/communication
ComputationInterface/communication
RTL
Gates
High levelC/C++
10
Starting from C/C++ designs/specifications Extraction of parallelisms Partition of HW and SW
Based on profiling: performance critical parts (mostly loops) are assigned to HW
Design issues for large complicated systems
void main() { a = read(); b = read(); c = func(a, b); write(c);}
IP Library
SDRAM
CPU
H/WDSP
Bus
Mem
MemH/W
IP reuse designIP reuse design
CompilationCompilation
High level synthesisHigh level synthesis
11
C/C++ based design and specification languages for SoC designs
SystemC and SpecC are most common
Based on C/C++ Structural hierarchy
behavior, module Parallelism with event
based synchronization par wait, notify channels
Others … Support hardware-softwar
e co-designs
・・・C
・・・
・・・C
・・・
・・・C
notify・・・
・・・C
wait・・・
par
behavior b1 behavior b2
channels
Communication through shared variables
12
Claim in this talk: Separation of concerns
Even inside interface control and datapath should be clearly separated
Interface/communication
ComputationInterface/communication
This is not sufficient !
ComputationControl
DatapathControl
Datapath
Separation of computation and communication
A protocol is a collection of sequences Each sequence can operate independently
Protocol
Sequence1
Sequence2
Sequence3
Sequence4
Hardwaredefinition
( Read)(Write)( 4 burst read)( 4 burst write)
Automaton1
Port, signal names, etc.
Automaton2
For request orblockingFor response
i(stb==1)ack<=0
ack<=1
ack<=0
All sequences share initial state・・・
13
Goal: automatic generation of “correct” bus interfaces
Communication protocol can be arbitrarily complicated Blocking, non-blocking, out-out-order, tags, etc.
Deal only with state-of-the-art bus protocols Specification documents are over 200 pages Mostly subsets of them are actually used
Formally verify the definition of protocol in automaton and automatically generate interface circuits from them
If necessary, change their functionality in the fields
Assuming C-based designs
14
State-of-the-art on-chip bus protocol example: OCP (Open Core Protocol)
Interface Protocol proposed by OCP-IP Configurable interface protocol
Data/Address width, Burst/OutOfOrder features, … At basic configuration, interface has 8 signals (includin
g clock and reset)
Full specification documents over 200 pages More than 30 different transactions/sequences
OCPMaster
OCPSlave
MCmdMAddrMData
SCmdAcceptSRespSData
15
What protocol transducer does
Change from protocol A to B Protocols can be very
complicated Over 30 different commands
defined in the protocols Manuals over 200 pages Transactions (sequences) such
as Bust, out-of-order modes, … Each transaction (sequence) is
sent/received one at a time Protocols can be defined with
FSM/automaton State-of-the-art protocols may
need extremely large and complicated FSM/automaton
Protocol B
MPEG RAM
CustomHWDMAC
RAM(IP)
DMAC(IP)
CPU(IP)
Trans-ducer
Protocol A
Protocol A
…
Protocol B
…
16
Our scenario Start with C/C++ based descripti
ons for SoC Apply control/data flow analysis f
or computation Use protocol converter generator i
n communication interface synthesis Convert the descriptions’ protocol
to the target protocol Protocols themselves are formall
y verified with model checkers
HW/SW generated
(scheduled/allocated)
ProtocolConverterin HW/SW
Original designsin C/C++
Verificationthrough
SDG traversal
HW/SW synthesis(manual/automatic)
Protocol extraction
Protocolin design
Protocol Convertergenerator
ProtocolLibrary(target
Protocol)
Model checkingon protocol definitions[1] K. Tanabe, S. Sasaki, and M. Fujita, “Program Slicing for System Level Designs in
SpecC”, In Proc. of the IASTED, p.p. 252-258, Nov. 2004[2] S, Sasaki, M. Fujita, et al. FSEN 05.
Much small numbers of states to be checked than actual designs
Much small numbers of states to be checked than actual designs
17
Example: Non-Blocking protocol conversion
MASTER: OCP(Single Read, Non-Posted Write)
Request Response Request Response
SLAVE: OCP (Single Read, Single Write)
Single Read
Non-Posted Write
Single Read
Single Write
18
Conversion example
FSM for Response(FIFO-ready)
FSM for Request(FIFO-ready)
Master SlaveFIFO(2bit x 4)
M_MCmd
M_MAddr
M_MData
M_SCmdAccept
S_MCmd
S_MAddr
S_MData
S_SCmdAccept
M_SResp
M_SData
S_SResp
S_SData
WD
ata
PU
SH
RD
ata
PU
SH
RSTCLK
D
Single ReadRequest
Single ReadRequest
Non-PostedWrite
Request
Single WriteRequest
Single ReadResponse
Single ReadResponse
Non-PostedWrite
Response
19
How protocol transducer is realized
Intuitive understanding of the problem Follow the two protocols ⇒ compute the product
of the two FSM/automata and follow it
ProtocolA
Master
ProtocolB
Slave
Request Request
Response Response
Target
Exploration[1] + ours
Definition of protocol
Protocol A
Protocol BProtocol transducerIn FSM/automaton
(stb==1)ack<=0
ack<=1
ack<=0Clock-wisebehavior
[1] R.Passerone, J.A.Rowson, A.Sangiovanni-Vincentelli,“Automatic Transducer Synthesis of Interfaces between Incompatible Protocols” ,DAC’98 pp.8-13
20
: Dependency violation !
Simple computation of product
Follow the two automata Compute the product of the two
Eliminate dependency violated nodes/paths
A D
A E B D B E
B FC D C E C F
C E C F
C F
C
B
A
F
E
D
ProtocolA
Master 8
ctrl
DI
ProtocolB
Slave8
ctrl
DO
{Ctrl=0}
{Ctrl=1, DI:=data1}{Ctrl=1, D:=data1}
{Ctrl=0, DI=data2}
{Ctrl=0, D:=data2}
Transducer
Transducer
{Ctrl=0}{Ctrl=1, Rcvf1:=DO}
{Ctrl=0, Rcvd2:=DO}
{Ctrl=0}
Data not yet received but sent
B→(B or C) E→F
A→(A or B) 、D→(D or E) (Transducer)
Minimum latency path !
8
8
Data not yet received but sent
[1]
21
Need separation between control and datapath
If there is a loop in automata, product computation never terminates Data values are different each time going through
the loop
A D
B D
C E
C F
C
B
A
F
E
D
ProtocolA
Master 8
ctrl
DI
ProtocolB
Slave8
ctrl
DO
{Ctrl=0}
{Ctrl=1, DI:=data1}{Ctrl=1, D:=data1}
{Ctrl=0, DI=data2}
{Ctrl=0, D:=data2}
Transducer
Transducer
{Ctrl=0}{Ctrl=1, Rcvf1:=DO}
{Ctrl=0, Rcvd2:=DO}
{Ctrl=0}
8
8
A D
Data values are different
These two are not the same states
Need to expand more and more…
22
Protocols can be very complicated State-of-the-art protocols introduces many
features for faster throughputs
ProtocolMaster
ProtocolSlave
Request(Address / Data)
Response(Data)
t
Split transaction( Non blocking)
Req1 t
Out of ordertransaction
Req2
Req3 Res1
Res2
Req1
Req2
Req3
Res1
Res3
Res2
Bursttransaction
t
Addr1
Addr2
Addr3
Data1
Data2
Data3
Data4Addr4
Request
Single addressBurst trans.
Addr1 Data1
Data2
Data3
Data4
Requestt
Req1 → Res1
Req2 → Res2
t Blocking( Low throughput)
23
Problems and solutions
Simple product computation has essential problems for realistic on-chip bus protocols
If there is a loop in control, no termination If automata become large, may not terminate practically Protocol must be represented in a automaton
Cannot deal with non-blocking type protocolsThe above problems come from the non-separation between control an
d datapath Solutions: With separation of computation and communication
(control and datapath), the followings can be realized Hiding loops Protocols are represented hierarchically with automata
24
Separation of communication and computation (data transfer)
Data values are abstracted away Only data id is watched in communication part Actual data transfer is realized by computation part Id matching is guaranteed by agreement between
computation and communication New request is accepted only after the previous
request has been accepted If necessary FIFO (buffer) is inserted to keep
not-yet-serviced sequences There can be multiple and simultaneous responses
may be coming before finishing the current response
25
Separation of computation and communication inside protocol transducers
In protocol definition, control and data are separately specified
Introduce two FSMs for request and control to describe complicated protocols uniformly
FIFO can be made arbitrary complicated if we like
ProtocolA
Master
ProtocolB
SlaveRes. Res.
FSM
Transducer
ProtocolA
Master
ProtocolB
Slave
Req.
Res. Res.
Transducer
Res.FSM
Even arithmetic computation
possible
Req.
Req.FSM
26
Protocols can be very complicated State-of-the-art protocols introduces many
features for faster throughputs
ProtocolMaster
ProtocolSlave
Request(Address / Data)
Response(Data)
t
Split transaction( Non blocking)
Req1 t
Out of ordertransaction
Req2
Req3 Res1
Res2
Req1
Req2
Req3
Res1
Res3
Res2
Bursttransaction
t
Addr1
Addr2
Addr3
Data1
Data2
Data3
Data4Addr4
Request
Single addressBurst trans.
Addr1 Data1
Data2
Data3
Data4
Requestt
Req1 → Res1
Req2 → Res2
t Blocking( Low throughput)
27
For more complicated protocols…
Protocol definition
Protocol A Protocol B
Req.
Res.
Req.
Res.
ProtocolA
Master
ProtocolB
Slave
Req.
Res. Res.SendFSM
Req.
Req.FSM
RecvFSM
X ReqReq
X FIFOWR
ResXFIFORD
Res
Newly introduced FIFO
Transducer
Pros: Can deal with more complicated protocols
Cons: Need more latency delay due to multiple FIFO
Control for FIFO
Read Write
28
Now we can resolve it
Elimination of loops (to initial states)。
Elimination of intermediate loops
i
A
B
i
C
D
i
A
B
i
C
D
e
e
Exp
lora
tion
i
Y
e
X
Z
U
W
i
YX
Z
U
WIntroductionof ending state
Eliminationpf ending states
SS = Loops are replaced with super states
Exp
lora
tion
Exp
lora
tion
[2] S.Watanabe, K.Seto, Y.Ishikawa, S.Komatsu, M.Fujita, “Protocol Transducer Synthesis using Divide and Conquer approach, “ Proc. of the 12th. Asia and South Pacific Design Automation Conference, pp.280-285, 2007.
Concentrating on controls only Date parts are processed separately !
[2]
29
How to deal with multiple complicated transactions
A protocol is a collection of sequences Each sequence can operate independently
True for state-of-the-art protocols with separation between computation and communication
Protocol
Sequence1
Sequence2
Sequence3
Sequence4
Hardwaredefinition
( Read)
(Write)
( 4 burst read)
( 4 burst write)
Automaton1
Port, signal names, etc.
Automaton2
For request orblocking
For response
i(stb==1)ack<=0
ack<=1
ack<=0
All sequences share initial state
・・・
[2]
30
Hierarchical synthesis owing to comp. and comm. separation
ProtocolA
ProtocolB
Transducer
Partial transducer1
Partial transducer2
SequenceA2
SequenceB1
SequenceB2
グラフ探索
グラフ探索
SequenceA1
Exploration
Exploration
ii
i
+ =
Merge generated FSM with the same initial state
Sequence level synthesis followed by merge process
[2]
31
The protocol transducer synthesis (1)
Transducers including blocking protocols
ProtocolA
Master
ProtocolB
Slave
Req. Req.
Res. Res.
FSM
Transducer
i
i
Blocking protocol
Automaton level
synthesis
i
i
Req.
Res.
Non-blocking protocol (out of order)
Generate blocking automaton by composition
i
Automata for other sequences
Compose
32
The protocol transducer synthesis (2) Out-of-order to out-of-order → Out of order processing
Tags are sent out as they are Non-blocking and out-of-order → In order processing
Transducer generates tags and reorders responses
i iRes.Req.
i i
Res.Req.
Pro
toco
l A
Pro
toco
l B
Automaton levelsynthesis
Automaton levelsynthesis
ProtocolA
Master
ProtocolB
Slave
Req.
Res. Res.
Transducer
Res.FSM
FIFO memorizes sequences whose responses are not
yet received
Req.
Req.FSM
i iRes.Req. Protocol
transducer
33
The protocol transducer synthesis (3)
In state-of-the-art on-chip bus protocols: All masters have waiting mechanisms for request Some slaves do not have waiting mechanisms for
responses Ex OCP
AutomatonLevel
synthesis
Restrictions on protocol definitions:Master does not have waiting mechanisms but slave has (request)Slave does not have waiting mechanisms but master has (response)
Next transaction may start before transducer returns to initial state→Some requests/responses may not be processed
OCP request(Read sequence)Wait with SCmdAccept signal
OCP response (Read sequence)Finish in exactly one cycle (no waiting mechanisms)
34
The protocol transducer synthesis (4) Responses guaranteed to be processed with FIFO
Protocol definition
Protocol A Protocol B
Req.
Res.
Req.
Res.
ProtocolA
Master
ProtocolB
Slave
Req.
Res. Res.SendFSM
Req.
Req.FSM
RecvFSM
X ReqReq
Automaton level synthesis
X FIFOWR
ResXFIFORD
Res
Automaton level synthesis with FIFO control automaton (no waiting)
Newly introduced
FIFOTransducer
Pros: Can deal with more complicated protocols
Cons: Need more latency delay due to multiple FIFO
Automaton controlling FIFO
Read Write
35
Tool implementation
Planned to be distributed freely from OCP-IP Currently under evaluation at Toshiba
36
Experimental results Atholon64 2GH z + 1GB RAM Implemented as over 12,000 loc in C++
Input: Hierarchical automaton descriptions in XML Output: RTL synthesizable Verilog
Logic synthesis: Xilinx ISE RTL simulator: Model Sim XE
Mater'sProtocol
Slave'sProtocol
Type Sequences Synth.Time Gate counts
OCP AHB (NB,BK) 4 1.1[s] 2,352
AHB OCP (BK,NB) 4 1.3[s] 1,843
OCP OCP (NB,NB) 2 1.9[s] 1,568
OCPTagged
OCP(NB,OoO) 2 2.2[s] 3,514
Tagged OCP
AXI (OoO,OoO) 2 4.8[s] 1,377
AXI OCP (OoO,NB) 2 4.9[s] 1,731
OCP AXI (NB,OoO) 26 257.8[s] 61,205
No one has ever synthesized !
37
Rectification after manufacturing Transducer FSM can be implemented with
programmable devices Make run time change of protocols possible
FSMFor Sequence C
FSMFor Sequence B
FSMFor Sequence A
DefaultOutput Value
Reg. File
ProcessingElement
1
Pro
toco
l A
ProcessingElement
2
Pro
toco
l B
Protocol
Sequence1
Sequence2
Sequence3
Sequence4
Hardwaredefinition
( Read)
(Write)
( 4 burst read)
( 4 burst write)
Port, signal names, etc.
・・・
38
Conclusion
The following have been shown through an example: protocol transducer synthesis Separation of concerns is essential
Hierarchical definition of protocol Complete separation between computation and
communication State-of-the-art protocols can be processed
efficiently Even formal verification becomes possible Rectification after manufacturing can be
handled
39
Future issues
Bit-width conversion Ex: 16-bit write → 2 of 8-bit write Need ways to compose multiple sequences
Dynamic change of transfer times in burst mode Use super states
SequenceWrite8bit
SequenceWrite8bit
Composition SequenceWrite8bit*2
SequenceWrite16bit
Existing methods
SSSeparate the data transfer part
Determine loop count
Repeat super state by loop count
Application to dynamically reconfigurable
processors/protocol transducers
41
Hardware OS
Portions to be reconfigured in dynamically reconfigurable architectures Load and unload functional blocks dynamically Schedule functional blocks dynamically Communicate among functional blocks
Hardware OS(Operating System) Self (partial) reconfiguration on FPGA Load and unload circuit blocks (hardware tasks)
Just like “processes” in multi task software Provide ways to communicate among hardware tasks
42
Example of hardware OS
Herbert et al. Task slot: Rectangle areas to load hardware
tasks Interconnect: Shared bus for communications OS module: Scheduling and loading hardware
tasks
OS
mod
ule
Tas
k sl
ot
Tas
k sl
ot
Tas
k sl
ot
Tas
k sl
ot
Interconnect
Hardware task
Circ
uit
bloc
k
ロード
Circuit block
Herbert Walder, Marco Platzner, “Reconfigurable Hardware Operating Systems: From Design Concepts to Realizations,”Proceedings of ERSA’03 pp.284-287, 2003
Dynamically reconfigurable
43
Interconnect
Various topologies have been proposed
Assuming all functional blocks use the same protocols Not in general and need protocol transducers
FB1 FB2
FB3 FB4
FB1 FB2
FB1 FB2
SWBOX
SWBOX
SWBOX
SWBOX
FB FB
FB FB
S
FB FB
FB FB
S
SWBOX
a) Shared bus b) Mesh network c) Tree netowrk
44
How to build protocol trasnducer
Proposal: Dynamically reconfigurable protocol transducers
Optimizing protocol transducers Universal protocol transducer for ( {A,
D}⇔{B, C}) is simply too complicated and hardware resource consuming
Load minimum protocol transducers dynamically Save hardware resources
IP1Protocol
A
IP2Protocol
BA to B
Reconf.IP1
ProtocolA
IP3Protocol
C
Reconf.
A to CIP4
ProtocolD
IP3Protocol
CD to C
45
Basic idea: Use our protocol transducer synthesis method
Selecting partial protocol transducers dynamically
Our SynthesisMethod
Partial trnsdcr 1
Partial trnsdcr 2
Partial trnsdcr 3
Partial trnsdcr 4
Partial trnsdcr 5
Design phase
Run timeIP1
ProtocolA
IP2Protocol
B
A to BPartial trnsdcr 1
Partial trnsdcr 3
Selection from library
ProtocoltransducerCompose
(in case of static atchitecures )
Partial trnsdcr 2
Partial trnsdcr 4A to C
46
Architecture of dynamically reconfigure protocol transducer
Place functional blocks and partial protocol transducers in task slots Like hardware OS Partial protocol transducers are dynamically
loaded and unloaded
Func.block
Shared bus
PartialTrnsdcr
PartialTrnsdcr
Func.block
Func.block
PartialTrnsdcr
Direct communication when protocols match
Through protocol transducer when protocols do not match
Dynamically loaded when necessary
Task slot
47
Dynamic reconfiguration of protocol transducers
When loading functional blocks, their partial protocol transducers are also loaded
Unload non-in-use partial protocol transducers
Func.block
Func.block
Func.block
PartialTrnsdcr
Func.block
Required conv.
A→C:WriteA→C:Read
Functionalblocklibrary
Partialtransducer
library
PartialTrnsdcr
A→C:Read
LoadLoad
PlacePlace
Search