a building block computing system for ai...
TRANSCRIPT
A building block computing
system for AI applications
Keio University
Hideharu Amano
Background
• Rapidly Increasing
Requirements of
Embedded Systems
– Smart Phones,
Tablets, Car
electronics, Robotics,
IoTs, etc.
– High Performance,
Low Energy
Consumption, Low
Cost….
• Rapidly Increasing
NRE (Non-Recurring
Engineering) cost of
advanced LSIs.
– Complicated masks
– Design/Verification
• SoC (System-on-
Chip) is difficult to be
implemented for each
target.
Building Block Computing Systems
Building Block Computing
CPU Memory Accelerator1
Memory
CPU
MemoryMemory
CPU
MemoryMemory
Accelerator1Accelerator2
Accelerator2
CPU
Accelerator2Accelerator2Accelerator1
Accelerator1
CPU
1 step: Various Combination can be done after chip-fabrication
2 step: Chips can be replaced by users → Field Stackable
Key Technique: Inductive Coupling Through-Chip Interface(TCI)
Today’s talk
• What is TCI?
• The first prototype: Cube-1
• Intellectual Property of TCI
• Plug-and-play inter-chip network
• A heterogeneous AI system: Cube-2
3D IC technology for going verticalTwo
chips
(face
-to
-fa
ce)
Microbump
Through silicon via
Capacitive coupling
Inductive coupling
Wired Wireless
Scalability
Flexibility
Mor
e t
han
thre
e c
hips
Block diagram of scalable 3D NoC using inductive-coupling ThruChip Interface (TCI).
Rx TxTx RxTx
Network
Interface
TCI
3D Interconnect TCI
Packet En/DecodeNetworkInterface Packet En/Decode
Router
Accelerator Core
Rx TxTx RxTx
CPU Core
Rx TxTx
Network Interface
Data Link
Clock Link
Uplink Downlink
Accelerator 1
Accelerator 2
Host CPU
Header Payload
1323335
35bit Packet Structure
Transfer Type:Command/AddressSingle Packet DataBurst Multi-Packet Data
Benefits of Inductive Coupling TCI
• More than 8Gbps speed is achieved within 10mW power dissipation.
• The error rate is less than 10^-9.– No error correction circuits are required.
• Coils are built by metal wires of common CMOS.– No additional process is required.
• No ESD (Electro-Static Discharge) protection devices are required unlike TSV.
• Stacking cost is much lower than that for TSV.– Just grind chips to 40-80um thickness and stack
them as in a common interposer inside the package.
• First commercial products with TCI will be shipped this year from PEZY.
Problems• The footprint of a coil is much larger than a via
for TSV. – Chip does not scale in the vertical direction.
– Footprint of both coils and TSVs is increased in the advanced technology.
– Digital circuits can be implemented in coils.
→ The real footprint is not larger than that of TSV.
• Power and system clocks are supplied by bonding wires.– Now, the power supply by inductive coupling is tried.
– It may take some time….
• Heat dissipation cannot be done by TCI.– The bonding agent used in the current prototype well
transfers heat.
– But heat dissipation more than 1W chips is a challenge.
Today’s talk
• What is TCI?
• The first prototype: Cube-1
• Intellectual Property of TCI
• Plug-and-play inter-chip network
• A heterogeneous AI system: Cube-2
The first prototype Cube-1Built in 2013. To demonstrate that the TCI can be used in a real system.
Geyser-Cube
CMA-Cube
InductiveCoupling
Ring based packet switching network is formedThe number of accelerators can be
changed.
Microphotograph of stacked test chips.
Host CPU
Accelerator 1
Accelerator 2
Accelerator 3
Host CPU + Accelerator x3 Chip Stack
Fabricated in 65nm CMOS
Host CPU Chip
Accelerator Chip
TEG
Netw
ork
IF
Netw
ork
IF
m-C
on
tro
lle
r
MIPS CPU
Core
8x8 PE Array
TCI Tx
TCI Rx
Rx
Rx
Tx
Tx
TCI
Cube-1 Cube-2
Process Fujitsu e-shuttle 65nm
12metal
Renesas SOTB 65nm
7metal
Area 2.1mm X 4.2mm 3mm X 6mm
TCI 3Gb/s 240um 2.5Gb/s 240um
CPU
Cache
TLB
MIPS R3000
4KW 2way sep.
16-entry shared
MIPS R3000
4KW 2way sep.
16-entry shared
CMA CMA-Cube 8x8 CC-SOTB 12x8
Supply Voltage 0.6-1.2V 0.3-1.2V
Target Frequency 50-100MHz 50-100MHz
Chip Thickness 40-80um 80um
Cube-1 vs. Cube-2
0
100,000
200,000
300,000
400,000
500,000
600,000
Alpha Gray Sepia
Ex
ec
uti
on
Tim
e [
cyc
les
]
0
20
40
60
80
100
120
140
Alpha Gray SepiaAlpha Gray Sepia
Ener
gy (
μ J
)
Exec
uti
on
Tim
e (m
sec)
200
400
600
800
1000
1200
2-chipstack
GeyserOnly
GeyserOnly
2-chipstack
CMA
Geyser
Execution Time Energy Reduction
Evaluation Results from a 2-chip stack system
評価結果• Cube-1 Quad-CoreにおけるJPEGデコーダの実行• 128x96pixelの画像をデコード
0
100000
200000
300000
400000
500000
600000
700000
800000
実行サイクル数
(cyc
le)
other
convert yuv to rgb
store intermediate
inverse dct
inverse quantization
huffman decode
collect result
dma trans
processing
image data trans
Cube-1 Quad-Core No
3.15倍の性能
Evaluation Results
•Using three accelerators.
•The target image block: 128 x 96 pixel
3.15 times speed up
Ex
ecu
tio
n c
ycle
s
Measured PER and TCI power dissipation dependence on supply voltage.
10-9
10-8
10-7
10-6
10-5
10-4
5
6
7
8
9
10
0.8 0.9 1 1.1 1.2 1.3Supply Voltage [V]
TC
I P
ow
er
Dis
sip
ati
on
[m
W/c
h]
Pack
et
Err
or
Rate
(P
ER
)
35bit Burst Packet Transmission @ 50MHz System
Clock
Continuous >1 Hour
Error-Free Operation
@ Nominal Supply Voltage
5.8mW/c
h @
0.92V
Demonstration system.
DisplayOutput
Chip Stack
Mother Board
Daughter Board
FPGA
Cube-1 was a good starting point
Please check Micro
2013.11/12http://www.am.ics.keio.ac.jp/kaken_s
Demonstration in HotChips2014,
Coolchips2015, and FPL2015.
The problems on Cube-1• Not stable with three chips
– Interference between vertical inductors which
must be placed for the ring network.
• OK. It’s an interesting system. But what for?
– Only two chip types were provided.
• For development of various types of chips:
– Easy design.
– Automatic performance tuning.
– Thermal problem on stacking.
– Practical power supply.
– Comprehensive Researches are needed.
What is the next step?
• Intellectual Property of the TCI
– It must be embedded in the LSI design flow.
– Higher level layers are needed.
• A new plug-on network based on the IP
• Automatic performance tuning
– Software approach
– Hardware approach
• Thermal dissipation problems
The new target systems Cube-2 for the AI application
Today’s talk
• What is TCI?
• The first prototype: Cube-1
• Intellectual Property of TCI
• Plug-and-play inter-chip network
• A heterogeneous AI system: Cube-2
TCI-IP Physical layer
Txclk
TXDATA
ClockCounter
35
TXWrite
LA
TC
HTxBusy
MU
X
TX
TX
RX
DE
MU
X
RX
LA
TC
H
S/E BitDetector
Pari
tyC
hec
ker
RXDATA
OverW Detector
RxReady
RxReadOverWrite
35 42 42 44
7
TxSleepRxSleep
RxLatch
StartBit ->3bit
Payload ->35bit
Parity ->2bit
Endbit ->4bitPhysical layer = Inductors + Tx/Rx + SERDES
TCI physical layer
CLK
TxRx
receiver
SERDES
transceiver
SERDES
Data
TxRx
TXWRITE
TXBUSYRXREADY
RXREAD35bit data
35bit data
PERR, OVW,
Flame error
2GHz clk
Can be used for
the both direction
Timing chart
TxWrite
TxData0
TxLATCH0
TxBusy
Txclk
Rxclk
RxLatch
RxLATCH0
RxReady
RxRead
Serial transfer synchronized with Txclk
Fully Depleted Silicon on
Insulator(FD-SOI):
SOTB MOSFET
P-sub
(a) (b)
N-well P-wellSTI
Box layer
Back Gate
Deep n-well
☆Dopant less Structure (low variability)
☆Formed on an ultra thin BOX layer (about 10nm)
High efficiency of leakage reduction with body bias control
Body bias
24
Body Bias Control of SOTB
VBN > VS (Forward body bias : FBB)
VBP < VS
• High speed
• Large leakage current
leakagedelay
Body bias (+:forward -:reverse)[V]
It is important to decide optimal voltages of body bias
VBN < VS (Reverse body bias : RBB)
VBP > VS
• Low leakage current
• Low speed
Delay
[s]Pow
er[W
]
VBN
VBP
25See [Okuhara IEEE TVLSI 2017]
TCI IP
80μm
thickness
36bit/50MHz
with normal
bias.
Including
SER/DES
circuits.
Half-duplex
data transfer
Clock Data
CC SOTB Layout
TCI IP
×4sets
(Up/Down)
Renesas Electronics 65nm SOTB 7Metal layers
6mm X 3mm
Three layers for the TCI• Full duplex bi-directional channel is formed
with two physical IPs.
• The flow control is embedded into the link
controller.
• Various networks can be easily formed.
TCI
transceiverTCI receiver
Link Controller
Router
Physical layout. SERDES is embedded.
Verilog Soft core. 8-Virtual Channels.
Flow control. 17flits packet buffer
Verilog Soft core. Parameterized router.
3-stage pipeline.
Open in the VDEC web site
Today’s talk
• What is TCI?
• The first prototype: Cube-1
• Intellectual Property of TCI
• Plug-and-play inter-chip network
• A heterogeneous AI system: Cube-2
Bonding wire
Bonding wire
Bonding wire
Bonding wire
The ring network
on Cube-1
TX RXTXRX
TX RXTXRX
TX RXTXRX
TX RXTXRX
Netw
ork
IF
m-C
on
tro
lle
r
8x8 PE Array
Rx
Rx
Tx
Tx
TCI
The escalator network• Just by stacking chips with three layers IP, any numbers
of chips can be stacked.
• The simplest 3-port routers are used.
• Vertical coils never interfere with each other.
RXRX
TXTX
Router
TCIRX
RX
TXTX
Router
TCIRX RXTXTX
IP Core
Router
TCI
RX RXTXTX
IP Core
Router
TCI
Chip 0
Chip 1
Chip 2
Chip 3
The spacer chips
Drops
Spacer chips
Ring vs. Escalator
Esc. Esc. (no VC.)
The overhead of flow-control2 or 17-flit packets
Uniform Bit reverse
34The overhead can be ignored→
0
40
80
120
160
0 0.2 0.4 0.6
Avera
ge l
ate
ncy [
cycle
]
Injection rate [flit/node/cycle]
Piggyback
Ideal
0
40
80
120
160
200
0 0.2 0.4 0.6
Avera
ge l
ate
ncy [
cycle
]
Injection rate [flit/node/cycle]
Piggyback
Ideal
Today’s talk
• What is TCI?
• The first prototype: Cube-1
• Intellectual Property of TCI
• Plug-and-play inter-chip network
• A heterogeneous AI system: Cube-2
A heterogeneous multicore
system: Cube-2Image recognition for robotics, cars or other
edge systems.
• More than a single chip-stacking.
– Embedded CPU : GeyserTT
– Shared memory chip: SMTT
• Accelerator chips
– CC-SOTB2
– SNACC
– KVS
• Various combination of chips.
Cube-1 Cube-2
Process Fujitsu e-shuttle 65nm
12metal
Renesas SOTB 65nm
7metal
Area 2.1mm X 4.2mm 3mm X 6mm or
6mm X 6mm
TCI 3Gb/s 240um 2.5Gb/s 240um
Supply Voltage 0.6-1.2V 0.3-1.2V
Target Frequency 50-100MHz 50-100MHz
Chip Thickness 40-80um 80um
Cube-1 vs. Cube-2
Geyser-TT (Twin Tower)
TCI for a single tower
TCI for the twin tower
MIPS R3000 Compatible CPU
with three TCI IPs
Basic CPU
Core
TLB
I Cache
DMAC
D
Cache Router
I/F
Router
TCI
External Bus
Controller
Geyser CPU Core (MIPS
ISA)Geyser Internal Bus
Router
I/F
Router
TCI
Geyser External Bus: Main Memory, I/Os
Geyser CPU
TCI for Twin
Tower
Geyser-TT
TCI
TCI for Single
Tower
KVS
SNACC
CC-SOTB2
GeyserTT
A simple stacking
An escalator network is
formed.
Twin-tower with a single core
KVS
SNACC
GeyserTT
SNACC
CC
KVS
SNACC
Geyser TT is used as
a bridge of two chip
stacks
Shared Memory for
twin-tower (SMTT)• A large memory system is needed in
various configuration.
– A chip with DRAM module is needed.
– From the limitation of money and power
bedget, only 250KB static RAM is provided.
– Ultramemory’s DRAM with TCI will hopefully
take place in the future.
• Multiple hosts in multiple towers can
exchange the data with the shared
memory.
• Synchronization memory is also provided.
SMTT chipTCI for the twin tower
Twin-tower with multi cores
Shared Memory
KVS
SNACC
GeyserTT
SNACC
CC
GeyserTT
Accelerator family
• CC-SOTB2
– Coarse Grained Reconfigurable Array (CGRA)
– Low power image processing
• SNACC
– Convolutional Neural Network Accelerator
• KVS
– Key-value store accelerator
• All chips have the same TCI IP.
CC-SOTB2
12x8 PE array
PE array
with 12x8 PEVariable
Pipeline
structure
12x2
interleaved/banked
memory
Data manipulator
for flexible
load/store
Energy optimization with body
bias control
• The previous presentation by Anh-Vu-
san[McSoC2017]
• FPL2017 etc.
• More than 800MOPS/mW was achieved.
PE row
TCI IP
SNACC: a convolutional neural
network accelerator
TCI IPSIMD
cores
• SIMD cores
• Dedicated instruction set and ALUs
• Yesterday presentation by Sakamoto-
san[McSoC2017]
KVS Chip
• Memory search
• Set/Get operations
[HotInterconnect
2016]
Streaming processing in Cube-2
• Shared memory using cache/DMA transfer
– All memory modules of accelerators are
mapped on the same address space.
– Memory modules of accelerators are treated
as the 2nd-level cache.
– DMA transfer between accelerators is
triggered automatically.
• Streaming processing between chips can
be efficiently done.
Streaming processing between
chipsGeyserTT
CC
SOTB2
SNACC
KVS
Data
Write
Exec.
Data
DMA
Data
DMA
Data
Read
Exec.
Exec.
End
End
Read
Req.
Data
Write
Data
DMA
Execution of accelerators and communication between chips can
be overlapped.
Data
DMA
The current status
• 3-chip stack is now operational.
• Chips are available on this December.
Special thanks to:
• Prof. Kuroda (Keio Univ.)
• Prof. Kondo & Dr. Sakamoto (Univ. of Tokyo)
• Prof. Matsutani (Keio Univ.)
• Prof. Nakamura (Univ. of Tokyo)
• Prof. Namiki (Tokyo Agriculture & Tech.)
• Prof. Usami (Shibaura Inst. Tech.)
This work has been supported by JSPS
Kakenhi S Grant # 25220002
Thank you !
Please visit our web-site:
– http://www.am.ics.keio.ac.jp/kaken_s
For other projects:
– http://www.am.ics.keio.ac.jp