a reconfigurable signal processing ic with embedded fpga and
DESCRIPTION
TRANSCRIPT
A Reconfigurable Signal Processing IC with embedded
FPGA and Multi-Port Flash Memory
M. Borgatti, L. Calì, G. De Sandre,
B. Forêt, D. Iezzi, F. Lertora, G. Muzzi,
M. Pasotti, M. Poles, P.L. Rolandi
STMicroelectronics - Central R&D - Italy
Outline of Presentation
• Project motivation and background• System architecture– Reconfigurable core– Memory subsystem
• System performance– Application example: embedded face
recognition system
• Energy efficiency, measurements• SoC integration and design flow– System 2 RTL and RTL 2 Layout
• Summary22
Project motivation and background• Conflicting industry trends– Economics of system integration
• Even more complex SoC• More integration• Cost effectiveness and performance (per unit)
– Increasing design complexity and risks– Increasing NREs– Shorter time-to-market and product life
• Strong need for:– Faster project turnaround– Lower risk
• Usage of re-configurable silicon fabrics33
Project motivation and background
• Pragmatic approach proposed:– Reconfigurable architecture– Joins a statically extensible processor with
e-FPGA– Tight connection to Flash memory
subsystem– Open architecture with flexible
programmable I/O
• Programmable platform approach– Simple model for programmers
44
Programmable Platform Approach
System ApplicationsFamily System Application
Silicon process +Enabling technologies
PlatformCompilation
Config. Proc+
e-FPGA
ApplicationCompilation
Programmable platform55
8KB D$
System Architecture
Inst. Ext I/F
Extensible
MPU
bu
sb
ridg
e
e-FPGA
General Purpose I/O Lines
8KBD$
8KBI$
I2C BUS
M/SAHB I/F
INTs
DMA &FPGAProg.
I/F
BufferI/F
GPI/O 64 bit APB BUS
1kB Buffer
AHB/APBBridge
64 bit AHB BUS
I2CMaster
I/Oregisters
48 kBSRAM
FP CP DP
Flash Mem
Inst
r. E
xt.
66
e-FPGA Purposes• Processor ISA extensions– Simplest programmer’s model– Specific interface to the MPU datapath– Impact on processor performance– Impact on processor energy efficiency– Efficiency limited by instruction stream
decoding
• Bus-mapped co-processor–Maximum benefits in speed/power
• Flexible I/O
77
e-FPGA – Microprocessor interface
E
Clock Ctrl
Other FPGA
Purposes
Instructionextension
RPipe
ControlDecode
RegisterFile
Instruction
Result
Microprocessor clocke-FPGAClock
88
Flash Memory Architecture
DP CP FP
8-bit P
P I/F
PMA
DFT
PowerBlock
2Mb#0
FPGA PortCode PortData Port
2Mb#1
2Mb#2
2Mb#3
128-bit Memory Sub-System Crossbar
128 128 128 128
64 64 32
99
Flash Memory Subsystem
• Modular approach– Customizable array of N independent 2Mb
modules
• 3 content-specific ports (CP, DP, FP)• HW support for filesystem implem. (DP) – Defrag– Compression– Virtual erase
• 2Mb Module features:– 128b I/O– 40ns access time (400MB/s peak throughput)– Power management and arbitration
1010
System Memory Hierarchy
64-bit AHB Bus
32-bit uP RegisterFile
6x4 128-bit Crossbar
4 x 16384 x 128-bit Memory Module
AHB Bridge
4 x Flash Memory Controller Logic
64 bit Port CP32-bit
Port FP
2 x 64- + 1 x 32-bit Memory Port I/Fs
64-bit CP I/F 64-bit DP I/F DMA
64-bit AHB
32-bit FPGA PI/F
32-bit
512-B Buffer
64-bit Port DP
• AHB Peak Throughput:– 800MB/s
• e-FPGA– 400MB/s– (50MB/s
sustained)
• Total Aggregate Peak– 1.2GB/s
1111
Application Ex.: Face Recognition• Target application:– Recognize a face out of twenty– low-resolution images from CMOS cameras
• Potential applications:– Low cost smart toys– Advanced human-machine interfaces– Color CMOS camera processors
• Image preprocessing: Bayer filter• Face location: based on Hough transform• Face recognition: Line-Based
• Recognition rates over 90 %• Scale-invariant• Tolerant to changes in illumination intensity
1212
Processor Extension (I)
_
x
+
+ +
‘8’ ’16’
ProcessorLoad Unit
64-bit register
Result
4-segm. 4-segm. • 8-issue, 8-bit L2 distance
• Complexity:– 23 8-bit OPS– 6 64-bit OPS
• 1GOPS peak throughput– Distance computation
• 10k equiv. ASIC gates• Mapped to e-FPGA 1313
Processor Extension (II)NumberRemaind.root
>>1
<< 1
<<2 >>2>>30
+
_
+1
>
+ 2
Result
• Fixed-point square root kernel
• Complexity:– 12 32-bit OPS
• 2k equiv. ASIC gates• Mapped to e-FPGA
1414
Algorithm Stage RISC w/ basic DSP
RISC w/ basic DSP + uP Ext.
Speed-Up
Bayer Filter 58 msec 24.7 msec x 2.3
Edge Detection 4.5 msec
2.5 msec x 1.8
Face Detection 1.5 sec 382 msec x 4
Face Recognition
(20-face database)
9.15 sec 860 msec x 10.6
Totals 10.7 sec 1.26 sec x 8.5
Performance: Processing Time @ 100 MHz
Energy Efficiency vs. Flexibility
Flexibility (Coverage)
En
erg
y E
ffic
ien
cy (
MO
PS
/mW
)
Embedded Processors
ASIPs, DSPs
DedicatedHW
0.1
1
10
100
1000
from: Zhang et Al., ISSCC 2000
Energy-Flexibility Gap !
FPGA-mapped
CoProcessors
uP + FPGA
Instructions
1616
Algorithm Stage Speed-Up
Energy Gain
Energy x Delay Gain
Bayer Filter x 2.3 x 1.4 x 3.2
Edge Detection x 1.8 x 0.95 x 1.7
Face Detection x 4 x 2.9 x 11.6
Face Recognition
(20-face database)
x 10.6 x 9 x 95.4
Totals x 8.5 x 6.7 x 57
Performance: Energy Efficiency
1717
Cycle Accurate Simulation Performance Analysis
C
VHDL(e-FPGA) HW (RTL)
uP, AHB/APB Bus
Peripherals
SWApps
SoC Integration
uP ISS
Functional model (untimed)
Partitioning / I/F Synthesis / Refinement
LibrariesHW/SW
Soft Hardware (eFPGA)
eFPGA mapping
eFPGA HARD
MACRO
Inst.Ext. Verilog
1818
Inst.Ext.
Synthesis
Mapping (P&R)
CPU core, IPs
Interface RTL code
FlashRAM
Synthesis
Floorplanning / P&R
Static Timing Analysis, Dynamic Verification
Static Timing Analysis(SoC + eFPGA)
FPGA Timing DB
Bit-stream
Coproc. I/OI/F
eFPGA core
Con.
Netlist +Timing
Database
Silicon fab
1919
Chip LayoutProcess 0.18um CMOS 2P/6M
Embedded Flash
Flash
Memory (x4)
256kB x 9 sectors
128-bit word
1MB/s write through.
400MB/s read through.
SRAM
Memory
Main: 48kB (64-bit)
I$: 8kB (64-bit)
D$: 8kB (64-bit)
Buffers: 4x256B
Chip size 8.4 x 8.4 mm2 (e-FPGA size: 8.2 mm2)
I/O 24 inputs + 24 outputs (tristate) + 8 bidirs
Supply 2.7-3.6V (external), 1.8V(core)
48 KB SRAM
BU
FF
ER
Em
bed
ded
FP
GA
TAGS8+8 KBI$ + D$
32b uP +AHB & APB +250k GATES
1MB FLASH Memory
uPAHB/APB FPGA8+8 kBI$+D$
DFT
Flash
Ports
Bu
ffers
48kB SRAM
2020
Chip Performances and Power Consumption
Processor maximum speed: 125MHz (WCMIL)
Reconfiguration speed: 500us @ 100MHz clock
Chip average power consumption
300mW @ 100MHz, 1.8V
2121
Summary• e-FPGAs allow architectural tradeoffs for
reconfigurable embedded systems:– Processor ISA extensions– Bus-mapped co-processor– Flexible I/O
• Modular, content-specific, multiport e-Flash• Performance figures:– Up to 10x speedup– Up to 9x energy reduction– Dynamic reconfiguration in 500 us
• Specific design-flow for system and RTL2222
Acknowledgements:
The authors thank:
all the colleagues of NVM-DP Dept.A. Maurelli, F. Piazza and L. Fumagalli.
2323