adaptive system on a chip (asoc) for low-power signal processing andrew laffely, jian liang,...
Post on 21-Dec-2015
213 views
TRANSCRIPT
Adaptive System on a Chip (aSoC) for Low-Power Signal Processing
Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng,
Wayne Burleson, Russell TessierDepartment of Electrical and Computer
EngineeringUniversity of Massachusetts, Amherst
{alaffely, jliang, pjain, nweng, burleson, tessier} @ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Overview
• Motivation• Video Processing
• Architecture• Dynamic Power Management
• Core, Interconnect, and Clock
Problem
• Wireless video processing requires• High throughput • Low Power• Flexible
System on a Chip Solutions
• Take advantage of parallelism• Possible improved performance
• Allow use and reuse of existing integrated components
• If• The application can be partitioned • The appropriate architecture is used
Proposed Architecture: aSoC• High throughput
• Heterogeneous processor elements• Use the right tool for the job
• Fast and predictable interconnect
• Flexible• Runtime reconfiguration of cores and
interconnect
• Power consumption• Implement power saving features in both
cores and interconnect• Use reconfiguration to dynamically control
power consumption
aSoC: adaptive System on a Chip
• Tiled SoC architectureDCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
aSoC: adaptive System on a Chip
• Tiled SoC architecture• Supports the use of
independently developed heterogeneous cores
• Pick and place cores which best perform the given application
• Increase performance
• Save power• Cores may be any
number of tiles in size
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
aSoC: adaptive System on a Chip
• Tiled SoC architecture• Supports the use of
independently developed heterogeneous cores
• Connected with an interconnect mesh
• Restricted to near neighbor communications
• Creates pipeline• Decreases cycle time
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
aSoC: adaptive System on a Chip
• Tiled SoC architecture• Supports the use of
independently developed heterogeneous cores
• Connected with a fixed interconnect mesh
• Using a communication interface (CI) to manage data
• Network port (Coreport) for each core
• Each CI uses a memory and FSM to repetitively process a predefined schedule of communications
• Crossbar
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
Stream Control• Instruction memory
• Holds the predetermined schedule of communications
• PC • Selects and synchronizes
the communications• Decoder
• Sets crossbar• Controller
• Sets PC • Interprets incoming
configuration commands• Crossbar
• Any input to any set of outputs
NorthSouthEastWest
CoreNorthSouthEastWest
Core
Decoder/Controller
PC
InputsOutputs
Instruction
Memory
LocalConfig
.
Example: Communication
• Stream A-D
• Core CCore BCore A
• A given application requires periodic communications from Core A to Core C
• aSoC uses a prescheduled communication STREAM• Core A places the data in a dedicated STREAM between
the two tiles• Core C pulls the data from that STREAM
• The tile to tile communication uses 3 cycles
Example: Stream
CBA
1 Core to East
Example: Stream
• Stream A-D
• CBA
2 West to East
Example: Stream
CBA
West to Core3
Example: Stream
• Stream A-D
• CBA
West to Core
1
3
2
Core to East
West to EastLoopBack
Static Scheduled Communications
• Creates system scalability by “eliminating” network congestion
• Many interconnect segments managed with time division multiplexing
• lots of Bandwidth
• Improves SoC performance by up to
factor of 8
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
Power Consumption?
• Provide reconfiguration methods for cores and CI
• Develop programmable clocking systems at each tile
Power Aware Core
• Custom motion estimation core• Choose search method
• Full search• 960-600mW (bit width and pel sub-sampling)
• Spiral search• 76mW
• Three step search• 25mW
Data taken with SynopsysTM Power Compiler at the RTL level
aSoC Support
• Multiple streams in and out through dedicated coreports
• Easy to manage on both sides of the port
• Schedule configuration streams in with the data
• Stream A: Input Frame• Stream B: Configuration
(Choose search mode and size)
• Stream C: Motion Vectors
Motion Estimation
Core
in1 in2 out2out1
Stream AStream B
Stream C
Coreports
Reconfigurable Interconnect
• P-frame
• I-frame
ME MC
-
+
InputFrame
DCTInputFrame
DCT
aSoC Support
• Lumped ME, MC and Summation into one double core
DCTMotion Estimation& Compensation
aSoC Support: P-Frame
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Configuration Streams (C & D)
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Configuration(Streams C)
Schedule 1
Schedule 2
PC
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Configuration(Streams C)
Schedule 1
Schedule 2
PC
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
Configuration(Streams D)
Schedule 1
Schedule 2
PC
aSoC Support: Schedule Change
InputFrame
(Stream A’)
DCTMotion Estimation& Compensation
Configuration(Streams D)
Schedule 1
Schedule 2
PC
aSoC Support: I-Frame
InputFrame
(Stream A’)
DCTMotion Estimation& Compensation
OFF
Operating Frequency?
• Interconnect synchronized• H-tree clock distribution
• Core frequencies depend on critical path• Tile provides clock reference• Coreport provides asynchronous boundary
• Dynamic core configuration requires dynamic clock configuration• aSoC clock reference provides multiples of
interconnect clock (… 4x, 2x, 1x, 0.5x, 0.25x, …)
• Configured through the tile controller
Mixed vs. Fixed Core Frequencies
• Cores not designed with clock gating• Core power from Synopsys RTL simulation• Interconnect from SPICE• Assumes 10 cycle schedule, 4 pixels/word
Optimal Independent Frequencies
Fixed Worst Case 105MHz
Core: Mode
Frequency MHz
Power mW
Power mW
ME: Full Search
105 973 973
ME: Spiral
9.9 76 659
ME: Three Step Search
2.75 25 580
DCT 9.6 54 349 Interconnect 6.34 0.14 0.81
Current Density and Clocking
• Red: fixed worst case clocking
• Short spikes of high current
• Green: optimal independent clocking
• Slow and low
• Optimal clocking eliminates current spikes (improved battery life)
DeadlineProcess Start
ME: Full Search
ME: Spiral
ME: Three Step Search
DCT
Time
Current
Configuration Overhead• Configuration adds up
to 2 streams per tile• Only 2 required for
data• Total BW =5xTxN
• 5 streams/(cycle,tile)• T tiles• N cycles in schedule
• Single tile can support up to 50 different streams in 10 cycle schedule
DCT
TransformFrame
(Stream D)
InputFrame
(Stream B)
ConfigurationStreams
Configuration Power Overhead
• Configuration streams used infrequently• Once/Macro block or Once/Frame
• Architecture disables unused streams• Data valid bit already used for flow control
• Only 4-9% of interconnect power is due to configuration streams
Conclusion
• aSoC supports dynamic power management with Reconfiguration• Cores• Interconnect• Clocks
• Low configuration overhead in both• Communication Bandwidth• Power
Future Work
• Add reconfigurable voltage supplies at each tile
• Finish test chip• Import larger applications
Questions
aSoC: adaptive System on a Chip
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation Cores
Interconnect
Interface
Tile
Example: Stream
• Stream A-D
• CBA
Partitioning
• Automated partitioning a non trivial problem
• For small signal processing systems user defined partitioning may be possible
• Key: Perfectly partitioning the system may not be possible• How can the SoC mitigate the
penalty?