Technische Universität Dortmund
Automatic mapping to tightly coupled memories and cache locking
Peter Marwedel1,2, Heiko Falk1, Robert Pyka1, Lars Wehmeyer2
1TU Dortmund2Informatik Centrum Dortmund (ICD)
http://ls12-www.cs.uni-dortmund.de, http://www.icd.de
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 2 -
Problems with memory speeds
Speed gap between processor and main DRAM increases
[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]
2
4
8
2 4 5
Speed
years
CPU Per
form
ance
(1.5
-2 p
.a.)
DRAM (1.07 p.a.)
31
2x every 2 years
10
Similar problems also for embedded systems & MPSoCs
In the future:Memory access times >> processor cycle times
“Memory wall” problem
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 3 -
Problems with memory energy
[Segars 01 according to Vahid@ISSS01]
Caches consume muchof the available energy
[O. Vargas (Infineon): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ TU Dortmund) for providing this source.
Memories a major consumer of energy: Example: mobile phone
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 4 -
Tightly coupled memories/Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable
Address space
ARM7TDMI cores, well-known for low power consumption
scratch pad memory
0
FFF..
ExampleExample
Small; no tag memory
TCM/SPMs are small, physically separate memories mapped into the address space;
Selection is by an appropriate address decoder (simple!)
SPM
select
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 5 -
Migration of data and instructions to TCM/SPM
Which objects (array, loop, etc.) to be stored in SPM?
Non-overlaying (“static”) memory allocation:
Objects always in TCM while application is running
Overlaying (“dynamic”) allocation:
Moving objects back and forth between hierarchy levels
Processor
Scratch pad memory,capacity SSP
main memory
?
For i .{ }
for j ..{ }
while ...
Repeat
function ...
Array ...
Int ...
Array
Example:
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 6 -
A survey of algorithms for scratchpad allocation
Non-overlaying (“static”) approachesGain gk & size sk for each object k.Maximise gain G = gk, respecting SSP sk. Knapsack
• Code, static data, stack, heap,• Partitioning, handling large arrays
Overlaying (“dynamic”) approaches• single/multiple hierarchy levels• for single process• for static number of multiple processes• for dynamic number of multiple processes• not using/using MMU
Survey of algorithms: http://ls12-www.cs.uni-dortmund.de/publications/ papers/2007-marwedel-acaces.zip
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 7 -
Partitioning
# of partitions
number of partitions of size:
4k 2k 1k 512 256 128 64
7 0 1 1 1 1 1 2
6 0 1 1 1 1 2 0
5 0 1 1 1 2 0 0
4 0 1 1 2 0 0 0
3 0 1 2 0 0 0 0
2 0 2 0 0 0 0 0
1 1 0 0 0 0 0 0
Example of all considered memory partitions for a total capacity of 4096 bytes
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 8 -
Results for a non-overlaying approach:parts of GSM coder/decoder
Energy model: based on ARM evaluation board
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 9 -
Using these ideas with an gcc-based tool flow
Source is split into 2 different files by specially developed memory optimizer tool *.
applicationsource
profile Info.
main mem. src
spm src.
linker script*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)
*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es) .exe
.ld linker
ARM-GCCCompiler
ARM-GCCCompiler
.c
.c
.c
.txt
Memory optimizer(Incl. ICD-C*)
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 10 -
Hybrid Context Switch
Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads
Scratchpad
Process P1,P2, P3
Process P1
Process P2
Process P3
Process P1Process P2Process P3
P1
P2
P3Saving/Restoring at context switch
Saving/Restoring at context switch
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 11 -
Multi-process Scratchpad Allocation: Results
For small SPMs (64B-512B) Saving is better For large SPMs (1kB- 4kB) Non-Saving is better Hybrid is the best for all SPM sizes. Energy reduction @ 4kB SPM is 27% for Hybrid approach
80
90
100
110
120
130
140
150
160
64 128 256 512 1024 2048 4096
Scratchpad Size (bytes)
En
erg
y C
on
sum
ptio
n (
mJ)
Energy (SPA) Energy (Non-Saving)
Energy (Saving) CopyEnergy (Saving)
Energy (Hybrid) CopyEnergy (Hybrid)
27%
SPA: Single Process Approach
edge detection,
adpcm, g721, mpeg
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 12 -
Approach overview
App. 2App. 2
App. 1App. 1
App. nApp. n AllocationManager
AllocationManager
Standard Compiler(GCC)
Standard Compiler(GCC)
OperatingSystem RTEMS
OperatingSystem RTEMS
Compile-timeTransformations
Compile-timeTransformations
Profit values / Allocation hints
2 steps: compile-time analysis & runtime decisions
No need to know all applications at compile-time
Capable of managing runtime allocated memory objects
Integrated into an embedded operating system
Using MPArm simulator from U. Bologna
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 13 -
Comparison of SPMM to Caches for SORT
Baseline: Main memory only SPMM peak energy reduction by
83% at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way
cache
SPMM capable ofoutperforming caches
OS and libraries are not considered yet
Chunk allocation results:
SPM Size Δ 4-way
1024 74,81%
2048 65,35%
4096 64,39%
8192 65,64%
16384 63,73%
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 14 -
Worst case timing analysis using aiT
C program
SPM size
executable
Actualperformance
Worst caseexecution time
memory-awarecompiler
ARMulator
aiT
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 15 -
Results for G.721
L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004
L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005
Using Scratchpad: Using Unified Cache:
aiT timing analyzer
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 16 -
Locking of I-Caches
Many caches allow “locking” or “freezing” (no replacements).
Can be used to improve timing predictability: Load promising Functions into Cache
Optimizations: Worst case paths can change during optimization Requires coupling of timing analysis tool and compiler
[Heiko Falk, Sascha Plazar, Henrik Theiling: Compile-Time Decided Instruction Cache Locking Using Worst-Case Execution Paths, CODES/ISSS, 2007]
ANSI-C source
Compiler Linker
LinkerWCET-
optimized exeI-Cache
Optimizer
ait WCET Analysis
exe
startup code
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 17 -
Relative WCETs after I-Cache Locking
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
64 128 256 512 1024 2048 4096 8192 16384
Cache Size [bytes]
Re
l. W
CE
T [
%]
ADPCM G723 Statemate Compress MPEG2
(ARM920T)
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 18 -
More information
2006
• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedel-acaces.zip
• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html
• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedel-acaces.zip
• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html
2007
P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
TU Dortmund
- 19 -
Conclusion
Major impact of the memory system on system speed, energy consumption and timing predictability.
Memory hierarchies comprising TCMs/SPMs are fast, energy-efficient and timing predictable.
Algorithms have been designed for• Code, static data, stack, heap• Single and multiple memory hierarchy levels• non-overlaying and overlaying allocation• saving, non-saving and hybrid context switches• mono- and multiprocessor systems.
Very large improvements in terms of the considered figures of merit.
Compatible with existing tool flows