soc programming in the era of the internet of things, machine learning and emerging ... ·...
Post on 17-May-2020
1 Views
Preview:
TRANSCRIPT
SoC programming in the era of the Internet of Things, machine learning and emerging technologies
Jeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany
Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets ConnectéMontpellier, France. June 20 2019
Systems on Chip (SoC): Evolution
q SoCs: Long history of specialization and interaction with environmentq Incredible evolution over the last decades
3 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
SoC design and programming: Handling complexity
q Advances in design methodologies
q Model-based programming
4 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Transistors
Gate level
Register transfer level
Electronic system level
Application Architecture
Results
Optimization
SoC design and programming: Handling complexity
q Advances in design methodologies
q Model-based programming
5 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Transistors
Gate level
Register transfer level
Electronic system level
Application Architecture
Results
Optimization
Exciting innovations in q Modeling languagesq Programming languages and compilers
q Costs models of hardwareq System simulators
q Design space exploration (DSE) methodologies
New challengesq System dynamics (e.g., IoT)q Ubiquity of machine learning workloads
q Complexity of emerging technologies
Dataflow and Hybrid DSE
© Prof. J. Castrillon. GdR SoC2. Montpellier, 20196
Dataflow programming
q Graph representation of applicationsq Implicit repetitive execution of tasksq Good model for streaming applications
q Good match for signal processing & multi-media
q The whyq Explicit parallelism
q Often: Determinism q Better analyzability (scheduling, mapping, optimization)
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);
PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);
switch (transTarget) {case TransM VP:
PnTransformTemplateInstantiate(D, S);
ErasePnProcessTemplates(D);
PnPrintForM VP(D, S);break;
case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces,
streamFactory);ErasePnDefs(D);
break;case TransVPUtg:
PrintForVPUtg(D, S, streamFactory);
ErasePnDefs(D);break;
case TransVPUmap:PrintForVPUmap(D, S,
strM appingFileName, streamFactory);
ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
}}
#include "PnTransform.h"#include "PnTVPUtg.h"
#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;
voidclang::PnTransform(TransTarget transTarget, bool traces,
const std::string & strM appingFileName, ASTContext
&Ctx,Sema &S, const
llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =
Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {
case TransSystemC:
case TransVPUtg:case TransVPUmap:
PnTCopying(D, S);break;
default:;
}
PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces,
streamFactory);ErasePnDefs(D);break;
case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;
case TransVPUmap:
PrintForVPUmap(D, S, strM appingFileName, streamFactory);
ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
&BasePath) {assert(transTarget != TransInvalid);
TranslaYonUnitDecl *D = Ctx.getTranslaYonUnitDecl();
PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {
case TransSystemC:case TransVPUtg:case TransVPUmap:
PnTCopying(D, S);break;
default:;
}
7
Dataflow compilation
q Plenty of research onq Language, compiler and mapping algorithmsq Hardware modeling, performance estimation
q Runtime systemsq Code generation for heterogeneous multicores
8 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
LM
RISC
LM
RISC
LM
DSP
LM
DSP
LM
DSP
LM
DSP
ARM
SM
AHB Bus
LM: Local memorySM: Shared memory
qnt
src snk
qnt
qntdct
qnt
dctvle
dct
dct init
MJPEG application
[Springer14]
Example: HW acceleration and SW-defined radio
q Application: MIMO OFDM receiverq Hardware
q Platform 1: Baseline software q Platform 2: Optimized software
q Platform 3: Optimized SW + HW
9 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
ARM Mem
VLIW VLIW VLIW
src sink
Library 1: SW implementations
Library 2: SW optimized
MemMem
FFT
ARM Mem
VLIW FFT Viterbi
src sink
Custom Interconnect
Demap Library 3: SW+HW accel.7680
128000
1758241,758
1,00E+00
1,00E+02
1,00E+04
1,00E+06
1)bsp1 (sw,
unoptimized)
2)bsp2 (sw,optimized)
3)bsp3 (hw)
Rate
(bps
)
Achieved rate @ 100 MHz
[SDR’10, ALOG’11]
System dynamics
q Applications not so static anymore!
q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant
q Run-time predictability, robustness & isolation
11 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
EUF Pareto front @ compile-time
Data-level parallelism: Scalable and adaptive
q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)
12 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
P1 P2 P3
Meta-information & configurations
Changes @ runtime
P1
P12
P3P22
P32
P1
P12
P3P32
P42
or
P22
[PARMA-DITAM’18]
Exploiting symmetries
q Intuitionq SW: Some tasks/processes/actors may do the sameq HW: Symmetric latencies (CoreX çè CoreY)
q Symmetry: Allows transformations w/o changing the outcome
è No need to analyze all possible mappings (prune search space)
è Work on formalization via inverse semi-groups and efficient algorithms
13 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
=?
15
Symmetries in Odroid: Example
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
PE7
PE6PE5
PE4PE3
PE2PE1
PE0
PE7
PE6PE5
PE4PE3
PE2PE1
PE0D
A B C
DA
B
C
ϕ
PE1
PE2
PE3
PE4
PE5PE6
PE7
PE0
PE1
PE2
PE3
PE4
PE5
PE6PE7
PE0
Mappings Architecture Subgraphs
Cortex A7 Cortex A15
Equivalent mappings
Graph isomorphism
Mappings Architecture subgraphs
[IESS’15, ACM TACO’17]
Flexible mappings: Generalized Tetris
q Given multiple canonical configs by compiler, select one at run-time
q Exploit mapping equivalences and similarities
16 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Mapping 3 configuration1Mapping 2
configuration1Mapping configuration1
RuntimeMapping
configuration*
[SCOPES’17b]
Flexible mappings: Run-time analysis
q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO
q 1x AF
q 4 x AF q 2 x AF + 2 x MIMO
q 3 mappings to two processors q T1: Best CPU time
q T2: Best wall-clock timeq T3: GBM heuristic
17 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
●●●
12
14
16
18
CFS Dyn T1 T2 T3
CPU
tim
e [s
]
●
●
●
●●
●●
●
●
10
12
14
16
CFS Dyn T1 T2 T3
Ener
gy [J
]
●
●
●
●●●●
●●
4
5
6
7
8
CFS Dyn T1 T2 T3
Wal
l−cl
ock
time
[s]
Single AF
[Castrill12]
[SCOPES’17b]
Flexible mappings: Multi-application results (1)
18 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
6.5
7.0
7.5
8.0
8.5
9.0
9.5
CFS Dyn T1 T2 T3
Wal
l−cl
ock
time
[s]
●●
●●●●●●
●●●
●●●●●●●●●●●●●
●●●
●
●●●●●●
●●●●●●●
10
20
30
40
AF MIMO
CPU
time
[s]
●●●● ●●
●● ●●●●●●●
●●●
●●●●
●● ●●
10
20
30
AF MIMO
Wal
l−clo
ck ti
me
[s] More
predictable performance
Comparable performance to dynamic mapping
●
●
●
●
●●●●●●
●
●●●●
●
●●●
●
● ● ● ●
● ●
●●
●
●●
●
●●
● ● ● ●
12.5
15.0
17.5
20.0
CFS Dyn T1 T2 T3
CPU
tim
e [s
]
[SCOPES’17b]
Robustness
q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control
q Application data, unexpected interrupts, unexpected OS decisions
è Can we reason about robustness of mapping to external factors?
20 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Mapping 3 configuration1Mapping 2
configuration1Mapping configuration1
RuntimeMapping
configuration* Perturb Modified behavior(correct?)
Design centering
q Design centering: Find a mapping that can better tolerate variations while staying feasible
q Studied field, in e.g., biology, circuit design or manufacturing systems.
q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping
21 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
optimum
design center in feasible region
x1
x2
feasible region
[SCOPES’17a]
Evaluation
q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)
23 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
MIMO-OFDM
[SCOPES’17a]
Machine Learning & emerging technologies
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
ML revolution: Frameworks and architectures
q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …
q Lots of traction in hardware architectures: TPU, V100, …
q Lot’s of resources for training, less on inference on edge-devices!
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201926
[Chen, OSDI’18]Example flow: TVM
Domain-specific abstractions
q Commonality: Tensor expression languages
q Increase programmer’s productivityq From compiler perspective: No abstraction toll
q Easier access to informationq Larger score for optimization
27 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
var input A : matrix &var input u : tensorIN &
v = (A # A # A # u . [[5 8] [3 7] [1 6]])
[RWDSL’18]
vs
Correct by construction
q Especially important in embedded systemsq Correct by designq No abstraction leaks
q Current efforts in formal semantics for safe code generation
28 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
[Array’19]
A = placeholder((m,h), name='A')
B = placeholder((n,h), name='B')
k = reduce_axis((0, h), name='k')
C = compute((m, n), lambda i, j:
sum(A[k, i] * B[k, j], axis=k))
!"# = %&'(
)*&"+&#
TeML: Results
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
q Extra control allows for new optimization (vs pluto): changing shapesq General tensor semantics allows covering more benchmarks than TensorFlow
29
[GPCE’18]
Emerging technologies: Racetrack memories
q Racetrack memories: one of many future alternativesq Cf. STT/Re-RAM, hybrid architectures
q Predicted extreme density at low latencyq 3D nano-wires with magnetic domains
q One port shared for many bitsq Domains move at high speeds (1000 ms-1)
q Sequential: Game changer for current HW/SW stackq Memory management
q Integration with other memory architectures q Data layout and allocation
[Parkin-Nature’15]
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201930
[TVLSI’18, TVLSI’19]
Compiler research on placement
q Compiler pass to reorganize placement of instructionsq Instruction fetch is naturally sequential!q Layout instructions to reduce shift operations
q Compiler pass to reorganize data for higher-level objectsq Variable allocation
q Tensor allocation and program transformation
31
[ISLPED’19]
[ArXiv’19]
[LCTES’19]
Simulating RTMs
q RTSim: Configurable racetrack simulatorq Allows running software benchmarksq Built on stop of other simulator technology: NVMAIN 2, Gem5, SystemC, …
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201932
https://github.com/tud-ccc/RTSim[IEEE CAL’19]
Architecture and data layout optimization
q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other
optimizations
33[LCTES’19]
Architecture and data layout optimization
q Data-layout: Reduce the number of shifts
34[LCTES’19]
n
DB
C:2n
DB
C:2n+1
DB
C:3n-1
R0
R1
Rn-1
C00 C01
n
R0
R1
Rn-1
DB
C:0
DB
C:1
DB
C:n-1
A00 A01 A0n-1
A10 A1n-1
An-1n-1
C0 Cn-1DBC:n DBC:n+1 DBC:2n-1
B00
B10
Bn-10
A00 A01 A0n-1
compulsory shifts
overhead shifts
B00 B10 Bn-10
compulsory shifts
overhead shifts
C1
B01
B11
Bn-11
C (Bank-2)Ã (Bank-0) ~B (Bank-1)
Latency comparison vs SRAM
q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)
35
0
0.5
1
1.5
2
2.5
3
4 8 16 32 64 128 256 512 1024 2048 Avg
Nor
mal
ized
late
ncy
Tensor size
SRAM RTM-naïve RTM-opt RTM-opt-ps
[LCTES’19]
Energy comparison vs SRAM
q Higher savings due to less leakage powerq 74% average improvement
36
0
0.2
0.4
0.6
0.8
1
1.2
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
4 8 16 32 64 128 256 512 1024 2048
Nor
mal
ized
ene
rgy
Tensor size
Leakage Energy Read/Write Energy Shift Energy
[LCTES’19]
Discussion
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Summary
q System dynamics: More complex in distributed IoT scenariosq Hybrid compile-runtime methodologies: Difficult balance, new interfacesq Strive at retaining time predictability
q New workloads (e.g., ML) + new techs: Harness domain-specific abstractionsq More complex decision making (e.g., het. Memory systems)
q Expression DSLs to ease high-level manipulation and transformation
q Exciting times for DSE, SW/HW co-design formal languages for emerging platforms (example: RTM-scratchpads)
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201938
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
References
[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [SDR’10] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010[ALOG’11] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011
[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).
[IESS’15] A. Goens and J. Castrillon, “Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs”, In IFIP International Embedded Systems Symposium (IESS), 2015, Foz do Iguaçu, Brazil, 2015.
[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.[Chen, OSDI’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[Array’19] N.A. Rink, N. A. and Jeronimo Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”,
ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, 2019, 57-68[GPCE’17] A. Susungi, et al. "Towards Compositional and Generative Tensor Optimizations" GPCE’17, 169-175.
[GPCE’18] A. Susungi, et al. " "Meta-programming for cross-domain tensor optimizations" GPCE’18, 79-92.
[TVLSI’18] F. Hameed, A. A. Khan, J. Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018.
[TVLSI’19] F. Hameed, J. Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), pp. 13pp, Jun 2019.
[Parkin-Nature’15] S. Parkin, and See-Hun Yang. "Memory on the racetrack." Nature nanotechnology 10.3 (2015): 195.
[IEEE CAL’19] A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, J.Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories" In IEEE Computer Architecture Letters, Feb 2019.
[ISPLED’19] J. Multanen, A. A. Khan, P. Jääskeläinen, F. Hameed, J. Castrillon, “SHRIMP: Efficient Instruction Delivery with Domain Wall Memory”, International Symposium on Low Power Electronics and Design (ISLPED), ACM, 2019, 10pp
[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES’19, ACM, pp. 12pp, New York, NY, USA, Jun 2019
[ArXiv’19] Khan, A. A., Hameed, F., Bläsing, R., Parkin, S., Castrillon, J. “ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0”, ArXiv arXiv:1903.03597, Mar 2019.
41
top related