soc programming in the era of the internet of things, machine learning and emerging ... ·...

SoC programming in the era of the Internet of Things, machine learning and emerging technologies

Jeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany

Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets ConnectéMontpellier, France. June 20 2019

Systems on Chip (SoC): Evolution

q SoCs: Long history of specialization and interaction with environmentq Incredible evolution over the last decades

3 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

L2A15L1

VLIW DSP

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming

Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming

Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

Exciting innovations in q Modeling languagesq Programming languages and compilers

q Costs models of hardwareq System simulators

q Design space exploration (DSE) methodologies

New challengesq System dynamics (e.g., IoT)q Ubiquity of machine learning workloads

q Complexity of emerging technologies

Dataflow and Hybrid DSE

Dataflow programming

q Graph representation of applicationsq Implicit repetitive execution of tasksq Good model for streaming applications

q Good match for signal processing & multi-media

q The whyq Explicit parallelism

q Often: Determinism q Better analyzability (scheduling, mapping, optimization)

PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);

PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);

switch (transTarget) {case TransM VP:

PnTransformTemplateInstantiate(D, S);

ErasePnProcessTemplates(D);

PnPrintForM VP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);

break;case TransVPUtg:

PrintForVPUtg(D, S, streamFactory);

ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);

case TransInvalid:assert(false);break;

#include "PnTransform.h"#include "PnTVPUtg.h"

#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, bool traces,

const std::string & strM appingFileName, ASTContext

&Ctx,Sema &S, const

llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:

case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;

case TransVPUmap:

PrintForVPUmap(D, S, strM appingFileName, streamFactory);

case TransInvalid:assert(false);break;

&BasePath) {assert(transTarget != TransInvalid);

TranslaYonUnitDecl *D = Ctx.getTranslaYonUnitDecl();

PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

Dataflow compilation

q Plenty of research onq Language, compiler and mapping algorithmsq Hardware modeling, performance estimation

q Runtime systemsq Code generation for heterogeneous multicores

AHB Bus

LM: Local memorySM: Shared memory

src snk

qntdct

dctvle

dct init

MJPEG application

[Springer14]

Example: HW acceleration and SW-defined radio

q Application: MIMO OFDM receiverq Hardware

q Platform 1: Baseline software q Platform 2: Optimized software

q Platform 3: Optimized SW + HW

ARM Mem

VLIW VLIW VLIW

src sink

Library 1: SW implementations

Library 2: SW optimized

MemMem

ARM Mem

VLIW FFT Viterbi

src sink

Custom Interconnect

Demap Library 3: SW+HW accel.7680

128000

1758241,758

1,00E+00

1,00E+02

1,00E+04

1,00E+06

1)bsp1 (sw,

unoptimized)

2)bsp2 (sw,optimized)

3)bsp3 (hw)

Achieved rate @ 100 MHz

[SDR’10, ALOG’11]

System dynamics

q Applications not so static anymore!

q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant

q Run-time predictability, robustness & isolation

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

EUF Pareto front @ compile-time

Data-level parallelism: Scalable and adaptive

q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)

P1 P2 P3

Meta-information & configurations

Changes @ runtime

[PARMA-DITAM’18]

Exploiting symmetries

q Intuitionq SW: Some tasks/processes/actors may do the sameq HW: Symmetric latencies (CoreX çè CoreY)

q Symmetry: Allows transformations w/o changing the outcome

è No need to analyze all possible mappings (prune search space)

è Work on formalization via inverse semi-groups and efficient algorithms

Symmetries in Odroid: Example

PE6PE5

PE4PE3

PE2PE1

PE6PE5

PE4PE3

PE2PE1

PE5PE6

PE6PE7

Mappings Architecture Subgraphs

Cortex A7 Cortex A15

Equivalent mappings

Graph isomorphism

Mappings Architecture subgraphs

[IESS’15, ACM TACO’17]

Flexible mappings: Generalized Tetris

q Given multiple canonical configs by compiler, select one at run-time

q Exploit mapping equivalences and similarities

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration*

[SCOPES’17b]

Flexible mappings: Run-time analysis

q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO

q 1x AF

q 4 x AF q 2 x AF + 2 x MIMO

q 3 mappings to two processors q T1: Best CPU time

q T2: Best wall-clock timeq T3: GBM heuristic

●●●

CFS Dyn T1 T2 T3

●●

CFS Dyn T1 T2 T3

●●●●

●●

CFS Dyn T1 T2 T3

l−cl

Single AF

[Castrill12]

[SCOPES’17b]

Flexible mappings: Multi-application results (1)

●●

●● ●

●●

CFS Dyn T1 T2 T3

l−cl

●●

●●●●●●

●●●

●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

AF MIMO

●●●● ●●

●● ●●●●●●●

●●●

●●●●

●● ●●

AF MIMO

l−clo

[s] More

predictable performance

Comparable performance to dynamic mapping

●●●●●●

●●●●

●●●

● ● ● ●

● ●

●●

● ● ● ●

CFS Dyn T1 T2 T3

[SCOPES’17b]

Robustness

q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control

q Application data, unexpected interrupts, unexpected OS decisions

è Can we reason about robustness of mapping to external factors?

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration* Perturb Modified behavior(correct?)

Design centering

q Design centering: Find a mapping that can better tolerate variations while staying feasible

q Studied field, in e.g., biology, circuit design or manufacturing systems.

q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping

optimum

design center in feasible region

feasible region

[SCOPES’17a]

Evaluation

q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)

MIMO-OFDM

[SCOPES’17a]

Machine Learning & emerging technologies

ML revolution: Frameworks and architectures

q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …

q Lots of traction in hardware architectures: TPU, V100, …

q Lot’s of resources for training, less on inference on edge-devices!

[Chen, OSDI’18]Example flow: TVM

Domain-specific abstractions

q Commonality: Tensor expression languages

q Increase programmer’s productivityq From compiler perspective: No abstraction toll

q Easier access to informationq Larger score for optimization

var input A : matrix &var input u : tensorIN &

v = (A # A # A # u . [[5 8] [3 7] [1 6]])

[RWDSL’18]

Correct by construction

q Especially important in embedded systemsq Correct by designq No abstraction leaks

q Current efforts in formal semantics for safe code generation

[Array’19]

A = placeholder((m,h), name='A')

B = placeholder((n,h), name='B')

k = reduce_axis((0, h), name='k')

C = compute((m, n), lambda i, j:

sum(A[k, i] * B[k, j], axis=k))

!"# = %&'(

)*&"+&#

TeML: Results

q Extra control allows for new optimization (vs pluto): changing shapesq General tensor semantics allows covering more benchmarks than TensorFlow

[GPCE’18]

Emerging technologies: Racetrack memories

q Racetrack memories: one of many future alternativesq Cf. STT/Re-RAM, hybrid architectures

q Predicted extreme density at low latencyq 3D nano-wires with magnetic domains

q One port shared for many bitsq Domains move at high speeds (1000 ms-1)

q Sequential: Game changer for current HW/SW stackq Memory management

q Integration with other memory architectures q Data layout and allocation

[Parkin-Nature’15]

[TVLSI’18, TVLSI’19]

Compiler research on placement

q Compiler pass to reorganize placement of instructionsq Instruction fetch is naturally sequential!q Layout instructions to reduce shift operations

q Compiler pass to reorganize data for higher-level objectsq Variable allocation

q Tensor allocation and program transformation

[ISLPED’19]

[ArXiv’19]

[LCTES’19]

Simulating RTMs

q RTSim: Configurable racetrack simulatorq Allows running software benchmarksq Built on stop of other simulator technology: NVMAIN 2, Gem5, SystemC, …

https://github.com/tud-ccc/RTSim[IEEE CAL’19]

Architecture and data layout optimization

q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other

optimizations

33[LCTES’19]

Architecture and data layout optimization

q Data-layout: Reduce the number of shifts

34[LCTES’19]

C:2n+1

C:3n-1

C00 C01

A00 A01 A0n-1

A10 A1n-1

An-1n-1

C0 Cn-1DBC:n DBC:n+1 DBC:2n-1

A00 A01 A0n-1

compulsory shifts

overhead shifts

B00 B10 Bn-10

compulsory shifts

overhead shifts

C (Bank-2)Ã (Bank-0) ~B (Bank-1)

Latency comparison vs SRAM

q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)

4 8 16 32 64 128 256 512 1024 2048 Avg

Tensor size

SRAM RTM-naïve RTM-opt RTM-opt-ps

[LCTES’19]

Energy comparison vs SRAM

q Higher savings due to less leakage powerq 74% average improvement

4 8 16 32 64 128 256 512 1024 2048

Tensor size

Leakage Energy Read/Write Energy Shift Energy

[LCTES’19]

Discussion

Summary

q System dynamics: More complex in distributed IoT scenariosq Hybrid compile-runtime methodologies: Difficult balance, new interfacesq Strive at retaining time predictability

q New workloads (e.g., ML) + new techs: Harness domain-specific abstractionsq More complex decision making (e.g., het. Memory systems)

q Expression DSLs to ease high-level manipulation and transformation

q Exciting times for DSE, SW/HW co-design formal languages for emerging platforms (example: RTM-scratchpads)

References

[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [SDR’10] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010[ALOG’11] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011

[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).

[IESS’15] A. Goens and J. Castrillon, “Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs”, In IFIP International Embedded Systems Symposium (IESS), 2015, Foz do Iguaçu, Brazil, 2015.

[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.[Chen, OSDI’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[Array’19] N.A. Rink, N. A. and Jeronimo Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”,

ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, 2019, 57-68[GPCE’17] A. Susungi, et al. "Towards Compositional and Generative Tensor Optimizations" GPCE’17, 169-175.

[GPCE’18] A. Susungi, et al. " "Meta-programming for cross-domain tensor optimizations" GPCE’18, 79-92.

[TVLSI’18] F. Hameed, A. A. Khan, J. Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018.

[TVLSI’19] F. Hameed, J. Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), pp. 13pp, Jun 2019.

[Parkin-Nature’15] S. Parkin, and See-Hun Yang. "Memory on the racetrack." Nature nanotechnology 10.3 (2015): 195.

[IEEE CAL’19] A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, J.Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories" In IEEE Computer Architecture Letters, Feb 2019.

[ISPLED’19] J. Multanen, A. A. Khan, P. Jääskeläinen, F. Hameed, J. Castrillon, “SHRIMP: Efficient Instruction Delivery with Domain Wall Memory”, International Symposium on Low Power Electronics and Design (ISLPED), ACM, 2019, 10pp

[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES’19, ACM, pp. 12pp, New York, NY, USA, Jun 2019

[ArXiv’19] Khan, A. A., Hameed, F., Bläsing, R., Parkin, S., Castrillon, J. “ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0”, ArXiv arXiv:1903.03597, Mar 2019.

soc programming in the era of the internet of things, machine learning and emerging ... ·...

Documents

programming ubiquitous things · 3/23/2021 1 © paulo...

smooth operator: programming the little things that games...

programming in c: how to get things...

trump voters and criminal justice reform table 1-1...

a programming language for the internet of things · a...

programming webassembly with rustmedia.pragprog.com ›...

learning by playing, playing by learning · basic...

object-oriented programming in...

· 0016/soc/soc/11-12 krushna chandra sahoo 16...

third-party product abstraction for internet of things...

zynq-7000 all programmable soc software …...zynq-7000 ap...

linear programming on recursive additive...

cs12230 introduction to programming tying things together

miwot: an application-oriented programming supporting...

6 things about programming that every computer programmer...

programming the internet of things: why devices need apis

programming for the internet of things

asynchronous programming deadlock all the things!

facilitating the internet of things with policy...

iotivity connecting things with iot - tizen · programming...