autotm: automatic tensor movement in heterogeneous …
TRANSCRIPT
![Page 1: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/1.jpg)
AutoTM: Automatic Tensor Movement in HeterogeneousMemory Systems using Integer Linear Programming
Mark Hildebrand1,Jawad Khan2, Sanjeev Trika2,
Jason Lowe-Power1, Venkatesh Akella1
1 University of California, Davis2 Intel Corporation
https://github.com/darchr/AutoTM
March 12, 2020
1/29
![Page 2: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/2.jpg)
Executive Summary
ProblemAutomatic two-level memory management for Deep Neural Networks
Idea• Profile Guided Optimization
• Model as an Integer Linear Program (ILP)
Results• Replace 50-80% DRAM with NVDIMMs with geometric mean 27.1% performance
loss.
• 3x better performance than real hardware cache.
2/29
![Page 3: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/3.jpg)
Outline
Background
AutoTMProfilingILP Modeling
Results
Wrap Up
3/29
![Page 4: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/4.jpg)
Why Deep Neural Networks
to train large models on a single machine?Can we use multiple levels of memory
Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
4/29
![Page 5: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/5.jpg)
Why Deep Neural Networks
to train large models on a single machine?Can we use multiple levels of memory
Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
4/29
![Page 6: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/6.jpg)
Why Deep Neural Networks
to train large models on a single machine?Can we use multiple levels of memory
Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
4/29
![Page 7: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/7.jpg)
Heterogeneous Memory Systems
• Two types of memory.
• Same memory controller.
• Both are byte addressable.
• NVDIMMs for high capacity and low cost
Challenges
• All tensors in NVDIMMs memory is too slow.
• DRAM as a cache for NVDIMMs also too slow.
• Intelligent memory management required.
NVDIMM Style
5/29
![Page 8: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/8.jpg)
Heterogeneous Memory Systems
• Two types of memory.
• Same memory controller.
• Both are byte addressable.
• NVDIMMs for high capacity and low cost
Challenges
• All tensors in NVDIMMs memory is too slow.
• DRAM as a cache for NVDIMMs also too slow.
• Intelligent memory management required.
NVDIMM Style
5/29
![Page 9: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/9.jpg)
Outline
Background
AutoTMProfilingILP Modeling
Results
Wrap Up
6/29
![Page 10: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/10.jpg)
AutoTM
GoalMinimize execution time
• Arbitrary computation graph
• Size constraint on fast memory
How• Place tensors in fast or slow memory.
• Optimal tensor movement
Strategy
• Profile kernel performance.
• Model tensor assignment as ILP.
7/29
![Page 11: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/11.jpg)
AutoTM
GoalMinimize execution time
• Arbitrary computation graph
• Size constraint on fast memory
How• Place tensors in fast or slow memory.
• Optimal tensor movement
Strategy
• Profile kernel performance.
• Model tensor assignment as ILP.
7/29
![Page 12: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/12.jpg)
AutoTM
GoalMinimize execution time
• Arbitrary computation graph
• Size constraint on fast memory
How• Place tensors in fast or slow memory.
• Optimal tensor movement
Strategy
• Profile kernel performance.
• Model tensor assignment as ILP.
7/29
![Page 13: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/13.jpg)
Kernel Profiling
Profile performance of kernels for all tensorIO locations.
Kernel IO Tensor Locations
K2
T1 T2 T3
DRAM DRAM DRAMDRAM DRAM PMMDRAM PMM DRAMDRAM PMM PMMPMM DRAM DRAMPMM DRAM PMMPMM PMM DRAMPMM PMM PMM
Table: Profile space for kernel K2.
DRAMDRAMDRAM
DRAMDRAMPMM
DRAMPMMDRAM
DRAMPMMPMM
PMMDRAMDRAM
PMMDRAMPMM
PMMPMMDRAM
PMMPMMPMM
0
1
2
Data In:Weight:
Data Out:
Perform
ance
relative
toallIO
inDRAM
8/29
![Page 14: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/14.jpg)
Kernel Profiling
Profile performance of kernels for all tensorIO locations.
Kernel IO Tensor Locations
K2
T1 T2 T3
DRAM DRAM DRAMDRAM DRAM PMMDRAM PMM DRAMDRAM PMM PMMPMM DRAM DRAMPMM DRAM PMMPMM PMM DRAMPMM PMM PMM
Table: Profile space for kernel K2.
DRAMDRAMDRAM
DRAMDRAMPMM
DRAMPMMDRAM
DRAMPMMPMM
PMMDRAMDRAM
PMMDRAMPMM
PMMPMMDRAM
PMMPMMPMM
0
1
2
Data In:Weight:
Data Out:
Perform
ance
relative
toallIO
inDRAM
8/29
![Page 15: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/15.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
9/29
![Page 16: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/16.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
10/29
![Page 17: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/17.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
11/29
![Page 18: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/18.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
12/29
![Page 19: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/19.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
13/29
![Page 20: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/20.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
14/29
![Page 21: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/21.jpg)
Tensor Lifetime Flow Network
Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.
15/29
![Page 22: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/22.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk +∑t∈T
Mt
16/29
![Page 23: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/23.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk +∑t∈T
Mt
16/29
![Page 24: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/24.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk︸ ︷︷ ︸Kernel
ExecutionTime
+∑t∈T
Mt
K : Set of Kernels
ρk : Run time of kernel k
ExampleRun time of kernel k2
17/29
![Page 25: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/25.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk︸ ︷︷ ︸Kernel
ExecutionTime
+∑t∈T
Mt
K : Set of Kernels
ρk : Run time of kernel k
ExampleRun time of kernel k2
17/29
![Page 26: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/26.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk︸ ︷︷ ︸Kernel
ExecutionTime
+∑t∈T
Mt︸ ︷︷ ︸Tensor
MovementTime
T : Set of Tensors
Mt : Time moving tensor t
ExampleTime moving tensor t1
18/29
![Page 27: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/27.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk︸ ︷︷ ︸Kernel
ExecutionTime
+∑t∈T
Mt︸ ︷︷ ︸Tensor
MovementTime
T : Set of Tensors
Mt : Time moving tensor t
ExampleTime moving tensor t1
18/29
![Page 28: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/28.jpg)
ILP Modeling
Objective Function
Computation time
min∑k∈K
ρk︸ ︷︷ ︸Kernel
ExecutionTime
+∑t∈T
Mt︸ ︷︷ ︸Tensor
MovementTime
ConstraintsLimit DRAM at eachkernel∑t∈L(k)
‖t‖IDRAMt,k ≤ Limit ∀k
19/29
![Page 29: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/29.jpg)
Variations of AutoTM
Name DescriptionPMM
SystemGPU
System
Static Tensor’s can’t move 3 7
Synchronous Tensor’s move but blockcomputation
3 3
Asynchronous Tensor movement con-current with computa-tion
l 3
20/29
![Page 30: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/30.jpg)
Outline
Background
AutoTMProfilingILP Modeling
Results
Wrap Up
21/29
![Page 31: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/31.jpg)
Experiments!
Software
• Modified the ngraph1 compiler.
• Julia’s JuMP2 package for ILPmodeling.
• Gurobi3 as the ILP solver.
Hardware
• 1.5 TB OptaneTM DC PMM
• 384 GiB DRAM
Workloads −−−−−−−−−−−−−−→
Conventional Batchsize Memory (GB)
Inception v4 1024 111Vgg 19 2048 143
Resnet 200 512 132DenseNet 264 512 115
Large Batchsize Memory (GB)
Inception v4 6144 659Vgg 416 128 658
Resnet 200 2560 651DenseNet 264 3072 688
1https://github.com/NervanaSystems/ngraph2https://github.com/JuliaOpt/JuMP.jl3gurobi.com
22/29
![Page 32: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/32.jpg)
Experiments!
Software
• Modified the ngraph1 compiler.
• Julia’s JuMP2 package for ILPmodeling.
• Gurobi3 as the ILP solver.
Hardware
• 1.5 TB OptaneTM DC PMM
• 384 GiB DRAM
Workloads −−−−−−−−−−−−−−→
Conventional Batchsize Memory (GB)
Inception v4 1024 111Vgg 19 2048 143
Resnet 200 512 132DenseNet 264 512 115
Large Batchsize Memory (GB)
Inception v4 6144 659Vgg 416 128 658
Resnet 200 2560 651DenseNet 264 3072 688
1https://github.com/NervanaSystems/ngraph2https://github.com/JuliaOpt/JuMP.jl3gurobi.com
22/29
![Page 33: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/33.jpg)
Experiments!
Software
• Modified the ngraph1 compiler.
• Julia’s JuMP2 package for ILPmodeling.
• Gurobi3 as the ILP solver.
Hardware
• 1.5 TB OptaneTM DC PMM
• 384 GiB DRAM
Workloads −−−−−−−−−−−−−−→
Conventional Batchsize Memory (GB)
Inception v4 1024 111Vgg 19 2048 143
Resnet 200 512 132DenseNet 264 512 115
Large Batchsize Memory (GB)
Inception v4 6144 659Vgg 416 128 658
Resnet 200 2560 651DenseNet 264 3072 688
1https://github.com/NervanaSystems/ngraph2https://github.com/JuliaOpt/JuMP.jl3gurobi.com
22/29
![Page 34: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/34.jpg)
Scaling Performance - Inception V4
Performance of Inception v4 - Batchsize 1024
0 20 40 60 80 100 120
1
2
3
4
5
Dram Limit (GB)Lower is Better
SlowdownLower is Better
synchronous
23/29
![Page 35: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/35.jpg)
Scaling Performance - Inception V4
Performance of Inception v4 - Batchsize 1024
0 20 40 60 80 100 120
1
2
3
4
5
Dram Limit (GB)Lower is Better
SlowdownLower is Better
synchronous
◦ Just using PMM is too slow.
24/29
![Page 36: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/36.jpg)
Scaling Performance - Inception V4
Performance of Inception v4 - Batchsize 1024
0 20 40 60 80 100 120
1
2
3
4
5
Dram Limit (GB)Lower is Better
SlowdownLower is Better
synchronous
◦ Just using PMM is too slow.24/29
![Page 37: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/37.jpg)
Scaling Performance - Inception V4
Performance of Inception v4 - Batchsize 1024
0 20 40 60 80 100 120
1
2
3
4
5
Dram Limit (GB)Lower is Better
SlowdownLower is Better
synchronous
◦ Best performance when working-set fits in memory.
25/29
![Page 38: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/38.jpg)
Scaling Performance - Inception V4
Performance of Inception v4 - Batchsize 1024
0 20 40 60 80 100 120
1
2
3
4
5
Dram Limit (GB)Lower is Better
SlowdownLower is Better
synchronous
◦ Best performance when working-set fits in memory.25/29
![Page 39: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/39.jpg)
Comparison Against 2LM
Vgg416(320)
Inception v4 (6144)
Resnet200 (2560)
DenseNet 264 (3072)
0
1
2
3
Speedup over 2LMHigher is Better
static-AutoTM sync-AutoTM
2LM DRAM Cache
26/29
![Page 40: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/40.jpg)
Comparison Against 2LM
Vgg416(320)
Inception v4 (6144)
Resnet200 (2560)
DenseNet 264 (3072)
0
1
2
3
Speedup over 2LMHigher is Better
static-AutoTM sync-AutoTM
2LM DRAM Cache
26/29
![Page 41: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/41.jpg)
Comparison Against 2LM
Vgg416(320)
Inception v4 (6144)
Resnet200 (2560)
DenseNet 264 (3072)
0
1
2
3
Speedup over 2LMHigher is Better
static-AutoTM sync-AutoTM
2LM DRAM Cache • Avoid Dirty Writebacks
• Lower Memory Contention
26/29
![Page 42: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/42.jpg)
Comparison Against 2LM
Vgg416(320)
Inception v4 (6144)
Resnet200 (2560)
DenseNet 264 (3072)
0
1
2
3
Speedup over 2LMHigher is Better
static-AutoTM sync-AutoTM
2LM DRAM CacheSoftware managementoutperforms hardwaremanagement by up to 3x.
26/29
![Page 43: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/43.jpg)
Outline
Background
AutoTMProfilingILP Modeling
Results
Wrap Up
27/29
![Page 44: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/44.jpg)
Limitations
• Static computation graphs.
• Kernel profiling overhead.
• ILP solution times.
• ILP solution may be hard to interpret.
Vgg19 Inception v4 Resnet200 DenseNet 264
101
102
103
ILP Solution Time (s)Logarithmic
static-AutoTM sync-AutoTM
28/29
![Page 45: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/45.jpg)
Limitations
• Static computation graphs.
• Kernel profiling overhead.
• ILP solution times.
• ILP solution may be hard to interpret.
Vgg19 Inception v4 Resnet200 DenseNet 264
101
102
103
ILP Solution Time (s)Logarithmic
static-AutoTM sync-AutoTM
28/29
![Page 46: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/46.jpg)
Conclusion
AutoTM: A technique for managing tensors in heterogeneous memory systems.
• Profiling for Kernel Performance.
• Use ILP to optimally assign tensor location and movement.
• Three formulations: Static, Synchronous, Asynchronous.
We show
• Reduce DRAM requirement.
• Significant performance improvement over hardware solutions.
Code Available: https://github.com/darchr/AutoTM
29/29
![Page 47: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/47.jpg)
Common Questions
30/29
![Page 48: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/48.jpg)
Asynchronous Movement on PMMs
• Interference between DRAM and PMM.
• Low bandwidth and difficulty of DMA.
• Performance of kernels greatly impacted due to copy kernels.
31/29
![Page 49: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/49.jpg)
GPU
12345678910
64 128 256 32 64 128 32 64 128 64 128Inception v4 Resnet200 DenseNet 264 Vgg19
Speedupover
CudaM
allocM
anag
edsynchronous asynchronous oracle
32/29
![Page 50: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/50.jpg)
RNNs - more complex models
• AutoTM is limited to static computation graphs.• RNNs have dynamic behavior (i.e. unrolling based on sequence length).• RNNs can be implemented statically.• Key ideas from AutoTM can be used for dynamic workloads.
0 20 40 60 80 100 120
0
0.5
1
DRAM Limit (GB)
Percentof
Kernel
IOin
DRAM
static: input tensorsstatic: output tensors
synchronous: input tensorssynchronous: output tensors
33/29
![Page 51: AutoTM: Automatic Tensor Movement in Heterogeneous …](https://reader031.vdocuments.site/reader031/viewer/2022012416/6171835984945e59d13aba02/html5/thumbnails/51.jpg)
Concluding Conclusion
AutoTM: A technique for managing tensors in heterogeneous memory systems.
• Profiling for Kernel Performance.
• Use ILP to optimally assign tensor location and movement.
• Three formulations: Static, Synchronous, Asynchronous.
We show
• Reduce DRAM requirement.
• Significant performance improvement over hardware solutions.
Code Available: https://github.com/darchr/AutoTM
34/29