demystifying parallel and distributed deep learning: an in ... · demystifying parallel and...

39
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH Zurich Presenter : Derrick Blakely https://qdata.github.io/deep2Read March 6, 2019 Presenter : Derrick Blakely https://qdata.github.io/deep2Read Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis T March 6, 2019 1 / 34

Upload: others

Post on 22-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Demystifying Parallel and Distributed Deep Learning:An In-Depth Concurrency Analysis

Tal Ben-Nun and Torsten HoeflerETH Zurich

Presenter : Derrick Blakelyhttps://qdata.github.io/deep2Read

March 6, 2019

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 1 / 34

Page 2: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 2 / 34

Page 3: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Background

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 3 / 34

Page 4: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Background

Deep Learning is Using More Nodes

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 4 / 34

Page 5: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Background

More GPUs, More MPI

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 5 / 34

Page 6: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 6 / 34

Page 7: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Parallel Computing Basics: Computation Graphs

Critical path: causes the earliest possible completion time

Depth D: time needed to execute critical path

T1 = W : sequential execution; same as total work

T∞ = D: with unlimited processors

Average parallelism: T1/T∞ = W /D

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 7 / 34

Page 8: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Parallel Computing Basics: Granularity

Granularity: G ≈ Tcomp/Tcomm

Alternatively:

G =minn∈Vw(n)

maxe∈Ec(e)

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 8 / 34

Page 9: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Parallel Computing Basics: Granularity

Granularity: G ≈ Tcomp/Tcomm

Alternatively:

G =minn∈Vw(n)

maxe∈Ec(e)

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 8 / 34

Page 10: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Parallel Computing Basics: Granularity

Granularity: G ≈ Tcomp/Tcomm

Alternatively:

G =minn∈Vw(n)

maxe∈Ec(e)

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 8 / 34

Page 11: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Goals

Maximize parallelism

Minimize communication and load imbalance

Tradeoff: high parallelism, low communication

Tradeoff: high load balance vs low communication

High parallelism and high load balance are often compatible

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 9 / 34

Page 12: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Communication

Want to perform the following AllReduce:

y = x1 ⊕ x2 ⊕ ...⊕ xP−1 ⊕ xP

P: num processing elements

xi : length-m vector of data items stored on a processing element

γ: size of a data item (bytes)

G : computation cost per byte

L: network latency

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 10 / 34

Page 13: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Naive: Linear-Depth Reduction

T = γmG (P − 1) + L(P − 1)

Average Parallelism: T1/T∞ = W /D = (P − 1)/(P − 1) = 1

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 11 / 34

Page 14: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Tree and Butterfly AllReduce

T = 2γmG logP + 2L logP T = γmG logP + L logP

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 12 / 34

Page 15: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Ring (or Pipeline) AllReduce

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34

Page 16: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Ring (or Pipeline) AllReduce

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34

Page 17: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Ring (or Pipeline) AllReduce

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34

Page 18: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

Ring (or Pipeline) AllReduce

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34

Page 19: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parallel Computing and Communication

AllReduce

Ring:T = 2γmG (P − 1)/P + 2L(P − 1)

Reduce-Scatter-Gather:

T = 2γmG (P − 1)/P + 2L logP

Lower bound:T ≥ 2γmG (P − 1)/P + L logP

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 14 / 34

Page 20: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Neural Network Concurrency

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 15 / 34

Page 21: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Neural Network Concurrency

Within-Layer Concurrency

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 16 / 34

Page 22: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Neural Network Concurrency

Data Parallelism

Simple and efficient

Must replicate models, possible GPU out of memory

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 17 / 34

Page 23: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Neural Network Concurrency

Model Parallelism

Conserve GPU memory

Can work well with multiple GPUs on the same system

Must share minibatch with every worker

Back-prop requires all-to-all communication

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 18 / 34

Page 24: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Neural Network Concurrency

Layer-by-layer Concurrency: Pipelining

Avoid out of memory errors

Sparse communication: GPUs only communicate with GPU in frontof them

Have to make sure there is overlap in computation

Latency linear in number of processors

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 19 / 34

Page 25: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Neural Network Concurrency

Graph Parallelism

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 20 / 34

Page 26: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 21 / 34

Page 27: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Centralized Parameter Sharing: Parameter Server

T = 2P γmGs + 2L

Ensures consistency

“Stragglers” cause poor utilization

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 22 / 34

Page 28: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Asynchronous Parameter Server

Better utilization, faster training

Slow agents cause parameter divergence

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 23 / 34

Page 29: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Stale Synchronous Parameter Server

Statistical performance vs hardware performance tradeoff

Having a parameter server at all can bottleneck training

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 24 / 34

Page 30: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Decentralized Parameter Sharing

T = 2γmG (P − 1)/P + 2L logP

MPI or NCCL can automatically provide a good AllReduce

Avoids parameter server bottleneck

“Straggler” problem

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 25 / 34

Page 31: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Stale Synchronous Decentralized Parameter Sharing

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 26 / 34

Page 32: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Parameter Sharing and Consistency

Asynchronous Decentralized Parameter Sharing

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 27 / 34

Page 33: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Frameworks

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 28 / 34

Page 34: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Frameworks

Neural Net Frameworks

TensorFlow: allows Parameter Server and AllReduce (MPI, NCCL,TCP/IP)

PyTorch: AllReduce with MPI, NCCL, or gloo

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 29 / 34

Page 35: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Frameworks

Why not use Hadoop or Spark?

Don’t natively support GPU runtimes

Require JVM–slow

Designed for fault-tolerance, not speed

Only support data-parallelism

NNs have cyclic computation graphs (must revisit working sets)

Must be synchronous

TF or PyTorch better optimized for deep learning

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 30 / 34

Page 36: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Conclusion

Outline

1 Background

2 Parallel Computing and Communication

3 Neural Network Concurrency

4 Parameter Sharing and Consistency

5 Frameworks

6 Conclusion

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 31 / 34

Page 37: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Conclusion

Ongoing Distributed ML Challenges

1 Optmization and Theory

2 Make algorithms more scalable

3 Better software for distributing ML4 Develop distributed ML systems

ConsistencyFault toleranceCommunication latencyResource management

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 32 / 34

Page 38: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Conclusion

Up-and-Coming ML Topics

Impact of compression and quantization?

Challenges unique to networks with dynamic control flow?

Best ways to implement utilize graph parallelism?

Hierarchical tasks?

Automated architecture searches?

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 33 / 34

Page 39: Demystifying Parallel and Distributed Deep Learning: An In ... · Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoe

Conclusion

References

Ben-Nun, Tal, and Torsten Hoefler. ”Demystifying parallel and distributeddeep learning: An in-depth concurrency analysis.” arXiv preprintarXiv:1802.09941 (2018).

Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 34 / 34