deep3: leveraging three levels of parallelism for...

Deep3: Leveraging Three Levels of Parallelism for Efficient Deep Learning [1]

Bita Darvish Rouhani, Azalia Mirhoseini, Farinaz Koushanfar

DAC - 2017

Presenter: Mohammad Motamedi

• The computational cost of deep neural networks hinders their widespread usage.

• Challenges• The first challenge set has to do with the costly iterative nature of deep net training.

• The second challenge set is related to the mapping of computations to increasingly multi-core and/or heterogeneous modern architectures.

• Solutions• Data scientists

• Computer engineers

LEPS – UC Davis 2

• Hypothesis• Devising platform aware signal transformations could highly favor the underlying

learning task by holistic customization to the limits of the hardware resources.

• Deep3• An automated end-to-end learning framework that optimizes the procedure based

on hardware constraints.

LEPS – UC Davis 3

Identifying the intrinsic platform characteristics

Proposing an optimal graph traversal

Dividing the network into multiple smaller networks

Distributed, asynchronous training

LEPS – UC Davis 4

LEPS – UC Davis 5

LEPS – UC Davis 6

Send_Thread (threadID; SG; SL; history) :

send_count = 0

While(!done flag) :

IndexL = NetworkSubsampling(SL)

DL_local = SG:get weights(IndexL)

comm:send(DL_local; dest = threadID)

history[threadID].append(IndexL)

send_count++

LEPS – UC Davis 7

Receive_Thread (threadID; Q; Q Lock) :

Rcounter = 0

While(!done flag) :

delta_WL = comm:recv(source = threadID)

lock(Q Lock)

Q:put([delta_WL; threadID;Rcounter])

release(Q Lock)

Rcounter++

Main :

if (Pid == 0) : //Parameter Coordinator

Q = Queue()

Dlglob = RandomInitialization()

itr = 0

history = []

done_flag = False

for proc in xrange(Np):

createSendThread(proc)

createReceiveThread(proc)

LEPS – UC Davis 8

While(error() or itr < Max itr) :

[delta_WL; threadID;Rcounter] = Q:get()

IndexL = history[threadID][Rcounter]

Dlglob = SG:get_weights(SG)

SG:set_weights(DLglob + delta_WL; IndexL)

itr = itr + 1

done_flag = True

Broadcasts done_flag

LEPS – UC Davis 9

𝑛𝑝𝑢𝑠ℎ × 𝛼

𝑠=1

𝑆−1

𝑛𝑠𝑛𝑠+1 + 𝛽

𝑠−1

𝑆

𝑛𝑠 + 𝑇𝐵𝑃 +

2

𝑛𝑝𝑢𝑠ℎ× 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 +

𝑁𝑏𝑖𝑡 × σ𝑠=1𝑆−1𝑛𝑠𝑛𝑠+1𝐵𝑊

LEPS – UC Davis 10

• Platform 1• Jetson TK1

• 192 CUDA cores

• Quad-core ARM Cortex A15

• Platform 2• Intel Core i7 hosts a Xilinx Virtex-6

• Platform 3• Intel Corei7-2600K


• Remote sensing • Sample classes

• Corn, wheat, woods, soy beans

• Human activity recognition• Acceleration and Gyroscope information

• Classes• WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING

• Audio: Word classification• 26 classes


1. Rouhani, Bita Darvish, Azalia Mirhoseini, and Farinaz Koushanfar. "Deep3: Leveraging three levels of parallelism for efficient deep learning." Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017.


deep3: leveraging three levels of parallelism for...

Documents