deep3: leveraging three levels of parallelism for...
TRANSCRIPT
Deep3: Leveraging Three Levels of Parallelism for Efficient Deep Learning [1]
Bita Darvish Rouhani, Azalia Mirhoseini, Farinaz Koushanfar
DAC - 2017
Presenter: Mohammad Motamedi
• The computational cost of deep neural networks hinders their widespread usage.
• Challenges• The first challenge set has to do with the costly iterative nature of deep net training.
• The second challenge set is related to the mapping of computations to increasingly multi-core and/or heterogeneous modern architectures.
• Solutions• Data scientists
• Computer engineers
LEPS – UC Davis 2
• Hypothesis• Devising platform aware signal transformations could highly favor the underlying
learning task by holistic customization to the limits of the hardware resources.
• Deep3• An automated end-to-end learning framework that optimizes the procedure based
on hardware constraints.
LEPS – UC Davis 3
Identifying the intrinsic platform characteristics
Proposing an optimal graph traversal
Dividing the network into multiple smaller networks
Distributed, asynchronous training
LEPS – UC Davis 4
LEPS – UC Davis 5
LEPS – UC Davis 6
Send_Thread (threadID; SG; SL; history) :
send_count = 0
While(!done flag) :
IndexL = NetworkSubsampling(SL)
DL_local = SG:get weights(IndexL)
comm:send(DL_local; dest = threadID)
history[threadID].append(IndexL)
send_count++
LEPS – UC Davis 7
Receive_Thread (threadID; Q; Q Lock) :
Rcounter = 0
While(!done flag) :
delta_WL = comm:recv(source = threadID)
lock(Q Lock)
Q:put([delta_WL; threadID;Rcounter])
release(Q Lock)
Rcounter++
Main :
if (Pid == 0) : //Parameter Coordinator
Q = Queue()
Dlglob = RandomInitialization()
itr = 0
history = []
done_flag = False
for proc in xrange(Np):
createSendThread(proc)
createReceiveThread(proc)
LEPS – UC Davis 8
While(error() or itr < Max itr) :
[delta_WL; threadID;Rcounter] = Q:get()
IndexL = history[threadID][Rcounter]
Dlglob = SG:get_weights(SG)
SG:set_weights(DLglob + delta_WL; IndexL)
itr = itr + 1
done_flag = True
Broadcasts done_flag
LEPS – UC Davis 9
𝑛𝑝𝑢𝑠ℎ × 𝛼
𝑠=1
𝑆−1
𝑛𝑠𝑛𝑠+1 + 𝛽
𝑠−1
𝑆
𝑛𝑠 + 𝑇𝐵𝑃 +
2
𝑛𝑝𝑢𝑠ℎ× 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 +
𝑁𝑏𝑖𝑡 × σ𝑠=1𝑆−1𝑛𝑠𝑛𝑠+1𝐵𝑊
LEPS – UC Davis 10
• Platform 1• Jetson TK1
• 192 CUDA cores
• Quad-core ARM Cortex A15
• Platform 2• Intel Core i7 hosts a Xilinx Virtex-6
• Platform 3• Intel Corei7-2600K
LEPS – UC Davis 11
• Remote sensing • Sample classes
• Corn, wheat, woods, soy beans
• Human activity recognition• Acceleration and Gyroscope information
• Classes• WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING
• Audio: Word classification• 26 classes
LEPS – UC Davis 12
LEPS – UC Davis 13
LEPS – UC Davis 14
1. Rouhani, Bita Darvish, Azalia Mirhoseini, and Farinaz Koushanfar. "Deep3: Leveraging three levels of parallelism for efficient deep learning." Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017.
LEPS – UC Davis 15