restricted boltzmann machines on multi-core...

Restricted Boltzmann Machines on Multi-Core ProcessorsBy

Sai Prasad nooka

Stavan Karia

Overview

� What is machine learning?

� Brain Vs Processors

� Motivation & Goal

� Introduction to Artificial Neural Networks

� What is a Deep Neural Network ?

� Why Deep Neural Network?

� Boltzmann Machines

� Restricted Boltzmann Machines

� Semi- Supervised Learning

� Learning Feature Hierarchy

� RBM Implementation

� Compute Unified Device Architecture (CUDA )

� Implementation on GPU

� Results

� Conclusion

What is Machine Learning?

A model of digit recognition demo :� http://www.cs.toronto.edu/~hinton/adi/in

dex.htm� This model learns to generate combinations

of labels and images.

2000 top-level neurons

500 neurons

500 neurons

28 x 28 pixel image

10 label neurons

Hinton

Brain Vs Processors

� Brain is made up of billions of cells, called neurons which is highly parallel and adaptive connections.

� We now have similar number of transistors per chip but not adaptive.

Switching Time:

� Neurons switching frequency is 10KHz.

� Processors switching frequency is approaching 10 GHz, in this case processors are way faster.

Connections:

� In brain we have thousands of interconnected neurons.

� In processors we have maximum of 10 connections per transistor.

Motivation & Goal

http://publications.csail.mit.edu/abstracts/abstracts07/brussell2/brussell2.html

Motivation & Goal

� The goal is to solve practical problems by using novel learning algorithms inspired by brain and make computers more user friendly.

� Try to achieve Human like performance in problems like :� Object Detection

� Speech recognition

Introduction to Artificial Neural Networks

Activation function

Introduction to Artificial Neural Networks

http://neuralnetworksanddeeplearning.com/chap5.html

What is a Deep Neural network?

http://neuralnetworksanddeeplearning.com/chap5.html

Why Deep Neural Networks?

� There is a huge chance that back propagation might get stuck in local minima.

� It is very slow in networks with multiple hidden layers.

� It requires labeled training data.

Boltzmann Machines

� Boltzmann Machines were introduced by Hinton & Sejnowski [‘83].

� Boltzmann Machines have bidirectional Connections.

� Each neuron have binary valued states (‘on’ or ‘off’).

� Boltzmann Machines learn the complex irregularities in the training data.

� Probabilistic state transition mechanism.

� This Learning algorithm is very slow in networks with many layers, which gave rise to Restricted Boltzmann Machines.

Restricted Boltzmann Machines (RBM)

� RBM’s are the Boltzmann Machines with some restrictions stated below:� There are no connections between two visible units.

� There are no connections between any two hidden units.

� With these restrictions hidden units are conditionally independent given a visible vector.

Semi-Supervised Learning

Unlabeled images (all cars/elephants)

Elephant car

Test

Source: Caltech-101

Learning feature hierarchy

Akshay n hegde

Compute Unified Device Architecture (CUDA )� CUDA is the general purpose architecture which allows to compute in parallel on

NVIDIA GPU’s

� The CUDA programming models use C and C++ to create special functions called Kernels that define data parallel computations.

� Kernels are executed by different threads on GPU’s that operate as a coprocessor /accelerator to CPU.

� To run a kernel threads must first be organized into blocks that can run independently of each other.

RBM Implementation

� Z is the partition function given by:

� Given a random input (v) the probability of the hidden unit (j) to be 1 is

Where σ(x) is the sigmoid function

� Similarly given a random hidden vector the state of visible ‘i’ can be set to 1 with probability given by:

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves “Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors” in WCCI 2012 IEEE World Congress on Computational Intelligence

June, 10-15, 2012 - Brisbane, Australia

RBM Implementation

� Updating weights:

(6)

� (7)

� (8)



Algorithm -1



Algorithm - 2



RBM implementation on GPU

� RBM Kernel's:

� Compute Status Hidden Units� Compute Status Visible Units� Correct Weights

Sequence of GPU Kernel calls per epoch



Kernel Implementation

� Compute Status Hidden Units kernel & Compute Status Hidden Units kernel :� Each neuron in both visible and hidden layer represents a block and sum up the values

computed by each thread using a reduction process and then compute the output of the neuron for the active sample.

� The order in which the weight matrix is placed in the memory will affect both the kernels, We place these weights J X I matrix which favor Compute Status Hidden Units as it is executed more number of times.

� Correct weights Kernel 1st Method:� Correct weights kernel consists of summing the values of all samples in each block.

� Each thread gathers and sums up the values for all the samples, then a reduction process take place in order to calculate the weights and biases.



Kernel Implementation

� Correct weights Kernel 2nd Method:� Each block is of 16 X 16 threads, the first dimension of the block (x) is associated to an

input unit ‘i’,while the second dimension (y) to a hidden unit ‘j’. Each thread within a block performs all the samples.



Comparison between two methods

Fig. Proportion of time spent, per epoch, in each task/kernel (as measuredin a GTX 280 device).



Limitations Of 1st Approach

� Two main problems related to memory access :� The memory access is not in coalesced manner, thus cache performance not at its

best.

� Many blocks were trying to access the exactly the same memory addresses, generating memory conflicts.



Experiment Setup

Data Set : MNIST

Number of Samples : 60,000

Number of Visible Units : 784 (28 X 28)

CPU : Intel dual-core i5-2410M with 8 GB of memory

GPU : NVIDIA GeForce 460 GTX

� The number of Hidden units and number of training samples are changed.



Results

Increase in sample size

Increase in sample size

Increase in number of hidden units (across Horizontal Dimension)



Analysis

� The GPU speed ups obtained are in the range of 22 to 46 times.

� For example if N=60,000 with Hidden units= 800 takes 40 minutes per epoch to train but on GPU it takes only 53 seconds per epoch.

Factor Change Speedup Execution Time

Number of Samples Increases Tremendous Increases Drastic Fall

Number of hidden units Increases Sub-linear increase Mediocre reduction



Conclusion

� Deep Belief Networks model is time consuming and computationally expensive.

� With the help of GPU’s by taking advantage of its inherent parallel architecture we could run many number of experiments in short period.



References

[1] G. O. Young, “Synthetic structure of industrial plastics (Book style with paper title and editor),” in Plastics,

2nd ed. vol. 3, J. Peters, Ed. New York: McGraw-Hill, 1964, pp. 15–64.

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves “Restricted Boltzmann Machines and Deep Belief

Networks on Multi-Core Processors” in WCCI 2012 IEEE World Congress on Computational Intelligence


Thank You

restricted boltzmann machines on multi-core...

Documents