multimedia data mining using deep learning
TRANSCRIPT
Agenda
Aims
Multimedia Data Mining
Artificial Neural Networks
Deep learning
Challenges
Discussion
Aims
Analyze multimedia data for:
Object/face recognition
Voice commands
Natural Language Processing
Classification
Automatic caption generation
Record linkage (entity resolution)
Multimedia Data Mining I
Multimedia data mining:
Unprecedented amount of Multimedia data
since Web 2.0 and Social Media
Prosumer data
Uses algorithms to extract useful patterns
and relations from image, audio and video
data
Traditional methods often not satisfactory
Unsuitable for high dimensionality
Multimedia Data Mining II
Multimedia data mining has been
improved using deep learning in:
Visual data mining
Natural Language Processing
Deep learner are:
Machine Learning schemes
Usually multi-layered artificial neural
networks
Artificial Neural Networks I
Artificial Neural Networks:
Suitable to give good approximations for
complex problems
Consist of perceptrons, neurons,
and weighted connections,
the axons
Artificial Neural Networks II
Perceptron (Neuron)
Linear classifier
Data linearly separable using a hyperplane
Where w = weights, a = real-valued vector,
feature vector, a0 = bias
Binary classifier f(a) that maps its input
vector a to a single, binary output value
w0a0 + w1a1 + w2a2 + … + wkak = 0
Artificial Neural Networks III
w0
1
bias
attr
a1
attr
a2
attr
a3
w1 w2
w3
f(a) = kwkak + b
f(a) > 0 or
f(a) < 0
Artificial Neural Networks III
Training data
sex mask cape tie ears smokes class
Batman male yes yes no yes no Good
Robin male yes yes no no no Good
Alfred male no no yes no no Good
Penguin male no no yes no yes Bad
Catwoman female yes no no yes no Bad
Joker male no no no no no Bad
Test data
Batgirl female yes yes no yes no ?
Riddler male yes no no no no ?
Supervised learning
Artificial Neural Networks IV
Not all data is linearly separable
Artificial Neural Networks V
Multilayer Perceptron
Perceptrons organized in several layers
A layer is fully interconnected with the next
layer
All nodes except input node are perceptrons
Feedforward neural network
Uses backpropagation for training
Error propagated back to minimize loss function
Artificial Neural Networks VI
Multilayer perceptron can be used for
non-linear, multiclass classification
Artificial Neural Networks VII
Gradient descent optimization method
for learning weights
Artificial Neural Networks VIII
Complexity has to be accurate
(Occam’s razor)
Schapire 2004
Artificial Neural Networks IX
Schapire 2004
Artificial Neural Networks X
For building an accurate classifier:
Enough training examples
Good performance on training set
Classifier that is not too complex,
overfitting
Allows to get approximate solutions for
very complex problems
Support Vector Machines (SVM) are a
much simpler alternative to ANN
Deep learning I
Deep learning
No clear distinction to shallow learner
Multiple layers of non-linear processing
units
Each layer represents features at a higher
level
Forms a hierarchical representation
Majority of deep learners are aNN
Deep learning II
Deep learning neural networks
Uses Rectified Linear Unit (ReLU)
Learn faster
Half-wave rectifier
f(z) = max(z, 0)
Use backpropagation for adjusting the
weights
Deep learning III - ConvNet
LeNet 2015
Deep learning IV - ConvNet
Convolutional neural networks
Inspired by the animal visual cortex
Visual cortex is the most powerful visual
processing system in existence
Typically two stages:
Convolutional stage
Pooling stage
Characterized by
sparse connectivity
shared weights
Deep learning V - ConvNet
Shared weights
Subsets share weights and bias to form
feature map
Replicated across entire visual field
Deep learning VI - ConvNet
Each layer accepts 3D input vector and
transforms it into a 3D output vector
Filters activate when specific feature is
mapped
CS231n 2015
Deep learning VII - ConvNet
Receptive field spans all feature maps
LeNet 2015
Deep learning VIII - ConvNet
MaxPooling
Non-linear down-sampling
Partitions input into non-overlapping
rectangles
Outputs maximum value for each sub-
region
Minimizes computation for next layer
Reduces dimensionality of intermediate
representations
Deep learning IX - ConvNet
Convolutional and sampling sublayers
UFLDL 2015
Deep learning X - ConvNet
Image cascading max-pooling with
convolutionary layer
Similar to edge detector
Deep learning XI - RNN
Recurrent neural networks
Contain directed cycles
Take sequences as input, no fixed size
input and output vectors, e. g. natural
speech
Deep learning XII - RNN
No fixed size of computations
Much simpler than ConvNets
Maintain inner state exhibiting dynamic
temporal behavior
Optimized through backpropagation
Can be extended with long time memory
extensions
Don’t necessary need sequences of inputs
Deep learning XIII - RNN
Training RNN is a non-linear global
optimization problem
Trained using stochastic gradient descent
Non-linear, differentiable activation
function, e. g. rectifier
Trained through backpropagation through
time (BPTT)
Genetic algorithms can be used for training
Deep learning XIV - RNN
Many different architectures for RNN
Elman SRN Spiking neural network
Deep learning XV - RNN
RNN learns to read house numbers
RNN learns to paint house numbers
Karpathy 2015
Deep learning XVI - RNN
RNN used for
Transcribe speech to text
Voice synthetization
Machine translation
Deep learning XVII
Combining ConvNets and RNN for
image descriptions
Regions described
using language as
label space using
ConvNet
Language synthesizing
using RNN
Karpathy & Fei-Fei 2014
Deep learning XVIII
ConvNet and RNN can be combined
Automated caption generation
Deep learning XIX
Automatic feature extraction
No closed vocabulary set
Alignment of segments of sentences to
region on the image
Karpathy & Fei-Fei 2014
Deep learning XX
Other applications
Object recognition
Movie classification
Handwriting recognition
Record linkage
Challenges I
Main disadvantage large volumes of
training data needed
Overfitting if not enough training data
Optimization difficult
Finding relevant information
Privacy preservice data mining
Challenges II
Describing actions
Discussion
Future research in
Attention based models
Finding relevant information
Data democratization and Internet of
Things
Unsupervised learning
Semantic data modeling
Reasoning
Thank you for the attention
Questions?
References
Zhao, X, Li, X & Zhang, Z 2015, 'Multimedia Retrieval via Deep Learning to Rank ', IEEE Signal Processing Letters, vol. 22, no. 9, pp. 1487 -
91 <http://ieeexplore.ieee.org.ezproxy.usq.edu.au/xpls/abs_all.jsp?arnumber=7054452>.
Yu, W, Zhuang, F, He, Q & Shi, Z 2015, 'Learning deep representations via extreme learning machines', Neurocomputing, vol. 149, Part A,
pp. 308-15, <http://www.sciencedirect.com/science/article/pii/S0925231214011461>.
Xu, K, Ba, J, Kiros, R, Cho, K, Courville, A, Salakhutdinov, R, Zemel, R & Bengio, Y 2015, 'Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention', Proceedings of the 32nd International Conference on Machine Learning from Data: Artificial Intelligence
and Statistics, vol. 37.
Xin, J, Wang, Z, Qu, L & Wang, G 2015, 'Elastic extreme learning machine for big data classification', Neurocomputing, vol. 149, Part A, pp.
464-71, <http://www.sciencedirect.com/science/article/pii/S0925231214011503>.
Weston, J, Chopra, S & Bordes, A 2015, 'Memory Networks', in 3rd International Conference on Learning Representations: proceedings of
the3rd International Conference on Learning Representations San Diego, viewed <http://arxiv.org/pdf/1410.3916v10.pdf>.
Weilong, H, Xinbo, G, Dacheng, T & Xuelong, L 2015, 'Blind Image Quality Assessment via Deep Learning', Neural Networks and Learning
Systems, IEEE Transactions on, vol. 26, no. 6, pp. 1275-86.
Wang, Y, Li, D, Du, Y & Pan, Z 2015, 'Anomaly detection in traffic using L1-norm minimization extreme learning machine', Neurocomputing,
vol. 149, Part A, pp. 415-25, <http://www.sciencedirect.com/science/article/pii/S0925231214011382>.
Vinyals, O, Toshev, A, Bengio, S & Erhan, D 2015, 'Show and Tell: A Neural Image Caption Generator', Google,
<http://arxiv.org/pdf/1411.4555v1.pdf>.
Noda, K, Yamaguchi, Y, Nakadai, K, Okuno, H & Ogata, T 2015, 'Audio-visual speech recognition using deep learning', Applied Intelligence,
vol. 42, no. 4, pp. 722-37, <http://dx.doi.org/10.1007/s10489-014-0629-7>.
Mao, W, Zhao, S, Mu, X & Wang, H 2015, 'Multi-dimensional extreme learning machine', Neurocomputing, vol. 149, Part A, pp. 160-70,
<http://www.sciencedirect.com/science/article/pii/S0925231214011540>.
Liu, X, Wang, L, Huang, G-B, Zhang, J & Yin, J 2015, 'Multiple kernel extreme learning machine', Neurocomputing, vol. 149, Part A, pp. 253-
64, <http://www.sciencedirect.com/science/article/pii/S0925231214011199>.
LeCun, Y, Bengio, Y & Hinton, G 2015, 'Deep learning', Nature, vol. 521, no. 7553, pp. 436-44, <http://dx.doi.org/10.1038/nature14539>.
Srivastava, N, Hinton, G, Krizhevsky, A, Sutskever, I & Salakhutdinov, R 2014, 'Dropout: a simple way to prevent neural networks from
overfitting', J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929-58.
Karpathy, A & Fei-Fei, L 2014, 'Deep visual-semantic alignments for generating image descriptions', arXiv preprint arXiv:1412.2306.