ppt guangyang

24
ONLINE DETECTION AND CLASSIFICATION OF DYNAMIC HAND GESTURES WITH RECURRENT 3D CONVOLUTIONAL NEURAL NETWORKS Guangyang Qi

Upload: xiang-zhang

Post on 20-Mar-2017

16 views

Category:

Engineering


3 download

TRANSCRIPT

Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks

Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks Guangyang Qi

OutlineIntroductionOverview of the paperMethod explainRNN & CNN motivationApproaches in this paperExperiments analysisBrain storming and Conclusion

IntroductionDynamic Hand Gestures detection Diversitylag human accuracy of 88.4%Unsegmented input streams processingRNNConnectionist temporal classification (CTC) Video sequences, action recognition CNN Always GOOD

Environment for data collection RGB, optical flow, depth, IR-left, and IR-disparity

Environment for data collection. (Top) Driv- ing simulator with main monitor displaying simulated driv- ing scenes and a user interface for prompting gestures, (A) a SoftKinetic depth camera (DS325) recording depth and RGB frames, and (B) a DUO 3D camera capturing stereo IR. Both sensors capture 320 240 pixels at 30 frames per second. (Bottom) Examples of each modality, from left: RGB, optical flow, depth, IR-left, and IR-disparity.

4

Twenty-five dynamic hand gesture classes

Network Architecture

Recurrent Neural Networks

More RNN use case

Image classificationImage sentence of words

positive or negative sentimenttranslationVideo classification

From left to right:(1)Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).(2)Sequence output (e.g. image captioning takes an image and outputs a sentence of words).(3)Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment).(4)Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French).(5)Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.8

Long short-term memory(LSTM)

Vanishing / Exploding Gradients

Why CNN?

CNN inspiration- Visual Cortex Cell Recording

The visual cortex has small regions of cells that are sensitive to specific regions of the visual field.

Lets talk about straight edges, simple colors, and curves

Curve detector

The value is much lower!

When curve detect not match value is very low!13

Going Deeper Through the Network

Pre-training the 3D-CNN 8 convolutional layers 333 filters 2 fully-connected layers Softmax

Cost functionConnectionist temporal classification From gestures to labellings Probability of observing a particular sequence

Applied to identify and correctly label the nucleus of the gesture, WhileCTC loss is:

negative log-likelihood:

CTC is a cost function designed for sequence prediction in unsegmented or weakly segmented input streams Computation of p(y|X) is simplified by dynamic pro- graming. First, we create an assistant vector y by adding a no gesture label before and after each gesture clip in y, so that y contains |y | = P0 = 2P + 1 labels.

While CTC is used as a training cost function only, it affects the architecture of the network by adding the extra no gesture class label. For pre- segmented video classification, we simply remove the no- gesture output and renormalize probabilities by the `1-norm after modality fusion.

16

Learning rule

Optimize the network parameters(stochastic gradient descent (SGD) )

momentum term = 0.9 cost function E is the weight decay

is learning rate

Why not tuning momentum?Maybe slight difference, personal experience

To optimize the network parameters W with respect to either of the loss functions we use stochastic gradient descent (SGD) with a momentum term = 0.9. We update each parameter of the network 2 W at every back-propagation step i by: where is the learning rate, h E ibatch is the gradient value of the chosen cost function E with respect to the parameter averaged over the mini-batch, and is the weight decay parameter. To prevent gradient explosion in the recurrent layers during training, we apply a soft gradient clipping op- erator J () [29] with a threshold of 10.

Object function17

Experimental analysis

Comparison of our method to the state-of-the-art methods and human predictions with various modalities

Depth is better than other modalityAll modality is the best(lower than human)

Other method with all modality?

Experimental analysis

Comparison of 2D-CNN and 3D-CNN trained with different architectures on depth or color data. (CTC de- notes training without drop-out of feature maps.)

3D-CNN is obvious betterCTC and RNN is similarNo RNN is not bad at allRNN with drop-out of feature maps?

Experimental analysis

Experimental analysis

Accuracy of a linear SVM (C = 1) trained on features extracted from different networks and layers (final fully-connected layer fc and recurrent layer rnn).

SVM classifier, demonstrate a further improvement in performance SVM, 0.2% difference

Online classification with zero or negative lag (Beat human?)R3DCNN for other areaEmotion predictionLeapmotionBrain storming & Conclusion

ReferencesGupta, P.M.X.Y.S. and Kautz, K.K.S.T.J., Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks.Graves, A., 2012. Connectionist temporal classification. InSupervised Sequence Labelling with Recurrent Neural Networks(pp. 61-93). Springer Berlin Heidelberg.Deshpande, A. (2016).A Beginner's Guide To Understanding Convolutional Neural Networks. [online] Adeshpande3.github.io. Available at: https://adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/ [Accessed 20 Feb. 2017].Visual Cortex Cell Recording. (2017). [online] YouTube. Available at: https://www.youtube.com/watch?v=Cw5PKV9Rj3o [Accessed 20 Feb. 2017].CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning, LSTM. (2017). [online] YouTube. Available at: https://www.youtube.com/watch?v=iX5V1WpxxkY&t=864s [Accessed 20 Feb. 2017].

Thank you &Any question?