deep learning for computer vision (3/4): video analytics @ lasalle 2016

@DocXavi

Deep Learning for Computer VisionVideo Analytics 5 May 2016

Xavier Giró-i-Nieto

Master en Creació Multimedia

Outline

1. Recognition2. Optical Flow3. Object Tracking

Recognition

Demo: Clarifai

MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)3

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition

Previous lectures

Recognition

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

Slides extracted from ReadCV seminar by Victor Campos 8

Recognition: DeepVideo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 9

Recognition: DeepVideo: Demo

Recognition: DeepVideo: Architectures

Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]

Recognition: DeepVideo: Features

Recognition: DeepVideo: Multiscale

Recognition: DeepVideo: Results

Recognition

Recognition: C3D

16Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Demo

17K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.

Recognition: C3D: Spatial dimensionSpatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).

Recognition: C3D: Temporal dimension3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets

Temporal depth

2D ConvNets

A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets

Recognition: C3D: Temporal dimension

No gain when varying the temporal depth across layers.

Recognition: C3D: Temporal dimension

Recognition: C3D: Architecture

Featurevector

Recognition: C3D: Feature vector

Video sequence

16 frames-long clips

8 frames-long overlap

Recognition: C3D: Feature vector

16-frame clip

Average

L2 norm

Recognition: C3D: VisualizationBased on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.

Recognition: C3D: Compactness

Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks

Recognition: C3D: Performance

Recognition: C3D: SoftwareImplementation by Michael Gygli (GitHub)

28Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694-4702. 2015.

Recognition:

Recognition: ImageNet Video

[ILSVRC 2015 Slides and videos]

Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

Optical Flow

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 36

Optical Flow: FlowNet

End to end supervised learning of optical flow.

Optical Flow: FlowNet (contracting)

Option A: Stack both input images together and feed them through a generic network.

Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.

Correlation layer: Convolution of data patches from the layers to combine.

Optical Flow: FlowNet (expanding)

Upconvolutional layers: Unpooling features maps + convolution.Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 43

Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color).

Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.

Data augmentation

Object tracking: MDNet

45Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet

Object tracking: MDNet: Architecture

Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.

Object tracking: MDNet: Online update

MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.

Object tracking: FCNT

49Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]

Object tracking: FCNT

50Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.

conv4-3 conv5-3

Object tracking: FCNT: Specialization

Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.

Object tracking: FCNT: Localization

Although trained for image classification, feature maps in conv5-3 enable object localization…...but is not discriminative enough to different objects of the same category.

Object tracking: Localization

53Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.

Other works have shown how features maps in convolutional layers allow object localization.

Object tracking: FCNT: Localization

On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…

conv4-3 conv5-3

Object tracking: FCNT: Architecture

SNet=Specific Network (online update)

GNet=General Network (fixed)

Object tracking: FCNT: Results

Object tracking: DeepTracking

57P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” in The Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, Arizona USA, 2016. [code]

Object tracking: Compact features

61Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code]

Thank you !

https://imatge.upc.edu/web/people/xavier-giro

https://twitter.com/DocXavi

https://www.facebook.com/ProfessorXavi

xavier.giro@upc.edu

Xavier Giró-i-Nieto

deep learning for computer vision (3/4): video analytics @ lasalle 2016

Technology

mobile big data analytics using deep learning and apache...

security analytics: using deep learning to detect cyber

deep machine reading for customer analytics

a deep dive into nagios analytics

"interactive deep analytics" dashboard

deep learning for computer vision (1/4): image analytics @...

using deep learning analytics in proteomics analysis of

data analytics using deep learning · data analytics using...

advanced analytics and deep learning for...

miovision deep learning traffic analytics system...

visual analytics from feature design to deep neural...

pre-con ed: ca spool analytics deep dive

geospatial deep learning engineering analytics cloud...

structure2vec: deep learning for security analytics over...

impala deep dive - running descriptive analytics in hadoop

deep learning for computer vision (2/4): object analytics @...

web analytics deep dive - ses san francisco

three deep web analytics wednesday

jones lang lasalle incorporated 및 lasalle …...jones lang...

cognitive rpa - datamatics€¦ · graph analytics...