deep learning for computer vision (3/4): video analytics @ lasalle 2016

Post on 16-Apr-2017

1.146 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@DocXavi

Deep Learning for Computer VisionVideo Analytics 5 May 2016

Xavier Giró-i-Nieto

Master en Creació Multimedia

Outline

1. Recognition2. Optical Flow3. Object Tracking

2

Recognition

Demo: Clarifai

MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)3

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

4

Recognition

5

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

6

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Previous lectures

7

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

Slides extracted from ReadCV seminar by Victor Campos 8

Recognition: DeepVideo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 9

Recognition: DeepVideo: Demo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 10

Recognition: DeepVideo: Architectures

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11

Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]

Recognition: DeepVideo: Features

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 12

Recognition: DeepVideo: Multiscale

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13

Recognition: DeepVideo: Results

14

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

15

Recognition: C3D

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

16Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Demo

17K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.

Recognition: C3D: Spatial dimensionSpatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).

18Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Temporal dimension3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets

Temporal depth

2D ConvNets

19Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets

Recognition: C3D: Temporal dimension

20Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

No gain when varying the temporal depth across layers.

Recognition: C3D: Temporal dimension

21Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Architecture

Featurevector

22Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Feature vector

Video sequence

16 frames-long clips

8 frames-long overlap

23Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Feature vector

16-frame clip

16-frame clip

16-frame clip

16-frame clip

...

Average

4096

-dim

vid

eo d

escr

ipto

r

4096

-dim

vid

eo d

escr

ipto

r

L2 norm

24Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: VisualizationBased on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.

25Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Compactness

26Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks

Recognition: C3D: Performance

27Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: SoftwareImplementation by Michael Gygli (GitHub)

28Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694-4702. 2015.

Recognition:

29

Recognition: ImageNet Video

[ILSVRC 2015 Slides and videos]

30

Recognition: ImageNet Video

[ILSVRC 2015 Slides and videos]

31

Recognition: ImageNet Video

[ILSVRC 2015 Slides and videos]

32

Recognition: ImageNet Video

Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

33

Recognition: ImageNet Video

Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

34

Recognition: ImageNet Video

Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

35

Recognition: ImageNet Video

Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

Optical Flow

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 36

Optical Flow: FlowNet

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 37

Optical Flow: FlowNet

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 38

End to end supervised learning of optical flow.

Optical Flow: FlowNet (contracting)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 39

Option A: Stack both input images together and feed them through a generic network.

Optical Flow: FlowNet (contracting)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 40

Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.

Optical Flow: FlowNet (contracting)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 41

Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.

Correlation layer: Convolution of data patches from the layers to combine.

Optical Flow: FlowNet (expanding)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 42

Upconvolutional layers: Unpooling features maps + convolution.Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.

Optical Flow: FlowNet

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 43

Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color).

Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.

Data augmentation

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 44

Optical Flow: FlowNet

Object tracking: MDNet

45Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet

46Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet: Architecture

47Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.

Object tracking: MDNet: Online update

48Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.

Object tracking: FCNT

49Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]

Object tracking: FCNT

50Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.

conv4-3 conv5-3

Object tracking: FCNT: Specialization

51Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.

Object tracking: FCNT: Localization

52Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Although trained for image classification, feature maps in conv5-3 enable object localization…...but is not discriminative enough to different objects of the same category.

Object tracking: Localization

53Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.

Other works have shown how features maps in convolutional layers allow object localization.

Object tracking: FCNT: Localization

54Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…

conv4-3 conv5-3

Object tracking: FCNT: Architecture

55Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

SNet=Specific Network (online update)

GNet=General Network (fixed)

Object tracking: FCNT: Results

56Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Object tracking: DeepTracking

57P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” in The Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, Arizona USA, 2016. [code]

Object tracking: DeepTracking

58P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” in The Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, Arizona USA, 2016. [code]

Object tracking: DeepTracking

59P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” in The Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, Arizona USA, 2016. [code]

Object tracking: DeepTracking

60P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” in The Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, Arizona USA, 2016. [code]

Object tracking: Compact features

61Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code]

top related