CNN Based Object Detection in Large Video Images WangTao, [email protected]
IQIYI ltd. 2016.4
Outline
• Introduction • Background • Challenge
• Our approach • System framework • Object detection • Scene recognition • Body segmentation • Same style matching
• Experiments • Conclusion
Background
• Image retrieval
• Video advertising
Video out applications
Challenge
• Real video data vs. image dataset
- Clutter background
- Multiple objects
- Small objects
- Variant pose/position
- Partial occlusion
Our task
• Problems:
• Content based object retrieval in large video images
• High accuracy for same style matching
• High speed in large video database
• Solution:
• Accurate object detection + scene classification
• Discriminated DNN features and PCA/LDA transformation
• Speed up by parallel indexing and hierarchical filtering
System framework
Scene Classification
Video key frame
Object detection
Body segmentation
CNN feature
Indexing Database
Query image
Faster-RCNN rect
CNN feature
Scene Classification
Match
Distance sort
Result
Body segmentation
indexing
query
Object detection (I)
• Object detection by faster-RCNN • Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al.
NIPS2015]
• Trained on MS coco db (300k images) + video images (10k images)
• More pervasive and general for images with multi-objects
• Multi-class object detection including • Clothes(skirt,jacket,trousers)
• Bags(handbag , backpack , draw-bar box )
• Electronics (mobile, laptop,TV,keyboard,mouse, microwave oven , oven , refrigerator )
• Glasses, necklace, hat
• Shoes
Object detection (II)
• Object detection by CNN regression
• Input an image, output the coordinates of the object rectangle [Erhan, Dumitru, et al. CVPR2014]
• Efficient for images with single object, not recognized by faster-RCNN
Body Segmentation
• Constraint by human body parts • CNN based body segmentation [Jonathan Long,CVPR2015]
• Bounding box, body mask, body parsing
original image segmentation image
Scene classification
• CNN based Scene classification [Bolei Zhou, NIPS2014]
Video Key frame
Is Scene? yes/no
CNN absed Scene classification
tags
Non scene images Scene images of kitchen, office, living room, and bedroom
Multi-frame fusion
Scene classification Preciosn:65.8% Recall:74%
[email protected] Preciosn:83.8% Recall:56.7%
Scene classes
• 0 kitchen • 1 dining • 2 bakery • 3 ice_cream_parlor • 4 bathroom • 5 washing_room • 6 bedroom • 7 living_room • 8 office • 9 children_room • 10 nursery • 11 toyshop • 12 shoe_shop • 13 jewelry_shop
14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room
28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other
Same style matching
• SIFT feature matching • Normalization of SIFT • Dimension : 128dim x 400pts • MAP 22%
• CNN feature of imagenet 1k classifier • Model :VGG19 • Layers : fc7 • Dimension : 4096 600 • MAP 28%
• CNN feature of Same style classifier • Model :VGG19 • Layers : fc7 • Dimension : 4096 600 • MAP 34%
Multi-feature fusion
• Same class matching classifier on imagenet 21k classes of 15M images • Same style matching classifier trained on 1239 queries of 1M images
• Speed • Nvidia K40 GPU, 10x faster than CPU i7 • Faster RCNN speed: 200ms/frame , image size 1920x1080 • Vgg19 feature speed: 60ms/frame, image size 256x256
CNN Models Feature dim MAP
Inception_bn1k 1024 24%
Inception_21k 1024 34%
Vgg19_caffe 4096 34%
Inception_21k + vgg19_caffe 5120 43%
Experiments
• MAP precision on 3M testing images, trained on1M images
• Speed up • Parallel flann tree indexing • Hierarchical filtering by object classes, 10x faster speed • Query speed: 1s /image on 5000 teleplays with 2M images
Vgg 19model Full image Object rectangle
PCA+LDA Inception-21k MAP
√ √ × × × 27.8%
√ × √ × × 34.2%
√ × √ √ × 37.3%
√ × √ × √ 43.1%
√ × √ √ √ 46.1%
Query system GUI
Query examples on image dataset
Query examples on video dataset
Conclusion
• Bounding box is important to recognize object
• Fusion Same style matching with same class matching features to get higher accuracy
• PCA and LDA further improve accuracy and speed
• GPU is faster for CNN feature extraction
• Speed up query by parallel indexing and hierarchical filtering
References
• Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
• Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems. 2015.
• Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
• Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012.
• Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015 arXiv:1411.4038.
• Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr ICCV 2015.
• Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition, Clinical Orthopaedics and Related Research, 2015
• Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene Recognition using Places Database, NIPS, 2014
• Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns, ICLR, 2015
• Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN features for Scene Classification, ICCV, 2015
• Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception Architecture for Computer Vision, arXiv:1512.00567 ,2015