cnn based object detection in large video images · 2016. 4. 12. · • li shen, zhouchen lin and...

CNN Based Object Detection in Large Video Images WangTao, [email protected]

IQIYI ltd. 2016.4

Outline

• Introduction • Background • Challenge

• Our approach • System framework • Object detection • Scene recognition • Body segmentation • Same style matching

• Experiments • Conclusion

Background

• Image retrieval

• Video advertising

Video out applications

Challenge

• Real video data vs. image dataset

- Clutter background

- Multiple objects

- Small objects

- Variant pose/position

- Partial occlusion

Our task

• Problems：

• Content based object retrieval in large video images

• High accuracy for same style matching

• High speed in large video database

• Solution：

• Accurate object detection + scene classification

• Discriminated DNN features and PCA/LDA transformation

• Speed up by parallel indexing and hierarchical filtering

System framework

Scene Classification

Video key frame

Object detection

Body segmentation

CNN feature

Indexing Database

Query image

Faster-RCNN rect

CNN feature

Scene Classification

Match

Distance sort

Result

Body segmentation

indexing

query

Object detection (I)

• Object detection by faster-RCNN • Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al.

NIPS2015]

• Trained on MS coco db (300k images) + video images (10k images)

• More pervasive and general for images with multi-objects

• Multi-class object detection including • Clothes(skirt，jacket，trousers）

• Bags（handbag ， backpack ， draw-bar box )

• Electronics （mobile, laptop，TV，keyboard，mouse， microwave oven ， oven ， refrigerator ）

• Glasses, necklace, hat

• Shoes

Object detection (II)

• Object detection by CNN regression

• Input an image, output the coordinates of the object rectangle [Erhan, Dumitru, et al. CVPR2014]

• Efficient for images with single object, not recognized by faster-RCNN

Body Segmentation

• Constraint by human body parts • CNN based body segmentation [Jonathan Long,CVPR2015]

• Bounding box, body mask, body parsing

original image segmentation image

Scene classification

• CNN based Scene classification [Bolei Zhou, NIPS2014]

Video Key frame

Is Scene? yes/no

CNN absed Scene classification

tags

Non scene images Scene images of kitchen, office, living room, and bedroom

Multi-frame fusion

Scene classification Preciosn:65.8% Recall:74%

[email protected] Preciosn:83.8% Recall:56.7%

Scene classes

• 0 kitchen • 1 dining • 2 bakery • 3 ice_cream_parlor • 4 bathroom • 5 washing_room • 6 bedroom • 7 living_room • 8 office • 9 children_room • 10 nursery • 11 toyshop • 12 shoe_shop • 13 jewelry_shop

14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room

28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other

Same style matching

• SIFT feature matching • Normalization of SIFT • Dimension : 128dim x 400pts • MAP 22%

• CNN feature of imagenet 1k classifier • Model :VGG19 • Layers : fc7 • Dimension : 4096 600 • MAP 28%

• CNN feature of Same style classifier • Model :VGG19 • Layers : fc7 • Dimension : 4096 600 • MAP 34%

Multi-feature fusion

• Same class matching classifier on imagenet 21k classes of 15M images • Same style matching classifier trained on 1239 queries of 1M images

• Speed • Nvidia K40 GPU, 10x faster than CPU i7 • Faster RCNN speed: 200ms/frame , image size 1920x1080 • Vgg19 feature speed: 60ms/frame, image size 256x256

CNN Models Feature dim MAP

Inception_bn1k 1024 24%

Inception_21k 1024 34%

Vgg19_caffe 4096 34%

Inception_21k + vgg19_caffe 5120 43%

Experiments

• MAP precision on 3M testing images, trained on1M images

• Speed up • Parallel flann tree indexing • Hierarchical filtering by object classes, 10x faster speed • Query speed: 1s /image on 5000 teleplays with 2M images

Vgg 19model Full image Object rectangle

PCA+LDA Inception-21k MAP

√ √ × × × 27.8%

√ × √ × × 34.2%

√ × √ √ × 37.3%

√ × √ × √ 43.1%

√ × √ √ √ 46.1%

Query system GUI

Query examples on image dataset

Query examples on video dataset

Conclusion

• Bounding box is important to recognize object

• Fusion Same style matching with same class matching features to get higher accuracy

• PCA and LDA further improve accuracy and speed

• GPU is faster for CNN feature extraction

• Speed up query by parallel indexing and hierarchical filtering

References

• Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

• Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems. 2015.

• Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

• Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012.

• Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015 arXiv:1411.4038.

• Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr ICCV 2015.

• Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition, Clinical Orthopaedics and Related Research, 2015

• Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene Recognition using Places Database, NIPS, 2014

• Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns, ICLR, 2015

• Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN features for Scene Classification, ICCV, 2015

• Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception Architecture for Computer Vision, arXiv:1512.00567 ,2015

cnn based object detection in large video images · 2016. 4. 12. · • li shen, zhouchen lin and...

Documents