super: towards real-time event recognition in internet videos
Post on 01-Jan-2016
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
SUPER: Towards Real-time Event Recognition in Internet Videos
Yu-Gang JiangSchool of Computer Science
Fudan UniversityShanghai, China
ygj@fudan.edu.cn
ACM ICMR 2012, Hong Kong, June 2012
Speeded Up Event Recognition
ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.
2
The Problem• Recognize high-level events in videos
We’re particularly interested in Internet Consumer videos
• Applications Video Search Personal Video Collection Management Smart Advertising Intelligence Analysis …
…
3
Our Objective
Improve Efficiency
Maintain Accuracy
The Baseline Recognition Framework
4
Feature extraction
SIFT
Spatial-temporal
interest points
MFCC audio feature
Late Averag
e Fusion
χ2 kernel SVM
Classifier
Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.
Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task
Three Audio-Visual Features…
5
• SIFT (visual) – D. Lowe, IJCV ‘04
• STIP (visual)– I. Laptev, IJCV ‘05
• MFCC (audio) … 16ms 16ms
Bag-of-words Representation• SIFT / STIP / MFCC words• Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)
Keypoint extraction
Vocabulary 1
SIF
T fe
atur
e sp
ace
......... .........
Vocabulary 2
DoG Hessian Affine
BoW histograms Using Soft-Weighting
.........
Vocabulary Generation BoW Representation
Bag-of-SIFT
6Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010
Baseline Speed…
7
Feature extraction
SIFT
Spatial-temporal
interest points
MFCC audio feature
Late Averag
e Fusion
χ2 kernel SVM
Classifier
• 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling
82.0
916.8
2.36~2.0
0<<1
Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps).
Classification time is measured by classifying a video using classifiers of all the 20 categories
Total: 1003 seconds per video !
Basketball
Baseball
Soccer
Ice Skating
Skiing
Swimming
Biking
Cat
Dog
Bird
Graduation
Birthday Celebration
Wedding Reception
Wedding Ceremony
Wedding Dance
Music Performance
Non-music Performance
Parade
Beach
Playground
8
Dataset: Columbia Consumer Videos (CCV)
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.
9
Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009.
Feature Options• (Sparse) SIFT• STIP• MFCC• Dense SIFT (DIFT)• Dense SURF (DURF)• Self-Similarities (SSIM)• Color Moments (CM)• GIST• LBP• TINY
Suggested feature combinations:
10
Classifier Kernels• Chi Square Kernel• Histogram Intersection
Kernel (HI)• Fast HI Kernel (fastHI)
Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.
Multi-modality Fusion• Early Fusion
Feature concatenation
• Kernel FusionKf=K1+K2+…
• Late Fusionfusion of classificationscore
MFCC, DURF, SSIM, CM, GIST, LBP
MFCC, DURF
12
Frame Sampling
• DURF Uniformly sampling 16 frames per video seems sufficient.
K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.
13
Frame Sampling
• MFCC Sampling audio frames is always harmful.
14
Summary• Feature: Dense SURF (DURF), MFCC, plus some
global features• Classifier: Fast HI kernel SVM• Fusion: Early• Frame Selection: Audio - No; Visual - Yes
220-fold speed-up!
15
Demo…
email: ygj@fudan.edu.cn
THANK YOU!
16
top related