audio-based multimedia event detection with dnns and sparse sampling future work yli event detection...
TRANSCRIPT
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling
Future Work
YLI Event Detection Dataset
Growth of Multimedia DataOn YouTube today…
• Number of viewers: 1 billion• Hours of video watched every month: 6 billion• Video uploaded every minute: 100 hours• Average mobile views per day: 1 billion
Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Gerald Friedland, Julia Bernd, Kurt Keutzer
Memory and Speedup
Summary
References[1] www.yli-corpus.org [2] B. Elizalde, et al. An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content, International Symposium on Multimedia (ISM), 2013.[3] Ravanelli et al. Audio concept classification with Hierarchical Deep Neural Networks. European Signal Processing Conference (EUSIPCO), 2014
audioCaffe: DNN framework for audio Analysis
Method Train Time (hr)
Training Speedup
Test Time (hr)
Testing Speedup
Model Size (MB)
i-Vector [2] (baseline)
2.71 1x 7.8 1x 5100
DNN: All Frames Input (ours)
10.33 0.26x NA NA 48
DNN: Sparse Sampled (ours)
0.034 78.4x 0.748 10.4x 3
• 78x speedup in training and 10.4x speedup in testing• Accelerate development and deployment of systems that
use deep neural nets
• 200x reduction in input video feature size• Real time analytics for applications such as video
advertisement placement
• 18x reduction in deep neural network size• 48MB reduced to 3MB• Enabling content analysis system deployment in mobile
devices
• 11 percentage-point accuracy improvement in video event classification
• Can leverage this to improve intrusion detection
• 5x accuracy improvement in video event detection• Can leverage this to improve content based video search
• Mobile and embedded deployment of audio based event recognition• Use audio and visual cues
Vancouver 2011 riot brought 1600 hours of user generated video into the city’s police department.
THE PROBLEM: Manual video monitoring and evidence collection is intractable when we are flooded by video data.
We train our models 78.4x faster than the previous state-of-art
Method Per-Frame Classification Accuracy
Per-Video Classification Accuracy
DNN (2000:2000:2000:10), Dense Sampled w/ All Frames Input
18.3% 27.4% [3]
DNN (2000:2000:2000:10), Sparse Sampled w/ 100 Frames Input
28.6% 36.8%
DNN (600:600:10),Sparse Sampled w/100 Frames Input
29.3% 37.4%
Code available at https://github.com/ashrafk/audioCaffeInitial
Event classification accuracyEvery video in the test set contains 1 of 10 possible events
Event Category i-vector [2](baseline)
audioCaffe (ours)DNN (600:600:10),
Sparse Sampled
Birthday Party 0.37 1.10
Flash Mob 0.12 0.89
Getting a Vehicle Unstuck 0.12 0.61
Parade 0.32 1.62
Person Attempting a Board Trick
0.23 1.34
Person Grooming an Animal 0.11 1.12
Person Hand-Feeding an Animal
0.28 1.71
Person Landing a Fish 0.10 0.67
Wedding Ceremony 0.32 0.92
Working on a Woodworking Project
0.19 1.81
Overall MAP 0.22% 1.18%
Our pipeline
Event detection accuracyIn the test set, just 1 in 44 videos contains an event.
Framework features:• Speed
• Fastest publicly available CPU/GPU CNN framework
• Optimal memory usage for handling large-scale audio data.
• Ease of use• Simple definition of layer for
network description
name: "mnist-small"layers { layer { name: "mnist" type: "data" source: "data/mnist-train-
leveldb" batchsize: 64 scale: 0.00390625 } top: "data" top: "label"}layers { layer { name: "ip" type: "innerproduct" num_output: 10} bottom: "data" top: "ip"}
Event Detection on the YLI Dataset
2/3 of mobile web traffic will be video by 2018
Yahoo-Livermore-ICSI (YLI) Multimedia Event Detection Dataset [1]• Total of 700,000 videos collected from Flickr • 50,000 videos are labeled so far• YLI is similar to TRECVID-MED• Unlike TRECVID, YLI is openly available to all researchers
Training set Test setForeground (contain an event) 1000 1000Background (does not contain an event)
5000 43000
University of California, Berkeley and International Computer Science Institute
Deep Learning to improve the speed and accuracy of video event recognition
fewer parameters → slightly better accuracy
sparse sampling→ better accuracy