audio-based multimedia event detection with dnns and sparse sampling future work yli event detection...

1
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today… Number of viewers: 1 billion Hours of video watched every month: 6 billion Video uploaded every minute: 100 hours Average mobile views per day: 1 billion Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Gerald Friedland, Julia Bernd, Kurt Keutzer Memory and Speedup Summary References [1] www.yli-corpus.org [2] B. Elizalde, et al. An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content, International Symposium on Multimedia (ISM), 2013. [3] Ravanelli et al. Audio concept classification with Hierarchical Deep Neural Networks . European Signal Processing Conference (EUSIPCO), 2014 audioCaffe: DNN framework for audio Analysis Method Train Time (hr) Training Speedup Test Time (hr) Testing Speedup Model Size (MB) i-Vector [2] (baseline) 2.71 1x 7.8 1x 5100 DNN: All Frames Input (ours) 10.33 0.26x NA NA 48 DNN: Sparse Sampled (ours) 0.034 78.4x 0.748 10.4x 3 • 78x speedup in training and 10.4x speedup in testing Accelerate development and deployment of systems that use deep neural nets • 200x reduction in input video feature size Real time analytics for applications such as video advertisement placement • 18x reduction in deep neural network size 48MB reduced to 3MB Enabling content analysis system deployment in mobile devices • 11 percentage-point accuracy improvement in video event classification Can leverage this to improve intrusion detection • 5x accuracy improvement in video event detection Can leverage this to improve content based video search Mobile and embedded deployment of audio based event recognition Use audio and visual cues Vancouver 2011 riot brought 1600 hours of user generated video into the city’s police department. THE PROBLEM: Manual video monitoring and evidence collection is intractable when we are flooded by video data. We train our models 78.4x faster than the previous state-of-art Method Per-Frame Classification Accuracy Per-Video Classification Accuracy DNN (2000:2000:2000:10), Dense Sampled w/ All Frames Input 18.3% 27.4% [3] DNN (2000:2000:2000:10), Sparse Sampled w/ 100 Frames Input 28.6% 36.8% DNN (600:600:10), Sparse Sampled w/100 Frames Input 29.3% 37.4% Code available at https ://github.com/ashrafk/ audioCaffeInitial Event classification accuracy Every video in the test set contains 1 of 10 possible events Event Category i-vector [2] (baseline) audioCaffe (ours) DNN (600:600:10), Sparse Sampled Birthday Party 0.37 1.10 Flash Mob 0.12 0.89 Getting a Vehicle Unstuck 0.12 0.61 Parade 0.32 1.62 Person Attempting a Board Trick 0.23 1.34 Person Grooming an Animal 0.11 1.12 Person Hand-Feeding an Animal 0.28 1.71 Person Landing a Fish 0.10 0.67 Wedding Ceremony 0.32 0.92 Working on a Woodworking Project 0.19 1.81 Our pipeline Event detection accuracy In the test set, just 1 in 44 videos contains an event. Framework features: • Speed Fastest publicly available CPU/GPU CNN framework Optimal memory usage for handling large- scale audio data. Ease of use Simple definition of layer for network description name: "mnist-small" layers { layer { name: "mnist" type: "data" source: "data/mnist- train-leveldb" batchsize: 64 scale: 0.00390625 } top: "data" top: "label" } layers { layer { name: "ip" type: "innerproduct" num_output: 10 } bottom: "data" top: "ip" } Event Detection on the YLI Dataset [email protected] 2/3 of mobile web traffic will be video by 2018 Yahoo-Livermore-ICSI (YLI) Multimedia Event Detection Dataset [1] Total of 700,000 videos collected from Flickr 50,000 videos are labeled so far YLI is similar to TRECVID-MED Unlike TRECVID, YLI is openly available to all researchers Training set Test set Foreground (contain an event) 1000 1000 Background (does not contain an event) 5000 43000 University of California, Berkeley and International Computer Science Institute Deep Learning to improve the speed and accuracy of video event recognition fewer parameters → slightly better accuracy sparse sampling→ better accuracy

Upload: hilary-ball

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…

Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

Future Work

YLI Event Detection Dataset

Growth of Multimedia DataOn YouTube today…

• Number of viewers: 1 billion• Hours of video watched every month: 6 billion• Video uploaded every minute: 100 hours• Average mobile views per day: 1 billion

Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Gerald Friedland, Julia Bernd, Kurt Keutzer

Memory and Speedup

Summary

References[1] www.yli-corpus.org [2] B. Elizalde, et al. An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content, International Symposium on Multimedia (ISM), 2013.[3] Ravanelli et al. Audio concept classification with Hierarchical Deep Neural Networks. European Signal Processing Conference (EUSIPCO), 2014

audioCaffe: DNN framework for audio Analysis

Method Train Time (hr)

Training Speedup

Test Time (hr)

Testing Speedup

Model Size (MB)

i-Vector [2] (baseline)

2.71 1x 7.8 1x 5100

DNN: All Frames Input (ours)

10.33 0.26x NA NA 48

DNN: Sparse Sampled (ours)

0.034 78.4x 0.748 10.4x 3

• 78x speedup in training and 10.4x speedup in testing• Accelerate development and deployment of systems that

use deep neural nets

• 200x reduction in input video feature size• Real time analytics for applications such as video

advertisement placement

• 18x reduction in deep neural network size• 48MB reduced to 3MB• Enabling content analysis system deployment in mobile

devices

• 11 percentage-point accuracy improvement in video event classification

• Can leverage this to improve intrusion detection

• 5x accuracy improvement in video event detection• Can leverage this to improve content based video search

• Mobile and embedded deployment of audio based event recognition• Use audio and visual cues

Vancouver 2011 riot brought 1600 hours of user generated video into the city’s police department.

THE PROBLEM: Manual video monitoring and evidence collection is intractable when we are flooded by video data.

We train our models 78.4x faster than the previous state-of-art

Method Per-Frame Classification Accuracy

Per-Video Classification Accuracy

DNN (2000:2000:2000:10), Dense Sampled w/ All Frames Input

18.3% 27.4% [3]

DNN (2000:2000:2000:10), Sparse Sampled w/ 100 Frames Input

28.6% 36.8%

DNN (600:600:10),Sparse Sampled w/100 Frames Input

29.3% 37.4%

Code available at https://github.com/ashrafk/audioCaffeInitial

Event classification accuracyEvery video in the test set contains 1 of 10 possible events

Event Category i-vector [2](baseline)

audioCaffe (ours)DNN (600:600:10),

Sparse Sampled

Birthday Party 0.37 1.10

Flash Mob 0.12 0.89

Getting a Vehicle Unstuck 0.12 0.61

Parade 0.32 1.62

Person Attempting a Board Trick

0.23 1.34

Person Grooming an Animal 0.11 1.12

Person Hand-Feeding an Animal

0.28 1.71

Person Landing a Fish 0.10 0.67

Wedding Ceremony 0.32 0.92

Working on a Woodworking Project

0.19 1.81

Overall MAP 0.22% 1.18%

Our pipeline

Event detection accuracyIn the test set, just 1 in 44 videos contains an event.

Framework features:• Speed

• Fastest publicly available CPU/GPU CNN framework

• Optimal memory usage for handling large-scale audio data.

• Ease of use• Simple definition of layer for

network description

name: "mnist-small"layers { layer { name: "mnist" type: "data" source: "data/mnist-train-

leveldb" batchsize: 64 scale: 0.00390625 } top: "data" top: "label"}layers { layer { name: "ip" type: "innerproduct" num_output: 10} bottom: "data" top: "ip"}

Event Detection on the YLI Dataset

[email protected]

2/3 of mobile web traffic will be video by 2018

Yahoo-Livermore-ICSI (YLI) Multimedia Event Detection Dataset [1]• Total of 700,000 videos collected from Flickr • 50,000 videos are labeled so far• YLI is similar to TRECVID-MED• Unlike TRECVID, YLI is openly available to all researchers

Training set Test setForeground (contain an event) 1000 1000Background (does not contain an event)

5000 43000

University of California, Berkeley and International Computer Science Institute

Deep Learning to improve the speed and accuracy of video event recognition

fewer parameters → slightly better accuracy

sparse sampling→ better accuracy