audio-based multimedia event detection with dnns and sparse sampling future work yli event detection...

Post on 27-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

Future Work

YLI Event Detection Dataset

Growth of Multimedia DataOn YouTube today…

• Number of viewers: 1 billion• Hours of video watched every month: 6 billion• Video uploaded every minute: 100 hours• Average mobile views per day: 1 billion

Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Gerald Friedland, Julia Bernd, Kurt Keutzer

Memory and Speedup

Summary

References[1] www.yli-corpus.org [2] B. Elizalde, et al. An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content, International Symposium on Multimedia (ISM), 2013.[3] Ravanelli et al. Audio concept classification with Hierarchical Deep Neural Networks. European Signal Processing Conference (EUSIPCO), 2014

audioCaffe: DNN framework for audio Analysis

Method Train Time (hr)

Training Speedup

Test Time (hr)

Testing Speedup

Model Size (MB)

i-Vector [2] (baseline)

2.71 1x 7.8 1x 5100

DNN: All Frames Input (ours)

10.33 0.26x NA NA 48

DNN: Sparse Sampled (ours)

0.034 78.4x 0.748 10.4x 3

• 78x speedup in training and 10.4x speedup in testing• Accelerate development and deployment of systems that

use deep neural nets

• 200x reduction in input video feature size• Real time analytics for applications such as video

advertisement placement

• 18x reduction in deep neural network size• 48MB reduced to 3MB• Enabling content analysis system deployment in mobile

devices

• 11 percentage-point accuracy improvement in video event classification

• Can leverage this to improve intrusion detection

• 5x accuracy improvement in video event detection• Can leverage this to improve content based video search

• Mobile and embedded deployment of audio based event recognition• Use audio and visual cues

Vancouver 2011 riot brought 1600 hours of user generated video into the city’s police department.

THE PROBLEM: Manual video monitoring and evidence collection is intractable when we are flooded by video data.

We train our models 78.4x faster than the previous state-of-art

Method Per-Frame Classification Accuracy

Per-Video Classification Accuracy

DNN (2000:2000:2000:10), Dense Sampled w/ All Frames Input

18.3% 27.4% [3]

DNN (2000:2000:2000:10), Sparse Sampled w/ 100 Frames Input

28.6% 36.8%

DNN (600:600:10),Sparse Sampled w/100 Frames Input

29.3% 37.4%

Code available at https://github.com/ashrafk/audioCaffeInitial

Event classification accuracyEvery video in the test set contains 1 of 10 possible events

Event Category i-vector [2](baseline)

audioCaffe (ours)DNN (600:600:10),

Sparse Sampled

Birthday Party 0.37 1.10

Flash Mob 0.12 0.89

Getting a Vehicle Unstuck 0.12 0.61

Parade 0.32 1.62

Person Attempting a Board Trick

0.23 1.34

Person Grooming an Animal 0.11 1.12

Person Hand-Feeding an Animal

0.28 1.71

Person Landing a Fish 0.10 0.67

Wedding Ceremony 0.32 0.92

Working on a Woodworking Project

0.19 1.81

Overall MAP 0.22% 1.18%

Our pipeline

Event detection accuracyIn the test set, just 1 in 44 videos contains an event.

Framework features:• Speed

• Fastest publicly available CPU/GPU CNN framework

• Optimal memory usage for handling large-scale audio data.

• Ease of use• Simple definition of layer for

network description

name: "mnist-small"layers { layer { name: "mnist" type: "data" source: "data/mnist-train-

leveldb" batchsize: 64 scale: 0.00390625 } top: "data" top: "label"}layers { layer { name: "ip" type: "innerproduct" num_output: 10} bottom: "data" top: "ip"}

Event Detection on the YLI Dataset

ashrafkhalid@berkeley.edu

2/3 of mobile web traffic will be video by 2018

Yahoo-Livermore-ICSI (YLI) Multimedia Event Detection Dataset [1]• Total of 700,000 videos collected from Flickr • 50,000 videos are labeled so far• YLI is similar to TRECVID-MED• Unlike TRECVID, YLI is openly available to all researchers

Training set Test setForeground (contain an event) 1000 1000Background (does not contain an event)

5000 43000

University of California, Berkeley and International Computer Science Institute

Deep Learning to improve the speed and accuracy of video event recognition

fewer parameters → slightly better accuracy

sparse sampling→ better accuracy

top related