audio and (multimedia) methods for video analysisfractor/fall2011/cs298-3.pdf · audio and...
TRANSCRIPT
Audio and (Multimedia) Methods for Video Analysis
Dr. Gerald Friedland, [email protected]
CS-298 Seminar Fall 2011•
1
More on Sound
2
•Responses to Email Questions•Useful Filters•More Features•Typical Machine Learning on
Audio•More on Tools
Q&A
3
Why didn’t you talk about Loudspeakers?
Loudspeakers vs Microphones
4
Source: http://www.mediacollege.com/audio/microphones/dynamic.html
Source: http://www.physics.org
Loudspeaker Characteristics
Frequency dependent! (As are the response patterns of Microphones!)
Q&A
6
- Speakers don’t really help with video analysis...
- Exception: Multichannel!
Q&A
7
How is the directionality graph for Microphones generated?
- According to IEC 60268-4
Q&A
8
Does lossy compression influence audio analysis?
- Yes, but degree unknown- Common Sense: It’s better to train your models based on the same data compression algorithm.
Q&A
9
Why is the CD sampling rate 44100Hz and not just 44000?
- Because it is dividable by both 25 video frames per second (PAL) and 30 fps (NTSC)
Useful Filters
10
•Low-,High-, Band-pass, Graphic Equalizer•Dynamic Range Compression•Filter by Example: Spectral Subtraction, Wiener Filter•Handling Multiple Channels
Remember: Fourier Transform
11
DFT
iDFT
More Information: “Introduction to Multimedia Computing”, Draft Chapter 8
Remember: Fast Fourier Transform
12
More Information: “Introduction to Multimedia Computing”, Draft Chapter 8
// Input: A number N of samples X. N must be an integer power of two.// s the stride// Output: The Discrete Fourier Transform of X in the array Y.FFT(X, N, s)
IF (N==1) THENY[0]:=X{0] // Trivial size-1 case: Sample = DC coefficient
ELSEY[0,...,N/2]:=FFT(X,N/2,2*s) // DFT of (X[0],X[2s},X[4s},...)Y[N/2,...,N-1]:=FFT(X[s,...,N-1],N/2,2*s)// DFT of (X[s],X[s+2s],X[s+4s],...)
FOR k:=0 TO N/2 // Combine the two halves into onet:=Y[k]Y[k]:=t+exp(-2*pi*k/N)*Y[k=N/2]Y[k+N/2]:=t-exp(-2*pi*k/N)*Y[k=N/2]
FFT := Y
Remember: Convolution Theorem
13
More Information: “Introduction to Multimedia Computing”, Draft Chapter 8
Let f and g be two functions and f * g their convolution. Then:
Bandpass Filters
14
Dynamic Range Compression
15
Dynamic Range Compression
16
Dark blue: Original, Light Blue: Dynamic Range Compressed
Filter by Example: Spectral Subtraction
17
Dark blue: Original with humming noise, Light blue: Corrected using spectral subtraction of humming.
Filter by Example: Wiener Filter
18
Assume:
where:•s(t) is original signal•n(t) is noise footprint•s(t) is estimated signal•g(t) is Wiener Filter
Filter by Example: Wiener Filter
19
Define Error:
with being the delay. Thus:
(squared error)
α
Filter by Example: Wiener Filter
20
Define Error:
with being the delay. Thus:
(squared error)
α
Filter by Example: Wiener Filter
21
Results in:
where: •X(s(t)) is the power spectrum of the signal,•s(t) and N(s(t)) is the power spectrum of the noise, and •G(s(t)) is the frequency-space Wiener filter.
Problem: Do we know N(s(t))?
Multichannel Audio
22
•Microphone Array Recordings popular with Speech Community•Stereo popular since the 60s.•Most cameras have stereo microphones•Most DVDs are encoded in Dolby Digital 5.1 and in Dolby Surround on Stereo Track
Dolby Digital
23
Source: http://www.aspecttv.co.uk/surround-sound-setup/
Dolby Surround
24
How to Cope with Multiple Channels?
25
•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above
How to Cope with Multiple Channels?
26
•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above
Combining Channels
27
•Beamforming, two major techniques:- Delay-Sum- Frequency-Domain Beamformer
Delay-Sum Beamforming
28
More Information: http://www.xavieranguera.com/phdthesis/
Delay-Sum Beamforming
29
Output:- Sum of all channels (lower noise)- Delay between microphones (directional signal)
More on Features
30
• Mel-Frequency-Scaled Coefficients (MFCC)
Other (not explained here):• LPC (Linear Prediction Coefficients)• PLP (Perceptual Linear Predictive) Features• RASTA (see Morgan et al)• MSG (Modulation Spectrogram)
MFCC: Idea
31
power cepstrum of signal
Pre-emphasis
Windowing
FFT
Mel-Scale
Filterbank
Log-Scale
DCT
Audio Signal
MFCC
MFCC: Mel Scale
32
MFCC: Result
33
MFCC Variants and Derivates
34
Derivates: •LFCC (no Mel scale)•AMFCC (anti Mel scale)
Parameters: •MFCC12 (often used for ASR)•MFCC19 (often used in speaker id, diarization)•“delta”: coefficients subtracted (“first derivative”)•“deltadelta”: “second derivative”•Short term: Usually calculated on 10-50ms window
Typical Machine Learning for Audio Analysis
35
•Gaussian Mixture Models (today)
•HMMs•Bayesian Information Criterion•Supervector Approaches
Reminder: K-Means
36
Choose k initial means µi at randomloop for all samples xj: assign membership of each element to a mean (closest mean) for all means µi calculate a new µi by averaging all values xj that were assigned membersuntil means µi are not updated significantly anymore
Algorithm Outline (Expectation Maximization)
Reminder: Gaussian Mixtures
37
Reminder: Training of Mixture Models
38
Goal: Find ai for
Expectation:
Maximization:
Magic Duo
39
~ 90% of audio papers use the combination of MFCCs and Gaussian Mixture Models to model audio signals!
More Tools: HTK
40
Open Source Speech Recognition Toolkit: http://htk.eng.cam.ac.uk/
More Tools: A nice collection of them
41
Princeton SoundLab:http://soundlab.cs.princeton.edu/software/
More Tools: BeamformIT
42
Open Source Beamformer: http://beamformit.sourceforge.net/
Next Week
43
•High-Level Design of Typical Audio Analytics
•Measuring Error•Multimedia
Organizing the Presentations
44
1) Shaunack Chatterjee2) David Sun3) David Sheffield4) Armin Samii5) Jonathan Kolker6) Venkye Enkambaram7) Tim Althoff8) Giulia Fanti9) Shuo-Yiin Chang10) Ekatarina Gonina11) Jaeyoung Choi