“ pixels that sound ” find pixels that correspond (correlate !?) to sound
DESCRIPTION
34. “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound. Kidron, Schechner, Elad, CVPR 2005. 47. Audio-Visual Analysis: Applications. Lip reading – detection of lips (or person) Slaney, Covell (2000) Bregler, Konig (1994) Analysis and synthesis of music from motion - PowerPoint PPT PresentationTRANSCRIPT
“ Pixels that Sound ”
Find pixels that correspond (correlate !?) to sound
Kidron, Schechner, Elad, CVPR 2005
34
Audio-Visual Analysis: Applications• Lip reading – detection of lips (or person)
Slaney, Covell (2000)
Bregler, Konig (1994)
• Analysis and synthesis of music from motionMurphy, Andersen, Jensen (2003)
• Source separation based on visionLi, Dimitrova, Li, Sethi (2003)
Smaragdis, Casey (2003)
Nock, Iyengar, Neti (2002)
Fisher, Darrell, Freeman, Viola (2001)
Hershey, Movellan (1999)
• Tracking Vermaak, Gangnet, Blake, Pérez (2001)
• Biological systemsGutfreund, Zheng, Knudsen (2002)
47
Problem: Different Modalities
camera
microphone
audio-visual analysis
Visual data
25 frames/sec
Each frame: 576 x 720 pixels
Audio data
44.1 KHz, few bands
Not stereophonic
Kidron, Schechner, Elad, Pixels that Sound
47
Previous Work
Pointwise correlationNock, Iyengar, Neti (2002)
Hershey, Movellan (1999)
Ill-posed(lack of data)
• Canonical Correlation Analysis (CCA)Smaragdis, Casey (2003)
Li, Dimitrova, Li, Sethi (2003)
Slaney, Covell (2000)
Cluster of pixels - linear superposition
• Mutual Information (MI)Fisher et. al. (2001)
Cutler, Davis (2000)
Bregler,Konig (1994)
NotTypical
highly complex
54
Kidron, Schechner, Elad, Pixels that Sound
49
ProjectionProjection
Video Audio
Pixel #1
Pixel #2
Pixel #3
Band #1
Band #2
Optimal Optimal visual components
CCA
Visual Projection
1Dvariable
Projection
34012052687436859Video features• Pixels intensity• Transform coeff (wavelet)• Image differences
v
40
Audio Projection
1Dvariable
Projection
Audio features• Average energy per frame• Transform coeffs per frame
a
41
Canonical Correlation
Video AudioRepresentation
Projections(per time window)
Random variables(time dependent)
Correlation coefficient
42
CCA Formulation
yield an eigenvalue problem:Knutsson, Borga, Landelius (1995)
CanonicalCorrelationProjections
Largest Eigenvalue
equivalent to
Corresponding Eigenvectors
43
Visual Data
t (frames)
Spatial Location(pixels intensities)
Kidron, Schechner, Elad, Pixels that Sound
51
Rank Deficiency
t (frames)
Spatial Location(pixels intensities)
=
Kidron, Schechner, Elad, Pixels that Sound
44
Estimation of Covariance
Rank deficient
45
Ill-Posedness
Prior solutions:
• Use many more frames poor temporal resolution.
• Aggressive spatial pruning poor spatial resolution.
• Trivial regularization
Impossible to invert !!!
46
A General Problem
Small amount of data
The problem is ILL-POSED
Over fitting is likely
Large number of weights
47
An Equivalent Problem
Minimizing
Maximizing
48
Single Audio Band
(The denominator is non-zero)
Minimizing
Knowndata
A has a single column, and
49
=
Time
a(ti)
a (1)
a (30)
a (2)
V a
Full correlation if
Underdetermined system !
Kidron, Schechner, Elad, Pixels that Sound
52
end
Detected correlated pixels
“Out of clutter, find simplicity.
From discord, find harmony.”
Albert Einstein
52
end
Sparse Solution
• Non-convex• Exponential
complexity
-norm minimum
53
The -norm criterion
• Sparse• Convex• Polynomial
complexity
in common situations
-norm minimum
Donoho, Elad (2005)
54
The Minimum Norm Solution
Energy spread
-norm minimum
Solving using -norm (pseudo-inverse, SVD, QR)
55
Linear programming
Fully correlated
Sparse
No parameters to tweak
Polynomial
Audio-visual events
Maximum correlation: Eigenproblem
Minimum objective function G
56
Multiple Audio Bands - Solution
-ball
Non-convex constraint
• Convex• Linear
The optimization problem:
57
1 ball
Multiple Audio Bands
Optimization over each face is:
S1
S2
S3 S4
No parameters to tweak
•
• Each face: linear programming
58
Sharp & Dynamic, Despite Distraction
Frame 9 Frame 42 Frame 68
Frame 115 Frame 146 Frame 169
Frame 51
Frame 106
Frame 83
Frame 177
• Sparse
• Localization on the proper elements
• False alarm – temporally inconsistent
• Handling dynamics
Performing in Audio Noise
–norm: Energy Spread
Movie #1 Movie #2
Frame 83Frame 146
56
–norm: Localization
Movie #1 Movie #2
Frame 83Frame 146
57
The “Chorus Ambiguity”
Who’s talking?
Synchronized talk
Not unique (ambiguous)
Possible solutions:• Left• Right• Both
The “Chorus Ambiguity”
-norm-norm
feature 1
feature 2
feature 1
feature 2
Both