Transcript
- Slide 1
- (Duck Ju Kim)
- Slide 2
- Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated media data?
- Slide 3
- Introduction Analysis Structured organization Embedded semantics Indexing Tagging semantic units Limited machine perception Skimming Abstraction & Presentation Video browsing
- Slide 4
- Event Detection Approach Shot detection Low-level structure Not correspond directly to video semantics Scene extraction Higher-level context Many unimportant contents Event extraction Higher semantic level Better reveal, represent, abstraction
- Slide 5
- Speaker Identification Approach Standard speech databases YOHO, HUB4, SWITCHBOARD Integration from media cues Speaker recognition + Facial analysis Speech cues + Visual cues Supervised Identification Fixed speaker models Insufficient training data Data collection before processing
- Slide 6
- Video Skimming Approach Pre-developed schemes Discontinuous semantic flow Ignored embedded audio cue Computation of six types of features Importance evaluation Assembling important events
- Slide 7
- Content Pre-analysis Shot detection Color histogram-based approach Extract keyframes The first and last frames Audio content Classification Silence, speech, music, environmental sounds Visual content Detect human faces
- Slide 8
- Movie Event Extraction Develop thematic topics Through actions or dialogs What to extract? Two-speaker dialogs Multiple-speaker dialogs Hybrid Events
- Slide 9
- Movie Event Extraction How to extract? Shot sink computation Grouping close and similar shots Sink clustering and characterization Periodic, partly-periodic, non-periodic Event extraction and classification Post-processing
- Slide 10
- Shot Sink Computation Pool of close and similar shots Using Visual Information Window-based Sweep Algorithm
- Slide 11
- Shot Sink Clustering Clustering & Characterizing Periodic, Partly-periodic, Non-periodic Degree of shot repetition Determining the sink periodicity Calculate relative temporal distance Compute mean , standard deviation Grouping with K-means algorithm
- Slide 12
- Slide 13
- Integrating Speech & Face Information False Alarm Montage presentation -> Spoken Dialog Multiple-speaker dialog -> Two-speaker dialog Solution to reducing Embedded audio information integration Speech shot ratio calculation Facial cue inclusion Face detection
- Slide 14
- Adaptive Speaker Identification Shot detection & Audio classification Face detection & Mouth tracking Speech segmentation / clustering Initial speaker modeling Audiovisual-based speaker identification Unsupervised speaker model adaptation
- Slide 15
- Slide 16
- Face Detection & Mouth Tracking Detection & Recognition of talking faces Distance between eyes and mouth : dist Eyes position : (x1, y1), (x2, y2) Mouth center : (x, y)
- Slide 17
- Speech Segmentation
- Slide 18
- Speech Clustering Two separate segments X1, X2 Joined segment X = {X1, X2} For cluster C have n homogeneous speech segments Dist(X, C) =, Negative value -> Considered from the same speaker
- Slide 19
- Initial Speaker Modeling Required for identification process Exploiting the inter-relations between facial and speech cues For each target cast member A Find a speech shot where A is talking Collect all the speech segments Build initial model Gaussian Mixture Model(GMM)
- Slide 20
- Likelihood-based speaker identification GMM model notation, j = 1, 2, , m For ith enrolled speaker The log likelihood between X and Mi
- Slide 21
- Audiovisual integration for speaker identification Finalizing the speaker identification task Integration of audio and video cues Examine the existence of temporal overlap Overlap ratio > Threshold Assign face vector to cluster Otherwise, set face vector to null Speaker Identity
- Slide 22
- Unsupervised Speaker Model Adaptation Updating the speaker model Three approaches Average-based model adaptation MAP-based model adaptation Viterbi-based model adaptation
- Slide 23
- Average-based Model Adaptation Compute BIC distances Compare between d min and threshold T d min < T : d min > T : Initialize new mixture component Update the weight for each component
- Slide 24
- MAP-based Model Adaptation i : Mean of b i d L i : Occupation likelihood of the adaptation data -bar : Mean of the observed adaptation data
- Slide 25
- Viterbi-based Model Adaptation Allows different feature vectors from different components Hard decision Any vector can either occupy component or not Indicator function instead of probability function Mixture component
- Slide 26
- Event-based Movie Skimming Event feature extraction Six types of mid- to high-level features Evaluation of importance Movie skim generation Assemble major events -> final skim
- Slide 27
- Event Feature Extraction Music Ratio Speech Ratio Sound Loudness Action Level Normalized by dividing the largest value Present Cast Theme Topic
- Slide 28
- Event Feature Extraction M : # of features extracted N : # of events a i,j : value of jth feature in ith event
- Slide 29
- Movie Skim Generation Choosing important events Users feature preference Event importance vector
- Slide 30
- Event Detection Results Correctness of the event classification System performance evaluation Hybrid class excluded
- Slide 31
- Slide 32
- Speaker Identification Results Evaluation of adaptive speaker identification system False acceptance(FA) False rejection(FR) Identification accuracy(IA)
- Slide 33
- Slide 34
- Average-based, MAP-based, Viterbi-based
- Slide 35
- Slide 36
- Movie Skimming Results Difficulties of Qualitative evaluation Quantitative measure based on user study 5-point scale : 1~5 Visual comprehension Audio comprehension Semantic continuity Good abstraction Quick browsing Video skipping