searching video collections: representation, indexing ... · played material, statistical analysis...
TRANSCRIPT
1Dulce Ponceleon
Searching Video Collections: Representation, Indexing, Browsing and Evaluation
Part II
Universidad de Chile December 2002
CIW-DCC, Universidad de Chile2
Audio Features in Multimedia
Features depend on audio category SpeechMusicSounds (i.e. explosions, street noise, etc.)
FeaturesEnergy, LoudnessPitchCepstral CoefficientsBeatHarmonics
CIW-DCC, Universidad de Chile3
Audio IndexingFeatures: Pitch, Loudness, Energy, Mel Cepstral Coefficients, Zero-Crossing RateSpeech-Music DiscriminationSpeaker IdentificationMusic Retrieval
Query-by-HummingBeat Analysis
Foreground/backgroundNeed to find tiger without regard to backgroundAudio sounds are often isolated
Towards the ‘google system” for audio retrieval..
CIW-DCC, Universidad de Chile4
Examples
Pitch
Vowels
CIW-DCC, Universidad de Chile5
Speech SoundsSpeech sounds are created by vibratory activity in the human vocal tract. Speech is normally transmitted to a listener's ears or to a microphone through the air, where speech and other sounds take on the form of waves.It is not possible to read the phonemes in a waveform, but if we analyze the waveform into its frequency components, we obtain a spectrogram which can be deciphered.We apply a mathematical technique called Fourier analysis to the speech waveform in order to discover what frequencies are present at any given moment in the speech signal. The result of Fourier analysis is a spectrum.these vowels can usually be easily distinguished by the frequency values of the first two or three formants, which are called F1, F2, and F3.
The audible frequency range in human beings extends from 20 Hz to 20,000 Hz (20 kHz).
CIW-DCC, Universidad de Chile6
Spectograms
CIW-DCC, Universidad de Chile7
ApplicationsRecognition of Speech
Recognition of Silence
Recognition of MusicMusic + Security = active area
Newscast RecognitionRecognition of Commercials
CIW-DCC, Universidad de Chile8
Applications ...Recognition of Music
Recognition of the song to link to metadataBroadcast monitoring:
Monitor radio programs, scheduled transmission of advertisement sport, ensure composer‘s royalties for played material, statistical analysis of played material
Music SalesRecord signatures of music/sound for small hand held devices
Audio FingerprintingA compact representation of the signal features for matching. It captures the essence of the music item and thus can be use as a fingerprint of the music item
CIW-DCC, Universidad de Chile9
Audio Classification
Audio-waveform
Cepstral Coefficients
Pitch
Energy/Loudness
Speech Recognition
Music/Sound
Segmentation
Musclefish.ccom
CIW-DCC, Universidad de Chile10
0 0.5 1 1.5 2 2.5 3 3.5-0.5
0
0.5A Huge Tapestry Hung in Her Hallway
0 0.5 1 1.5 2 2.5 3 3.50
100
200Zero Crossing Rate
Note, the zero crossing rate goes way up in the quiet region (there is only noise)And way down when the energy is high (which happens during voiced soundsWhen the signal is repetitive.
CIW-DCC, Universidad de Chile11
Spectrogram of “A Huge Tapestry hung in her Hallway”Filterbank output
DCT ReconstructionDCT Coefficients
Time (frame number)
Freq
uenc
y
CIW-DCC, Universidad de Chile12
Music Retrieval
Music Database
Query by Humming
Query Name by Example
Query
MIDIWaveform
WaveformHumming
Hold microphone to
the radio
CIW-DCC, Universidad de Chile13
Audio Indexing Example: Acoustic Identification
Predicted words: (stemmed)
anim (.108), hors (.105), left (.086), trot (.065), approach (.059), track (.047), walk (.040), depart(.037)
Predicted words: (stemmed)
bird (.11), ambienc (.107), jungl (.104), morn (.094), Africa (.093), anim (.054), bark (.029), dog (.020), cricket (.018)
[Slaney02]
CIW-DCC, Universidad de Chile14
Audio Indexing Example: Acoustic Identification
Predicted words: anim (.108), hors (.105), left (.086), trot (.065), approach (.059), track (.047), walk (.040), depart(.037)
True label: animals: horses, two horses trot past on rough track left to right
Predicted words: bird (.11), ambienc (.107), jungl (.104), morn (.094), Africa (.093), anim (.054), bark (.029), dog (.020), cricket (.018)
True label: jungle, Africa, Africa: morning ambience, birds
[Slaney02]
CIW-DCC, Universidad de Chile15
Semantic Audio Retrieval
Acoustic SpaceSemantic SpaceMixture of Probability Experts
Semantic Space Acoustic Space
Step
Whinny
Semantic Space Acoustic Space
HorseTrot
Acoustic Retrieval Semantic Retrieval
CIW-DCC, Universidad de Chile16
Speech: A Brief HistoryRecognition started late 60’s – early 70’sMultimedia Indexing late 80’s Research has been ongoing
Carnegie Mellon,Columbia University, Georgia Institute of Technology, and University of Texas
Products have been available for only about five years
BBN, IBM, CMUFast-Talk, and ScanSoft
Acceptable performance and accuracy level for commercial use in the last 18 monthsProducts are generally integrated into larger systems.
CIW-DCC, Universidad de Chile17
Speech IndexingLarge Vocabulary Automatic Speech Recognition (ASR)Combined with Text Information RetrievalCombined with Phonetic Retrieval for OOV wordsCombined with Query Expansion/Document Expansion based on N-best or external corpus
CIW-DCC, Universidad de Chile18
Automatic Speech Recognition (ASR)
Closed vocabulary (100,000 words)Misses out-of-vocabulary words (OOV), like names of people, places, products, companies, acronyms, etc.Type of errors: word split, words join, word substitutionTypical error rate: 30%-50% WER
CIW-DCC, Universidad de Chile19
SpeechBot Retrieval Interface
Copyrights Cambridge Research Labs
CIW-DCC, Universidad de Chile20
Phonetic Speech Retrieval
Example: training video, pilot work with BoeingThe original speech:"...you can now arm the door and emergency ... "Speech recognition result: "...you can now are on the door and emergency ... "The query "arm the door" is missed by text search, but can be found by phonetic search.
CIW-DCC, Universidad de Chile21
Approaches to Phonetic Speech RetrievalPhone recognition with phonetic string index [Schauble95]Combined word and phonetic IR using Phone Lattice Scanning [James95, Jones96]Inverted Index of Phonetic Sub strings [Witbrock97]Confusion Matrix based Phonetic Indexing [van Leeuwen, Srinivasan00]
CIW-DCC, Universidad de Chile22
Word and Phone Lattices
Lattice: Directed acyclic graph capturing the multiple hypothesis of an ASR systemCan be generated for words or sub-words (phones)May be shallow or deepHypothetical decoding of the phrase “Please be quite sure”:
CIW-DCC, Universidad de Chile23
Using n-best Data for Indexing
Expand document /query representation with n-best wordsIndex sounds-like phones using (n-best) common meta-phone.
G, DGtlN, NGNtlTH, FTtlB, BD, DD, GDGtlEI, IH, IX, IYEtlCH, JH, SHCtl
AA, AE, AH, AO, AW, AX, AXR, AY, EH, ER
AtlGroup of phonesMetaphone
G, DGtlN, NGNtlTH, FTtlB, BD, DD, GDGtlEI, IH, IX, IYEtlCH, JH, SHCtl
AA, AE, AH, AO, AW, AX, AXR, AY, EH, ER
AtlGroup of phonesMetaphone
CIW-DCC, Universidad de Chile24
Phonetic Retrieval – ASystem Overview
Text to Phones
Generate Keys(use meta-phones)
Retrieve & Merge
Bayesianedit distance
Text query
phonetic query
Keyslist
List of candidates
Rankedresults
Audioinput
Speech to Phones
Generate Keys(use meta-phones)
PopulateIndex
Timedphonetic
transcriptKeyslistSpeech to
PhonesGenerate Keys
(use meta-phones)Populate
Index
Timedphonetic
transcriptKeyslist
Phonetic Index
Timed phonetic transcript
Retrieval
Indexing
CIW-DCC, Universidad de Chile25
Similarity Matching using Phonetic Confusion Matrix
C(oi,qj) is the probability of recognizing phone oi when the actual phone in the audio is qj.
The Bayesian Edit Distance D(o,q) is the log-likelihood of the best editing sequence which converts the query string q to the actual string o.
P(ZH|ZH)P(ZH|Z)P(ZH|AE)P(ZH|AA)
P(Z|ZH)P(Z|Z)P(Z|AE)P(Z|AA)
……
P(AE|ZH)P(AE|Z)P(AE|AE)P(AE|AA)
P(AA|ZH)P(AA|Z)…
……
…
P(AA|AE)P(AA|AA)
C(oi,qj) =
P(ZH|ZH)P(ZH|Z)P(ZH|AE)P(ZH|AA)
P(Z|ZH)P(Z|Z)P(Z|AE)P(Z|AA)
……
P(AE|ZH)P(AE|Z)P(AE|AE)P(AE|AA)
P(AA|ZH)P(AA|Z)…
……
…
P(AA|AE)P(AA|AA)
C(oi,qj) =
CIW-DCC, Universidad de Chile26
An ASR Example: ScanSoft
CIW-DCC, Universidad de Chile27
Copyright Virage Video Logger
Virage Video Logging Interface
CIW-DCC, Universidad de Chile28
Speech Indexing PerformanceNote: Speech corpuses, queries and metrics vary significantly: therefore it is very difficult to compare p-r numbers directlyTwo classes of evaluations: document/story retrievalwhere p-r numbers are reported as a % of full text retrieval
Cambridge reports 82-85% relative precision compared to perfect text retrieval with WER 47%CMU reports ~80.2% relative precision compared to perfect text retrieval with WER 50.7%
Word Spotting evaluation against manual full text transcriptions
VMR from Cambridge reports average precision of 0.315TNO reports 100/hr for a 3 phone word and 8/hr for a 6 phone word [van Leeuwen]
CIW-DCC, Universidad de Chile29
In Vocabulary Words by Length
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Precision
Recall
short words
long words
Overall 74% precision at 50% recall.
DATA SET
100Hrs of broadcast news (1.04M words)
QUERIES:24,000 single-word queries
EVALUATION
Compare retrieved matches with ground-truth manual, time-aligned transcript (objective)
Use exact word match (conservative)
CIW-DCC, Universidad de Chile30
Mixing Document Collections
Poorly translated documents less likely to be retrievedProblem faced by Cross Language Information Retrieval CommunityProblem: How to address ranking bias at a per document levelSome ideas: Detect collection source, estimate noise/error rate and compensate in ranking scheme
CIW-DCC, Universidad de Chile31
Performance Claims
Fast-Talk says it canIndex a one-hour audio file in 5 minProcess 30 hours of content per second in response to a specific, 10-phoneme search query (2.53-GHz Pentium CPU)
Recognition faster than 1.5 x real-timeOn 100 hours – how long a query took to run (CIKM talk)
CIW-DCC, Universidad de Chile32
Multiple Languages Support
Demand is slowly growingIBM’s Via-Voice is available in 9 languagesBBN’ Audio Indexer: generates searchable transcripts in Arabic, Chinese, English or Spanish in real time on a standard PCPorting a product is expensive and time-consuming
Collect and transcribe acoustic data Train and evaluate new acoustic models
CIW-DCC, Universidad de Chile33
Conquered ChallengesReal time performanceSpeaker Independent Technology
Gender, age, dialect, style, etc.Acoustic Models tuned to different environments: telephony, TV or radioLanguage Models Speaker identificationSpeaker segmentation Reduced background noiseTools to customize your language modelTrainingFast algorithms for indexing and searching
CIW-DCC, Universidad de Chile34
Open Challenges
Users have high expectationsPrecision and price might impact widespread adoption in some cases
court reporting, medical dictationexpensive systems, as high as $100,000
Conversational speech recognition (i.e. meetings)How to go beyond incremental improvementsWhat about the killer app?Multiple Languages
CIW-DCC, Universidad de Chile35
Searching Video Collections: Representation, Indexing, Browsing and Evaluation
Introduction to Multimedia Information Retrieval Effective MMIR
Multimedia RepresentationMultimedia IndexingQuery FormulationMultimedia RetrievalBrowsingDistribution/StreamingEvaluation
Multimedia IR ApplicationsConclusions
CIW-DCC, Universidad de Chile36
What is a Multimedia Query?
KeywordsNatural language queryWith Multimedia:
sample image basic art tools (sketch, shape, etc)Query by Humming
CIW-DCC, Universidad de Chile37
MMIR Query ProcessingQuery formulation beyond words
Sample image, color images, texture, shapes, etc. (i.e., hard to input)Find a picture of a satellite
Ambiguous and incomplete queriesInexact media representationRelevance detected easily
CIW-DCC, Universidad de Chile38
Text Retrieval/Search StrategiesQuery Compared with Document RepresentationBoolean SearchMatching Functions (SMART system/cosine correlation)Serial SearchCluster based retrieval
CIW-DCC, Universidad de Chile39
Retrieval Models
Key elementsDocument and query representation, similarity measure, retrieval function
Retrieval ModelsBoolean: Text, Speech
Statistical vector space: Text, ASR Text, Images
Probabilistic: Images, Audio, Video
Distinction between model and implementation
CIW-DCC, Universidad de Chile40
Multimedia Retrieval Implementation using Models
Object Layer
Feature Layer
Media/Data Layer
Concept LayerRelationships/
Events
Blocks of attributes
Colors, shapes, textures,text
Images, video, audio,
CIW-DCC, Universidad de Chile41
Similarity Measures
Distance MeasuresMean Character DifferenceMinkowski Metric
Manhattan, Euclidean, Chebyshev
Correlation CoefficientsCorrelation CoefficientsCosine Measure
Association Coefficients
CIW-DCC, Universidad de Chile42
Combining Indexes: Multimodal Retrieval
Weighted Sum with different Normalization SchemesAdaptive Weights: Each Modality is Query Dependent – use thresholds as measure of similarityDomain Knowledge Modeling: represent knowledge as concept tree, frames, semantic nets
CIW-DCC, Universidad de Chile43
Semantic-based Retrieval Example
Keywords: rose flower plant leavesCopyright Berkeley Blobworld system
CIW-DCC, Universidad de Chile44
Semantic-based Retrieval Example
Query on
“Rose”
Copyright Berkeley Blobworld System
CIW-DCC, Universidad de Chile45
Semantic-based Retrieval Example
Query on
Copyright Berkeley Blobworld system
CIW-DCC, Universidad de Chile46
Semantic-based Retrieval ExampleQuery on
and
“Rose”
Copyright Berkeley Blobworld system
CIW-DCC, Universidad de Chile47
Semantic-based Retrieval Example
Appearance counts!
Semantics counts!
Copyright Berkeley Blobworld System
CIW-DCC, Universidad de Chile48
Effective MMIR
OverviewMedia RepresentationMedia IndexingQuery FormulationMedia RetrievalBrowsingDistribution/StreamingEvaluation
CIW-DCC, Universidad de Chile49
TREC Goals
To increase research in information retrieval based on large-scale collections To provide an open forum for exchange of research ideas to increase communication among research, academia and governmentTo improve evaluation methodologies and measures for text retrievalTo create a series of collections covering different aspects of text retrieval [Voorhees00]
CIW-DCC, Universidad de Chile50
CIW-DCC, Universidad de Chile51
TREC Tasks and Evaluations
Traditional precision and recall measures with binary relevance judgments Three types of automatic tasks:
Ad-hoc task - new queries against new dataRouting and Filtering tasks - old queries against new dataSpecialized tasks question answering, known-item task
Interactive task - different functional systems are compared by giving tasks to human searchers, in which a real-time interface to an experimental system is used to gain the best possible results in under 5 minutes.
CIW-DCC, Universidad de Chile52
Past SDR Test Collections
TREC-6 ‘97 TREC-7 ‘98 TREC-8 ‘99Broadcast News Collection
43 Hours 1996-97 1,451 Stories ~276 wrds/stry
87 Hours 1996-97 2,866 Stories ~269 wrds/stry
557 Hours Jan-Jun 1998 21,754 Stories~169 wrds/stry
Baseline ASR Transcripts
IBM (50% WER)
NIST/CMU SPHINX (33.8% / 46.6% WER)
NIST/BBN Byblos (27.5% / 26.7% WER)
Paradigm Known-Item (% at rank 1)
Ad Hoc (MAP) Ad Hoc (MAP)
Queries 50 23 49
Map = Mean Average Precision
CIW-DCC, Universidad de Chile53
IR MetricsTraditional TREC ad-hoc Metric:
Mean Average Precision (MAP) using TREC_EVALCreated assessment pools for each topic using top 100 of all retrieval runs
Mean pool size: 596 (2.1% of all segments)Min pool size: 209Max pool size: 1309
NIST assessors created reference relevance assessments from topic pools Somewhat artificial for boundary unknown conditions
CIW-DCC, Universidad de Chile54
Known Story Boundary Condition
Retrieval using pre-segmented news storiessystems given index of story boundaries for recognition with IDs for retrieval
excluded non-news segments stories are treated as documents
systems produce rank-ordered list of Story IDsdocument-based scoring:
score as in other TREC Ad Hoc tests using TREC_EVAL
CIW-DCC, Universidad de Chile55
Unknown Story Boundary ConditionRetrieval using continuous speech stream
systems process entire broadcasts for ASR and retrieval with no provided segmentationsystems output a single time marker for each relevant excerpt to indicate topical passages
this task does NOT attempt to determine topic boundaries
time-based scoring:map to a story ID (“dummy” ID for retrieved non-stories and duplicates) score as usual using TREC_EVALpenalizes for duplicate retrieved storiesstory-based scoring somewhat artificial but expedient
CIW-DCC, Universidad de Chile56
TREC SDR Conclusions
ad hoc retrieval in broadcast news domain appears to be a “solved problem”
systems perform well at finding relevant passages in transcripts produced by a variety of recognizers on full unsegmented news broadcasts
performance on own recognizer comparable to human reference just beginning to investigate use of non-lexical information
Caveat EmptorASR may still pose serious problems for Question Answering domain where content errors are fatal
CIW-DCC, Universidad de Chile57
TREC Video RetrievalChallenge:
Answering “semantic” queries for video contentI.e., “retrieve video showing rocket launches”I.e., “retrieve clips of people water skiing”
Fully automated content analysis (w/o transcript or manual annotation of content)
Types of Queries:Automatic queries: feature extraction (i.e., color, texture, edges, motion) and content-based retrieval (CBR) using examplesInteractive queries: CBR + statistical modeling of features
Off-line: automatic classification of video content using statistical models of generic concepts (i.e., scenes, events, objects)Query time: user selection of classifiers (models) and example content (features) in iterative search process
Compare with automatic speech recognition (ASR) approach
CIW-DCC, Universidad de Chile58
Assumptions and LimitationsAssumptions:
Example content provided with query, and/orStatement of information need is available
Statistical modeling of specific semantics:Limitations:
Large number of semantic concepts are relevant to any videoInsufficient training data (need training video content + labels)Not all concepts are easily modeled from simple visual features
Advantages:Feasible to train small number of statistical models for genericconcepts (i.e., indoors vs. outdoors, nature vs. man-made)
Complex concepts composed from generic concepts:i.e., waterskiing outdoors + water + people + boat
Complements content-based retrieval (CBR):i.e., CBR Model 1 Model 2 CBR … Results
CIW-DCC, Universidad de Chile59
Shot Boundary Detection
SMPTE 00:12:45:20
Detects cuts, dissolves, fades and other gradual changesCompare multiple pairs of frames: 1, 3 and 7 frames apartProcesses decoded frames
Supports MPEG, QT, AVI, live feed,…No knobs (e.g., “sensitivity”) or tuning by user
CIW-DCC, Universidad de Chile60
TREC Video Shot Boundary Detection Results
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1
2
34
5
789
1011
12
13
14
15
Avg
Recall
Pre
cisi
onNIST Precision-Recall for all types of edits
CIW-DCC, Universidad de Chile61
TREC Video Retrieval ResultsTREC Video Retrieval (IBM Research General Search Results)
0
2
4
6
8
10
12
14
16
18
20
vt11
vt24
vt37
vt38
vt39
vt40
vt41
vt42
vt43
vt44
vt45
vt46
vt47
vt48
vt49
vt50
vt51
vt52
vt53
vt54
vt55
vt56
vt57
vt58
vt59
vt63
vt64
vt65
vt66
vt72
vt73
vt74
TREC Video Retrieval Topic #
# Hi
ts IBM_A_ASRIBM_A_CBRIBM_A_C+S
ASR = Speech only CBR = Content-based only C+S = Combination
CIW-DCC, Universidad de Chile62
TREC Video Retrieval 2002
Shot Boundary Detection not a separate taskRetrieval task to use NIST generated shotsManual queries only, interactive and automatic retrieval tasksMAP to be used as evaluation measureNo “Known Item” evaluation (?)
CIW-DCC, Universidad de Chile63
Searching Video Collections: Representation, Indexing, Browsing and Evaluation
Introduction to Multimedia Information Retrieval Effective MMIR
Multimedia RepresentationMultimedia IndexingQuery FormulationMultimedia RetrievalBrowsingDistribution/StreamingEvaluation
Multimedia IR Applications Conclusions