searching video collections: representation, indexing ... · played material, statistical analysis...

1Dulce Ponceleon

Searching Video Collections: Representation, Indexing, Browsing and Evaluation

Part II

Universidad de Chile December 2002

CIW-DCC, Universidad de Chile2

Audio Features in Multimedia

Features depend on audio category SpeechMusicSounds (i.e. explosions, street noise, etc.)

FeaturesEnergy, LoudnessPitchCepstral CoefficientsBeatHarmonics


Audio IndexingFeatures: Pitch, Loudness, Energy, Mel Cepstral Coefficients, Zero-Crossing RateSpeech-Music DiscriminationSpeaker IdentificationMusic Retrieval

Query-by-HummingBeat Analysis

Foreground/backgroundNeed to find tiger without regard to backgroundAudio sounds are often isolated

Towards the ‘google system” for audio retrieval..


Examples

Pitch

Vowels


Speech SoundsSpeech sounds are created by vibratory activity in the human vocal tract. Speech is normally transmitted to a listener's ears or to a microphone through the air, where speech and other sounds take on the form of waves.It is not possible to read the phonemes in a waveform, but if we analyze the waveform into its frequency components, we obtain a spectrogram which can be deciphered.We apply a mathematical technique called Fourier analysis to the speech waveform in order to discover what frequencies are present at any given moment in the speech signal. The result of Fourier analysis is a spectrum.these vowels can usually be easily distinguished by the frequency values of the first two or three formants, which are called F1, F2, and F3.

The audible frequency range in human beings extends from 20 Hz to 20,000 Hz (20 kHz).


Spectograms


ApplicationsRecognition of Speech

Recognition of Silence

Recognition of MusicMusic + Security = active area

Newscast RecognitionRecognition of Commercials


Applications ...Recognition of Music

Recognition of the song to link to metadataBroadcast monitoring:

Monitor radio programs, scheduled transmission of advertisement sport, ensure composer‘s royalties for played material, statistical analysis of played material

Music SalesRecord signatures of music/sound for small hand held devices

Audio FingerprintingA compact representation of the signal features for matching. It captures the essence of the music item and thus can be use as a fingerprint of the music item


Audio Classification

Audio-waveform

Cepstral Coefficients

Pitch

Energy/Loudness

Speech Recognition

Music/Sound

Segmentation

Musclefish.ccom


0 0.5 1 1.5 2 2.5 3 3.5-0.5

0

0.5A Huge Tapestry Hung in Her Hallway

0 0.5 1 1.5 2 2.5 3 3.50

100

200Zero Crossing Rate

Note, the zero crossing rate goes way up in the quiet region (there is only noise)And way down when the energy is high (which happens during voiced soundsWhen the signal is repetitive.


Spectrogram of “A Huge Tapestry hung in her Hallway”Filterbank output

DCT ReconstructionDCT Coefficients

Time (frame number)

Freq

uenc

y


Music Retrieval

Music Database

Query by Humming

Query Name by Example

Query

MIDIWaveform

WaveformHumming

Hold microphone to

the radio


Audio Indexing Example: Acoustic Identification

Predicted words: (stemmed)

anim (.108), hors (.105), left (.086), trot (.065), approach (.059), track (.047), walk (.040), depart(.037)

Predicted words: (stemmed)

bird (.11), ambienc (.107), jungl (.104), morn (.094), Africa (.093), anim (.054), bark (.029), dog (.020), cricket (.018)

[Slaney02]


Audio Indexing Example: Acoustic Identification

Predicted words: anim (.108), hors (.105), left (.086), trot (.065), approach (.059), track (.047), walk (.040), depart(.037)

True label: animals: horses, two horses trot past on rough track left to right

Predicted words: bird (.11), ambienc (.107), jungl (.104), morn (.094), Africa (.093), anim (.054), bark (.029), dog (.020), cricket (.018)

True label: jungle, Africa, Africa: morning ambience, birds

[Slaney02]


Semantic Audio Retrieval

Acoustic SpaceSemantic SpaceMixture of Probability Experts

Semantic Space Acoustic Space

Step

Whinny

Semantic Space Acoustic Space

HorseTrot

Acoustic Retrieval Semantic Retrieval


Speech: A Brief HistoryRecognition started late 60’s – early 70’sMultimedia Indexing late 80’s Research has been ongoing

Carnegie Mellon,Columbia University, Georgia Institute of Technology, and University of Texas

Products have been available for only about five years

BBN, IBM, CMUFast-Talk, and ScanSoft

Acceptable performance and accuracy level for commercial use in the last 18 monthsProducts are generally integrated into larger systems.


Speech IndexingLarge Vocabulary Automatic Speech Recognition (ASR)Combined with Text Information RetrievalCombined with Phonetic Retrieval for OOV wordsCombined with Query Expansion/Document Expansion based on N-best or external corpus


Automatic Speech Recognition (ASR)

Closed vocabulary (100,000 words)Misses out-of-vocabulary words (OOV), like names of people, places, products, companies, acronyms, etc.Type of errors: word split, words join, word substitutionTypical error rate: 30%-50% WER


SpeechBot Retrieval Interface

Copyrights Cambridge Research Labs


Phonetic Speech Retrieval

Example: training video, pilot work with BoeingThe original speech:"...you can now arm the door and emergency ... "Speech recognition result: "...you can now are on the door and emergency ... "The query "arm the door" is missed by text search, but can be found by phonetic search.


Approaches to Phonetic Speech RetrievalPhone recognition with phonetic string index [Schauble95]Combined word and phonetic IR using Phone Lattice Scanning [James95, Jones96]Inverted Index of Phonetic Sub strings [Witbrock97]Confusion Matrix based Phonetic Indexing [van Leeuwen, Srinivasan00]


Word and Phone Lattices

Lattice: Directed acyclic graph capturing the multiple hypothesis of an ASR systemCan be generated for words or sub-words (phones)May be shallow or deepHypothetical decoding of the phrase “Please be quite sure”:


Using n-best Data for Indexing

Expand document /query representation with n-best wordsIndex sounds-like phones using (n-best) common meta-phone.

G, DGtlN, NGNtlTH, FTtlB, BD, DD, GDGtlEI, IH, IX, IYEtlCH, JH, SHCtl

AA, AE, AH, AO, AW, AX, AXR, AY, EH, ER

AtlGroup of phonesMetaphone

G, DGtlN, NGNtlTH, FTtlB, BD, DD, GDGtlEI, IH, IX, IYEtlCH, JH, SHCtl

AA, AE, AH, AO, AW, AX, AXR, AY, EH, ER

AtlGroup of phonesMetaphone


Phonetic Retrieval – ASystem Overview

Text to Phones

Generate Keys(use meta-phones)

Retrieve & Merge

Bayesianedit distance

Text query

phonetic query

Keyslist

List of candidates

Rankedresults

Audioinput

Speech to Phones

Generate Keys(use meta-phones)

PopulateIndex

Timedphonetic

transcriptKeyslistSpeech to

PhonesGenerate Keys

(use meta-phones)Populate

Index

Timedphonetic

transcriptKeyslist

Phonetic Index

Timed phonetic transcript

Retrieval

Indexing


An ASR Example: ScanSoft


Copyright Virage Video Logger

Virage Video Logging Interface


Speech Indexing PerformanceNote: Speech corpuses, queries and metrics vary significantly: therefore it is very difficult to compare p-r numbers directlyTwo classes of evaluations: document/story retrievalwhere p-r numbers are reported as a % of full text retrieval

Cambridge reports 82-85% relative precision compared to perfect text retrieval with WER 47%CMU reports ~80.2% relative precision compared to perfect text retrieval with WER 50.7%

Word Spotting evaluation against manual full text transcriptions

VMR from Cambridge reports average precision of 0.315TNO reports 100/hr for a 3 phone word and 8/hr for a 6 phone word [van Leeuwen]


In Vocabulary Words by Length

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Precision

Recall

short words

long words

Overall 74% precision at 50% recall.

DATA SET

100Hrs of broadcast news (1.04M words)

QUERIES:24,000 single-word queries

EVALUATION

Compare retrieved matches with ground-truth manual, time-aligned transcript (objective)

Use exact word match (conservative)


Mixing Document Collections

Poorly translated documents less likely to be retrievedProblem faced by Cross Language Information Retrieval CommunityProblem: How to address ranking bias at a per document levelSome ideas: Detect collection source, estimate noise/error rate and compensate in ranking scheme


Performance Claims

Fast-Talk says it canIndex a one-hour audio file in 5 minProcess 30 hours of content per second in response to a specific, 10-phoneme search query (2.53-GHz Pentium CPU)

Recognition faster than 1.5 x real-timeOn 100 hours – how long a query took to run (CIKM talk)


Multiple Languages Support

Demand is slowly growingIBM’s Via-Voice is available in 9 languagesBBN’ Audio Indexer: generates searchable transcripts in Arabic, Chinese, English or Spanish in real time on a standard PCPorting a product is expensive and time-consuming

Collect and transcribe acoustic data Train and evaluate new acoustic models


Conquered ChallengesReal time performanceSpeaker Independent Technology

Gender, age, dialect, style, etc.Acoustic Models tuned to different environments: telephony, TV or radioLanguage Models Speaker identificationSpeaker segmentation Reduced background noiseTools to customize your language modelTrainingFast algorithms for indexing and searching


Open Challenges

Users have high expectationsPrecision and price might impact widespread adoption in some cases

court reporting, medical dictationexpensive systems, as high as $100,000

Conversational speech recognition (i.e. meetings)How to go beyond incremental improvementsWhat about the killer app?Multiple Languages



Introduction to Multimedia Information Retrieval Effective MMIR

Multimedia RepresentationMultimedia IndexingQuery FormulationMultimedia RetrievalBrowsingDistribution/StreamingEvaluation

Multimedia IR ApplicationsConclusions


What is a Multimedia Query?

KeywordsNatural language queryWith Multimedia:

sample image basic art tools (sketch, shape, etc)Query by Humming


MMIR Query ProcessingQuery formulation beyond words

Sample image, color images, texture, shapes, etc. (i.e., hard to input)Find a picture of a satellite

Ambiguous and incomplete queriesInexact media representationRelevance detected easily


Text Retrieval/Search StrategiesQuery Compared with Document RepresentationBoolean SearchMatching Functions (SMART system/cosine correlation)Serial SearchCluster based retrieval


Retrieval Models

Key elementsDocument and query representation, similarity measure, retrieval function

Retrieval ModelsBoolean: Text, Speech

Statistical vector space: Text, ASR Text, Images

Probabilistic: Images, Audio, Video

Distinction between model and implementation


Multimedia Retrieval Implementation using Models

Object Layer

Feature Layer

Media/Data Layer

Concept LayerRelationships/

Events

Blocks of attributes

Colors, shapes, textures,text

Images, video, audio,


Similarity Measures

Distance MeasuresMean Character DifferenceMinkowski Metric

Manhattan, Euclidean, Chebyshev

Correlation CoefficientsCorrelation CoefficientsCosine Measure

Association Coefficients


Combining Indexes: Multimodal Retrieval

Weighted Sum with different Normalization SchemesAdaptive Weights: Each Modality is Query Dependent – use thresholds as measure of similarityDomain Knowledge Modeling: represent knowledge as concept tree, frames, semantic nets


Semantic-based Retrieval Example

Keywords: rose flower plant leavesCopyright Berkeley Blobworld system



Query on

“Rose”

Copyright Berkeley Blobworld System



Query on

Copyright Berkeley Blobworld system


Semantic-based Retrieval ExampleQuery on

and

“Rose”

Copyright Berkeley Blobworld system



Appearance counts!

Semantics counts!

Copyright Berkeley Blobworld System


Effective MMIR

OverviewMedia RepresentationMedia IndexingQuery FormulationMedia RetrievalBrowsingDistribution/StreamingEvaluation


TREC Goals

To increase research in information retrieval based on large-scale collections To provide an open forum for exchange of research ideas to increase communication among research, academia and governmentTo improve evaluation methodologies and measures for text retrievalTo create a series of collections covering different aspects of text retrieval [Voorhees00]


TREC Tasks and Evaluations

Traditional precision and recall measures with binary relevance judgments Three types of automatic tasks:

Ad-hoc task - new queries against new dataRouting and Filtering tasks - old queries against new dataSpecialized tasks question answering, known-item task

Interactive task - different functional systems are compared by giving tasks to human searchers, in which a real-time interface to an experimental system is used to gain the best possible results in under 5 minutes.


Past SDR Test Collections

TREC-6 ‘97 TREC-7 ‘98 TREC-8 ‘99Broadcast News Collection

43 Hours 1996-97 1,451 Stories ~276 wrds/stry

87 Hours 1996-97 2,866 Stories ~269 wrds/stry

557 Hours Jan-Jun 1998 21,754 Stories~169 wrds/stry

Baseline ASR Transcripts

IBM (50% WER)

NIST/CMU SPHINX (33.8% / 46.6% WER)

NIST/BBN Byblos (27.5% / 26.7% WER)

Paradigm Known-Item (% at rank 1)

Ad Hoc (MAP) Ad Hoc (MAP)

Queries 50 23 49

Map = Mean Average Precision


IR MetricsTraditional TREC ad-hoc Metric:

Mean Average Precision (MAP) using TREC_EVALCreated assessment pools for each topic using top 100 of all retrieval runs

Mean pool size: 596 (2.1% of all segments)Min pool size: 209Max pool size: 1309

NIST assessors created reference relevance assessments from topic pools Somewhat artificial for boundary unknown conditions


Known Story Boundary Condition

Retrieval using pre-segmented news storiessystems given index of story boundaries for recognition with IDs for retrieval

excluded non-news segments stories are treated as documents

systems produce rank-ordered list of Story IDsdocument-based scoring:

score as in other TREC Ad Hoc tests using TREC_EVAL


Unknown Story Boundary ConditionRetrieval using continuous speech stream

systems process entire broadcasts for ASR and retrieval with no provided segmentationsystems output a single time marker for each relevant excerpt to indicate topical passages

this task does NOT attempt to determine topic boundaries

time-based scoring:map to a story ID (“dummy” ID for retrieved non-stories and duplicates) score as usual using TREC_EVALpenalizes for duplicate retrieved storiesstory-based scoring somewhat artificial but expedient


TREC SDR Conclusions

ad hoc retrieval in broadcast news domain appears to be a “solved problem”

systems perform well at finding relevant passages in transcripts produced by a variety of recognizers on full unsegmented news broadcasts

performance on own recognizer comparable to human reference just beginning to investigate use of non-lexical information

Caveat EmptorASR may still pose serious problems for Question Answering domain where content errors are fatal


TREC Video RetrievalChallenge:

Answering “semantic” queries for video contentI.e., “retrieve video showing rocket launches”I.e., “retrieve clips of people water skiing”

Fully automated content analysis (w/o transcript or manual annotation of content)

Types of Queries:Automatic queries: feature extraction (i.e., color, texture, edges, motion) and content-based retrieval (CBR) using examplesInteractive queries: CBR + statistical modeling of features

Off-line: automatic classification of video content using statistical models of generic concepts (i.e., scenes, events, objects)Query time: user selection of classifiers (models) and example content (features) in iterative search process

Compare with automatic speech recognition (ASR) approach


Assumptions and LimitationsAssumptions:

Example content provided with query, and/orStatement of information need is available

Statistical modeling of specific semantics:Limitations:

Large number of semantic concepts are relevant to any videoInsufficient training data (need training video content + labels)Not all concepts are easily modeled from simple visual features

Advantages:Feasible to train small number of statistical models for genericconcepts (i.e., indoors vs. outdoors, nature vs. man-made)

Complex concepts composed from generic concepts:i.e., waterskiing outdoors + water + people + boat

Complements content-based retrieval (CBR):i.e., CBR Model 1 Model 2 CBR … Results


Shot Boundary Detection

SMPTE 00:12:45:20

Detects cuts, dissolves, fades and other gradual changesCompare multiple pairs of frames: 1, 3 and 7 frames apartProcesses decoded frames

Supports MPEG, QT, AVI, live feed,…No knobs (e.g., “sensitivity”) or tuning by user


TREC Video Shot Boundary Detection Results

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1

2

34

5

789

1011

12

13

14

15

Avg

Recall

Pre

cisi

onNIST Precision-Recall for all types of edits


TREC Video Retrieval ResultsTREC Video Retrieval (IBM Research General Search Results)

0

2

4

6

8

10

12

14

16

18

20

vt11

vt24

vt37

vt38

vt39

vt40

vt41

vt42

vt43

vt44

vt45

vt46

vt47

vt48

vt49

vt50

vt51

vt52

vt53

vt54

vt55

vt56

vt57

vt58

vt59

vt63

vt64

vt65

vt66

vt72

vt73

vt74

TREC Video Retrieval Topic #

# Hi

ts IBM_A_ASRIBM_A_CBRIBM_A_C+S

ASR = Speech only CBR = Content-based only C+S = Combination


TREC Video Retrieval 2002

Shot Boundary Detection not a separate taskRetrieval task to use NIST generated shotsManual queries only, interactive and automatic retrieval tasksMAP to be used as evaluation measureNo “Known Item” evaluation (?)



Introduction to Multimedia Information Retrieval Effective MMIR

Multimedia RepresentationMultimedia IndexingQuery FormulationMultimedia RetrievalBrowsingDistribution/StreamingEvaluation

Multimedia IR Applications Conclusions

searching video collections: representation, indexing ... · played material, statistical analysis...

Documents