survey of approaches to information retrieval of speech message kenney ng spoken language systems...
TRANSCRIPT
Survey of Approaches to Information Retrieval of Speech Message
Kenney NgSpoken Language Systems Group Laboratory for Computer Science
Massachusetts Institute of Technology
Presenter: chia-hao Lee
Outline
• Introduction
• Information Retrieval
• Text Retrieval
• Difference between text and speech media
• Information Retrieval of Speech Messages
Introduction
• Process, organize, and analyze the data.• Present the data in human usable form.• Find the “interesting” piece of information effic
iently.• Increasingly large portions in spoken languag
e information :– recorded speech messages– radio and television broadcasts
• Development of automatic methods.
2.1 Definition• “connected with the representation , storage ,
organization , and accessing of information items” .
• Return the best match a “request” provided by user’s information need.
• There is no restriction on the type of document.• Text Retrieval , Document Retrieval• Image Retrieval , Speech Retrieval • Multi-media Retrieval
Information Retrieval
Information Retrieval
Database RetrievalInformation
Retrieval
Similarity
The existence of an organized collection of information items
The use of a request formulated by a user to access the items
Diff.
Goal Return the specific facts(answer or exactly match request)
Return the relevant to the user’s request
structure well defined not well defined
the request
Complete specification of the user’s information
need
Incomplete specification
of the user’s information
need
type of answer
a specific fact or piece of information.
a general topic or subjectarea , and find out more ab
out it.
Database Retrieval vs. Information Retrieval
Information Retrieval
Creating document representations (indexing)
Creating request representations (query formation)
Comparing representations
(retrieval)
Evaluating retrieved documents
(relevance feedback)
Component Processes
Information Retrieval
• Recall• The fraction of all the relevant documents in the entire
collection that are retrieved in response to a query.
• Precision• The fraction of the retrieved documents that are relevant.
• Average precision• The precision values obtained at each new relevant
document in the ranked output for an individual query are averaged.
Component ProcessesPerformance
Information RetrievalRelated Information Process
Information Filtering and Retrieval
User request Data collection The User Training Data
Filtering Static Dynamic Passive Yes!!
Retrieval(Ad hoc)
Dynamic Static Active No!!
Information RetrievalRelated Information Process
Information Categorization and Clustering
Goal Label data Train data
Categorization Classify or assign label to document
Yes!! Yes!!
Clustering Discover structure in a collection of unlabelled data
No!! No!!
Text Retrieval
• Indexing and
Document Representation
• Query Formation
• Matching Query and Document Representation
• Terms and Keywords– A list of words extracted from the full text
document.– Construct a Stop list to remove the useless words
because of those too common to important.– The use of synonyms
• Construct a dictionary structure to modify• To replace each word in one class
– Tradeoff exists between normalization and discrimination in the indexing process.
Text Retrieval
Text Retrieval
• Term frequency– The frequency of occurrence of each term in the docu
ment
– For term tk in document di
),( ki tdtf
Index Term Weighting
• Inverse document frequency– Approach of weighting each term inversely proportion
al to the number of documents in which the term occurs.
– For term tk N : the total number of documentsntk : the number of documents with term tk
)log()(kt
kn
Ntidf
Text RetrievalIndex Term Weighting
• Weights of terms• Terms that occur frequently in particular documents
but rarely in the overall collection should receive a large weight.
j tjji
tki
j
jji
kkiik
nN
idftdtf
nN
tdtf
tidftdtf
tidftdtfdtw k
2222
)(),(
)log(),(
)(),(
)(),(),(
Text RetrievalIndex Term Weighting
Text RetrievalQuery Formation
• Relevance Feedback• The IR system automatically modifies a query
based on user feedback about documents retrieved in an initial run.
• Advantage:– add new terms to the query – re-weight existing query terms.
nonrelireli
oldnewdi
di
di
diqq
Text Retrieval
)1(
)1(log
1log
1log
kk
kk
k
k
k
kk pq
qp
q
q
p
ptRW
• Another approach to relevance feedback Compute a “relevance weight” for each term ti
The weight can be used to re-weight the terms in the initial query.
Query Formation
Text RetrievalMatching Query and Document Representations
• Boolean Model, Extended Boolean Model
• Vector Space Model• Probabilistic Models
Method Model
Exact-matchDivide document collection
into matched or unmatched.
Boolean Model
Best-match Give document collection a score , and rank document collection.
Vector space Model or Probabilistic model
Text Retrieval
• Document representation– Binary value variable
• True: the term is present in the document• False: the term is absent in the document
– The document can be represented in a binary vector
• Query – Boolean operator : AND, OR and NOT
• Matching function – Standard rule of Boolean logic– If the document representation satisfy the query
expression then that document matches the query
Boolean Model
Text RetrievalExtended Boolean Model
• Because of the retrieval decision of the Boolean Model too harsh
• The extended boolean model:
(AND query)
• This is maximal for a document contain all the terms and decreases the numbers of matching term decreases.
pp
Kpp
andK
dddqdsim
121
])1(.......)1()1(
[1),(
• For the OR query
• This is minimal for a document that contains none of the terms and increases as the number of matching terms increases.
• The variable p is a constant in the range 1≤p≤∞ that is determined empirically;it is typically in the range 2≤p≤5.
• The model gives a “soft” Boolean matching function.
Text RetrievalExtended Boolean Model
pp
Kpp
orK
dddqdsim
121
].......
[),(
Text RetrievalVector Space Model
• Documents and queries are represented as vector in a K-dimensional space
• K is the number of indexing terms.
• When the indexing terms form an orthogonal basis for the vector space , it isn’t assumed that the indexing terms are independent.
K
kk
K
kk
K
kkk
dqdqsim
1
2
1
2
1
)()(),(
Text RetrievalProbabilistic Model(1/6)
• Bayes’ Decision Rule– The probability that the document d is relevant to the query q
denotes
– The probability that the document d is non-relevant to the que
ry q denotes – Cr is the cost of retrieving a non-relevant document
– Cn is the cost of not retrieving a relevant document
– The expected cost of retrieving a extraneous document
is
),|( qdRp
),|( qdRp
rCqdRp ),|(
Cn
Cr
qdRp
qdRpCqdRpCqdRp rn
),|(
),|(),|(),|(
• How to compute the and which are posterior probabilities?
• Base on Bayes’ Rule
• , are the priori probabilities of relevance and non-relevance of a document.
• , are the likelihoods or class conditional probabilities.
Text RetrievalProbabilistic Model(2/6)
),|( qdRp ),|( qdRp
)|(
)|(),|(),|(
)|(
)|(),|(),|(
qdp
qRpqRdpqdRp
qdp
qRpqRdpqdRp
)|( qRp )|( qRp
),|( qRdp ),|( qRdp
)|(),|(
)|(),|(
)|(),|(
)|(
)|(
)|(),|(
),|(
),|(
qRpqRdp
qRpqRdp
qRpqRdp
qdp
qdp
qRpqRdp
qdRp
qdRp
Text RetrievalProbabilistic Model(3/6)
• Now we have to estimate and
),|( qRdp ),|( qRdp
In order to simplify the function , so we do the assumptions• The document vectors are binary, indicating the presence or absen
ce of each indexing term.• Each term has a binomial distribution. • There are no interactions between the terms.
K
k
dk
dk
K
k
k
K
k
dk
dk
K
k
k
kk
kk
qqqRdpqRdp
ppqRdpqRdp
1
1
1
1
1
1
)1(),|(),|(
)1(),|(),|(
),|1(),|1( qRdpqqRdpp kkkk
dvectordocumenttheintermitheisdtermsindexKk thk 1,0,,.....,1
Text RetrievalProbabilistic Model(4/6)
Cwd
q
p
qRp
qRp
pq
qpd
qqqRp
ppqRp
qRdpqRp
qRdpqRp
qdRp
qdRpdqsim
K
k
kk
K
k k
kK
k kk
kkk
K
k
dk
K
k
dk
K
k
dk
K
k
dk
kk
kk
1
11
1
1
1
1
1
1
1
1log
)|(
)|(log
)1(
)1(log
)1()|(
)1()|(log
),|()|(
),|()|(log
),|(
),|(log),(
Text RetrievalProbabilistic Model(5/6)
Probabilistic Model(6/6)
• wk is the same as the relevance weight of kth index term
• Assume pk a constant value : 0.5
qk overall frequency : nk/N
Text Retrieval
)1(
)1(log
kk
kkk
pq
qpw
)1log(log)1(
)1(log
)2
11(
)1(2
1
kN
nN
n
kk
kkk
n
N
pq
qpw
k
k
Text RetrievalPoisson Model
• Unlike the above model with binary document vector, in the model, each document vector contains the number of occurrences of each indexing term in the document.
• In the model, the probability that a document d in class contains n occurrences of the indexing term is :
is the mean parameter for the indexing term in class R documents.
Rken
qRndpnRk
k,
!),( ,
thkRk ,
thk
Text RetrievalPoisson Model
• Similarly for document in class , we have:
•So, we can get the function:
__
R
Rken
qRndpnRk
k,
!),( ,
___
)|(),|(
)|(),|(
),|(
),|(
qRpqRndp
qRpqRndp
qndRp
qndRp
)(!
1
)(!
1
,
,
,
,
qRpen
qRpen
Rk
Rk
n
Rk
nRk
1
)(
,
, ,, RkRke
n
Rk
Rk
Text RetrievalPoisson Model• Those with large separation of the Poisson mean parameters.
RkRk
RkRkz
,,
,,
Text RetrievalDependent Model
• In the above models, we have assumed that the indexing terms are independent of each other.
• So, we need the dependent function:
•But it is computationally impractical and there is not enough data, we use “partial” dependence between the indexing terms.
),,,...,,|()...,,|(),|(),|( 121121 qRddddPqRddpqRdpqRdp KK
K
kii qRddpqRdpkjk
1
),,|(),|()(
• Speech is a richer and more expressive medium than text. (mood, tone)
• Robustness of the retrieval models to noise or errors in transcription.
• How to accurately extract and represent the contents of a speech message in a form that can be efficiently stored and searched because of multiple microphones ,multiple speaker , and so on .
Differences between text and speech media
Information Retrieval of Speech message
• Speech Message Retrieval– Large Vocabulary Word Recognition Approach – Sub-Word Unit Approaches – Word Spotting Approaches
• Speech Message Classification and Sorting– Topic Identification– Topic Spotting– Topic Clustering
• Suggested by CMU in Information digital video library project.
• A user can interact with the text retrieval system to obtain video clips stored in the library that are relevant to his request.
Large vocabularyspeech recognizer
(Sphinx-II)
Sound trackof video
Textualtranscript
Natural languageunderstanding
Full-text informationretrieval system
Large Vocabulary Word Recognition Approach
Sub-Word Unit Approaches
•Syllabic Units•Phonetic Units
• VCV (vowel-consonants-vowel)-features• Sub-word units consist of a maximum sequence of
consonants enclosed between two maximum sequences of vowels.– eg: INFORMATION has INFO,ORMA,ATIO
vcv-features• Take subset of these features as the indexing term
s.
Syllabic Units
• Criteria• The features occur frequently enough for a reliable
acoustic model to be trained for it.• It is not occur so frequently that its ability to
discriminate between different messages is poor.
• Process
query VCV-features tf*idf weight
Document representation
Cosine similarity function
Document with highest score
Syllabic Units
Syllabic Units
• Major problem– The acoustic confusability of
VCV-feature based approach is not taken into account during the selection of indexing features; they are selected based only on the text transcription.
– So, it may have a high false alarm rate.
Phonetic Units• Using variable length phone sequences as ind
exing feature.– These features can be viewed as “pseudo -word” a
nd were shown to be useful for detecting or spotting topics in recorded military radio broadcasts.
– An automatic procedure based on “digital trees” is used to search the possible subsequences
– A Hidden Markov Model (HMM) phone recognizer with 52 monophone models is used to process the speech
• More domain independent than a word based system.
Word Spotting Approaches• Between the simple phonetic and the
complex large-vocabulary recognition.• Two different ways that word spotting
has been used.• 1. Small, fixed number of keywords are selected
a priori for both recognition and indexing.• 2. The speech messages in the collection are
processed and stored in a form (e.g. phone lattice) that allows arbitrary keywords to be searched for after they are specified by the user.
Speech Message Classification and Sorting
• Topic Identifications – K keywords– nk is the binary value indicating the presence or absenc
e of keyword wk.– Finding that topic Ti which maximum the score Si
Speech Message Classification and Sorting
• Topic Identifications – If there are 6 topics , top scoring 40 words each,
total 240 keywords .
– These keywords used on the text transcriptions of the speech messages 82.4% classification accuracy achieved
– If a genetic algorithm used to reduced the number of keywords down to 126 with a small drop in classification performance to 78.2% .
Topic Identifications
• The topic dependent unigram language models– K is the number of keywords in the indexing vocabulary
– nk is the number of times keyword wk occurs in the speech message
– p( wk | Ti ) is the unigram or occurrence probability of keyword wk in the set of class Ti message.
K
k
K
k
ikkn
iki TwpnTwps k
0 0
)|(log)|(log
Topic Identifications
Number of wordsThe topic classification accuracy
All 8431 words in the recognition vocabulary 72.5%
a subset of 4600 words by performing a X2 hypothesis test based on contingency tables to s
elect the “important” keywords74%
A genetic algorithm search was then used to Reduce to 203
70%
Topic Identifications
• The length normalized topic score– N is the total number of words in speech message– K is the number of keywords in the indexing
vocabulary– nk is the number of times keyword wk occurs in the
speech message – p( wk | Ti ) is the unigram or occurrence probability of
keyword wk in the set of class Ti message.
K
k
ikki TwpnN
S0
)|(log1
Topic Identifications
• 750 keywords
• Classification accuracy is 74.6%
Topic Identifications
• The topic model is extended to a mixture of multinomial– M is the number of multinomial model components
– Πm is the weight of the mth multinomial component– K is the number of keywords in the indexing vocabulary– nk is the number of times keyword wk occurs in the spee
ch message – p( wk | Ti ) is the unigram or occurrence probability of keyw
ord wk in the set of class Ti message.
})|({log01
K
k
nikm
M
j
mikTwpS
Topic Identifications
• Experiments indicate that the more complex models do not perform as well as the simple single mixture model.
Topic Spotting
• “usefulness” measure how discriminating the word is for the topic.
• and are the probabilities of detecting the keyword in the topic and unwanted
• This measure select words that occur often in the topic and have high discriminability .
)|(
)|(log)|(),(
Twp
TwpTwpTwu
k
kkk
)|( Twp k )|( Twp k
Topic Spotting
• Performed by accumulating over a window of speech (typically 60 seconds)
• The log likelihood ratio of the detected keywords to produce a topic score for that region of the speech message.
K
k k
kk
Twp
Twpns
1 )|(
)|(log
Topic Spotting
• Try to capture dependencies between the keywords are examined.
• w represent the vector of keywords• is the coefficient of model .
• Their experiments show that using a carefully chosen log-linear model can give topic spotting performance that is better than using the basic model that assumes keyword independence
)(
)()exp(
)|(
)|(
)|(
)|(log
0
0
Tp
Tpn
Twp
Twp
nwTp
wTp
K
k
kk
K
k
kk
k
Topic Clustering
• Try to discover structure or relationships between messages in a collection.
• The clustering process• Tokenization• Similarity computation• Clustering
Topic ClusteringTokenization to come up with a suitable representation of the speech message which can be used in the next two steps.
Similarity it needs to compare every pair of messages, N-gram model is used.
Clustering using hierarchical tree clustering or nearest neighbor classification. Work well under true transcription texts figure of merit (FOM) 90% rates Using speech input is worse than texts, it down to 70% FOM using recognition output, unigram language models and tree-based clustering.